Interpretable phenotype decoding from multicondition sequencing data with ALPINE

  1. Vicky Yao2,3,4
  1. 1Systems, Synthetic, and Physical Biology, Rice University, Houston, Texas 77005, USA;
  2. 2Department of Computer Science, Rice University, Houston, Texas 77005, USA;
  3. 3Ken Kennedy Institute, Rice University, Houston, Texas 77005, USA;
  4. 4Rice Synthetic Biology Institute, Rice University, Houston, Texas 77005, USA
  1. 5 These authors contributed equally to this work.

  • Corresponding author: vy{at}rice.edu
  • Abstract

    As sequencing techniques advance in precision, affordability, and diversity, an abundance of heterogeneous sequencing data has become available, encompassing a wide range of phenotypic features and biological perturbations. Unfortunately, increased resolution comes with the cost of increased complexity of the biological search space, even at the individual study level, as perturbations are now often examined across many dimensions simultaneously, including different donor phenotypes, anatomical regions and cell types, and time points. Furthermore, broad integration across studies promises a unique opportunity to explore the molecular underpinnings of distinct healthy and disease states, larger than the original scope of the individual study. To fully realize the promise of both individual higher resolution studies and large cross-study integrations, we need a robust methodology that can disentangle the influence of technical and nonrelevant phenotypic factors, isolating relevant condition-specific signals from shared biological information while also providing interpretable insights into the genetic effects of these conditions. Current methods typically excel in only one of these areas. To address this gap, we have developed ALPINE, a supervised nonnegative matrix factorization (NMF) framework that effectively separates both technical and nontechnical factors while simultaneously offering direct interpretability of condition-associated genes. Through simulations across four different scenarios, we demonstrate that ALPINE outperforms existing methods in both isolating the effect of different phenotypic conditions and prioritizing condition-associated genes. Furthermore, ALPINE has favorable performance in batch effect removal compared with state-of-the-art integration methods. When applied to real-world case studies, we showcase how ALPINE can be used to extract insights into the biological mechanisms that underlie differences between phenotypic conditions.

    Footnotes

    • Received February 15, 2025.
    • Accepted October 8, 2025.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    Articles citing this article

    This Article

    1. Genome Res. © 2025 Lee et al.; Published by Cold Spring Harbor Laboratory Press

    Article Category

    ORCID

    Share

    Preprint Server