Aggregation of recount3 RNA-seq data improves inference of consensus and tissue-specific gene coexpression networks

    • 1Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA;
    • 2Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA;
    • 3Department of Biostatistics, Johns Hopkins School of Public Health, Baltimore, Maryland 21205, USA;
    • 4Department of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21287, USA;
    • 5Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, Maryland 21218, USA;
    • 6Data Science and AI Institute, Johns Hopkins University, Baltimore, Maryland 21218, USA
Published July 17, 2025. Vol 35 Issue 9, pp. 2087-2103. https://doi.org/10.1101/gr.280808.125
Download PDF Cite Article Permissions Share
cover of Genome Research Vol 36 Issue 5
Current Issue:

Abstract

Gene coexpression networks (GCNs) describe relationships among genes that maintain cellular identity and homeostasis. However, typical RNA-seq experiments often lack sufficient sample sizes for reliable GCN inference. recount3, a data set with 316,443 processed human RNA-seq samples, provides an opportunity to improve network reconstruction. However, GCN inference from public data is challenged by confounders and inconsistent labeling. To address this, we develop a pipeline to annotate samples based on cell-type composition. By comparing aggregation strategies, we find that regressing confounders within studies and prioritizing larger studies optimizes network reconstruction. We apply these findings to infer three consensus networks (universal, cancer, noncancer) and 27 context-specific networks. Central genes in consensus networks are enriched for evolutionarily constrained genes and ubiquitous biological pathways, whereas context-specific central nodes include tissue-specific transcription factors. The increased statistical power from data aggregation facilitates the derivation of variant annotations from context-specific networks, which are significantly enriched for complex-trait heritability independent of overlap with baseline functional genomic annotations. Although data aggregation led to strictly increasing held-out log-likelihood, we observe diminishing marginal improvements, suggesting that integrating complementary modalities, such as Hi-C and ChIP-seq, can further refine network reconstruction. Our approach outlines best practices for GCN inference and highlights both the strengths and limitations of data aggregation.

Loading
Loading
Loading
Loading
Back to top