Aggregation of recount3 RNA-seq data improves inference of consensus and tissue-specific gene coexpression networks
- Prashanthi Ravichandran1,
- Princy Parsana2,
- Rebecca Keener1,
- Kasper D. Hansen1,3,4 and
- Alexis Battle1,2,4,5,6
- 1Department of Biomedical Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA;
- 2Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA;
- 3Department of Biostatistics, Johns Hopkins School of Public Health, Baltimore, Maryland 21205, USA;
- 4Department of Genetic Medicine, Johns Hopkins University, Baltimore, Maryland 21287, USA;
- 5Malone Center for Engineering in Healthcare, Johns Hopkins University, Baltimore, Maryland 21218, USA;
- 6Data Science and AI Institute, Johns Hopkins University, Baltimore, Maryland 21218, USA
Abstract
Gene coexpression networks (GCNs) describe relationships among genes that maintain cellular identity and homeostasis. However, typical RNA-seq experiments often lack sufficient sample sizes for reliable GCN inference. recount3, a data set with 316,443 processed human RNA-seq samples, provides an opportunity to improve network reconstruction. However, GCN inference from public data is challenged by confounders and inconsistent labeling. To address this, we develop a pipeline to annotate samples based on cell-type composition. By comparing aggregation strategies, we find that regressing confounders within studies and prioritizing larger studies optimizes network reconstruction. We apply these findings to infer three consensus networks (universal, cancer, noncancer) and 27 context-specific networks. Central genes in consensus networks are enriched for evolutionarily constrained genes and ubiquitous biological pathways, whereas context-specific central nodes include tissue-specific transcription factors. The increased statistical power from data aggregation facilitates the derivation of variant annotations from context-specific networks, which are significantly enriched for complex-trait heritability independent of overlap with baseline functional genomic annotations. Although data aggregation led to strictly increasing held-out log-likelihood, we observe diminishing marginal improvements, suggesting that integrating complementary modalities, such as Hi-C and ChIP-seq, can further refine network reconstruction. Our approach outlines best practices for GCN inference and highlights both the strengths and limitations of data aggregation.
Footnotes
-
[Supplemental material is available for this article.]
-
Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.280808.125.
-
Freely available online through the Genome Research Open Access option.
- Received April 18, 2025.
- Accepted July 8, 2025.
This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.











