Aggregation of recount3 RNA-seq data improves inference of consensus and tissue-specific gene coexpression networks

Abstract

Gene coexpression networks (GCNs) describe relationships among genes that maintain cellular identity and homeostasis. However, typical RNA-seq experiments often lack sufficient sample sizes for reliable GCN inference. Recount3, a dataset with 316,443 processed human RNA-seq samples, provides an opportunity to improve network reconstruction. However, GCN inference from public data is challenged by confounders and inconsistent labeling. To address this, we developed a pipeline to annotate samples based on cell type composition. By comparing aggregation strategies, we found that regressing confounders within studies and prioritizing larger studies optimized network reconstruction. We applied these findings to infer three consensus networks (universal, cancer, non-cancer) and 27 context-specific networks. Central genes in consensus networks were enriched for evolutionarily constrained genes and ubiquitous biological pathways, while context-specific central nodes included tissue-specific transcription factors. The increased statistical power from data aggregation facilitated the derivation of variant annotations from context-specific networks, which were significantly enriched for complex-trait heritability independent of overlap with baseline functional genomic annotations. While data aggregation led to strictly increasing held-out log-likelihood, we observed diminishing marginal improvements, suggesting that integrating complementary modalities, such as Hi-C and ChIP-seq, could further refine network reconstruction. Our approach outlines best practices for GCN inference and highlights both the strengths and limitations of data aggregation.