Deep structural clustering reveals hidden systematic biases in RNA sequencing data
- Qiang Su1,2,3,9,
- Yi Long4,9,
- Deming Gou5,
- Junmin Quan3,
- Xiaoming Zhou6 and
- Qizhou Lian1,2,7,8
- 1Faculty of Synthetic Biology, Shenzhen University of Advanced Technology, Shenzhen 518107, China;
- 2State Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China;
- 3State Key Laboratory of Chemical Oncogenomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen 518055, China;
- 4Institute of Chemical Biology, Shenzhen Bay Laboratory, Shenzhen 518132, China;
- 5Shenzhen Key Laboratory of Microbial Genetic Engineering, Vascular Disease Research Center, College of Life Sciences and Oceanography, Shenzhen University, Shenzhen 518060, China;
- 6School of Life Sciences, MOE Key Laboratory of Laser Life Science and Guangdong Provincial Key Laboratory of Laser Life Science, College of Biophotonics, South China Normal University, Guangzhou 510631, China;
- 7Cord Blood Bank, Guangzhou Institute of Eugenics and Perinatology, Guangzhou Women and Children's Medical Center, Guangzhou Medical University, Guangzhou 511436, China;
- 8State Key Laboratory of Pharmaceutical Biotechnology and Department of Medicine, the University of Hong Kong, Hong Kong SAR, China
-
↵9 These authors contributed equally to this work.
Abstract
RNA sequencing (RNA-seq) is a pivotal tool for transcriptomic analysis, providing comprehensive exploration of gene expression across diverse biological contexts. However, RNA-seq data are susceptible to various biases that can significantly compromise the accuracy and reliability of transcript quantification. This study investigates the influence of high-dimensional RNA structures on local sequencing efficiency using an innovative unsupervised variational autoencoder-Gaussian mixture model (VAE-GMM). The VAE-GMM effectively captures intricate high-dimensional k-mer structural similarities by learning compact latent representations, which reduces dimensionality while meticulously preserving essential structural features crucial for bias identification. This sophisticated modeling allows precise tracking of local RNA-read conversion dynamics and the identification of complex, often overlooked bias sources. We rigorously validate the VAE-GMM model's performance and robustness against conventional machine learning techniques, including Gaussian mixture models (GMM-only), principal component analysis-based GMMs, k-means clustering, and hierarchical clustering. These validations, using an extensive and diverse array of data sets, including synthetic RNA constructs, various human cell lines, and authentic tissue samples, consistently demonstrate the model's superior versatility and accuracy across different biological systems. Furthermore, in silico simulations of the sequencing process closely align with actual sequencing data, strongly reinforcing the critical role of high-dimensional RNA structures in determining sequencing efficiency and their impact on data quality. Our findings offer valuable insights into the underlying mechanisms of RNA structure–mediated sequencing bias. This deeper understanding enables more accurate and reliable RNA-seq analyses and is expected to improve the interpretation of transcriptomic data in future genomic studies.
Footnotes
-
[Supplemental material is available for this article.]
-
Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.280713.125.
- Received March 27, 2025.
- Accepted September 15, 2025.
This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.











