k-mer manifold approximation and projection for visualizing DNA sequences
- Chengbo Fu1,
- Einari A. Niskanen2,
- Gong-Hong Wei3,4,
- Zhirong Yang5,6,
- Marta Sanvicente-García7,
- Marc Güell7,8 and
- Lu Cheng1,2
- 1Department of Computer Science, School of Science, Aalto University, 02150 Espoo, Finland;
- 2Institute of Biomedicine, University of Eastern Finland, 70211 Kuopio, Finland;
- 3Fudan University Shanghai Cancer Center & MOE Key Laboratory of Metabolism and Molecular Medicine and Department of Biochemistry and Molecular Biology of School of Basic Medical Sciences, Shanghai Medical College of Fudan University, 200032 Shanghai, China;
- 4Disease Networks Research Unit, Faculty of Biochemistry and Molecular Medicine, Biocenter Oulu, University of Oulu, 90220 Oulu, Finland;
- 5Department of Computer Science, Norwegian University of Science and Technology, 7491 Trondheim, Norway;
- 6Jinhua Institute of Zhejiang University, 321032 Zhengjiang, China;
- 7Department of Medicine and Life Sciences, Universitat Pompeu Fabra, 08003 Barcelona, Spain;
- 8Institució Catalana de Recerca i Estudis Avançats, ICREA, 08003 Barcelona, Spain
Abstract
Identifying and illustrating patterns in DNA sequences are crucial tasks in various biological data analyses. In this task, patterns are often represented by sets of k-mers, the fundamental building blocks of DNA sequences. To visually unveil these patterns, one could project each k-mer onto a point in two-dimensional (2D) space. However, this projection poses challenges owing to the high-dimensional nature of k-mers and their unique mathematical properties. Here, we establish a mathematical system to address the peculiarities of the k-mer manifold. Leveraging this k-mer manifold theory, we develop a statistical method named KMAP for detecting k-mer patterns and visualizing them in 2D space. We applied KMAP to three distinct data sets to showcase its utility. KMAP achieves a comparable performance to the classical method MEME, with ∼90% similarity in motif discovery from HT-SELEX data. In the analysis of H3K27ac ChIP-seq data from Ewing sarcoma (EWS), we find that BACH1, OTX2, and KNCH2 might affect EWS prognosis by binding to promoter and enhancer regions across the genome. We also observe potential colocalization of BACH1, OTX2, and the motif CCCAGGCTGGAGTGC in ∼70 bp windows in the enhancer regions. Furthermore, we find that FLI1 binds to the enhancer regions after ETV6 degradation, indicating competitive binding between ETV6 and FLI1. Moreover, KMAP identifies four prevalent patterns in gene editing data of the AAVS1 locus, aligning with findings reported in the literature. These applications underscore that KMAP can be a valuable tool across various biological contexts.
Footnotes
-
[Supplemental material is available for this article.]
-
Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.279458.124.
-
Freely available online through the Genome Research Open Access option.
- Received April 12, 2024.
- Accepted February 20, 2025.
This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.











