Fast sequence alignment for centromeres with RaMA

  1. Yansu Wang1,2
  1. 1Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China;
  2. 2Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China;
  3. 3School of Computer Science and Technology, Xidian University, Xi'an 710126, Shaanxi, China
  • Corresponding author: wangyansu{at}uestc.edu.cn
  • Abstract

    The release of the first draft of the human pangenome has revolutionized genomic research by enabling access to complex regions like centromeres, composed of extra-long tandem repeats (ETRs). However, a significant gap remains as current methodologies are inadequate for producing sequence alignments that effectively capture genetic events within ETRs, highlighting a pressing need for improved alignment tools. Inspired by UniAligner, we developed a rare match aligner (RaMA), using rare matches as anchors and two-piece affine gap cost to generate complete pairwise alignment that better captures genetic evolution. RaMA also employs parallel computing and the wavefront algorithm to accelerate anchor discovery and sequence alignment, achieving up to 13.66 times faster processing using only 11% of UniAligner's memory. Downstream analysis of simulated data and the CHM13 and CHM1 higher-order repeat (HOR) arrays demonstrates that RaMA achieves more accurate alignments, effectively capturing true HOR structures. RaMA also introduces two methods for defining reliable alignment regions, further refining and enhancing the accuracy of centromeric alignment statistics.

    Footnotes

    • Received July 8, 2024.
    • Accepted February 6, 2025.

    This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see https://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    Preprint Server