RAmbler resolves complex repeats in human Chromosomes 8, 19, and X

  1. Stefano Lonardi1
  1. 1Department of Computer Science and Engineering, University of California, Riverside, California 92521, USA;
  2. 2Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania 19103, USA
  • Corresponding author: stelo{at}cs.ucr.edu
  • Abstract

    Repetitive regions in eukaryotic genomes often contain important functional or regulatory elements. Despite significant algorithmic and technological advancements in genome sequencing and assembly over the past three decades, modern de novo assemblers still struggle to accurately reconstruct highly repetitive regions. In this work, we introduce RAmbler (Repeat Assembler), a reference-guided assembler specialized for the assembly of complex repetitive regions exclusively from Pacific Biosciences (PacBio) HiFi reads. RAmbler (1) identifies repetitive regions by detecting unusually high coverage regions after mapping HiFi reads to the draft genome assembly, (2) finds single-copy k-mers from the HiFi reads, (i.e., k-mers that are expected to occur only once in the genome), (3) uses the relative location of single-copy k-mers to barcode each HiFi read, (4) clusters HiFi reads based on their shared barcodes, (5) generates contigs by assembling the reads in each cluster, and (6) generates a consensus assembly from the overlap graph of the assembled contigs. Here, we show that RAmbler can reconstruct human centromeres and other complex repeats to a quality comparable to the manually curated Telomere-to-Telomere human genome assembly. Across more than 250 synthetic data sets, RAmbler outperforms hifiasm, LJA, HiCANU, and Verkko across various parameters such as repeat lengths, number of repeats, heterozygosity rates, and depth of sequencing.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.279308.124.

    • Freely available online through the Genome Research Open Access option.

    • Received March 13, 2024.
    • Accepted February 6, 2025.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International license), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    OPEN ACCESS ARTICLE

    Preprint Server