Estimating the size of long tandem repeat expansions from short reads with ScatTR

  1. Gamze Gürsoy1,2,3
  1. 1Department of Biomedical Informatics, Columbia University, New York, New York 10032, USA;
  2. 2New York Genome Center, New York, New York 10013, USA;
  3. 3Department of Computer Science, Columbia University, New York, New York 10027, USA
  • Corresponding author: gamze.gursoy{at}columbia.edu
  • Abstract

    Tandem repeats (TRs) are sequences of DNA in which ≥2 bp are repeated back-to-back at specific locations in the genome. TR expansions, in which the number of repeat units exceeds the normal range, have been implicated in more than 50 conditions. However, accurately measuring the copy number of TRs is challenging, especially when their expansions are larger than the fragment sizes used in standard short-read genome sequencing. Here, we introduce ScatTR, a novel computational method that leverages a maximum likelihood framework to estimate the copy number of large TR expansions from short-read sequencing data. ScatTR calculates the likelihood of different alignments between sequencing reads and reference sequences that represent various TR lengths and employs a Monte Carlo technique to find the best match. In simulated data, ScatTR outperforms state-of-the-art methods, particularly for TRs with longer motifs and those with lengths that greatly exceed typical sequencing fragment sizes. When applied to data from the 1000 Genomes Project, ScatTR detects potential large TR expansions that other methods missed, highlighting its ability to better characterize genome-wide TR variation.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.280563.125.

    • Freely available online through the Genome Research Open Access option.

    • Received February 15, 2025.
    • Accepted August 15, 2025.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    OPEN ACCESS ARTICLE

    This Article

    1. Genome Res. © 2025 Al-Abri and Gürsoy; Published by Cold Spring Harbor Laboratory Press

    Article Category

    Share

    Preprint Server