Accurate detection of tandem repeats from error-prone sequences with EquiRep
Abstract
A tandem repeat is a sequence of nucleotides that appear as multiple contiguous, near-identical copies arranged consecutively. Tandem repeats are widespread across natural genomes, play critical roles in genetic diversity, gene regulation, and are associated with various neurological and developmental disorders. They can also arise in sequencing reads generated by certain technologies, such as those used for sequencing circular molecules. A key challenge in analyzing tandem repeats is reconstructing the sequence of the underlying repeat unit. While several methods exist, they often exhibit low accuracy when the repeat unit length increases or the number of copies is low. Furthermore, methods capable of handling highly mutated sequences remain scarce, highlighting a significant opportunity for improvement. We introduce EquiRep, a tool for accurate detection of tandem repeats from erroneous sequences. EquiRep estimates the likelihood of positions originating from the same location in the unit through self-alignment, followed by a novel refinement approach. The resulting equivalence classes and consecutive position information are then used to build a weighted graph. A cycle in this graph with maximum bottleneck weight covering most nucleotide positions is identified to reconstruct the repeat unit. We test EquiRep on two applications, identifying repeat units from satellite DNAs and reconstructing circular RNAs from rolling-circular long-read sequencing data, using both simulated and raw sequencing datasets. Our results show that EquiRep consistently outperforms or matches state-of-the-art methods, demonstrating robustness to sequencing errors and superior performance on long repeat units and low-frequency repeats. These capabilities underscore EquiRep’s broad utility in tandem repeat analysis.
- Received April 4, 2025.
- Accepted July 30, 2025.
- Published by Cold Spring Harbor Laboratory Press
This manuscript is Open Access.
This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International license), as described at http://creativecommons.org/licenses/by-nc/4.0/.











