Multisample motif discovery and visualization for tandem repeats

  1. Henne Holstege1,2,3,5
  1. 1Section Genomics of Neurodegenerative Diseases and Aging, Department of Clinical Genetics, Vrije Universiteit Amsterdam, Amsterdam UMC, 1081HV Amsterdam, The Netherlands;
  2. 2Delft Bioinformatics Lab, Delft University of Technology, 2628CD Delft, The Netherlands;
  3. 3Amsterdam Neuroscience, Neurodegeneration, 1081HV Amsterdam, The Netherlands;
  4. 4Department of Human Genetics, Radboud University Medical Center, 6525GA Nijmegen, The Netherlands;
  5. 5Alzheimer Center Amsterdam, Neurology, Vrije Universiteit Amsterdam, Amsterdam UMC location VUmc, 1081HV Amsterdam, The Netherlands
  • Corresponding author: h.holstege{at}amsterdamumc.nl
  • Abstract

    Tandem repeats (TRs) occupy a significant portion of the human genome and are a source of polymorphisms due to variations in sizes and motif compositions. Some of these variations have been associated with various neuropathological disorders, highlighting the clinical importance of assessing the motif structure of TRs. Moreover, assessing the TR motif variation can offer valuable insights into evolutionary dynamics and population structure. Previously, characterizations of TRs were limited by short-read sequencing technology, which lacks the ability to accurately capture the full TR sequences. As long-read sequencing becomes more accessible and can capture the full complexity of TRs, there is now also a need for tools to characterize and analyze TRs using long-read data across multiple samples. In this study, we present MotifScope, a novel algorithm for the characterization and visualization of TRs based on a de novo k-mer approach for motif discovery. Comparative analysis against established tools reveals that MotifScope can identify a greater number of motifs and more accurately represent the underlying repeat sequences. Moreover, MotifScope has been specifically designed to enable motif composition comparisons across assemblies of different individuals, as well as across long-read sequencing reads within an individual, through combined motif discovery and sequence alignment. We showcase potential applications of MotifScope in diverse fields, including population genetics, clinical settings, and forensic analyses.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.279278.124.

    • Freely available online through the Genome Research Open Access option.

    • Received March 8, 2024.
    • Accepted October 31, 2024.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    OPEN ACCESS ARTICLE

    Preprint Server