Haplotype diversity and sequence heterogeneity of human telomeres

  1. Christopher E. Mason1,2,3,9
  1. 1Department of Physiology and Biophysics, Weill Cornell Medicine, New York, New York 10065, USA;
  2. 2The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medicine, New York, New York 10021, USA;
  3. 3The Feil Family Brain and Mind Research Institute, New York, New York 10065, USA;
  4. 4Institute of Medical Genetics and Applied Genomics, University of Tübingen, 72076 Tübingen, Germany;
  5. 5NGS Competence Center Tübingen, University of Tübingen, 72076 Tübingen, Germany;
  6. 6Department of Environmental and Radiological Health Sciences, Colorado State University, Fort Collins, Colorado 80523, USA;
  7. 7Cell and Molecular Biology Program, Colorado State University, Fort Collins, Colorado 80523, USA;
  8. 8KBR, Houston, Texas 77002, USA;
  9. 9The WorldQuant Initiative for Quantitative Prediction, Weill Cornell Medicine, New York, New York 10065, USA
  1. 10 These authors contributed equally to this work.

  • Corresponding authors: susan.bailey{at}colostate.edu, chm2042{at}med.cornell.edu
  • Abstract

    Telomeres are regions of repetitive nucleotide sequences capping the ends of eukaryotic chromosomes that protect against deterioration, and whose lengths can be correlated with age and adverse health risk factors. Yet, given their length and repetitive nature, telomeric regions are not easily reconstructed from short-read sequencing, thus making telomere sequencing, mapping, and variant resolution challenging problems. Recently, long-read sequencing, with read lengths measuring in hundreds of kilobase pairs, has made it possible to routinely read into telomeric regions and inspect their sequence structure. Here, we describe a framework for extracting telomeric reads from whole-genome single-molecule sequencing experiments, including de novo identification of telomere repeat motifs and repeat types, and also describe their sequence variation. We find that long, complex telomeric stretches and repeats can be accurately captured with long-read sequencing, observe extensive sequence heterogeneity of human telomeres, discover and localize noncanonical telomere sequence motifs (both previously reported, as well as novel), and validate them in short-read sequence data. These data reveal extensive intra- and inter-population diversity of repeats in telomeric haplotypes, reveal higher paternal inheritance of telomeric variants, and represent the first motif composition maps of multi-kilobase-pair human telomeric haplotypes across three distinct ancestries (Ashkenazi, Chinese, and Utah), which can aid in future studies of genetic variation, aging, and genome biology.

    Footnotes

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.274639.120.

    • Freely available online through the Genome Research Open Access option.

    • Received November 25, 2020.
    • Accepted May 4, 2021.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

    | Table of Contents
    OPEN ACCESS ARTICLE

    Preprint Server