Gaps and complex structurally variant loci in phased genome assemblies

  1. Evan E. Eichler1,10
  1. 1Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA;
  2. 2Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany;
  3. 3Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany;
  4. 4UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, California 95064, USA;
  5. 5European Molecular Biology Laboratory (EMBL), Genome Biology Unit, 69117 Heidelberg, Germany;
  6. 6Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 10115 Berlin, Germany;
  7. 7Berlin Institute of Health (BIH), 10178 Berlin, Germany;
  8. 8Charité-Universitätsmedizin, 10117 Berlin, Germany;
  9. 9European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom;
  10. 10Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA;
  11. 12Division of Oncology, Department of Internal Medicine, Washington University School of Medicine, St. Louis, MO 63110, USA;
  12. 13McDonnell Genome Institute, Washington University School of Medicine, St. Louis, MO 63108, USA;
  13. 14UC Santa Cruz Genomics Institute, University of California, Santa Cruz, Santa Cruz, CA 95064, USA;
  14. 15Google LLC, Mountain View, CA 94043, USA;
  15. 16Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA 98195, USA;
  16. 17European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK;
  17. 18Department of Human Genetics, McGill University, Montreal, Québec H3A 0C7, Canada;
  18. 19Canadian Center for Computational Genomics, McGill University, Montreal, Québec H3A 0G1, Canada;
  19. 20Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto 606-8501, Japan;
  20. 21Institute of Genetics and Biophysics, National Research Council, Naples 80111, Italy;
  21. 22Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA;
  22. 23Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA 02215, USA;
  23. 24Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215, USA;
  24. 25Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA;
  25. 26Arizona State University, Barrett and O'Connor Washington Center, Washington, DC 20006, USA;
  26. 27Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA 95064, USA;
  27. 28Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany;
  28. 29Center for Digital Medicine, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany;
  29. 30Core Unit Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, 40225 Düsseldorf, Germany;
  30. 31Howard Hughes Medical Institute, Chevy Chase, MD 20815, USA;
  31. 32Vertebrate Genome Laboratory, The Rockefeller University, New York, NY 10065, USA;
  32. 33National Institutes of Health (NIH)–National Human Genome Research Institute, Bethesda, MD 20892, USA;
  33. 34Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA;
  34. 35Center for Computational and Genomic Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA;
  35. 36Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Copenhagen DK-2200, Denmark;
  36. 37Institute for Society and Genetics, College of Letters and Science, University of California, Los Angeles, Los Angeles, CA 90095, USA;
  37. 38Institute for Precision Health, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA;
  38. 39Division of General Internal Medicine and Health Services Research, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA 90095, USA;
  39. 40Department of Biomolecular Engineering, University of California, Santa Cruz, Santa Cruz, CA 95064, USA;
  40. 41Dovetail Genomics, Scotts Valley, CA 95066, USA;
  41. 42Quantitative Life Sciences, McGill University, Montreal, Québec H3A 0C7, Canada;
  42. 43Genomics Research Centre, Human Technopole, Milan 20157, Italy;
  43. 44Department of Genetics, Yale University School of Medicine, New Haven, CT 06510, USA;
  44. 45Center for Genomic Health, Yale University School of Medicine, New Haven, CT 06510, USA;
  45. 46Quantitative Biology Center (QBiC), University of Tübingen, 72076 Tübingen, Germany;
  46. 47Biomedical Data Science, Department of Computer Science, University of Tübingen, 72076 Tübingen, Germany;
  47. 48Tree of Life, Wellcome Sanger Institute, Hinxton, Cambridge CB10 1SA, UK;
  48. 49Northeastern University, Boston, MA 02115, USA;
  49. 50Laboratory of Neurogenetics of Language, The Rockefeller University, New York, NY 10065, USA;
  50. 51Division of Oncology, Department of Medicine, Stanford University School of Medicine, Stanford, CA 94305, USA;
  51. 52Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY 10029, USA;
  52. 53Program in Bioethics and Institute for Human Genetics, University of California, San Francisco, San Francisco, CA 94143, USA;
  53. 54European Molecular Biology Laboratory, Genome Biology Unit, 69117 Heidelberg, Germany;
  54. 55Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD 20892, USA;
  55. 56Division of Biology and Biomedical Sciences, Washington University School of Medicine, St. Louis, MO 63110, USA;
  56. 57Computer Sciences Department, Barcelona Supercomputing Center, 08034 Barcelona, Spain;
  57. 58Departament d'Arquitectura de Computadors i Sistemes Operatius, Universitat Autònoma de Barcelona, 08193 Barcelona, Spain;
  58. 59Material Measurement Laboratory, National Institute of Standards and Technology, Gaithersburg, MD 20877, USA;
  59. 60Coriell Institute for Medical Research, Camden, NJ 08103, USA;
  60. 61Department of Computer Science, University of Pisa, Pisa 56127, Italy;
  61. 62Department of Public Health Sciences, University of California, Davis, Davis, CA 95616, USA;
  62. 63Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA;
  63. 64Department of Ecology and Evolutionary Biology, University of California, Santa Cruz, Santa Cruz, CA 95064, USA;
  64. 65Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 10115 Berlin, Germany;
  65. 66National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA;
  66. 67Center for Health Data Science, University of Copenhagen, 2200 Copenhagen, Denmark;
  67. 68Al Jalila Genomics Center of Excellence, Al Jalila Children's Specialty Hospital, Dubai, UAE;
  68. 69Center for Genomic Discovery, Mohammed Bin Rashid University of Medicine and Health Sciences, Dubai, UAE;
  69. 70Division of Medical Genetics, University of Washington School of Medicine, Seattle, WA 98195, USA;
  70. 71Center for Computational Biology, Johns Hopkins University, Baltimore, MD 21218, USA;
  1. Corresponding author: eee{at}gs.washington.edu
  2. Abstract

    There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6–7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.

    Footnotes

    • 11 A complete list of contributing Consortium members appears at the end of this paper.

    • [Supplemental material is available for this article.]

    • Article published online before print. Article, supplemental material, and publication date are at https://www.genome.org/cgi/doi/10.1101/gr.277334.122.

    • Freely available online through the Genome Research Open Access option.

    • Received September 19, 2022.
    • Accepted December 7, 2022.

    This article, published in Genome Research, is available under a Creative Commons License (Attribution 4.0 International), as described at http://creativecommons.org/licenses/by/4.0/.

    Articles citing this article

    | Table of Contents
    OPEN ACCESS ARTICLE

    This Article

    1. Genome Res. 33: 496-510 © 2023 Porubsky et al.; Published by Cold Spring Harbor Laboratory Press

    Article Category

    ORCID

    Share

    Preprint Server