Gaps and complex structurally variant loci in phased genome assemblies

David Porubsky; Mitchell R. Vollger; William T. Harvey; Allison N. Rozanski; Peter Ebert; Glenn Hickey; Patrick Hasenfeld; Ashley D. Sanders; Catherine Stober; Human Pangenome Reference Consortium; Jan O. Korbel; Benedict Paten; Tobias Marschall; Evan E. Eichler

doi:10.1101/gr.277334.122

Research

Gaps and complex structurally variant loci in phased genome assemblies

- ¹Department of Genome Sciences, University of Washington School of Medicine, Seattle, Washington 98195, USA;
- ²Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University, 40225 Düsseldorf, Germany;
- ³Center for Digital Medicine, Heinrich Heine University, 40225 Düsseldorf, Germany;
- ⁴UC Santa Cruz Genomics Institute, University of California Santa Cruz, Santa Cruz, California 95064, USA;
- ⁵European Molecular Biology Laboratory (EMBL), Genome Biology Unit, 69117 Heidelberg, Germany;
- ⁶Berlin Institute for Medical Systems Biology, Max Delbrück Center for Molecular Medicine in the Helmholtz Association, 10115 Berlin, Germany;
- ⁷Berlin Institute of Health (BIH), 10178 Berlin, Germany;
- ⁸Charité-Universitätsmedizin, 10117 Berlin, Germany;
- ⁹European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, CB10 1SD, United Kingdom;
- ¹⁰Howard Hughes Medical Institute, University of Washington, Seattle, Washington 98195, USA
- 11 A complete list of contributing Consortium members appears at the end of this paper.

Published May 10, 2023. https://doi.org/10.1101/gr.277334.122

Download PDF Cite Article Permissions

Current Issue:

June 2026, Vol. 36, No. 6

Focus view

Abstract

There has been tremendous progress in phased genome assembly production by combining long-read data with parental information or linked-read data. Nevertheless, a typical phased genome assembly generated by trio-hifiasm still generates more than 140 gaps. We perform a detailed analysis of gaps, assembly breaks, and misorientations from 182 haploid assemblies obtained from a diversity panel of 77 unique human samples. Although trio-based approaches using HiFi are the current gold standard, chromosome-wide phasing accuracy is comparable when using Strand-seq instead of parental data. Importantly, the majority of assembly gaps cluster near the largest and most identical repeats (including segmental duplications [35.4%], satellite DNA [22.3%], or regions enriched in GA/AT-rich DNA [27.4%]). Consequently, 1513 protein-coding genes overlap assembly gaps in at least one haplotype, and 231 are recurrently disrupted or missing from five or more haplotypes. Furthermore, we estimate that 6–7 Mbp of DNA are misorientated per haplotype irrespective of whether trio-free or trio-based approaches are used. Of these misorientations, 81% correspond to bona fide large inversion polymorphisms in the human species, most of which are flanked by large segmental duplications. We also identify large-scale alignment discontinuities consistent with 11.9 Mbp of deletions and 161.4 Mbp of insertions per haploid genome. Although 99% of this variation corresponds to satellite DNA, we identify 230 regions of euchromatic DNA with frequent expansions and contractions, nearly half of which overlap with 197 protein-coding genes. Such variable and incompletely assembled regions are important targets for future algorithmic development and pangenome representation.

Article contents

Article (Back to top)
- Abstract
- Notes

Research

Gaps and complex structurally variant loci in phased genome assemblies

Cite this article

Share

Current Issue:

Abstract

Article contents

Announcement(s)