Gapless assembly of complete human and plant chromosomes using only nanopore sequencing

Table 1.

Assemblies of the reference human genome HG002

Asm Contig NG50 (Mb) Scaffold NG50 (Mb) Contig NGA50 (Mb) Hamming error (%) QV Dup gene Missing gene T2T ctgs T2T scfs
Downsampled (50× Duplex + 30× ONT UL)
Verkko + Illumina trio 103.00 135.21 57.87 0.75 55.77 200 292 16 27/46
Verkko + Pore-C 86.69 136.00 51.99 0.75 55.72 232 361 13 26/46
Full-coverage (70× Duplex)
Verkko + Illumina trio 59.40 133.48 39.41 0.70 57.00 296 309 1 23/46
Verkko + Pore-C 43.16 113.59 31.06 0.77 56.49 290 310 4 17/46
HiFi (43× + 30× ONT UL) (Cheng et al. 2024)
Verkko + Illumina trio 101.76 121.21 69.19 0.17 59.33 206 314 8 16/46
hifiasm + Illumina trio 101.21 N/A 60.49 0.20 60.37 182 287 7 N/A/46
  • Contig NG50: The length of the shortest contig such that half of the genome is in contigs of this length or greater. No gaps are allowed and sequences are split where a gap of at least three Ns is present. The genome size is defined as 6.08 Gbps based on the reference HG002 assembly (https://github.com/marbl/HG002/blob/main/README.md). Scaffold NG50: same as contig NG50 without splitting at gaps. Hifiasm assemblies from Cheng et al. (2024) do not include scaffolds so we use N/A to denote this in the scaffold NG50 column. Contig NGA50: The length of the shortest alignment such that half of the genome is in contigs of this length or greater. Calculated using Q100 (https://github.com/nhansen/q100bench) versus HG002 v1.0.1. Hamming error: The haplotype error rate computed using yak (Liao et al. 2023) and parent short-read sequence databases measuring the consistency of each scaffold with a single haplotype, lower is better. QV: the Phred (Ewing and Green 1998) log-scaled quality score calculated using Merqury (Rhie et al. 2020), higher is better. Dup/Missing Gene: duplicated or missing genes computed using compleasm (Huang and Li 2023) using the OrthoDB v10 (Waterhouse et al. 2018; Zdobnov et al. 2021) primate database, lower is better. Each haplotype was measured independently and the missing and duplicated genes reported are the sum of both haplotypes. Since single-copy genes from Chromosome X are expected to be missing on the paternal haplotype and some genes may be true duplications, we also measured gene completeness on the HG002 v1.1 assembly (https://github.com/marbl/HG002/blob/main/README.md) (Supplemental Table 2) as a baseline. This assembly has 178 duplicated and 288 missing genes and a hamming error rate of 0.10%. T2T ctgs: The count of telomere-to-telomere contigs for each assembly. A contig is defined as T2T if it has the canonical (TTAGGG) telomere sequence within 10 kbp of the start and end and has no gaps, higher is better. T2T scfs: same as T2T ctgs but gaps are allowed, higher is better. Bold values denote the best result for each metric and sequencing combination.

This Article

  1. Genome Res. 34: 1919-1930

Preprint Server