Telomere-to-telomere assembly by preserving contained reads

Table 2.

Evaluation of the RAFT-hifiasm workflow for computing haplotype-resolved assembly

Data set Method Size (Gbp) NG50 (Mbp) Switch error (%) T2T contigs Multicopy genes retained (%) Gene completeness
Complete (%) Duplicated (%)
D1: HiFi (36×) hifiasm 3.04/3.04 59.0/45.4 1.09/0.96 0 83.37/76.42 97.59/97.70 0.45/0.31
NaiveCut-hifiasm 3.06/2.96 51.1/48.7 1.12/0.95 0 80.50/80.42 97.66/97.57 0.45/0.39
RAFT-hifiasm 3.04/2.98 44.9/62.4 1.06/1.01 1 79.22/82.09 97.87/97.45 0.34/0.39
D2: ONT Duplex (32×) hifiasm 2.99/3.04 42.2/51.0 2.37/1.66 2 80.98/81.45 96.94/96.63 1.06/1.37
NaiveCut-hifiasm 3.01/3.00 61.2/56.3 2.04/2.08 2 83.61/77.22 97.61/97.64 0.54/0.42
RAFT-hifiasm 2.96/3.04 80.3/52.6 2.02/1.90 6 83.53/80.98 97.72/97.59 0.50/0.50
D3: HiFi (36×) + ONT Duplex (32×) hifiasm 3.13/2.99 49.3/53.3 0.82/1.00 1 83.85/77.78 95.69/95.79 2.42/2.10
NaiveCut-hifiasm 3.04/3.01 81.9/82.1 1.02/1.08 2 83.61/78.74 97.93/97.76 0.40/0.42
RAFT-hifiasm 3.03/3.02 89.6/89.3 0.94/1.10 7 83.13/80.50 97.79/97.98 0.42/0.41
D4: ONT high-acc UL (40×) hifiasm 3.44/3.43 16.2/20.8 2.49/1.69 0 74.02/78.18 67.72/70.41 19.88/19.82
NaiveCut-hifiasm 3.02/3.11 45.2/51.7 1.96/2.05 0 81.14/79.54 97.07/96.91 0.55/0.60
RAFT-hifiasm 3.05/3.07 81.3/49.1 2.19/1.87 1 81.45/77.45 97.21/96.94 0.71/0.51
  • We measured assembly quality statistics separately for both haplotypes. The reported statistics are formatted as haplotype 1/haplotype 2. In the NaiveCut-hifiasm method, we fragment all reads to the same length as RAFT, regardless of whether the read contains a repetitive region. NG50 is the length of the shortest contig at 50% of the genome length. We assumed a genome length of 3.1 Gbp. Switch error is the percentage of incorrectly phased adjacent SNP sites. A contig is defined as T2T if it contains the telomeric repeat unit “TTAGGG” within 1 kbp of both ends, and aligns with a reference chromosome with more than 95% identity. “Multicopy genes retained” is the percentage of multicopy genes in GRCh38, i.e., genes with multiple mapping positions at ≥99% sequence identity that occur multiple times in the assembly. In the gene completeness statistics, the percentage of complete genes are those genes occurring only once in the assembly only once in GRCh38 (at 99% sequence identity). The percentage of duplicated genes are those genes which occur multiple times in the assembly and occur only once in GRCh38. The tools and commands used to measure the assembly statistics are available in Supplemental Note S3.

This Article

  1. Genome Res. 34: 1908-1918

Preprint Server