HiCanu: accurate assembly of segmental duplications, satellites, and allelic variants from high-fidelity long reads

Complete and accurate genome assemblies form the basis of most downstream genomic analyses and are of critical importance. Recent genome assembly projects have relied on a combination of noisy long-read sequencing and accurate short-read sequencing, with the former offering greater assembly continuity and the latter providing higher consensus accuracy. The recently introduced Pacific Biosciences (PacBio) HiFi sequencing technology bridges this divide by delivering long reads (>10 kbp) with high per-base accuracy (>99.9%). Here we present HiCanu, a modification of the Canu assembler designed to leverage the full potential of HiFi reads via homopolymer compression, overlap-based error correction, and aggressive false overlap filtering. We benchmark HiCanu with a focus on the recovery of haplotype diversity, major histocompatibility complex (MHC) variants, satellite DNAs, and segmental duplications. For diploid human genomes sequenced to 30× HiFi coverage, HiCanu achieved superior accuracy and allele recovery compared to the current state of the art. On the effectively haploid CHM13 human cell line, HiCanu achieved an NG50 contig size of 77 Mbp with a per-base consensus accuracy of 99.999% (QV50), surpassing recent assemblies of high-coverage, ultralong Oxford Nanopore Technologies (ONT) reads in terms of both accuracy and continuity. This HiCanu assembly correctly resolves 337 out of 341 validation BACs sampled from known segmental duplications and provides the first preliminary assemblies of nine complete human centromeric regions. Although gaps and errors still remain within the most challenging regions of the genome, these results represent a significant advance toward the complete assembly of human genomes.

Supplementary Note 2: Using k-mers for assembly evaluation and identifying haplotype-blocks within diploid assemblies We used Merqury (Rhie et al 2020) to estimate assembly k-mer level QV, completeness, and phase block statistics. We computed an optimal k-mer size of 18 for the D. melanogaster and 21 for human, as in Fofanov et al., with a collision rate of 0.001 (automated in Merqury's best_k.sh). Since QV varies depending on the k-mer size used we also calculated QV for all genomes using a k-mer size of 31. The 31-mer QV was lower than 18-and 21-mer, by as many as 4.5QV points. However, the relative ordering of the assemblies remained stable.

QV and k-mer completeness
The similarity of an assembly to a read set is defined as the number of k-mers shared by both the assembly and the reads divided by the total number of k-mers found only in the assembly. This can be converted to a similarity as in Ondov et al. 2019: i = (Kshared / Ktotal) 1/k which is akin to percent sequence identity and is converted to the Phred scale as: QV = -10 log10(1-i) Completeness is measured as the fraction of "reliable" k-mers in the read set present in an assembly. To determine reliable k-mers, we build the histogram of k-mer counts (multiplicity = number of times a k-mer is seen in the read set). The minimum reliable k-mer threshold is set to the lowest multiplicity with a positive slope in the histogram (automated in Merqury, with build/filt.sh)

Phase blocks
Haplotype-specific markers were determined using parental specific k-mers as previously done for trio binning (Koren et al. 2018). K-mer databases were built for each parental read and subtracted to obtain parent-specific markers. Reliable k-mers for each parent-specific database are selected as above. These databases were used for the D. melanogaster. For human datasets, the parent-specific k-mer databases were further intersected with the child's F1 Illumina data to only include parent-specific markers which were inherited by the child.
Phase blocks were obtained by querying each k-mer found in the assembly to each parental kmer database. A phase block is defined as having at least 2 haplotype-specific markers from the same haplotype, allowing short-range "switches" to the other haplotype. We defined a shortrange switch as at most 100 consecutive markers within a 20 kbp region. When more than 100 markers from the other haplotype are found or they span more than 20 kbp, a new block is created. The new block starts at the first inconsistent marker found. The switch error rate is defined as the fraction of markers from the wrong haplotype within all phase blocks.

Supplementary Note 3: CHM13 Challenge BAC Validation
We inspected all 15 of the CHM13 BACs which were not correctly resolved by the HiCanu assembly. We mapped all 20 kbp HiFi reads to the BACs and to the HiCanu assembly with minimap2: minimap2 -t 32 -ax map-pb -r 2000 -m 3000 <reference.fasta> *.fastq.gz We also checked the BACs for sequence similarity to the vector and E. coli sequences. We concluded that 11 of the sequences were incorrect or highly suspicious: • AC278245.  Clipped identically in GRCh38.p13 and HiFi and UL assemblies (likely error), Supplementary Figure 5 The adjusted CHM13 BAC resolution stats can be found in Supplementary The HiCanu assembled contigs were also mapped to the T2T assembly with the same commands. We identified regions with <= 2-fold HiFi coverage coinciding with a break in the contig mapping. The regions were visually inspected in IGV to confirm simple-sequence repeats.

Supplementary Note 5: Human repeat modeling
We previously developed a model of human assembly continuity to predict the effects of read length and accuracy. Increased read length allows for more repeats to be spanned by reads that are uniquely anchored on either side, while increased read accuracy allows for more repeats to be separated based on sequence variants within the repeat. Using PacBio CLR or Nanopore data, existing algorithms have been able to separate repeats that are up to 98% identical at the sequence level by detecting variants within the repeats. Assuming a 20 kbp read length, this model predicts that improving repeat separation from 98% to 99.9% identical repeats would boost assembly NG50 by 1.8-fold (N such that 50% of the genome size is assembled in contigs of this size or greater). Thus, because of their improved ability to separate repeats, highly accurate 20 kbp reads are predicted to rival noisy 200 kbp reads in terms of assembly continuity (Supplementary Figure 7).

Supplementary Note 7: Estimate of AWS costs
Assuming a 30-hr movie (https://www.pacb.com/products-and-services/sequel-system/latestsystem-release/), acquiring CHM13 data on four SMRTcells would require 120 hours. We obtained 3/6 NA12878 raw bam files for the subreads and converted them to HiFi using the CCS command v3.4.1: ccs --maxLength 21000 --minPasses 3 --numThreads 32 --polish --minPredictedAccuracy 0.99 input.bam output.bam The jobs took on average 7,136 CPU h and 13 GB memory. Thus, we estimated a total of 42,817 CPU h to convert all six cells to HiFi reads. This corresponds to 199 hours on a 36-core node per cell. Using a c5d.9xlarge (1.728/hr) instance to allow storing large BAM files locally, this would cost $2,055. Finally, we estimate the HiCanu runtime on AWS to be 29 hours versus 2 hours for Peregrine. The total runtime with HiCanu is 348 and with Peregrine is 321, a difference of 8%. For cost, we ignore reagent and consumable cost and focus only on computation. We estimated above the cost to generate HiFi reads on AWS as $2,055 and HiCanu as $400 vs $10 for Peregrine. The total cost with HiCanu is $2,455 and with Peregrine $2,065, a difference of 19%.

Supplementary Figure 7. Predicted human assembly continuity based on sequencing read length and accuracy.
Assembly continuity (NG50 contig size) is dependent on read length (x-axis) and read accuracy (colored curves). Model curves are plotted for three hypothetical assemblers able to separate repeats with 98%, 99.5%, and 99.99% average sequence identity. Points represent real assemblies from Peregrine HiFi (Chin and Khalak 2019), HiCanu HiFi (this work), and Canu Nanopore (Miga et al. 2020). The dashed line corresponds to the continuity of human reference genome GRCh38 (Schneider et al. 2017). Real assemblies typically perform better than predicted by the model, due to the unknown divergence level of many large human repeats which are considered as perfect repeats within the model. In reality, many of the repeat instances appear to have enough variants to be effectively resolved by the assembler.

Supplementary Figure 8. HiFi mapping coverage of HiCanu centromere assemblies
In addition to Chromosome 19, HiCanu produced draft assemblies for eight other centromeric satellite arrays of the CHM13 genome. RepeatMasker (RM) annotation reveals the location of α-satellite HOR arrays (marked with a black bar) in each contig. Alignment of HiFi data to each contig reveals even coverage, except for rare dips and spikes in coverage within contigs from Chromosomes 8, 10, 12, and 16, which may indicate mis-mapping, mis-assembly, and/or collapse in sequence. Further validation is required to determine the accuracy of these centromeric assemblies.

Supplementary Figure 9. Sequence identity between the HiCanu D19Z1 or D19Z3 αsatellite arrays and the D19Z1 or D19Z3 α-satellite sequences.
A) Plot showing the sequence identity of each α-satellite repeat within the HiCanu D19Z1 array and the previously identified D19Z1 10-mer (left; Hulsebos et al. 1988;Puechberty et al. 1999; pG-A16, AJ295045.1) or the newly identified D19Z1 13-mer (right). The cluster of low-sequence-identity α-satellite repeats on the left is a result of poor sequence alignment to the 10-mer. When the three newly-identified α-satellite repeats are included to form the 13mer, this cluster has increased sequence identity, and the overall mean sequence identity increases (94.87 +/-6.32% vs. 97.51 +/-1.91%). B) Plot showing the sequence identity of each α-satellite repeat within the HiCanu D19Z3 array and the previously identified D19Z3 dimer (left; Baldini et al. 1989; pC1.8, M26919 and M26920) or the newly identified D19Z3 dimer (right). Almost all α-satellite repeats within the HiCanu D19Z3 array have higher sequence identity to the newly identified dimer than to the previously published dimer (95.74 +/-1.05% vs. 93.02 +/-1.37%). Mean +/-SD is shown.

Supplementary Figure 10. Chromosome 19 D19Z1, D19Z2, and D19Z3 α-satellite HOR array structures.
Dot plots showing the high sequence identity and structure of the D19Z1 array (A), D19Z2 array (B), and edges of the D19Z3 array (C, D). The D19Z2 array has two complex HOR structures that can be observed in panel B (see Main Text for details). Because the HOR array structures in this region do not match the expected pG-A16 repeat structure, we designate it here as 'D19Z2?'. Dot plots were generated with a word length of 100.

Supplementary Figure 11. Segmental duplications associated with contig ends.
Shown are the sequence identity and length of all segmental duplications in GRCh38 (gray). Red data points indicate the largest segmental duplication within 10 kbp of the 95 contig ends located within segmental duplications in the CHM13 20 kbp HiCanu assembly.

Supplementary Figure 12. Coverage drop in defensin reference.
Сontig 'break' in the HiCanu 20 kbp assembly (highlighted in red) corresponds to a coverage drop in both the 10 kbp and 20 kbp HiFi library. The region immediately upstream of the gap contains a >200 bp (AAAGG) simple-sequence repeat. HiCanu has the lowest rate of extensive mis-assemblies (5) versus Canu HiFi (11), Peregrine (10), Canu ONT (32), and Flye (14). HiCanu also has the most continuous reconstruction of the CENX array, splitting it across two contigs. Canu HiFi has the CENX array in three contigs while the other assemblers have no long contigs which can be aligned to the centromere.

Supplementary Figure 14. Read alignment identity against ChrX
Separate boxplots are shown for raw HiFi reads as in Figure 1 (init), homopolymer-compressed reads (compressed), OEA-corrected reads (corrected), and corrected reads after ignoring differences in microsatellite repeats (masked). This figure also includes compressed + masked reads to evaluate the impact of our correction. The median read identity, indicated by solid segments, increases from less than 99.9% to 100% (note that the plots show an y-range of 99.65-100%). Corrected + masked reads are comparable to just compressed reads and correction is necessary to achieve the near-perfect read alignments.

Supplementary Figure 15. Bubble contig analysis in HiCanu.
a. Genomic area, containing two regions of relatively high heterozygosity (blue/green) divided by a long homozygous region (black). b. The homozygous region is typically classified as a genomic repeat due to the reads at boundaries between homozygous and flanking heterozygous regions. c. The large pseudo-haplotype contig is broken at the region boundaries. d. In HiCanu, bubble contigs (see main text for the definition) are first detected and the reads crossing the regions boundary are no longer considered in repeat detection. e. The pseudo-haplotype contig is not split, leaving a continuous pseudo-haplotype.

Supplementary Tables
Supplementary Table 1 The analysis used 20 kbp HiFi CHM13 library. All reads were aligned to the recently finished ChrX reference. Reads were output from HiCanu after compression, after compression and ignoring errors within microsatellite regions, after OEA-correction, and all steps combined (compression + OEA-correction + microsatellite mask). CHM13 collapses were evaluated using SDA as in Vollger et al. 2019. All contigs in the assembly are included for analysis. The collapsed bases are the assembly bases with "higher than expected coverage, while the expanded bases is the estimate of how much sequence is contained in the collapsed regions, had they been correctly assembled.