ABySS 2.0: resource-efficient assembly of large genomes using a Bloom filter

The assembly of DNA sequences de novo is fundamental to genomics research. It is the first of many steps toward elucidating and characterizing whole genomes. Downstream applications, including analysis of genomic variation between species, between or within individuals critically depend on robustly assembled sequences. In the span of a single decade, the sequence throughput of leading DNA sequencing instruments has increased drastically, and coupled with established and planned large-scale, personalized medicine initiatives to sequence genomes in the thousands and even millions, the development of efficient, scalable and accurate bioinformatics tools for producing high-quality reference draft genomes is timely. With ABySS 1.0, we originally showed that assembling the human genome using short 50-bp sequencing reads was possible by aggregating the half terabyte of compute memory needed over several computers using a standardized message-passing system (MPI). We present here its redesign, which departs from MPI and instead implements algorithms that employ a Bloom filter, a probabilistic data structure, to represent a de Bruijn graph and reduce memory requirements. We benchmarked ABySS 2.0 human genome assembly using a Genome in a Bottle data set of 250-bp Illumina paired-end and 6-kbp mate-pair libraries from a single individual. Our assembly yielded a NG50 (NGA50) scaffold contiguity of 3.5 (3.0) Mbp using <35 GB of RAM. This is a modest memory requirement by today's standards and is often available on a single computer. We also investigate the use of BioNano Genomics and 10x Genomics’ Chromium data to further improve the scaffold NG50 (NGA50) of this assembly to 42 (15) Mbp.


Effect of Bloom Filter False Positive Rate
To assess the effects of the Bloom filter false positive rate (FPR) on ABySS 2.0 assemblies, we conducted assemblies of the C. elegans N2 strain DRR008444 dataset (Illumina GA IIx sequencing of 2x100 bp reads of 300 bp fragments to 75 fold coverage) under a range of different Bloom FPR values and assessed the resulting NG50, number of misassemblies, wallclock times.We note that Bloom filter false positive rate is determined by a combination of Bloom filter size, number of Bloom filter hash functions, and number of distinct k-mers in the dataset, as per Equation 1 in the main text.However, the prediction of Bloom filter FPR is further complicated by the use of a cascading chain of Bloom filters to remove low-occurrence k-mers, as detailed in the assembly algorithm description in Methods.For the purposes of our experiment, we fixed all parameters affecting FPR except the Bloom filter memory allocation, which was used as the driving parameter for the experiment.In particular, we fixed the number of Bloom filter hash functions at 1, fixed the number of cascading Bloom filters at 4, fixed the k-mer size at 64, and varied the Bloom filter memory allocation from 250 MB to 3000 MB with a step size of 250 MB.For example, the ABySS 2.0 assembly for a Bloom filter memory allocation of 250 MB was run with the command abyss-pe c=4 k=64 H=1 B=250M in= DRR008444_1.fastqDRR008444_2.fastq, where c=4 specifies the use of 4 cascading Bloom filters (i.e.minimum k-mer count threshold of 4), k=64 specifies a k-mer size of 64, and H=1 specifies that the Bloom filter should use a single hash function.The runs for other Bloom filter sizes used the same parameter values with the exception of B (Bloom filter memory allocation).The wallclock time of the assemblies was measured with /usr/bin/time and the false positive rates corresponding to each Bloom filter size were obtained from the ABySS 2.0 log files.All assemblies were run with 12 threads on an isolated machine with 48GB RAM and two Xeon X5650 CPUs.
We used QUAST 3.2 to calculate the NG50 and misassembly metrics for the experiment, using the C. elegans Bristol N2 strain as the reference genome (NCBI BioProject PRJNA158).Expanding on the results presented in Fig. 3 of the main text, Fig. S1, S2, S3 depict the changes to FPR, wallclock time, and a variety of QUAST contiguity and misassembly metrics that result from changing the Bloom filter allocation, while Tables S1, S2, S3 provide the corresponding data.We observe that as the Bloom filter memory decreases from 3000 MB to 500 MB (FPR values of 1.91% and 10.9%, respectively), the majority of assembly metrics remain stable.However, large changes in the metrics occur when the Bloom filter allocation is decreased further from 500 MB to 250 MB (FPR values of 10.9% and 20.7%, respectively).We similarly observe a steep increase in wallclock time from 57 min to 152 min when further decreasing Bloom filter allocation from 500 MB to 250 MB.These results indicate that a target FPR between 5% -10% provides the best trade-off between assembly quality, wallclock time, and memory usage.

Bloom
Filter Mem (MB) N50 NG50 NGA50 Misassemblies Relocations q q q q q q q q q q q q 10500 10600 10700 10800 10900 11000 1000 2000 3000 Bloom filter memory (MB) N50 q q q q q q q q q q q q 9300 9400 9500 9600 1000 2000 3000 Bloom filter memory (MB) NG50 q q q q q q q q q q q q 9300 9400 9500 9600 1000 2000 3000 Bloom filter memory (MB) NGA50 q q q q q q q q q q q q 8.50 Bloom filter memory (MB) misassemblies q q q q q q q q q q q q 8.50 Bloom filter memory (MB) relocations q q q q q q q q q q q q 8.50 Bloom filter memory (MB) misassembled contigs Figure S1: N50, number of misassemblies, NG50, number of relocation misassemblies, NGA50, and number of misassembled contigs reported by QUAST 3.2 for ABySS 2.0 assemblies of C. elegans dataset DRR008444, using the C. elegans Bristol N2 strain as the reference genome (NCBI BioProject PRJNA158).Results are shown for Bloom filter memory allocations ranging between 250 MB and 3000 MB with a step size 250 MB.Number of translocation misassemblies and inversion misassemblies are omitted because their count was zero across all Bloom filter sizes.q q q q q q q q q q q q 1000 2000 3000 Bloom filter memory (MB) misassembled contigs length q q q q q q q q q q q q 30.00 Bloom filter memory (MB) local misassemblies q q q q q q q q q q q q 1020 1040 1060 1080 1000 2000 3000 Bloom filter memory (MB) mismatches q q q q q q q q q q q q 950 960 970 980 1000 2000 3000 Bloom filter memory (MB) indels q q q q q q q q q q q q 910 920 930 1000 2000 3000 Bloom filter memory (MB) short indels q q q q q q q q q q q q 48 49 50 51 1000 2000 3000 Bloom filter memory (MB) long indels

Figure S2
: Sum length of misassembled contigs, number of indels, number of local misassemblies, number of short indels, number of mismatches, and number of long indels reported by QUAST 3.2 for ABySS 2.0 assemblies of C. elegans dataset DRR008444, using the C. elegans Bristol N2 strain as the reference genome (NCBI BioProject PRJNA158).Results are shown for Bloom filter allocations ranging between 250 MB and 3000 MB with a step size 250 MB. q q q q q q q q q q q q 2175 2200 2225 2250 2275 1000 2000 3000 Bloom filter memory (MB) indels length q q q q q q q q q q q q 0e+00 3e+07 6e+07 9e+07 1000 2000 3000 Bloom filter memory (MB) reconstruction q q q q q q q q q q q q 5 10 15 20 1000 2000 3000 Bloom filter memory (MB) false positive rate (%) q q q q q q q q q q q q 60 90 120 150 1000 2000 3000 Bloom filter memory (MB) wallclock time (min) Figure S3: Sum length of indels, Bloom filter false positive rate, reconstruction, and wallclock time for ABySS 2.0 assemblies of C. elegans dataset DRR008444, using Bloom filter memory allocations ranging between 250 MB and 3000 MB with a step size 250 MB.Sum length of indels and reconstruction were computed by QUAST 3.2 using the C. elegans Bristol N2 strain as the reference genome (NCBI BioProject PRJNA158).The reconstruction figure corresponds to the "Total_length" column reported by QUAST, which the sum length of all assembled sequences >= 500 bp.The dashed line of the reconstruction plot indicates the length of the reference genome sequence.

Assembler Comparison Details Sealer Gap Filling Results
In addition to comparing the contiguity and correctness of the contig sequences in Fig. 3A, we also assessed the contiguity improvements produced by closing scaffold gaps with Sealer, prior to splitting the sequences at 'N's.Sealer is a tool that fills scaffold gaps by searching for a connecting path between gap flanks in the de Bruijn graph, using multiple k-mer sizes.

Sequence Identity and Genome Coverage
We aligned the contigs to the reference genome using BWA-MEM and filtered out secondary alignments (SAM flag 0x100).We calculated the number of reference nucleotides covered by contigs and the total number of aligned contig nucleotides using samtools depth.We calculated the total number of mismatching nucleotides by computing the sum of the SAM NM tag (number of mismatches) of the alignments.The percent identity is calculated as one minus the number of mismatches divided by the total number of aligned contig nucleotides.The percent genome coverage is calculated as the number of reference genome positions covered by an aligned contig divided by the number of non-N reference nucleotides, 2,937,639,113 bp.

K -mer Size Sweeps
For most of the assemblers, we conducted assemblies across a range of k-mer sizes and selected the optimal k-mer size based on the trade-off between maximizing contiguity (NGA50/NG50) and minimizing the number of breakpoints when aligning the sequences to the reference genome GRCh38 (Fig. S5, Tables S5,  S6, S7, S8, S9).We note that Minia did not support k-mer sizes greater than 128 and BCALM 2 did not support k-mers sizes larger than 63.Assemblers for which we did not perform k-mer size optimization were DISCOVAR de novo, MEGAHIT, and SGA.In the case of DISCOVAR de novo, the software determines a suitable k-mer size automatically from the input data (author communication).In the case of MEGAHIT, the algorithm assembles across multiple k-mer sizes simultaneously.To better cover the full read length of 250bp, we extended the default range of k-mer sizes for MEGAHIT from 21,41,61,81,99 to 17,45,73,101,129,157,185,213,241 to achieve improved contiguity (NG50 of 8293 bp vs. 4058 bp) at the expense of additional running time (25.6 hours vs. 15.5 hours) and no significant increase in memory usage (196.9 GB vs. 194.5GB).In the case of SGA, the assembly follows the string graph paradigm (Myers 2005) which accomodates variable-size overlaps, and so the k-mer size optimization was not needed.

Additional Benchmarking of ABySS
All assemblies for the assembler comparison of Fig. 3 were run on servers with 4 Xeon E7-8867 v3 CPUs running @ 2.50GHz, having a total of 64 cores and 2.5 TB of RAM.In addition to the main runs on the Xeon E7 machines, we also conducted additional performance tests for ABySS 1.0 and ABySS 2.0 on alternate architectures.
To measure the performance of ABySS 1.0 in a cluster environment, we benchmarked an MPI assembly job distributed across 11 nodes, each having 48 GB RAM and 2 Xeon X5650 CPUs running at 2.67 GHz.Each cluster node provided a total of 12 CPU cores and the cluster nodes were interconnected via Infiniband.Table S10 compares the wallclock times of the distributed ABySS 1.0 job and the main ABySS 1.0 run from Fig. 3, which was run on a single 64-core Xeon E7 machine.For the sake of comparison, we set the number of MPI processes for the cluster assembly job to 64 (abyss-pe parameter "np=64"), even though 132 CPU cores were available across the 11 cluster nodes.Wallclock times in Table S10 are broken down by ABySS assembly stage.We note that only the first (unitig) stage of the ABySS assembly pipeline is distributed across nodes with MPI, whereas the contig and scaffold stages are multithreaded and run on a single node.As a result, the contig stage ran much more slowly for the cluster-based ABySS 1.0 job than for the single-machine Xeon E7 run (14.0 hours vs. 3.3 hours).The scaffold stage, which is not as computationally intensive as the contig stage, ran in roughly the same wallclock time in both cases (4.5 hours vs. 4.8 hours).The overall wallclock time for the distributed ABySS 1.0 assembly was 25.4 hours vs. 14.3 hours for the E7 run.In practice, the distributed ABySS 1.0 job also required more memory than the single-machine E7 run.While the E7 run had a peak memory requirement of 418 GB RAM, the cluster job required 528 GB of aggregate RAM (11 nodes with 48 GB per node).Although the actual memory used by ABySS 1.0 was the same in both cases, the cluster job required extra headroom because the distribution of k-mer data was not perfectly even across MPI processes.To test the performance of ABySS 2.0 on a low-memory machine, we benchmarked the ABySS 2.0 on a node with 48 GB RAM and 2 Xeon X5650 CPUs running at 2.67 GHz, having a total of 12 cores.As expected, the peak RAM usage was the same as for the E7 run (34 GB), while the wallclock time was approximately 4 times longer (80 hours vs. 20 hours).We attribute the longer wallclock time to the use of 12 threads rather than 64 threads, due to the lower number of cores available on this machine in comparison to the Xeon E7 server.

Assemblies with Raw and BFC-corrected Reads
To assess the impact of using BFC-corrected reads in our assembly comparison of Fig. 3, we ran equivalent assemblies on the uncorrected reads and compared the contig NGA50, contig NG50, alignment breakpoints, peak memory usage, and wallclock time to the values measured for BFC-corrected reads (Fig. S6, Tables S11, S12).For each assembler, we used identical command line parameters for the uncorrected reads assembly as were used on the BFC-corrected reads (see Methods).We note that the BFC-corrected assemblies generally required less time and memory and produced improved assembly contiguity in comparison to the uncorrected reads.Two minor exceptions were: (i) DISCOVAR de novo whose contig NG50 and NGA50 were 1.9% and 2.0% less respectively with the BFC corrected reads, and (ii) SGA which ran slightly faster on uncorrected reads (60 hours vs. 65 hours).For consistency, we used the assemblies of BFC corrected reads for all assemblers in Fig. 3 of the main text.
Table S11: The sequence contiguity and number of breakpoints of assemblies using raw and BFC-corrected reads from the GIAB HG004 dataset.NGA50 and number of breakpoints were calculated by aligning the sequences to GRCh38 using BWA-MEM.

Software
Most software used in these analyses was installed from the Homebrew-Science software collection using Linuxbrew with the command brew install abyss allpaths-lg bcalm bfc bwa discovardenovo masurca megahit nxtrim samtools seqtk sga soapdenovo.The following three tools were installed manually.

Assembler Scripts and Configuration Files
For ABySS 1.0 (Simpson et al. 2009), we installed version 1.9.0 and assembled the paired-end and mate-pair reads with the command shown in Supplemental Listing S1, where the files pe400.inand mp6k+unknown.inare lists of the locations of compressed FASTQ files.
For ABySS 2.0, we assembled the paired-end and mate-pair reads with the command shown in Supplemental Listing S2.In comparison to the ABySS 1.0 assembly command, three Bloom filter-specific assembly parameters were added (B=26G H=4 kc=3), which specify the total memory allocated to the Bloom filters, the number of Bloom filter hash functions, and the number of cascading Bloom filter levels, respectively.We determined the values for total memory size (B) and number of hash functions (H) by counting distinct 144-mers with ntCard (Mohamadi et.al 2017) and targeting a false positive rate of 5% for the first level of the cascading Bloom filter.We deemed 5% to be a suitable upper bound for Bloom filter FPR based on the results of our C. elegans experiment above, which indicated good performance in the range of 5-10% FPR.We determined the optimal number of cascading Bloom filter levels by running assemblies with kc=2, kc=3, and kc=4, and choosing the assembly with highest NG50 and lowest number of breakpoints.Note that the parameter kc of the final release version of ABySS 2.0 was originally named c in the prerelease version tagged bloom-abyss-preview evaluated in this paper.
For ALLPATHS-LG (Gnerre et al. 2010), we installed version 52488 and attempted to assemble the paired-end and mate-pair reads with the command shown in Supplemental Listing S3 and the configuration files in_libs.csvand in_groups.csvshown in Supplemental Listings S4-S5.We terminated the ALLPATHS-LG job after it ran for more than a month without completing.
For BCALM 2 (Chikhi et al. 2016), we installed version 2.0.0 and assembled the paired-end reads with the command shown in Supplemental Listing S6.The largest value of k supported by BCALM 2 is 63.
For DISCOVAR de novo, the whole genome de novo assembly successor of DIS-COVAR (Weisenfeld et al. 2014), we installed version 52488 and assembled the paired-end reads and scaffolded this assembly using three standalone scaffolding tools, ABySS-Scaffold 1.9.0,BESST 2.2.4 (Sahlin et al. 2016), andLINKS 1.8.2 (Warren et al. 2015), with the command shown in Supplemental Listing S7.
For MaSuRCA (Zimin et al. 2013), we installed version 3.1.3and attempted to assemble the paired-end and mate-pair reads with the command shown in Supplemental Listing S8 and the configuration file config.txtshown in Supplemental Listing S9.MaSuRCA ran for five days and failed with a segmentation fault in the program gatekeeper.
For MEGAHIT (Li et al. 2016), we installed version 1.0.6-3-gfb1e59b and assembled the paired-end reads with the command shown in Supplemental Listing S10.
For Minia (Chikhi et al. 2013), we installed version 3.0.0-alpha1and assembled the paired-end reads with the command shown in Supplemental Listing S11.The largest value of k supported by Minia was 128.
For SGA (Simpson and Durbin 2011), we installed version 0.10.14 and assembled the paired-end reads with the command shown in Supplemental Listing S12.
For SOAPdenovo2 (Luo et al. 2012), we installed version 2.04 and assembled the paired-end and mate-pair reads with the command shown in Supplemental Listing S13 and the configuration file hsapiens.configshown in Supplemental Listing S14.We used the BioNano optical map to further scaffold the ABySS 1.0, ABySS 2.0 and DISCOVAR de novo assemblies, scaffolded with ABySS-Scaffold, BESST and LINKS, using IrysSolve 2.1.5063with the command shown in Supplemental Listing S15 according to the document "Theory Of Operation: Hybrid Scaffolding" available online at http://bit.ly/bionano-scaffolding.The configuration files are used unmodified as distributed by BioNano Genomics and available online at https://github.com/bcgsc/abyss-2.0-giab/tree/master/bionano.
We used 10x Genomics Chromium data to scaffold the ABySS 2.0 + BioNano scaffolds with ARCS (Yeo et al. 2017) andLINKS 1.8.2 (Warren et al. 2015).The version of ARCS used in the paper is available from: https://github.com/bcgsc/arcs/tree/arcs-prerelease.We aligned the Chromium reads to the ABySS 2.0 + BioNano scaffolds using BWA-MEM with default settings and ran ARCS and LINKS with the commands shown in Supplemental Listing S16.

Figure S4 :
Figure S4: Percent genome coverage and percent sequence identity of contigs aligned to the reference genome using BWA-MEM

Figure S6 :
FigureS6: Comparison of assembly results for uncorrected reads and BFCcorrected reads for the GIAB HG004 dataset using ABySS 1.0, ABySS 2.0, BCALM 2, DISCOVAR de novo, Minia, SGA, and SOAPdenovo2.(A) Peak memory usage and wallclock times of each assembler when run on raw and BFC-corrected reads.(B) NG50 and number of breakpoints for contig sequences generated from raw and BFC-corrected reads.The number of breakpoints was calculated by aligning the assembled sequences to the reference genome GRCh38 with BWA MEM 0.7.13.For assemblies with scaffolding stages, the contigs were extracted by splitting the sequences at 'N' characters.While the SOAPdenovo2 assembly of BFC-corrected reads completed successfully, the SOAPdenovo2 assembly on uncorrected reads failed with a segmentation fault, and thus only the BFC-corrected result is shown.

Figure
Figure S7: A Circos Assembly Consistency Plot for the ABySS 1.0 + BioNano Assembly.Scaftigs from the largest scaffolds that compose 90% of the genome are aligned to GRCh38 using BWA-MEM.GRCh38 chromosomes are displayed on the left and the scaffolds on the right.Connections show the aligned regions between the genome and scaffolds.Contigs are included as a part of the same region if the are within 1Mbp of on either side of the connection, and regions shorter than 100 kbp are not shown.The black regions on the chromosomes indicate gaps in the reference and the circles indicate the centromere location on each chromosome.

Figure
Figure S8: A Circos Assembly Consistency Plot for the ABySS 2.0 + BioNano Assembly.Scaftigs from the largest scaffolds that compose 90% of the genome are aligned to GRCh38 using BWA-MEM.GRCh38 chromosomes are displayed on the left and the scaffolds on the right.Connections show the aligned regions between the genome and scaffolds.Contigs are included as a part of the same region if the are within 1Mbp of on either side of the connection, and regions shorter than 100 kbp are not shown.The black regions on the chromosomes indicate gaps in the reference and the circles indicate the centromere location on each chromosome.

Figure
Figure S9: A Circos Assembly Consistency Plot for the ABySS 2.0 + BioNano + Chromium Assembly.Scaftigs from the largest scaffolds that compose 90% of the genome are aligned to GRCh38 using BWA-MEM.GRCh38 chromosomes are displayed on the left and the scaffolds on the right.Connections show the aligned regions between the genome and scaffolds.Contigs are included as a part of the same region if the are within 1Mbp of on either side of the connection, and regions shorter than 100 kbp are not shown.The black regions on the chromosomes indicate gaps in the reference and the circles indicate the centromere location on each chromosome.

Figure
Figure S10: A Circos Assembly Consistency Plot for the DISCOVAR de novo + ABySS-Scaffold + BioNano Assembly.Scaftigs from the largest scaffolds that compose 90% of the genome are aligned to GRCh38 using BWA-MEM.GRCh38 chromosomes are displayed on the left and the scaffolds on the right.Connections show the aligned regions between the genome and scaffolds.Contigs are included as a part of the same region if the are within 1Mbp of on either side of the connection, and regions shorter than 100 kbp are not shown.The black regions on the chromosomes indicate gaps in the reference and the circles indicate the centromere location on each chromosome.

Figure
Figure S11: A Circos Assembly Consistency Plot for the DISCOVAR de novo + BESST + BioNano Assembly.Scaftigs from the largest scaffolds that compose 90% of the genome are aligned to GRCh38 using BWA-MEM.GRCh38 chromosomes are displayed on the left and the scaffolds on the right.Connections show the aligned regions between the genome and scaffolds.Contigs are included as a part of the same region if the are within 1Mbp of on either side of the connection, and regions shorter than 100 kbp are not shown.The black regions on the chromosomes indicate gaps in the reference and the circles indicate the centromere location on each chromosome.

Figure
Figure S12: A Circos Assembly Consistency Plot for the DISCOVAR de novo + LINKS + BioNano Assembly.Scaftigs from the largest scaffolds that compose 90% of the genome are aligned to GRCh38 using BWA-MEM.GRCh38 chromosomes are displayed on the left and the scaffolds on the right.Connections show the aligned regions between the genome and scaffolds.Contigs are included as a part of the same region if the are within 1Mbp of on either side of the connection, and regions shorter than 100 kbp are not shown.The black regions on the chromosomes indicate gaps in the reference and the circles indicate the centromere location on each chromosome.

Table S1 :
Bloom filter memory, N50, NG50, NGA50, number of misassemblies, number of relocation misassemblies, and number of misassembled contigs reported by QUAST 3.2 for ABySS 2.0 assemblies of C. elegans dataset DRR008444, using the C. elegans Bristol N2 strain as the reference genome (NCBI BioProject PRJNA158).Results are shown for Bloom filter memory allocations ranging between 250 MB and 3000 MB with a step size 250 MB.Number of translocation misassemblies and inversion misassemblies are omitted because their count was zero across all Bloom filter sizes.

Table S2 :
Bloom filter memory, sum length of misassembled contigs, number of local misassemblies, number of mismatches, number of indels, number of short indels, and number of long indels reported by QUAST 3.2 for ABySS 2.0 assemblies of C. elegans dataset DRR008444, using the C. elegans Bristol N2 strain as the reference genome (NCBI BioProject PRJNA158).Results are shown for Bloom filter memory allocations ranging between 250 MB and 3000 MB with a step size 250 MB.

Table S3 :
Bloom filter memory, sum length of indels, reconstruction, Bloom filter false positive rate, and wallclock time for ABySS 2.0 assemblies of C. elegans dataset DRR008444, using Bloom filter memory allocations ranging between 250 MB and 3000 MB with a step size 250 MB.Sum length of indels and reconstruction were computed by QUAST 3.2 using the C. elegans Bristol N2 strain as the reference genome (NCBI BioProject PRJNA158).The reconstruction figure corresponds to the "Total_length" column reported by QUAST, which the sum length of all assembled sequences >= 500 bp.The dashed line of the reconstruction plot indicates the length of the reference genome sequence.

Table S4 :
Percent genome coverage, percent sequence identity and the corre-

Table S5 :
Scaffold contiguity and number of breakpoints for ABySS 1.0 assemblies of the GIAB HG004 dataset, conducted across a range of k-mer sizes.NGA50 and number of breakpoints were calculated by aligning the sequences to GRCh38 using BWA-MEM.

Table S6 :
Scaffold contiguity and number of breakpoints for ABySS 2.0 assemblies of the GIAB HG004 dataset, conducted across a range of k-mer sizes.NGA50 and number of breakpoints were calculated by aligning the sequences to GRCh38 using BWA-MEM.

Table S7 :
Scaffold contiguity and number of breakpoints for BCALM 2 assemblies of the GIAB HG004 dataset, conducted across a range of k-mer sizes.NGA50 and number of breakpoints were calculated by aligning the sequences to GRCh38 using BWA-MEM.

Table S8 :
Scaffold contiguity and number of breakpoints for Minia assemblies of the GIAB HG004 dataset, conducted across a range of k-mer sizes.NGA50 and number of breakpoints were calculated by aligning the sequences to GRCh38 using BWA-MEM.

Table S10 :
Breakdown of wallclock time for two ABySS 1.0 assemblies of the Genome in a Bottle HG004 data set, run on different platforms.

Table S12 :
The peak memory usage and wall clock run time of assemblies using raw and BFC-corrected reads from the GIAB HG004 dataset.Each assembler was run with 64 threads.