RT Journal A1 Carvalho, Antonio Bernardo A1 Dupim, Eduardo G. A1 Goldstein, Gabriel T1 Improved assembly of noisy long reads by k-mer validation JF Genome Research JO Genome Research YR 2016 FD December 01 VO 26 IS 12 SP 1710 OP 1720 DO 10.1101/gr.209247.116 UL http://genome.cshlp.org/content/26/12/1710.abstract AB Genome assembly depends critically on read length. Two recent technologies, from Pacific Biosciences (PacBio) and Oxford Nanopore, produce read lengths >20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (∼15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ∼95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ∼5% of k-mers that are error free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively.