Improved assembly of noisy long reads by k-mer validation

  1. Gabriel Goldstein
  1. Universidade Federal do Rio de Janeiro
  1. * Corresponding author; email: bernardo1963{at}gmail.com

Abstract

Genome assembly depends critically on read length. Two recent technologies, PacBio and Oxford Nanopore, produce read lengths greater than 20 kb, which yield de novo genome assemblies with vastly greater contiguity than those based on Sanger, Illumina, or other technologies. However, the very high error rates of these two new technologies (~15% per base) makes assembly imprecise at repeats longer than the read length and computationally expensive. Here we show that the contiguity and quality of the assembly of these noisy long reads can be significantly improved at a minimal cost, by leveraging on the low error rate and low cost of Illumina short reads. Namely, k-mers from the PacBio raw reads that are not present in Illumina reads (which account for ~95% of the distinct k-mers) are deemed sequencing errors and ignored at the seed alignment step. By focusing on the ~5% of k-mers that are error-free, read overlap sensitivity is dramatically increased. Of equal importance, the validation procedure can be extended to exclude repetitive k-mers, which prevents read miscorrection at repeats and further improves the resulting assemblies. We tested the k-mer validation procedure using one long-read technology (PacBio) and one assembler (MHAP/ Celera Assembler), but it is very likely to yield analogous improvements with alternative long-read technologies and assemblers, such as Oxford Nanopore and BLASR/DALIGNER/Falcon, respectively.

  • Received May 1, 2016.
  • Accepted September 29, 2016.

This article is distributed exclusively by Cold Spring Harbor Laboratory Press for the first six months after the full-issue publication date (see http://genome.cshlp.org/site/misc/terms.xhtml). After six months, it is available under a Creative Commons License (Attribution-NonCommercial 4.0 International), as described at http://creativecommons.org/licenses/by-nc/4.0/.

Articles citing this article

ACCEPTED MANUSCRIPT

This Article

  1. Genome Res. gr.209247.116 Published by Cold Spring Harbor Laboratory Press

Article Category

ORCID

Share

Preprint Server