Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing

Wei Qu; Shin-ichi Hashimoto; Shinichi Morishita

doi:10.1101/gr.089151.108

Efficient frequency-based de novo short-read clustering for error trimming in next-generation sequencing

¹ Department of Computational Biology, Graduate School of Frontier Sciences, The University of Tokyo, Kashiwa 277-0882, Japan;
² Department of Molecular Preventive Medicine, Graduate School of Medicine, The University of Tokyo, Tokyo 113-0033, Japan;
³ Bioinformatics Research and Development (BIRD), Japan Science and Technology Agency (JST), Tokyo 102-8666, Japan

Abstract

Novel massively parallel sequencing technologies provide highly detailed structures of transcriptomes and genomes by yielding deep coverage of short reads, but their utility is limited by inadequate sequencing quality and short-read lengths. Sequencing-error trimming in short reads is therefore a vital process that could improve the rate of successful reference mapping and polymorphism detection. Toward this aim, we herein report a frequency-based, de novo short-read clustering method that organizes erroneous short sequences originating in a single abundant sequence into a tree structure; in this structure, each “child” sequence is considered to be stochastically derived from its more abundant “parent” sequence with one mutation through sequencing errors. The root node is the most frequently observed sequence that represents all erroneous reads in the entire tree, allowing the alignment of the reliable representative read to the genome without the risk of mapping erroneous reads to false-positive positions. This method complements base calling and the error correction of making direct alignments with the reference genome, and is able to improve the overall accuracy of short-read alignment by consulting the inherent relationships among the entire set of reads. The algorithm runs efficiently with a linear time complexity. In addition, an error rate evaluation model can be derived from bacterial artificial chromosome sequencing data obtained in the same run as a control. In two clustering experiments using small RNA and 5′-end mRNA reads data sets, we confirmed a remarkable increase (∼5%) in the percentage of short reads aligned to the reference sequence.

Footnotes

↵4 Corresponding author.

E-mail moris{at}cb.k.u-tokyo.ac.jp; fax 81-4-7136-3977.
[Supplemental material is available online at www.genome.org. The frequency-based de novo short-read clustering software program, FreClu, is freely available from http://mlab.cb.k.u-tokyo.ac.jp/~quwei/DeNovoShortReadClustering/. Complete data sets are available at the NCBI Short Read Archive (http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi) under accession no. SRA003629.]
Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.089151.108.
- Received November 14, 2008.
- Accepted April 27, 2009.
Freely available online through the Genome Research Open Access option.