genBlastA: Enabling BLAST to identify homologous gene sequences

  1. Rong She1,3,
  2. Jeffrey S.-C. Chu2,3,
  3. Ke Wang1,
  4. Jian Pei1 and
  5. Nansheng Chen2,4
  1. 1 School of Computing Science, Simon Fraser University, Burnaby, British Columbia, V5A 1S6 Canada;
  2. 2 Department of Molecular Biology and Biochemistry, Simon Fraser University, Burnaby, British Columbia, V5A 1S6 Canada
  1. 3 These authors contributed equally to this work.

Abstract

BLAST is an extensively used local similarity search tool for identifying homologous sequences. When a gene sequence (either protein sequence or nucleotide sequence) is used as a query to search for homologous sequences in a genome, the search results, represented as a list of high-scoring pairs (HSPs), are fragments of candidate genes rather than full-length candidate genes. Relevant HSPs (“signals”), which represent candidate genes in the target genome sequences, are buried within a report that contains also hundreds to thousands of random HSPs (“noises”). Consequently, BLAST results are often overwhelming and confusing even to experienced users. For effective use of BLAST, a program is needed for extracting relevant HSPs that represent candidate homologous genes from the entire HSP report. To achieve this goal, we have designed a graph-based algorithm, genBlastA, which automatically filters HSPs into well-defined groups, each representing a candidate gene in the target genome. The novelty of genBlastA is an edge length metric that reflects a set of biologically motivated requirements so that each shortest path corresponds to an HSP group representing a homologous gene. We have demonstrated that this novel algorithm is both efficient and accurate for identifying homologous sequences, and that it outperforms existing approaches with similar functionalities.

Footnotes

  • 4 Corresponding author.

    E-mail chenn{at}sfu.ca; fax (778) 782-5583.

  • [Supplemental material is available online at www.genome.org.]

  • Article published online before print. Article and publication date are at http://www.genome.org/cgi/doi/10.1101/gr.082081.108.

    • Received June 9, 2008.
    • Accepted September 29, 2008.
| Table of Contents

Preprint Server