Jacques D. Retief; Kevin R. Lynch; William R. Pearson

Box 1.

The FAST_PAN Strategy

1. A set of protein query sequences that reflects the evolutionary range of interest is assembled. To find entirely unrecognized families, a very diverse set of related sequences should be chosen. To fine new paralogs of known families, known paralogs from a specific organism (e.g., mouse or human) should be assembled.

2. Each of these query sequences is used to search an EST or genomic DNA database using thetfastx3 or tfasty3 program. (Thetfastx3 program is preferred for its speed when the alignment quality is not paramount; tfasty3 should be used when searching for new paralogs (Fig. 1A).

3. The list of alignments from each search is parsed to extract the DNA library (EST) name, expectation value, percent identity, and boundaries of the alignment (Fig. 1B).

4. The parsed table of alignment information is combined for all of the query sequences and sorted based on DNA library (EST) sequence identifier (Fig.1C).

5. The sorted table of library scores is rescanned to calculate a sum–score that combines the log (Expectation) scores for the alignments between the same library sequence and each of the query sequences that “found” the library sequence (Fig.1D).

6. The table of library scores is resorted by sum–score and sequence name, so that the library sequence that obtained the best total expectation value score is shown first, the next best second, etc.

7. The final sorted list is used to produce (a) a postscript plot summarizing the expectation value, extent of alignment, and percent identity (Fig. 3); and (b) an html file that combines all of the alignments between the DNA library sequence and the different query sequences produced in step 1 (Fig.6A).

Panning for Genes—A Visual Strategy for Identifying Novel Gene Orthologs and Paralogs

This Article

Preprint Server

Current Issue

In This Issue