Vol. 8, Issue 3, 163-167, March 1998

INSIGHT/OUTLOOK
Phylogenomics: Improving Functional Predictions for Uncharacterized Genes by Evolutionary Analysis

Jonathan A. Eisen1

Department of Biological Sciences, Stanford University, Stanford, California 94305-5020 USA

    ARTICLE
Top
Article
References

The ability to accurately predict gene function based on gene sequence is an important tool in many areas of biologicalresearch. Such predictions have become particularly importantin the genomics age in which numerous gene sequences are generatedwith little or no accompanying experimentally determined functionalinformation. Almost all functional prediction methods rely onthe identification, characterization, and quantification of sequencesimilarity between the gene of interest and genes for which functionalinformation is available. Because sequence is the prime determiningfactor of function, sequence similarity is taken to imply similarityof function. There is no doubt that this assumption is valid inmost cases. However, sequence similarity does not ensure identicalfunctions, and it is common for groups of genes that are similarin sequence to have diverse (although usually related) functions.Therefore, the identification of sequence similarity is frequentlynot enough to assign a predicted function to an uncharacterizedgene; one must have a method of choosing among similar genes withdifferent functions. In such cases, most functional predictionmethods assign likely functions by quantifying the levels of similarityamong genes. I suggest that functional predictions can be greatlyimproved by focusing on how the genes became similar in sequence(i.e., evolution) rather than on the sequence similarity itself.It is well established that many aspects of comparative biologycan benefit from evolutionary studies (Felsenstein 1985), andcomparative molecular biology is no exception (e.g., Altschulet al. 1989; Goldman et al. 1996). In this commentary, I discussthe use of evolutionary information in the prediction of genefunction. To appreciate the potential of a phylogenomic approachto the prediction of gene function, it is necessary to first discusshow gene sequence is commonly used to predict gene function andsome general features about gene evolution.

Sequence Similarity, Homology, and Functional Predictions

To make use of the identification of sequence similarity between genes, it is helpful to understand how such similarity arises.Genes can become similar in sequence either as a result of convergence(similarities that have arisen without a common evolutionary history)or descent with modification from a common ancestor (also knownas homology). It is imperative to recognize that sequence similarityand homology are not interchangeable terms. Not all homologs aresimilar in sequence (i.e., homologous genes can diverge so muchthat similarities are difficult or impossible to detect) and notall similarities are due to homology (Reeck et al. 1987; Hillis1994). Similarity due to convergence, which is likely limitedto small regions of genes, can be useful for some functional predictions(Henikoff et al. 1997). However, most sequence-based functionalpredictions are based on the identification (and subsequent analysis)of similarities that are thought to be due to homology. Becausehomology is a statement about common ancestry, it cannot be provendirectly from sequence similarity. In these cases, the inferenceof homology is made based on finding levels of sequence similaritythat are thought to be too high to be due to convergence (theexact threshold for such an inference is not well established).

Improvements in database search programs have made the identification of likely homologs much faster, easier, and more reliable(Altschul et al. 1997; Henikoff et al. 1998). However, as discussedabove, in many cases the identification of homologs is not sufficientto make specific functional predictions because not all homologshave the same function. The available similarity-based functionalprediction methods can be distinguished by how they choose thehomolog whose function is most relevant to a particular uncharacterizedgene (Table 1). Some methods are relatively simple---many researchersuse the highest scoring homolog (as determined by programs likeBLAST or BLAZE) as the basis for assigning function. While highesthit methods are very fast, can be automated readily, and are likelyaccurate in many instances, they do not take advantage of anyinformation about how genes and gene functions evolve. For example,gene duplication and subsequent divergence of function of theduplicates can result in homologs with different functions beingpresent within one species. Specific terms have been created todistinguish homologs in these cases (Table 2): Genes of the sameduplicate group are called orthologs (e.g., beta -globin from mouseand humans), and different duplicates are called paralogs (e.g.,alpha - and beta -globin) (Fitch 1970). Because gene duplications are frequentlyaccompanied by functional divergence, dividing genes into groupsof orthologs and paralogs can improve the accuracy of functionalpredictions. Recognizing that the one-to-one sequence comparisonsused by most methods do not reliably distinguish orthologs fromparalogs, Tatusov et al. (1997) developed the COG clustering method(see Table 1). Although the COG method is clearly a major advancein identifying orthologous groups of genes, it is limited in itspower because clustering is a way of classifying levels of similarityand is not an accurate method of inferring evolutionary relationships(Swofford et al. 1996). Thus, as sequence similarity and clusteringare not reliable estimators of evolutionary relatedness, and asthe incorporation of such phylogenetic information has been souseful to other areas of biology, evolutionary techniques shouldbe useful for improving the accuracy of predicting function basedon sequence similarity.

                              
View this table:
[in this window]
[in a new window]
 
Table 1.   Methods of Predicting Gene Function When Homologs Have Multiple Functions
                              
View this table:
[in this window]
[in a new window]
 
Table 2.   Types of Molecular Homology

Phylogenomics

There are many ways in which evolutionary information can be used to improve functional predictions. Below, I present an outlineof one such phylogenomic method (see Fig. 1), and I compare thismethod to nonevolutionary functional prediction methods. Thismethod is based on a relatively simple assumption---because genefunctions change as a result of evolution, reconstructing theevolutionary history of genes should help predict the functionsof uncharacterized genes. The first step is the generation ofa phylogenetic tree representing the evolutionary history of thegene of interest and its homologs. Such trees are distinct fromclusters and other means of characterizing sequence similaritybecause they are inferred by special techniques that help convertpatterns of similarity into evolutionary relationships (see Swoffordet al. 1996). After the gene tree is inferred, biologically determinedfunctions of the various homologs are overlaid onto the tree.Finally, the structure of the tree and the relative phylogeneticpositions of genes of different functions are used to trace thehistory of functional changes, which is then used to predict functionsof uncharacterized genes. More detail of this method is providedbelow.


View larger version (34K):
[in this window]
[in a new window]
 
Figure 1   Outline of a phylogenomic methodology. In this method, information about the evolutionary relationships among genes is used to predict the functions of uncharacterized genes (see text for details). Two hypothetical scenarios are presented and the path of trying to infer the function of two uncharacterized genes in each case is traced. (A) A gene family has undergone a gene duplication that was accompanied by functional divergence. (B) Gene function has changed in one lineage. The true tree (which is assumed to be unknown) is shown at the bottom. The genes are referred to by numbers (which represent the species from which these genes come) and letters (which in A represent different genes within a species). The thin branches in the evolutionary trees correspond to the gene phylogeny and the thick gray branches in A (bottom) correspond to the phylogeny of the species in which the duplicate genes evolve in parallel (as paralogs). Different colors (and symbols) represent different gene functions; gray (with hatching) represents either unknown or unpredictable functions.

Identification of Homologs

The first step in studying the evolution of a particular gene is the identification of homologs. As with similarity-basedfunctional prediction methods, likely homologs of a particulargene are identified through database searches. Because phylogeneticmethods benefit greatly from more data, it is useful to augmentthis initial list by using identified homologs as queries forfurther database searches or using automatic iterated search methodssuch as PSI-BLAST (Altschul et al. 1997). If a gene family isvery large (e.g., ABC transporters), it may be necessary to onlyanalyze a subset of homologs. However, this must be done withextreme care, as one might accidentally leave out proteins thatwould be important for the analysis.

Alignment and Masking

Sequence alignment for phylogenetic analysis has a particular purpose---it is the assignment of positional homology. Each columnin a multiple sequence alignment is assumed to include amino acidsor nucleotides that have a common evolutionary history, and eachcolumn is treated separately in the phylogenetic analysis. Therefore,regions in which the assignment of positional homology is ambiguousshould be excluded (Gatesy et al. 1993). The exclusion of certainalignment positions (also known as masking) helps to give phylogeneticmethods much of their discriminatory power. Phylogenetic treesgenerated without masking (as is done in many sequence analysissoftware packages) are less likely to accurately reflect the evolutionof the genes than trees with masking.

Phylogenetic Trees

For extensive information about generating phylogenetic trees from sequence alignments, see Swofford et al. (1996). In summary,there are three methods commonly used: parsimony, distance, andmaximum likelihood (Table 3), and each has its advantages anddisadvantages. I prefer distance methods because they are thequickest when using large data sets. Before using any particulartree it is important to estimate the robustness and accuracy ofthe phylogenetic patterns it shows (through techniques such asthe comparison of trees generated by different methods and bootstrapping).Finally, in most cases, it is also useful to determine a rootfor the tree.
                              
View this table:
[in this window]
[in a new window]
 
Table 3.   Molecular Phylogenetic Methods

Functional Predictions

To make functional predictions based on the phylogenetic tree, it is necessary to first overlay any known functions onto thetree. There are many ways this "map" can then be used to makefunctional predictions, but I recommend splitting the task intotwo steps. First, the tree can be used to identify likely geneduplication events in the past. This allows the division of thegenes into groups of orthologs and paralogs (e.g., Eisen et al.1995). Uncharacterized genes can be assigned a likely functionif the function of any ortholog is known (and if all characterizedorthologs have the same function). Second, parsimony reconstructiontechniques (Maddison and Maddison 1992) can be used to infer thelikely functions of uncharacterized genes by identifying the evolutionaryscenario that requires the fewest functional changes over time(Fig. 1). The incorporation of more realistic models of functionalchange (and not just minimizing the total number of changes) mayprove to be useful, but the parsimony minimization methods areprobably sufficient in most cases.

Is the Phylogenomic Method Worth the Trouble?

Phylogenomic methods require many more steps and usually much more manual labor than similarity-based functional predictionmethods. Is the phylogenomic approach worth the trouble? Manyspecific examples exist in which gene function has been shownto correlate well with gene phylogeny (Eisen et al. 1995; Atchleyand Fitch 1997). Although no systematic comparisons of phylogeneticversus similarity-based functional prediction methods have beendone, there are a variety of reasons to believe that the phylogenomicmethod should produce more accurate predictions than similarity-basedmethods. In particular, there are many conditions in which similarity-basedmethods are likely to make inaccurate predictions but which canbe dealt with well by phylogenetic methods (see Table 4).

                              
View this table:
[in this window]
[in a new window]
 
Table 4.   Examples of Conditions in Which Similarity Methods Produce Inaccurate Predictions of Function

A specific example helps illustrate a potential problem with similarity-based methods. Molecular phylogenetic methods showconclusively that mycoplasmas share a common ancestor with low-GCGram-positive bacteria (Weisburg et al. 1989). However, examinationof the percent similarity between mycoplasmal genes and theirhomologs in bacteria does not clearly show this relationship.This is because mycoplasmas have undergone an accelerated rateof molecular evolution relative to other bacteria. Thus, a BLASTsearch with a gene from Bacillus subtilis (a low GC Gram-positivespecies) will result in a list in which the mycoplasma homologs(if they exist) score lower than genes from many species of bacterialess closely related to B. subtilis. When amounts or rates ofchange vary between lineages, phylogenetic methods are betterable to infer evolutionary relationships than similarity methods(including clustering) because they allow for evolutionary branchesto have different lengths. Thus, in those cases in which genefunction correlates with gene phylogeny and in which amounts orrates of change vary between lineages, similarity-based methodswill be more likely than phylogenomic methods to make inaccuratefunctional predictions (see Table 4).

Another major advantage of phylogenetic methods over most similarity methods comes from the process of masking (see above).For example, a deletion of a large section of a gene in one specieswill greatly affect similarity measures but may not affect thefunction of that gene. A phylogenetic analysis including thesegenes could exclude the region of the deletion from the analysisby masking. In addition, regions of genes that are highly variablebetween species are more likely to undergo convergence and suchregions can be excluded from phylogenetic analysis by masking.Masking thus allows the exclusion of regions of genes in whichsequence similarity is likely to be "noisy" or misleading ratherthan a biologically important signal. The pairwise sequence comparisonsused by most similarity-based functional prediction methods donot allow such masking. Phylogenetic methods have been criticizedbecause of their dependence (for most methods) on multiple sequencealignments that are not always reliable and unbiased. However,multiple sequence alignments also allow for masking, which isprobably more valuable than the cost of depending on alignments.

The conditions described above and highlighted in Table 4 are just some examples of conditions in which evolutionary methodsare more likely to make accurate functional predictions than similarity-basedmethods. Phylogenetic methods are particularly useful when thehistory of a gene family includes many of these conditions (e.g.,multiple gene duplications plus rate variation) or when the genefamily is very large. The principle is simple---the more complicatedthe history of a gene family, the more useful it is to try toinfer that history. Thus although the phylogenomic method is slowand labor intensive, I believe it is worth using if accuracy isthe main objective. In addition, information about the evolutionaryrelationships among gene homologs is useful for summarizing relationshipsamong genes and for putting functional information into a usefulcontext.

Despite the evolution of these methods, and likely continued improvements in functional predictions, it must be rememberedthat the key word is prediction. All methods are going to makeinaccurate predictions of functions. For example, none of themethods described can perform well when gene functions can changewith little sequence change as has been seen in proteins likeopsins (Yokoyama 1997). Thus, sequence databases and genome researchersshould make clear which functions assigned to genes are basedon predictions and which are based on experiments. In addition,all prediction methods should use only experimentally determinedfunctions as their grist for predictions. This will hopefullylimit error propagation that can happen by using an inaccurateprediction of function to then predict the function of a new gene,which is a particular problem for the highest hit methods, asthey rely on the function of only one gene at a time to make predictions(Eisen et al. 1997). Despite these and other potential problems,functional predictions are of great value in guiding researchand in sorting through huge amounts of data. I believe that theincreased use of phylogenetic methods can only serve to improvethe accuracy of such functional predictions.

    FOOTNOTES

1 E-MAIL jeisen{at}leland.stanford.edu; FAX (650) 725-1848.

WWW: http://www-leland.stanford.edu/~jeisen.

    REFERENCES
Top
Article
References


8:163-167 ©1998 by Cold Spring Harbor Laboratory Press  ISSN 1088-9051/98 $5.00