Vol. 8, Issue 3, 163-167, March 1998
Department of Biological Sciences, Stanford University, Stanford, California 94305-5020 USA
The ability to accurately predict gene function based on gene sequence
is an important tool in many areas of biologicalresearch. Such predictions have become particularly
importantin the genomics age in which numerous gene sequences are
generatedwith little or no accompanying experimentally determined
functionalinformation. Almost all functional prediction methods rely
onthe identification, characterization, and quantification of sequencesimilarity between the gene of interest and genes for
which functionalinformation is available. Because sequence is the prime determiningfactor of function, sequence similarity
is taken to imply similarityof
function. There is no doubt that this assumption is valid inmost
cases. However, sequence similarity does not ensure identicalfunctions, and it is common for groups of genes that are similarin
sequence to have diverse (although usually related) functions.Therefore, the identification of sequence similarity is frequentlynot
enough to assign a predicted function to an uncharacterizedgene; one
must have a method of choosing among similar genes withdifferent
functions. In such cases, most functional predictionmethods assign
likely functions by quantifying the levels of similarityamong genes. I
suggest that functional predictions can be greatlyimproved by focusing
on how the genes became similar in sequence(i.e., evolution)
rather than on the sequence similarity itself.It is well established
that many aspects of comparative biologycan benefit from evolutionary
studies (Felsenstein 1985 Sequence Similarity, Homology, and Functional Predictions To make use of the identification of sequence similarity between
genes, it is helpful to understand how such similarity arises.Genes
can become similar in sequence either as a result of
convergence(similarities that have arisen without a common
evolutionary history)or descent with modification from a common
ancestor (also knownas homology). It is imperative to
recognize that sequence similarityand homology are not interchangeable
terms. Not all homologs aresimilar in sequence (i.e., homologous genes
can diverge so muchthat similarities are difficult or impossible to
detect) and notall similarities are due to homology (Reeck et al.
1987 Improvements in database search programs have made the identification
of likely homologs much faster, easier, and more reliable(Altschul et
al. 1997
![]()
ARTICLE
Top
Article
References
), andcomparative molecular biology is no
exception (e.g., Altschulet al. 1989
; Goldman et al. 1996
). In this
commentary, I discussthe use of evolutionary information in the
prediction of genefunction. To appreciate the potential of a
phylogenomic approachto the prediction of gene function, it
is necessary to first discusshow gene sequence is commonly used to
predict gene function andsome general features about gene evolution.
; Hillis1994
). Similarity due to convergence, which is likely
limitedto small regions of genes, can be useful for some functional
predictions(Henikoff et al. 1997
). However, most sequence-based
functionalpredictions are based on the identification (and subsequent
analysis)of similarities that are thought to be due to homology.
Becausehomology is a statement about common ancestry, it cannot be
provendirectly from sequence similarity. In these cases, the inferenceof homology is made based on finding levels of sequence
similaritythat
are thought to be too high to be due to convergence (theexact
threshold for such an inference is not well established).
; Henikoff et al. 1998
). However, as discussedabove, in many
cases the identification of homologs is not sufficientto make specific
functional predictions because not all homologshave the same function.
The available similarity-based functionalprediction methods can be
distinguished by how they choose thehomolog whose function is most
relevant to a particular uncharacterizedgene (Table
1). Some methods are relatively simple
many
researchersuse the highest scoring homolog (as determined by programs
likeBLAST or BLAZE) as the basis for assigning function. While highesthit methods are very fast, can be automated readily,
and are likelyaccurate in many instances, they do not take advantage of anyinformation about how genes and gene functions
evolve. For example,gene duplication and subsequent divergence of function of theduplicates can result in homologs with different
functions beingpresent within one species. Specific terms have been created todistinguish homologs in these cases (Table 2): Genes
of the sameduplicate group are called orthologs (e.g.,
-globin from mouseand humans), and different duplicates are
called paralogs (e.g.,
- and
-globin) (Fitch 1970
).
Because gene duplications are frequentlyaccompanied by functional
divergence, dividing genes into groupsof orthologs and paralogs can
improve the accuracy of functionalpredictions. Recognizing that the
one-to-one sequence comparisonsused by most methods do not reliably
distinguish orthologs fromparalogs, Tatusov et al. (1997)
developed
the COG clustering method(see Table 1). Although the COG method is
clearly a major advancein identifying orthologous groups of genes, it
is limited in itspower because clustering is a way of classifying
levels of similarityand is not an accurate method of inferring
evolutionary relationships(Swofford et al. 1996
). Thus, as sequence
similarity and clusteringare not reliable estimators of evolutionary
relatedness, and asthe incorporation of such phylogenetic information
has been souseful to other areas of biology, evolutionary techniques
shouldbe useful for improving the accuracy of predicting function
basedon sequence similarity.
View this table:
[in this window]
[in a new window]
Table 1.
Methods of Predicting Gene Function When Homologs Have
Multiple Functions
View this table:
[in this window]
[in a new window]
Table 2.
Types of Molecular Homology
Phylogenomics
There are many ways in which evolutionary information can be used
to improve functional predictions. Below, I present an outlineof one
such phylogenomic method (see Fig. 1), and I
compare thismethod to nonevolutionary functional prediction methods.
Thismethod is based on a relatively simple assumption
because genefunctions change as a result of evolution, reconstructing theevolutionary history of genes should help predict
the functionsof
uncharacterized genes. The first step is the generation ofa
phylogenetic tree representing the evolutionary history of thegene of
interest and its homologs. Such trees are distinct fromclusters and
other means of characterizing sequence similaritybecause they are
inferred by special techniques that help convertpatterns of similarity
into evolutionary relationships (see Swoffordet al. 1996
). After the
gene tree is inferred, biologically determinedfunctions of the various
homologs are overlaid onto the tree.Finally, the structure of the tree
and the relative phylogeneticpositions of genes of different functions
are used to trace thehistory of functional changes, which is then used
to predict functionsof uncharacterized genes. More detail of this
method is providedbelow.
|
Identification of Homologs
The first step in studying the evolution of a particular gene is the identification of homologs. As with similarity-basedfunctional prediction methods, likely homologs of a particulargene are identified through database searches. Because phylogeneticmethods benefit greatly from more data, it is useful to augmentthis initial list by using identified homologs as queries forfurther database searches or using automatic iterated search methodssuch as PSI-BLAST (Altschul et al. 1997Alignment and Masking
Sequence alignment for phylogenetic analysis has a particular purpose
it is the assignment of positional homology. Each
columnin a multiple sequence alignment is assumed to include amino
acidsor nucleotides that have a common evolutionary history, and eachcolumn is treated separately in the phylogenetic analysis.
Therefore,regions in which the assignment of positional homology is ambiguousshould be excluded (Gatesy et al. 1993Phylogenetic Trees
For extensive information about generating phylogenetic trees from sequence alignments, see Swofford et al. (1996)
|
Functional Predictions
To make functional predictions based on the phylogenetic tree, it is necessary to first overlay any known functions onto thetree. There are many ways this "map" can then be used to makefunctional predictions, but I recommend splitting the task intotwo steps. First, the tree can be used to identify likely geneduplication events in the past. This allows the division of thegenes into groups of orthologs and paralogs (e.g., Eisen et al.1995Is the Phylogenomic Method Worth the Trouble?
Phylogenomic methods require many more steps and usually much
more manual labor than similarity-based functional predictionmethods.
Is the phylogenomic approach worth the trouble? Manyspecific examples
exist in which gene function has been shownto correlate well with gene
phylogeny (Eisen et al. 1995
; Atchleyand Fitch 1997
). Although no
systematic comparisons of phylogeneticversus similarity-based
functional prediction methods have beendone, there are a variety of
reasons to believe that the phylogenomicmethod should produce more
accurate predictions than similarity-basedmethods. In particular,
there are many conditions in which similarity-basedmethods are likely
to make inaccurate predictions but which canbe dealt with well by
phylogenetic methods (see Table 4).
|
A specific example helps illustrate a potential problem with
similarity-based methods. Molecular phylogenetic methods showconclusively that mycoplasmas share a common ancestor with low-GCGram-positive
bacteria (Weisburg et al. 1989
). However, examinationof the percent similarity between
mycoplasmal genes and theirhomologs in bacteria does not clearly show
this relationship.This is because mycoplasmas have undergone an
accelerated rateof molecular evolution relative to other bacteria.
Thus, a BLASTsearch with a gene from Bacillus subtilis (a low
GC Gram-positivespecies) will result in a list in which the mycoplasma
homologs(if they exist) score lower than genes from many species of
bacterialess closely related to B. subtilis. When amounts or
rates ofchange vary between lineages, phylogenetic methods are betterable to infer evolutionary relationships than similarity
methods(including clustering) because they allow for evolutionary branchesto
have different lengths. Thus, in those cases in which genefunction
correlates with gene phylogeny and in which amounts orrates of change
vary between lineages, similarity-based methodswill be more likely
than phylogenomic methods to make inaccuratefunctional predictions
(see Table 4).
Another major advantage of phylogenetic methods over most similarity methods comes from the process of masking (see above).For example, a deletion of a large section of a gene in one specieswill greatly affect similarity measures but may not affect thefunction of that gene. A phylogenetic analysis including thesegenes could exclude the region of the deletion from the analysisby masking. In addition, regions of genes that are highly variablebetween species are more likely to undergo convergence and suchregions can be excluded from phylogenetic analysis by masking.Masking thus allows the exclusion of regions of genes in whichsequence similarity is likely to be "noisy" or misleading ratherthan a biologically important signal. The pairwise sequence comparisonsused by most similarity-based functional prediction methods donot allow such masking. Phylogenetic methods have been criticizedbecause of their dependence (for most methods) on multiple sequencealignments that are not always reliable and unbiased. However,multiple sequence alignments also allow for masking, which isprobably more valuable than the cost of depending on alignments.
The conditions described above and highlighted in Table 4 are just some
examples of conditions in which evolutionary methodsare more likely to
make accurate functional predictions than similarity-basedmethods.
Phylogenetic methods are particularly useful when thehistory of a gene
family includes many of these conditions (e.g.,multiple gene
duplications plus rate variation) or when the genefamily is very
large. The principle is simple
the more complicatedthe history of a
gene family, the more useful it is to try toinfer that history. Thus
although the phylogenomic method is slowand labor intensive, I believe
it is worth using if accuracy isthe main objective. In addition,
information about the evolutionaryrelationships among gene homologs is
useful for summarizing relationshipsamong genes and for putting
functional information into a usefulcontext.
Despite the evolution of these methods, and likely continued
improvements in functional predictions, it must be rememberedthat the
key word is prediction. All methods are going to makeinaccurate predictions of functions. For example, none of themethods
described can perform well when gene functions can changewith little
sequence change as has been seen in proteins likeopsins (Yokoyama
1997
). Thus, sequence databases and genome researchersshould make
clear which functions assigned to genes are basedon predictions and
which are based on experiments. In addition,all prediction methods
should use only experimentally determinedfunctions as their grist for
predictions. This will hopefullylimit error propagation that can
happen by using an inaccurateprediction of function to then predict
the function of a new gene,which is a particular problem for the
highest hit methods, asthey rely on the function of only one gene at a
time to make predictions(Eisen et al. 1997
). Despite these and other
potential problems,functional predictions are of great value in
guiding researchand in sorting through huge amounts of data. I believe
that theincreased use of phylogenetic methods can only serve to
improvethe accuracy of such functional predictions.
| |
FOOTNOTES |
|---|
1 E-MAIL jeisen{at}leland.stanford.edu; FAX (650) 725-1848.
WWW: http://www-leland.stanford.edu/~jeisen.
| |
REFERENCES |
|---|