Qualifying the relationship between sequence conservation and molecular function

  1. Gregory M. Cooper1,3,4 and
  2. Christopher D. Brown2,3
  1. 1 Department of Genome Sciences, University of Washington, Seattle, Washington 98195, USA;
  2. 2 Institute for Genomics and Systems Biology, University of Chicago, Chicago, Illinois 60637, USA
  1. 3 These authors contributed equally to this work.

Abstract

Quantification of evolutionary constraints via sequence conservation can be leveraged to annotate genomic functional sequences. Recent efforts addressing the converse of this relationship have identified many sites in metazoan genomes with molecular function but without detectable conservation between related species. Here, we discuss explanations and implications for these results considering both practical and theoretical issues. In particular, phylogenetic scope influences the relationship between sequence conservation and function. Comparisons of distantly related species can detect constraint with high specificity due to the loss of conserved neutral sequence, but sensitivity is sacrificed as a result of functional changes related to lineage-specific biology. The strength of natural selection operating on functional sequence is also important. Mutations to functional sequences that result in small fitness effects are subject to weaker constraints. Therefore, particularly when comparing highly divergent species, functional sequences that are degenerate or biologically redundant will be prone to turnover, wherein functional sequences are replaced by effectively equivalent, but nonorthologous counterparts. Finally, considering the size and complexity of metazoan genomes and the fact that many nonconserved sequences are associated with sequence-degenerate, low-level molecular functions, we find it likely that there exist many biochemically functional sequences that are not under constraint. This hypothesis does not lead to the conclusion that huge amounts of vertebrate genomes are functionally important, but rather that such “functionality” represents molecular noise that has weak or no effect on organismal phenotypes.

Introduction

The identification of functional elements within large complex genomes has been aided by comparative genomics, in particular, via the quantification of evolutionary constraints (Pennacchio et al. 2001, 2006; Göttgens et al. 2002; Kellis et al. 2003). Recently, however, high-throughput functional genomics techniques have allowed for an initial assessment of the converse relationship, namely, the quantification of selective constraint on large, unbiased collections of functional elements. These studies include cell-based assays on a genomic scale (Kim et al. 2005; Borneman et al. 2007; The ENCODE Project Consortium 2007; Heintzman et al. 2007; Xi et al. 2007) and in vivo assays for developmentally important functions in individual loci in animal model organisms (Fisher et al. 2006; Brown et al. 2007). Interestingly, they have demonstrated that there are large numbers of functional sequences that are not detectably conserved across both distant (McGaughey et al. 2008, this issue) and close (Moses et al. 2006; Margulies et al. 2007) evolutionary timescales. This lack of conservation has several explanations in principle, each of which has distinct implications for functional annotation of complex genomes and a better understanding of genomic evolution. Here, we address these possibilities in light of variation in phylogenetic scope and the quantitative relationship between sequence function and evolutionary rate.

The basic premise

The application of comparative sequence analysis to annotate genomic functional sequences is dependent upon the basic principles laid out by Kimura in the neutral theory of molecular evolution (Kimura 1983). Most evolutionary change between species is the result of mutations with minimal or no functional impact that are fixed via random genetic drift. In contrast, mutations in functional elements (e.g., exons, cis-regulatory elements) are likely to impair function, be deleterious to the organism, and subsequently be eliminated by purifying selection. The detection of sequences affected by purifying selection, which are said to be under evolutionary constraint, can therefore be used to annotate functional sites in genomes. Detection and quantification of constraint is usually accomplished through statistical evaluations of interspecific genomic sequence conservation. We note that it is important to distinguish “conservation,” which is an observation of similarity, from “constraint,” which is a hypothesis about the effects of purifying selection. Conservation, when observed to be in excess of the levels predicted by a neutral model, can be used to infer constraint. However, the presence of conservation does not necessarily imply constraint nor does its absence imply a lack of constraint. This distinction is critical to the interpretation of results from comparative genomic analyses. Indeed, conservation statistics should never be utilized in the absence of the context provided by the levels of neutral sequence conservation/divergence.

Phylogenetic scope

One of the most important parameters of a comparative genomics study is phylogenetic scope, defined as the minimal evolutionary span that captures all of the included species. For example, analyses comparing sequence from human, mouse, and dog have a placental mammalian scope. Because constraint analyses require an assumption of orthology (or at the very least homology), the phylogenetic scope of the analysis enforces a limit on sensitivity to only those functional sequences present in the species’ last common ancestor. Phylogenetic scope is also correlated with levels of genomic sequence divergence, defined in this context as the average number of nucleotide changes affecting neutral sites. Since the inference of constraint requires a statistically significant difference between the conservation seen for neutral sites and that seen for constrained sites, the level of neutral sequence divergence is a direct contributor to the specificity of a sequence comparison.

Phylogenetic scope thus has direct, predictable consequences on specificity and sensitivity for a comparative analysis. It is difficult to measure the effects of selection on any given nucleotide of the human genome when comparing only closely related ape genome sequences (Eddy 2005; Stone et al. 2005), for example, as the vast majority of neutral nucleotides remain conserved between these species. On the other hand, comparisons between human and more distant vertebrates like fishes, or even among distantly related fishes like zebrafish and Fugu, are so divergent that neutral sites have been completely saturated with nucleotide changes (both substitutions and deletions), and any sequence that is reliably aligned between these species is almost certainly under constraint. However, such comparisons are known to miss a large number of highly constrained lineage-specific functional elements (Cooper et al. 2005). Thus, it should not be regarded as surprising that many functional elements are not conserved when comparing extremely distant species (e.g., as seen in McGaughey et al. 2008).

Sequence function and evolutionary rate

The sensitivity of constraint-based methods to identify functional sequence is also dependent on the quantitative relationship between sequence function and evolutionary rate, which is mediated by the strength and efficacy of natural selection. In general, nucleotides with important molecular functions will evolve more slowly than the rate predicted by a neutral model. However, this is not a discrete phenomenon. The selection coefficient, a quantitative measure of the effects of selective pressure, varies continuously in relation to both the sensitivity of the molecular function to nucleotide change (degeneracy) and the importance of the molecular function to survival and reproductive success (dispensability). Quantitative variation in selection coefficients in turn produces quantitative variation in the rate of sequence change. That this is a generalizable property of both protein-coding and noncoding sequences is supported by several lines of evidence.

With respect to coding DNA, it is well established that proteins evolve at vastly different rates. Protein expression level, functional category, structural characteristics, and participation in intermolecular interactions have all been suggested to contribute to this evolutionary rate variation (Li 1997; Pal et al. 2001; Wall et al. 2005; Drummond et al. 2006; Kim et al. 2006). In addition, within a given protein, the rates of evolution of individual amino acids vary greatly, largely as a result of the structure-function requirements for a given amino acid at a particular position within that protein. For example, active sites of enzymes, DNA-binding domains of transcription factors, and residues important for structural maintenance evolve slowly, as substitutions in these residues are particularly deleterious (Suckow et al. 1996; Simon et al. 2002).

With respect to other classes of functional sequence, recent estimates suggest that 70% of the nucleotides evolving under purifying selection in mammalian genomes are not within exons of protein-coding genes (“noncoding”) and, except for the extreme constraint seen on some critical proteins (e.g., histones), the range of selection coefficients affecting these positions appears similar to that for protein-coding DNA (Mouse Genome Sequencing Consortium 2002; Rat Genome Sequencing Project Consortium 2004; King et al. 2007). Furthermore, analysis of the regulatory function and biochemical specificity of individual transcription-factor binding sites also supports the presence of a quantitative spectrum of selective strength in noncoding functional sequences. Across transcription-factor binding sites, sites that contribute more to the total regulatory activity of a cis-regulatory element accumulate fewer substitutions than those that contribute less (Brown et al. 2007). In addition, nucleotide-by-nucleotide binding specificity within a transcription-factor binding site is inversely proportional to the evolutionary rate of the position (Mirny and Gelfand 2002; Moses et al. 2003).

Interpreting nonconserved genomic functionality

Results from constraint-based comparative genomic analyses should be interpreted in light of the principles described above in addition to practical considerations. The discovery of many nonconserved functional sequences in metazoan genomes can thus be explained by several nonexclusive possibilities, including technical challenges, divergent biology related to phylogenetic scope, loss of conservation resulting from weak constraints, and unconstrained molecular functionality. We address each of these explanations in turn.

Technical challenges

Some constrained functional elements are likely to be misclassified as nonconserved (“false negatives”) by comparative sequence analyses due to technical challenges. Consider a small functional element (<10 bp) present within a long stretch of neutral sequence. Even if the element itself is highly constrained and persistent across a wide phylogenetic scope, without similar sequence nearby to provide a reliable alignment “anchor” (Batzoglou 2005), such an element would likely not manifest as a conserved sequence. While such obstacles are more problematic when comparing highly divergent species, they are not restricted to comparisons in wide scopes. Genomic sequence alignment, a prerequisite to any constraint-based analysis, remains a challenging problem even for relatively closely related species (Pollard et al. 2006; Margulies et al. 2007).

Experimental limitations also contribute to false negatives. For example, many functional genomic datasets are plagued by poor resolution: Transcription factor “binding sites” identified by “ChIP-chip” experiments, for example, often span hundreds of nucleotides, while the extent of a functional sequence is likely to be substantially smaller. This problem has been shown to obscure the relationship between constraint and function (Brown et al. 2007; Margulies et al. 2007) since the conservation signal indicative of constraint on the functional nucleotides is diluted by the noise resulting from the inclusion of many nonfunctional and neutrally evolving sites. In addition, nearly all sequence comparisons of functional sites derive functional annotation from only one species. Simultaneous annotation of function independently in multiple species can significantly clarify the relationship between sequence conservation and molecular function, contrasting conservation that may simply be obscured due to technical challenges (Brown et al. 2007) from legitimate primary sequence turnover of functional binding sites (Borneman et al. 2007; Odom et al. 2007).

Divergent biology

Pathway modularity and functional exaptation notwithstanding, functional elements that relate to environmental, developmental, physiological, or other biological factors that are not common to the entire phylogenetic scope of an analysis are likely to be systematically missed. Indeed, it has been shown that many regulatory elements in the human genome are restricted to particular clades and are likely to play important, but clade-specific roles (King et al. 2007); sequences involved in the articulation of digits in the developing mammalian limb bud are unlikely to be systematically captured in a human–fish comparison, for example. Even for those elements present in the common ancestral genome, changes in genomic or biological context that alter the strength of selection are likely to be major contributors to a loss of sensitivity in the detection of constraint. Lineage-specific loss of function, for example, can have a major effect on sensitivity even when only a minor subset of the analyzed lineages is affected (Stone et al. 2005). Additionally, even for functionality that is persistent across the entire phylogenetic scope, changes in genomic context can lead to decreased sensitivity. Duplication events, a prominent feature in the evolution of genomes (Ohno 1970; Wolfe and Shields 1997; Dehal and Boore 2005), in principle, allow for relaxed constraint on one or both copies of a duplicated functional element (Lynch and Conery 2000; Kondrashov et al. 2002). As such, inclusion of only one member of a lineage-specific duplicated sequence, as is routinely done by the popular genomic sequence alignment tools (Margulies et al. 2007), will provide an incomplete picture of the constraint–function relationship.

Weak constraints

The strength of selection operating on any particular genomic sequence is related to both the sequence degeneracy and organismal importance of its molecular function. As such, it is anticipated that functional sequences that have a small influence on organismal fitness or are sequence-degenerate will be under weaker evolutionary constraints and thus more likely to change or “turnover” as the amount of neutral divergence increases. For example, enhancer sequences that contribute only a small portion of the total regulatory information for a given gene have been shown to evolve more swiftly than enhancers with larger effect, even when they regulate genes with critical developmental function (Brown et al. 2007). Additionally, consider transcriptional promoters of human protein-coding genes (Trinklein et al. 2003; Kim et al. 2005): While these regions are important for transcriptional regulation and strongly enriched for constrained sequences, many individual promoters lack strong sequence conservation, even among placental mammals (The ENCODE Project Consortium 2007), and may be influenced by a significant level of individual binding site changes (Odom et al. 2007). This is likely a consequence of flexibility in sequence that can give rise to promoter function relating to either the sequence degeneracy or redundancy of individual functional elements. An additional possibility is the need for secondary structural or other characteristics that are only indirectly related to primary sequence (e.g., Greenbaum et al. 2007). Altogether, these observations suggest that promoter sequences are generally under constraint and as a class evolve more slowly than neutral DNA, but possess enough sequence degeneracy such that they are affected by a significant level of nucleotide divergence.

Unconstrained molecular functionality

We speculate that there may be many sequences capable of molecular function in complex genomes, but lacking any significant effect on organismal fitness. Such sequences would evolve neutrally and therefore contribute to the discovery of nonconserved functional sequences. For example, recent studies in human cells describe extensive but low-level transcriptional activity spread across the genome, the vast majority of which yields no detectable signals of evolutionary constraint in mammalian genomic sequence (The ENCODE Project Consortium 2007; Kapranov et al. 2007; Margulies et al. 2007). While it certainly is possible that some of these functional sequences are under constraint but appear to be false negatives for reasons described above, two observations support the idea that many are truly not under constraint. First, some classes of experimentally annotated functional sequences fail to show enrichment for constrained nucleotides (Margulies et al. 2007). If these elements were truly, but weakly constrained, some enrichment would be expected, as is seen for promoters of protein-coding genes. Second, bulk distribution analyses comparing rates of evolution in ancient mobile element insertion fragments (“ancestral repeats” or “ARs”) to those in unique sequence find that there are unlikely to be a large number of truly constrained bases in the human genome that are not currently annotated (The ENCODE Project Consortium 2007). While it is clear that some ARs include functionally constrained DNA (Cooper et al. 2005; Bejerano et al. 2006; Xie et al. 2006), most are unlikely to possess specific and important molecular functions. Considering then that they can often be recognized as orthologous, alignable DNA amongst related mammals, ARs are likely to constitute a good empirical model for neutral evolution. This hypothesis is supported by the global regional correlations between rates of evolution at these sites and synonymous sites in protein-coding genes, and also a strong concordancy of results between AR-based and independently constructed null models (Mouse Genome Sequencing Consortium 2002; Hardison et al. 2003; Margulies et al. 2007).

If these functional sequences are under little to no constraint, it becomes critical to characterize their origins. One possibility is a result of the interplay between functional degeneracy and genome size and complexity. Indeed, given the impossibility of perfect molecular fidelity, we speculate that such “molecular noise” must be a common phenomenon, particularly for those functions that would arise frequently in large genomes at random and have a very minimal impact on the overall molecular activity of the cell. For example, given the variety of primary sequences that can give rise to their function, there are likely to be many transcriptional promoters occurring at random in the human genome; in fact, mobile elements like Alus are capable of some promoter function, and randomly selected fragments of the human genome often show at least minimal promoter activity (Smit 1996; Khambata-Ford et al. 2003). Furthermore, such events may even show reproducible spatiotemporal specificity due to differential local chromatin regulation (Thurman et al. 2007). Thus, it is plausible, if not likely, to expect low levels of reproducible transcriptional activity and weak protein–DNA binding widely distributed across large complex genomes with no particular purpose.

Conclusions

Large amounts of sequence data have provided a wealth of insights into the evolution of genes and genomes from the perspective of mutation and divergence. Improvements in functional genomics technologies and the development of appropriate model systems promise to provide similar insights from the perspective of molecular function. Recent efforts adopting an unbiased approach to discover functional sequences in complex genomes are already providing a glimpse of such insights. While of tremendous interest, we argue that the discovery of nonconserved functional sequences is largely in line with expectations.

First, we note that these results highlight gaps in our current data and analytical tools and the need for careful study design. Improved computational techniques related to sequence alignment and genomic sequence data from additional species will significantly boost the sensitivity to detect constrained and, therefore, functional sequences (Boffelli et al. 2003). Comparative studies of model organisms that are currently restricted to extreme phylogenetic scopes would benefit tremendously from additional genome sequences from more closely related species, such as the recent effort to surround the Drosophila melanogaster genome sequence with data from many other Drosophilids (Drosophila 12 Genomes Consortium 2007). Additionally, higher-resolution functional annotations and the development of experimental platforms for model organism “sister” species are also likely to clarify this relationship (Brown et al. 2007; Margulies et al. 2007). Second, these results also point to the influence of functional sequence turnover (Ludwig et al. 2000; Moses et al. 2006; Odom et al. 2007). We note that this phenomenon may apply to even developmentally important functionality, particularly for comparisons of distantly related species to discover elements that are individually minor contributors to the overall functional output (McGaughey et al. 2008).

Finally, we speculate that there are many functional sequences that are unlikely to have a major phenotypic effect and are therefore of minimal or no relevance to organismal fitness. It is important to keep in mind that we are not suggesting that such “molecular noise” is irrelevant to biology. Quite to the contrary, beyond the fact that characterizing these functions is necessary for a more complete understanding of biology, it seems possible that such “background” functionality serves some more general role. Synonymous sites in protein-coding DNA are often considered to be neutral (Kimura 1983), for example, but serve the abstract, yet critical function of generating a richer genetic code. Additionally, sequences with subtle molecular functionality may constitute a set of elements adaptable for the generation of novel genes or regulatory elements; mobile element activity may play a role in recruiting new genes to particular regulatory networks (Wang et al. 2007), for example, and there exists at least one example of a “promoter-like” sequence that is turned into a novel functional element (albeit pathogenic) via a single-nucleotide change in humans (De Gobbi et al. 2006). In any case, the accumulation of neutral “functional” changes is likely to be a common and important biological phenomenon. This idea has already received support from analyzing transcriptional “drift” in the evolution of humans and chimpanzees (Khaitovich et al. 2004). Much as the neutral theory of molecular evolution emphasized the role of chance in the evolution of genomic sequences, such a model seems appropriate as the default interpretation for the evolution of genomic function.

Acknowledgments

We thank Mark Rieder, Arend Sidow, Nadia Singh, and three anonymous reviewers for helpful comments on the manuscript. G.M.C. is supported by a Merck, Jane Coffin Childs Memorial Fund Fellowship.

Footnotes

References

Related Article

| Table of Contents

Preprint Server