Locating mammalian transcription factor binding sites: A survey of computational and experimental techniques

  1. Laura Elnitski1,4,
  2. Victor X. Jin2,
  3. Peggy J. Farnham2, and
  4. Steven J.M. Jones3
  1. 1 Genomic Functional Analysis Section, National Human Genome Research Institute, National Institutes of Health, Rockville, Maryland 20878, USA;
  2. 2 Genome and Biomedical Sciences Facility, University of California–Davis, Davis, California 95616-8645, USA;
  3. 3 Genome Sciences Centre, British Columbia Cancer Research Centre, Vancouver, British Columbia, Canada V5Z-4E6

Abstract

Fields such as genomics and systems biology are built on the synergism between computational and experimental techniques. This type of synergism is especially important in accomplishing goals like identifying all functional transcription factor binding sites in vertebrate genomes. Precise detection of these elements is a prerequisite to deciphering the complex regulatory networks that direct tissue specific and lineage specific patterns of gene expression. This review summarizes approaches for in silico, in vitro, and in vivo identification of transcription factor binding sites. A variety of techniques useful for localized- and high-throughput analyses are discussed here, with emphasis on aspects of data generation and verification.

One documented goal of the National Human Genome Research Institute (NHGRI) is the identification of all functional noncoding elements in the human genome (ENCODE Project Consortium 2004). Studies by ENCODE Consortium members and other investigators in the field have demonstrated that a mixture of computational and experimental approaches is required for the genome-wide elucidation of cis-acting transcriptional regulatory elements. These include promoters, enhancers, and repressor elements, along with structural components like origins of replication and boundary elements. For instance, experimentally based oligo-capping methods represent technical advances toward defining the precise 5′ ends of mRNA transcripts (Suzuki et al. 2002; Shiraki et al. 2003; for review, see Harbers and Carninci 2005), enabling the robust prediction of proximal promoter regions and their components (Trinklein et al. 2003; Cooper et al. 2006). In addition, sensitive and comprehensive microarray-based analyses of human RNAs are providing a detailed map of the transcribed regions of the human genome (ENCODE Consortium, in prep.). Methods to characterize replication origins on a genome-wide scale are also in development. Techniques like microarrays are providing details on the coordinated timing of replication by detecting twofold increases in DNA copy number, or heavy isotope incorporation into newly synthesized DNA (for review, see Schwob 2004; MacAlpine and Bell 2005; ENCODE Consortium, in prep.). Along with newly emerging techniques, a few historically proven approaches still provide reliable indicators of functional regions. These include the detection of altered chromatin structure using DNaseI hypersensitivity (Weisbrod and Weintraub 1979; ENCODE Consortium, in prep.) and sequence conservation as found in pairwise- or multi-species comparisons (for review, see Miller et al. 2004; ENCODE Consortium, in prep.).

The function of promoters, enhancers, replication origins, and other regulatory elements is mediated by DNA/protein interactions. Thus, one major step in the characterization of the functional elements of the human genome is the identification of all the protein binding sites, which serve as the atomic units of functional activity (Collins 2003). Recent studies focused on the analysis of transcription factor binding sites in one percent of the human genome (ENCODE Consortium, in prep.) have revealed the need for integrated computational and experimental approaches in the identification of genome-scale sets of transcriptional regulatory elements. Given the amount of noncoding sequence that is under selective constraint (∼3.5% of the human genome; Waterston et al. 2002; Chiaromonte et al. 2003), the anticipated number of DNA binding factors (∼1962; Messina et al. 2004), the complexity of finding a suitable biological assay to detect a given functional activity, and the cost of pursuing such efforts, synergistic collaborations between computational prediction and high-throughput experimental validation remain a critical necessity.

The techniques summarized herein are meant to provide a comprehensive overview of the complementary aspects of computational prediction and experimental validation of functional sites. These described methods are often considered fundamental to investigators working within that specific field but may be unfamiliar to an outside investigator, such as a biologist wanting to predict the identity of transcription factor binding sites (TFBSs), or a computer programmer trying to validate the prediction of a biological feature. In that spirit, we include techniques from categories that address locus-specific and high-throughput methodologies. Supplemental tables list the Web servers available for computational predictions and provide URLs for protocols of experimental techniques. Additionally, readers are encouraged to examine the journal Nucleic Acids Research for its Annual Review of Bioinformatics Web Sites (2006, http://nar.oxfordjournals.org/).

Although broad in scope, this review of computational and experimental techniques is intended to elucidate aspects of their interdependence. Computational techniques, by definition, are predictive and vary in performance quality. Experimental results provide a spectrum of information, ranging from implied functional relevance to validation of protein identity. In this review, we will begin with a description of computational approaches used to identify transcription factor binding sites, define a need for additional experimental data sets, and introduce several experimental methodologies for identifying regulatory elements. We will then end with a description of how computational analyses of these experimental data sets can provide new insights into transcriptional regulation. Figure 1 illustrates this interplay between computational and experimental strategies. Whether initiating from a locus-specific or high-throughput perspective, all indicated pathways lead to the ultimate goal of validation of a biological mechanism.

Computational techniques

A computational approach to studying transcriptional regulatory networks requires the analysis of large and complex data sets. These data sets often include such diverse yet interdependent data as (1) gene expression profiles, (2) locations of promoters and computationally predicted transcription factor binding sites, (3) experimentally identified target genes of specific transcription factor families, and (4) sequence conservation (for review, see Qiu 2003). Using such data sets, investigators have produced, for example, a computational catalog of high-quality putative regulatory elements from vertebrates (Prakash and Tompa 2005). Also, ab initio approaches using the techniques of conservation, overrepresentation, and coregulation are successfully being applied to determine cohorts of expression groups within the genome (Cora et al. 2005). Methods to identify tissue-specific factors have evolved from detecting single factors that regulate expression in tissues such as liver (Krivan and Wasserman 2001) or muscle (Wasserman and Fickett 1998) to comprehensively identifying novel motifs that confer tissue-specific expression patterns (Qian et al. 2005; Blanchette et al. 2006; Huber and Bulyk 2006). As transcription factors often work cooperatively, binding in close physical proximity, recent computational approaches have used the presence of co-occurring motifs to identify putative regulatory modules (Kreiman 2004; Zhou and Wong 2004; Zhu et al. 2005). The recent analysis by Blanchette and colleagues (2006) predicted more than 118,000 such regulatory modules in the human genome. Clearly, these (and other) computational approaches used to identify transcription networks are providing new insights into transcriptional regulation. However, in this review, we will focus only on the various computational methodologies used to predict transcription factor binding sites. An in-depth discussion of all the computational techniques used to predict binding sites is beyond the scope of this review; however, a survey of tools available as Web-based resources is documented in Supplemental Table 1. Also, evaluations of some computational tools are available in other publications (Roulet et al. 1998; Tompa et al. 2005).

The myriad approaches to the computational prediction of functional binding sites are all based on either pattern matching or pattern detection (Supplemental Table 1). Pattern matching utilizes prior knowledge of all characterized DNA binding sites for a given protein. Finding these patterns within the genome allows one to identify putative protein binding sites that might represent uncharacterized regulatory elements (van Helden 2003). Pattern matching requires that the known binding sites for a given protein be represented as a consensus of the collection or as a matrix of acceptable nucleotides at each position. The use of consensus patterns is convenient but may exclude a significant subset of a binding site repertoire because of omission of important variable regions (Roulet et al. 1998). Some of the missing information can be captured through IUPAC strings (an alphabet of characters other than ACGT) that provide a means to indicate alternative choices at each position (e.g., Y = C or T). However, IUPAC strings, while conveying more information than a core consensus, do not provide information as to the relative frequency of the alternative nucleotides. The information content of regulatory sites can be more accurately represented by position weight matrices (PWMs) (Stormo et al. 1982; Harr et al. 1983) or position-specific scoring matrices (PSSMs), which incorporate pattern variability by recording the frequencies of nucleotides at each site or by assigning penalties to nucleotides that should not appear within a factor binding site. Although PWM pattern matching represents an improvement over consensus mapping in sensitivity (i.e., it has a lower false negative rate), when used as the sole means of identifying protein-binding sites it still suffers from the limited amount of training data available (Roulet et al. 1998) and often results in a high rate of false-positive predictions (Tompa et al. 2005; Jolly et al. 2005).

A number of algorithmic approaches have been developed for de novo pattern detection (i.e., the discovery of unknown motifs), many of which search for recurring or overrepresented patterns in DNA. Examples include Hidden Markov Models (Pedersen and Moult 1996), Gibbs sampling (Lawrence et al. 1993), greedy alignment algorithms (e.g., CONSENSUS, Hertz and Stormo 1999), expectation-maximization (MEME, Bailey and Elkan 1995), probabilistic mixture modeling (NestedMica, Down and Hubbard 2005) and exhaustive enumeration (i.e., detecting the set of all nucleotide n-mers, then reporting the most frequent or overrepresented; e.g., Weeder, Pavesi et al. 2004). Alternatively, variations of a pattern can be modeled using information theory (Schneider 2000). Using this approach, the frequencies of nucleotides at each position give insight into whether a protein binds to the major or minor groove of the DNA helix. Once these patterns are determined for a particular protein, the range of variation in target binding sequences can be modeled and matched to the genome (Gadiraju et al. 2003; Vyhlidal et al. 2004).

The use of orthologous sequences, also referred to as phylogenetic footprinting, introduces the filtering power of evolutionary constraint to identify putative regulatory regions that stand apart from the background sequence conservation (Tagle et al. 1988). The search for both known binding sites (pattern matching) and overrepresented novel motifs (pattern detection) can be improved through the analysis of data sets containing orthologous or coregulated genes (summarized in Frith et al. 2004). In one case, regions that colocalized as high-scoring PWM matches and conserved regions in human–mouse–rat genomic alignments provided a 44-fold increase in the specificity of the predictions compared with pattern matching alone (Gibbs et al. 2004). In a study involving pattern detection, Xie et al. (2005) report the first comprehensive screen for regulatory motifs in human promoters by identifying motifs that are enriched above background and are conserved in human, mouse, rat, and dog genomes. Analysis tools that have been refined by incorporating cross-species conservation include Gibbs sampling (e.g., CompareProspector, Liu et al. 2004; PhyloGibbs, Siddharthan et al. 2004, 2005); expectation maximization (PhyloME, Sinha et al. 2003; orthoMEME, Sinha et al. 2003; EMnEM, Moses et al. 2004) and greedy alignment algorithms (PhyloCon, Wang and Stormo 2003). A variation of phylogenetic footprinting known as phylogenetic shadowing uses the collective divergence time of a relatively large number of closely related species (Boffelli et al. 2003). This has the advantage of identifying functional elements that are specific to a lineage from within an unambiguously aligned set of sequences. The disadvantage of this approach lies in the fact that the number of genomic sequences required for such an analysis is currently prohibitive for most investigators.

The recent refinements of computational techniques for identifying binding sites have evoked considerable interest from the field in the development of follow-up or validation analyses. For example, evolutionary constraint has been used not only to identify sites but also to distinguish real motifs from false positives (Blanchette and Sinha 2001) and to discern potentially functional sites from neutral DNA (King et al. 2005). Other validation analyses capitalize on properties of regulatory elements such as the presence of spaced dyads (pairs of short words separated by a fixed distance) and the propensity for palindromic content (van Helden et al. 2000), as well as the interdependence of bases at specific positions within a motif (Wang et al. 2005). Also, the information content and binding preferences of known motifs have been used to identify binding sites of new family members (Keles et al. 2003). Currently, the limiting factor in confirmation and refinement of in silico predictions is a lack of experimental data (Vavouri and Elgar 2005). Despite our best efforts at predicting functional sites, the cellular environment dictates which events can and cannot occur by imposing the selective constraint of higher-order chromatin structure; consequently, experimental confirmation remains the highest form of validation. Described below are various experimental techniques that can be used in conjunction with the computational approaches depicted above.

Experimental techniques

Experimental approaches to identifying transcription factor binding sites are necessary to understand their contributions to biological function, to address the complexity of tissue-specific and temporal stage-specific effects on gene expression (Levine and Tjian 2003), and to continue refinement of computational predictions. Experimental techniques useful for identifying transcription factor binding sites include those that, although not directly measuring transcription factor/DNA interactions, can lead to the identification of regulatory elements. Such techniques include analysis of alterations of chromatin structure and experimental manipulation of defined DNA segments, both of which are advantageous in helping to locate a functional element when the exact regulatory protein(s) involved is not known. Other techniques, which directly measure protein/DNA interactions, provide more precise information but are only useful after the identity of the critical transcription factor has been established. Examples of these two types of approaches, each of which can range in scope from localized, site-specific analyses to high-throughput assays that generate broad conclusions about binding site preferences and regulation of gene expression, are described below. Protocols for these experimental assays are available in the references and in Supplemental Table 2.

DNaseI hypersensitivity

DNaseI hypersensitivity provides a method to map changes in chromatin structure. The degree of response of the DNA sequence to DNase is classified as generalized sensitivity or hypersensitivity. Generalized nuclease sensitivity is a property inherent in all actively expressed genes (Gazit and Cedar 1980), correlating closely with the presence of acetylated histones. Whether a cause or an effect, the presence of acetylated histones accompanies the appearance of open chromatin extending over 10s to 100s of kilobases and is documented in the chicken lysozyme, chicken ovalbumin, human apolipoprotein B, c-fos, and chicken β-globin loci (Lawson et al. 1982; Goodwin et al. 1985; Jantzen et al. 1986; Feng and Villeponteau 1992; Hebbes et al. 1994). In contrast, hypersensitivity refers to regions showing extreme sensitivity to the cleavage effects of the enzyme that are localized to short stretches of DNA ranging from 100 to 400 bp in length (Gross and Garrard 1988). Extreme sensitivity serves as a marker for functional regions that fall in noncoding sequences; these include promoters, enhancers, silencers, origins of replication, recombination elements, and structural sites of telomeres and centromeres (Cereghini et al. 1984; Gross and Garrard 1988). Early observations suggested that hypersensitivity is associated with the removal of nucleosomes (Almer et al. 1986), whereas more recent analyses can detect the presence of histones in modified form (Gui and Dean 2003), such as acetylated histones H3 and H4 and methylated H3 at lysine 4 (K4; Jenuwein and Allis 2001), at the hypersensitive sites (HSs). The modifications on the histones reduce the affinity of DNA for the nucleosome (Bode et al. 1980), facilitating the interaction of DNA with trans-acting factors (Vettese-Dadey et al. 1996). Thus, the presence of hypersensitivity, which originated as a feature of already-characterized functional sites, has now evolved into a predictive indicator for the presence of a functional site. Furthermore, the impermanent nature of the nuclease hypersensitivity provides insight into the temporal and tissue-specific stages of activity in the underlying elements when assayed using representative biological samples.

Many studies have focused on a locus-specific analysis of nuclease hypersensitive sites. In such studies, the resolution with which one can identify the location of a DNaseI HS varies by approach, ranging from ±500 bp using the indirect end-labeling technique (Wu 1980) to nearly nucleotide resolution using PCR assessment (Yoo et al. 1996) and quantitative PCR (McArthur et al. 2001). Thus, the interpretation of results is dependent on the exact method used for analysis. High-throughput approaches to assess DNase hypersensitivity address the appearance and disappearance of functional sites on a genome-wide scale. Comparisons can be made between cells from different tissues, or within the same type of cell to measure a response to changes in the cellular environment. Two new experimental techniques that have emerged as promising technologies in the ENCODE project (ENCODE Project Consortium 2004) are quantitative chromatin profiling (Dorschner et al. 2004) and massively parallel signature sequencing (Crawford et al. 2005). Additionally, Yuan et al. (2005) have used tiled microarrays to identify translation positioning of nucleosomes in Saccharomyces cerevisiae, revealing that 69% of nucleosomal DNA contained positioned nucleosomes, whereas transcription start sites tended to be nucleosome-free regions. Although the scope and expense of such experiments could limit the pace at which new cell lines or tissue types are investigated, comparisons across samples hold the promise of identifying regions that appear in a specific lineage of cells and thus could provide a systematic means of profiling functional sites that describe a cellular phenotype. Importantly, recent computational advances, which use a sequence-based classification algorithm, have relied on experimental data sets to model hypersensitive sites in silico (Noble et al. 2005). In this approach, a support vector machine is trained to discriminate between experimentally validated HSs and nonHSs. Experimental validation of the genome-wide probability scores shows 70% predictive accuracy, providing support for the extension of this application to additional tissue types. Clearly, additional cycling between experimental validation and computational predictions will continue to improve identification of HSs.

Promoter analyses

Gene expression assays measure changes in the production of a reporter protein in response to cis-acting regulatory signals. For instance, promoter sequences placed upstream of a firefly-luciferase reporter gene (de Wet et al. 1987) or green fluorescent protein (GFP; Tsien 1998) can be introduced into a sample of cultured cells and subsequently assayed in a 24- to 48-h time period, generating reproducible results. Promoters and enhancers can be tested in short-term reactions known as transient transfections, in which the test plasmid remains unintegrated (episomal) in the nucleus. The introduction of an enhancer element creates a “gain-of-function” result, whereas “loss-of-function” assays result from mutations of functional nucleotides in the target region. Alternatively, long-term assays, or stable transfections, use a linearized plasmid that integrates into the genomic DNA and hence is subject to effects conferred by the surrounding chromatin environment. Stable transfections are frequently used to identify sequences that protect against both positive and negative influences of surrounding chromatin (such as boundary elements) and to provide a biologically relevant view of the functional activity as measured within a living cell. High-throughput approaches to cell transfection include the use of cationic lipids or electroporation units that work in a 96-well plate format (Strauss 1996; Ovcharenko et al. 2005; Siemen et al. 2005). One assessment of high-throughput gene expression focused on putative promoters in one percent of the human genome, assayed in multiple cell lines (Trinklein et al. 2003; Cooper et al. 2006). Such large-scale promoter/enhancer assays provide insight into the features commonly found in promoters and serve to verify the functional capability of computationally predicted elements.

The immortalized cell lines used in most experiments rarely recapitulate a “normal” cellular environment (Worton et al. 1977). Nevertheless, they provide a suitable environment in which to initiate studies on the mechanisms of gene regulation rapidly. In contrast, although more technically difficult, in vivo expression assays using animal models supply a means of assessing functional elements within a biologically relevant, tissue-specific context. Analyses using fish, frogs, chickens, and mice (Khokha and Loots 2005; Poulin et al. 2005; Shin et al. 2005; Hallikas et al. 2006; Takemoto et al. 2006) have shown that an element or binding site can act in a defined biological pathway; such conclusions could not have been made with cultured cells. For example, Hallikas et al. (2006) used a computational approach to identify mammalian enhancers and then showed extreme developmental and tissue-specific activity of several of the identified enhancer elements.

Intraspecies comparative approaches in Ciona intestinalis highlight the versatility of this model organism. Boffelli et al. (2004) identified candidate regulatory regions undergoing the slowest mutation rates relative to the surrounding rates and tested them for functional activity in transgenic tadpoles. The work identified a set of noncoding elements that act as tissue-specific enhancers in notochord, endoderm, and neurotube. The availability of genomic sequence from a closely related species, Ciona savignyi, provides opportunities to identify additional candidate regulatory elements through interspecies comparisons. A summary of Web resources and experimental data available for Ciona is provided in Shi et al. (2005).

Protein binding assays

EMSAs and DNaseI protection

Historically, the traditional approach to defining protein–DNA interactions was through the electrophoretic mobility shift assay (EMSA) (Fried and Crothers 1981; Garner and Revzin 1981). By utilizing the sieving power of nondenaturing polyacrylamide gels to separate a protein-bound DNA molecule from one that is unbound, the in vitro “gel-shift” assay is ideal for verifying the ability of an unknown protein to recognize and bind a target DNA sequence. DNaseI protection (also known as DNaseI footprinting) (Galas and Schmitz 1978) is another technique for the precise localization of protein binding sites that does not require knowledge of protein identity. The technique combines the binding reaction of an EMSA with the cleavage reaction of DNaseI. When the radionuclide-labeled probe is visualized on a denaturing polyacrylamide gel, sites protected from cleavage create a blank image in the otherwise semicontinuous ladder of nucleotide positions. As an in vitro assessment of protein binding, DNaseI protection uses a longer probe than EMSA (500 bp vs. 25 bp), elucidating the positions of numerous proteins on the probe simultaneously. A variation of the in vitro assay uses chemical cleavage to produce a uniform cleavage pattern to overcome limitations of the DNaseI enzyme and simplify the interpretation of results (Drouin et al. 2001). With both gel-shift and DNase footprinting assays, unintended DNA–protein interactions are often detected. This can result from interference of nonspecific DNA binding proteins, such as DNA repair proteins, which can bind to the ends of DNA probes in a binding reaction (Klug 1997).

Technical advances related to in vitro binding assays include replacing the use of radionucleotides with use of fluorescent labels (Onizuka et al. 2002) and scaling up for high-throughput approaches. For example, SELEX (systematic evolution of ligands by exponential enrichment) (Tuerk and Gold 1990) and CASTing (cyclic amplification and selection of targets) (Wright et al. 1991) screen large pools of short, random oligonucleotide probes for recognition by a specific protein. The JASPAR database of nonredundant PWMs contains binding site information obtained with this in vitro approach (Sandelin et al. 2004). Other high-throughput in vitro approaches include DIP–ChIP (DNA immunoprecipitation) (Liu et al. 2005) and double-stranded DNA microarray chips (Bulyk et al. 1999; Mukherjee et al. 2004; Bai et al. 2005).

ChIP assays

Use of gel shift assays and in vitro DNase footprinting is quickly giving way to use of assays that capture binding as it happens in the in vivo environment. For instance, the development of in vivo footprinting now allows the study of DNA/protein events within a living cell. This assay uses ligation-mediated PCR (Mueller and Wold 1989) to capture the fractured pieces of genomic DNA that flank the sites protected by protein. Such in vivo assays are informative and can provide tissue-specific information concerning transcription factor binding, yet they can be technically challenging (Komura and Riggs 1998). Also, they do not provide information as to the identity of the involved protein(s). In contrast, the in vivo technique of chromatin immunoprecipitation (ChIP) is especially useful when the protein of interest is known. Reliable protocols for this procedure are listed in Supplemental Table 3. ChIP assays represent a modification of “pull-down” assays in which target proteins are precipitated from solution using an antibody coupled to a retrievable tag. In contrast with standard protein immunoprecipitation assays, ChIP assays capture in vivo protein–DNA interactions by cross-linking proteins to their DNA recognition sites using formaldehyde. Before precipitation by a transcription factor-specific antibody, the DNA is fragmented into small pieces averaging 100–500 bp. After precipitation, reversal of the cross-linking reaction releases the DNA for subsequent detection by PCR amplification. Caveats to the ChIP assay include an inability to detect precise contacts of binding within the 100–500-bp region of the DNA probe and the potential for recovering indirect interactions created by protein–protein contact rather than protein–DNA interactions. Kang et al. (2002) have proposed a method to combine ChIP with DNase protection to address the limitations of both assays, thereby identifying the interacting protein in addition to its interaction site. Although most ChIP assays are performed using tissue culture cells, modifications of the assay have been developed to allow analysis in mammalian tissues (Kirmizis et al. 2003; Chaya and Zaret 2004). A significant challenge that remains for tissue-ChIP assays is in gaining enough tissue for use in the assay, especially if the source tissue is rare (such as for human tumor samples).

ChIP–chip

High-throughput variations of the ChIP technique use ligation-mediated PCR to amplify the pool of DNA sequences as uniformly as possible, generating many copies of all genomic binding sites for a given protein. The assortment of DNA binding sites recovered in a ChIP assay can then be visualized by hybridization to a microarray of genomic sequences. This approach, called ChIP–chip, has been used to interrogate protein–DNA interactions in intact cells (Ren et al. 2000) and is well documented in many comprehensive reviews (see, e.g., Hanlon and Lieb 2004). Just as in cDNA microarrays, DNA that has undergone ChIP assay is labeled with the fluorophore Cy5 and its signal, when bound to an array of target sequences, is compared with signal from an equal amount of total input DNA labeled with Cy3. The relative enrichment of immunoprecipitated DNA over total input DNA (Cy5/Cy3) is used to identify putative binding sites. As the technology improves, the number of searchable target binding sites on the microarray chip grows more complex in nature. For instance, early applications using intergenic regions in yeast (Ren et al. 2000) led the way for analyses of putative promoter regions in humans (Li et al. 2003; Odom et al. 2004). More complex targets were developed using CpG islands associated with promoters (Weinmann and Farnham 2002; Mao et al. 2003; Heisler et al. 2005), finally converging on ever-increasingly refined platforms of nonrepetitive human genomic DNA. A recent study used a series of arrays that contained ∼14 million 50mer oligonucleotides, designed to represent all the non-repeat DNA in the human genome at 100-bp resolution, to define a genome-wide map of active promoters in human fibroblasts (Kim et al. 2005).

One of the key issues in processing the ChIP–chip raw data is to identify the “best” binding sites among the collection of potential DNA targets, pointing to the need for computational scientists to join experimental teams. Several statistical approaches have been developed to detect such regions, which are summarized eloquently in Buck and Lieb (2004). They include a median percentile rank (Lieb et al. 2001), single array-error models (Ren et al. 2000; Li et al. 2003), and a sliding window analysis (Buck and Lieb 2004). The tool ChiPOTle, which uses a sliding window approach, is publicly available for the analysis and interpretation of ChIP–chip data (Buck et al. 2005). Bieda et al. (2006) describe both theoretical and statistical approaches to ChIP–chip data analysis, bringing new insights into the role of the protein E2F1, which acts at a large fraction of human promoters without recognizing a consensus motif. Additional methods are described in a series of recent reports including variance stabilization (Gibbons et al. 2005), enrichment detection (Cawley et al. 2004), and model-based methods (Kim et al. 2005). Programs to identify the most significant regions for protein binding from a ChIP–chip analysis include MPEAK (Kim et al. 2005) and PEAKFinder (Glynn et al. 2004). URLs for these Web sites are listed in Supplemental Table 1.

Experimental conditions and reagent quality greatly affect ChIP–chip results. Published work by Oberley and Farnham (2003) and related papers in the same issue of the journal Methods in Enzymology (Allis and Wu 2003) provide guidance on these issues. Antibodies suitable for ChIP–chip applications are summarized in the ChIP-on-chip database (see Supplemental Table 1), as are additional resources for experimental protocols. ChIP–chip analyses conducted through February 2005 are summarized in Sikder and Kodadek (2005). Repositories and genomic servers listed in Supplemental Table 3 contain additional ChIP–chip data sets, many of which were contributed through the efforts of ENCODE project participants (www.genome.gov/10005107).

Computational follow-up experiments

The ChIP–chip assays described above allow the identification of the genomic region to which a particular protein is bound. However, because of limitations of the assays, it is difficult to identify the exact site within the region to which the protein is bound. Certain computational tools such as MEME (Bailey and Gribskov 1997) and AlignACE (Roth et al. 1998) (see Supplemental Table 1) have proven useful in the follow-up analysis of ChIP–chip data. In addition, other computational approaches have been applied to ChIP–chip data. For example, MDScan (Liu et al. 2002) involves an ab initio motif discovery method and applies a Bayesian statistical score function to refine the candidate motifs enumerated from a set of the top ChIP–chip sequences. Other approaches combine the use of a weight matrix model, which incorporates prior knowledge of PWMs, with statistical classification methods to identify the TFBSs. To illustrate this point, a CART model (classification and regression tree) has been used to identify estrogen receptor α target genes (Jin et al. 2004), and a MARS model (MARSMotifs)—which uses multivariate adaptive regression splines—was selected to discover liver target genes (Smith et al. 2005). Hong et al. (2005) adopted a confidence-rated boosting algorithm to discriminate positive and negative data by taking advantage of the ChIP–chip technology, to distinguish a set of positive data (binding sequences) from a set of negative data (nonbinding sequences). Several studies have increased the sensitivity of motif detection by building motif modules (cis-regulatory modules) based on interacting motifs (Zhou and Wong 2004; Gupta and Liu 2005; Wang et al. 2005; Li et al. 2006). Furthermore, integrating pattern detection of interacting transcription factors, phylogenetic footprinting, and statistical learning methods has provided a substantial increase in the specificity of detecting estrogen receptor alpha (Cheng et al. 2006) and E2F1 target genes (Jin et al. 2006.).

An example of integrating ChIP–chip data with phylogenetic conservation and experimental analyses is shown in Harbison et al. (2004). The authors combined binding data from 203 transcriptional regulators in yeast assayed under more than one growth condition. Six motif discovery methods (including MDScan and MEME) were used to find highly significant motifs for 116 regulatory proteins. The process identified promoter architectures that give clues to regulatory mechanisms defined by the presence of single or repetitive motifs, multiple occurrences of motifs having mixed identities, and co-occurring motifs.

Databases and Web sites for genomic analysis

The databases listed in Supplemental Table 3 serve as repositories for whole-genome high-throughput ChIP–chip binding data. Flexible query and output options in these databases allow one to filter data sets to meet user-specified thresholds (e.g., certain P-values on ChIP–chip data), to pass data to interconnected databases, and to retrieve the DNA sequences that underlie the regions of interest. The Galaxy2 repository (Giardine et al. 2005; Blankenberg et al. 2007) provides mathematical tools, known as set operations, for use on any genomic data sets represented as coordinate-based intervals. These include operations for intersection, union, subtraction, and complement. Importantly, the server can handle extremely large data files. Additional tools include operations for finding all regions that are proximal to a feature data set, merging regions that have overlapping coordinates, and clustering regions that are located within a specified distance. These tools were used to predict MEF2 and MyoD binding sites based on previous knowledge that these sites are known to cluster (Fickett 1996). A complementary approach to predicting target genes from genomic data sets aims to identify target genes using a combination of ChIP–chip and gene expression arrays (Kirmizis and Farnham 2004). Database repositories specializing in both of these data types include GEO (Barrett et al. 2005) and ArrayExpress (Brazma et al. 2003) along with others listed in Supplemental Table 3.

After a set of binding sites and/or target promoters are obtained, further analyses are used to place the information into a wider context. Conservation information is available as alignments of the sequenced mammalian genomes at the UCSC Genome Browser, Vista, and Ensembl Web sites (Karolchik et al. 2003; Frazer et al. 2004; Birney et al. 2006). Additional alignments containing up to 25 mammalian sequences (including pre-eutherian species) are available in the finished ENCODE and ZOOSEQ target regions and are viewable at the Genome Browser. Conversely, stand-alone tools for pairwise or multispecies alignments allow users to create statistically robust alignments of their own target sequences (Brudno et al. 2003; Blanchette et al. 2004; Cooper et al. 2005; for review, see Dubchak and Frazer 2003; Frazer et al. 2003). Highly conserved genomic segments are typically embraced as candidates for experimental and computational predictions of transcription factor binding sites. Accordingly, tools to detect conserved regions, such as Galaxy2, the UCSC Table Browser (Karolchik et al. 2004), MCS Browser (Margulies et al. 2003), and ECR browser (Ovcharenko et al. 2004) allow a user to define and extract conserved sequences from a multi-species alignment. In addition to conservation, the Genome Browser provides predictive measures of regulatory regions such as 5-way regulatory potential (RP) (Kolbe et al. 2004) and PhastCons scores (Siepel et al. 2005), both of which are useful for identifying putative functional regions under selective constraint.

Finally, it is becoming clear that the collection and visualization of data sets comprising both experimental and computational analyses from multiple independent research groups will allow broader insights than individually analyzed data sources (see Supplemental Table 3). Recent studies focused on the experimental and computational analysis of one percent of the human genome (ENCODE Consortium, in prep.) have led to the identification of a large number of novel regulatory elements initially termed GEMMS (genomic elements identified by multiple methods). Visualization of such collected data can be performed using the UCSC browser, which allows the display of information concerning known and predicted genes, protein binding sites, promoter activities, transcription factor motifs, sequence conservation, and DNaseI hypersensitivity sites. Continuing developments focused on the integration and dissemination of combined experimental and computational information are critical for the future.

Conclusions

In this review, we have attempted to demonstrate that the interdependence of experimental and computational approaches allows an iterative refinement process, with each side benefiting from collaboration with the other. Experimentalists may choose to begin a project with in silico analyses or to expand an experimental observation into a genome-wide predictive analysis. Programmers need to verify predictions of binding sites and improve their prognostic pipeline using experimental data. Although many tools for predicting binding sites are available worldwide via the Internet—thereby allowing universal implementation—a lot of experimentalists are not well trained in the programming skills needed for insightful application of the analysis tools. Similarly, despite the availability of robust protocols for genome-scale experimental identification of transcription factor binding sites, these experiments are technically challenging and time consuming. Because programmers are frequently more familiar with the intricacies of tools for binding site prediction, and biologists are better trained in the collection and interpretation of experimental data sets, collaborative interactions and cross training will serve both communities well.

Acknowledgments

We thank the members of our laboratories for productive discussions and anonymous reviewers for helpful comments. We apologize to colleagues whose original contributions could not be cited, given the space constraints. The Intramural Program of the NIH, NHGRI, supports research in the laboratory of L.E.; S.J. is a scholar of the Michael Smith Foundation for Health Research; and P.J.F. and V.X.J. are supported by grants from the NIH (CA45240, HG003129, and DK067889).

Footnotes

  • 4 Corresponding author.

    4 E-mail elnitski{at}mail.nih.gov; fax (301) 435-6170.

  • [Supplemental material is available online at www.genome.org.]

  • Article published online before print. Article and publication date at http://www.genome.org/cgi/doi/10.1101/gr.4140006

  • Freely available online through the Genome Research Open Access option.

References

| Table of Contents
OPEN ACCESS ARTICLE

Preprint Server