Eukaryotic Promoter Recognition

  1. James W. Fickett1,3 and
  2. Artemis G. Hatzigeorgiou2
  1. 1Bioinformatics, SmithKline Beecham Pharmaceuticals, King of Prussia, Pennsylvania 19406; 2Synaptic Ltd., 13671 Acharnai, Greece

Computational analysis of polymerase II (Pol II) promoters may contribute to improved gene identification and to prediction of the expression context of genes. Before assessing the state of computational promoter recognition per se in the main body of this review, we will provide a context by giving a brief overview of these two problems.

Partitioning a Genome into Genes

Only recently has it become common to determine eukaryotic genomic sequences large enough to contain several genes. With these data comes a new problem for gene finding programs: to partition a set of exons correctly among several genes.

One line of development in eukaryotic gene identification begins with coding region identification by statistical means and adds pattern recognition for sites of transcriptional, splicing, and translational control to produce algorithms capable of suggesting overall gene structure (for review, see Gelfand 1995; Fickett 1996a). To date, most development effort has focused on integration of the various kinds of pattern information in the relatively simple case where a single complete gene is present in the input sequence. In this case, current algorithms usually suggest a putative protein translation similar to that in the literature, though there is still significant room for improvement (Burset and Guigo 1996). The extension of these algorithms to deal with a sequence containing multiple or partial genes is just beginning (Burge and Karlin 1997;http://gnomic.stanford.edu/~chris/GENSCANW.html). Because the signals that control the start and stop of transcription and translation, and the location of splicing, are still not very well understood, it is not uncommon for a gene-finding algorithm to confuse internal with initial and terminal exons, thus wrongly partitioning the exons. The problem is compounded by our incomplete understanding of alternative splicing control elements.

Another line of development in gene identification is based on homology (e.g., Gish and States 1993; Gelfand et al. 1996). If there is a close homolog in the databases to one of the genes in the sequence under analysis, sequence similarity will usually group the exons for this gene correctly. Still, in many cases there is no close homolog and no guarantee when there is some homolog that the encoded protein lacks insertions/deletions.

Clearly, some means of recognizing the beginnings of genes, probably via the promoter, or the ends, probably by means of the polyadenylation signal or translation termination signal (e.g., Kondrakhin et al. 1994;Wahle and Keller 1996; Dalphin et al. 1997; Solovyev and Salamov 1997), would enable a major advance. The promoter seems to be a much richer signal than the 3′ processing signals, though, as we shall see below, it is not easy to take advantage of the information in the promoter.

Determining the Correct Protein Translation

Of course, the single most important goal in gene identification is to correctly deduce the protein product(s) of the gene. After partitioning the genome into genes, the greatest difficulty in eukaryotes is correctly determining the splicing structure. Locating the correct initiation codon is also a difficult and important step in this case. If the transcription start site (TSS) is known, and there is no intron interrupting the 5′-untranslated region, Kozak’s (1996) rules can probably locate the correct initiation codon in most cases.

In prokaryotes the problem is of a different nature. Because splicing is normally absent, dividing the genome into gene units is ordinarily straightforward. This does not make the correct deduction of protein product trivial, however, for finding the correct initiation codon within an open reading frame (ORF) is difficult. In this case, promoter location, though useful, does not provide the key information that it does for eukaryotes because of the existence of multicistronic operons. Rather, for prokaryotes, the key need is reliable localization of the ribosome binding site (Shine and Dalgarno 1974).

Determination of Expression Context

Many experimental techniques are being developed for cataloging the expression context of genes (e.g., Prashar and Weismann 1996 and references therein). Development of computer algorithms to predict expression context from genomic sequence has received much less attention but may represent an important opportunity.

Gene expression is regulated at many levels, including chromatin packing (for review, see Kingston et al. 1996), transcription initiation (see below), polyadenylation (for review, see Wahle and Keller 1996), splicing (for review, see McKeown 1992), mRNA stability (e.g., Decker and Parker 1994), translation initiation (for review, seeKozak 1992), and others. But it is generally thought that the single most important point of regulation is at transcription initiation. The initiation of transcription seems to be regulated in large part by coordinate binding of many proteins to the promoter and, for some genes, to one or more enhancers. Specific combinations of binding sites, then, may provide the information necessary to suggest a particular expression context, and it is here that computational work to date has focused.

In most cases, researchers in this area have taken the locations of transcriptional regulatory regions (promoters and enhancers) as given and, in attempting to define those patterns in the DNA (combinations of binding sites) that determine expression context, have only attempted to give patterns with sufficient information content to sort regulatory regions into those that are active in a particular context and those that are not (e.g., Claverie and Sauvaget 1985; Fondrat and Kalogeropoulos 1994; Pedersen et al. 1996; Rosenblueth et al. 1996). For this approach to be successful in the long run, reliable algorithms must be developed for the recognition of promoters and enhancers in general. Another approach to the problem is to attempt to define patterns with very high information content, capable of distinguishing regulatory regions active in a specific context from all the other DNA in the genome (e.g., Fickett 1996b; Tronche et al. 1997). With this approach, one can imagine that general promoter recognition would eventually consist of separately recognizing a large number of specific cases. It is too early to clearly define the benefits of either strategy, and in any case, techniques developed with one approach will almost certainly transfer in part to the other.

Eukaryotic Promoter Recognition

In the rest of the paper we concentrate on the key problem of general eukaryotic promoter recognition. First, we review a few salient points from recent advances in biochemical understanding of transcription initiation, next, the core computational resources and techniques are discussed, and then currently available tools are described. To give some feeling for the current state of the art, the application of these tools to some recently determined promoter sequences is also described. Finally, we discuss prospects for the future.

Eukaryotic Transcription Initiation

The biochemical mechanisms controlling transcription initiation in eukaryotes are currently under intense investigation. Recent advances are reviewed in, for example, Burley and Roeder (1996); Chao and Young (1996); Kaiser and Meisterernst (1996); Kornberg (1996); Novina and Roy (1996); Roeder (1996); Stargell and Struhl (1996); Verrijzer and Tjian (1996); Ptashne and Gann (1997); Smale (1997). Here we will attempt to summarize the conclusions most relevant to sequence analysis.

The so-called preinitiation complex (PIC) recognizes the core promoter and initiates transcription. The PIC includes, besides Pol II, the general initiation factors (or general transcription factors, GTFs) TFIIA, TFIIB, TFIID, TFIIE, TFIIF, and TFIIH. Each of these may itself be a multiprotein complex. TFIID, which consists of TATA-binding protein [TBP; the so-called TATA box is ∼25 bp upstream of the transcription start site (TSS) in metazoans] and several TBP-associated factors (TAFs), is the only one of these known to have site-specific DNA-binding ability (though several other GTFs are known to be in close contact with the DNA; cf. Coulombe et al. 1994). TBP is one of the major determinants of this DNA-binding specificity, and the consensus sequence or position weight matrix (PWM) often used to recognize the TATA box (Bucher 1990) is probably characterizing the DNA-binding specificity of TBP (see Singer et al. 1990; Wiley et al. 1992).

Around the TSS there is a loosely conserved initiator region (abbreviated Inr; for review, see Kaufmann et al. 1996; Smale 1997) that is one determinant of promoter strength and, in the absence of a TATA box, can determine the location of the TSS. To some extent, the TATA box and the Inr are interchangeable. For example, TFIID containing a mutated TBP defective in DNA binding cannot function on TATA-only promoters, but supports transcription from Inr-containing promoters (Martinez et al. 1995). There is evidence that several different proteins can bind to the Inr. Some of these seem to be capable of directing the initiation of transcription even in the absence of TBP (e.g., YY1; cf. Usheva and Schenk 1994). Javahery et al. (1994) (see also Purnell et al. 1994; Kraus et al. 1996) compare the sequence requirements for Inr activity in mammals to those for DNA binding of several proteins and to the initiation site characterization derived byBucher (1990) and conclude that in most cases basic Inr activity is probably mediated by a single protein within the TFIID complex, though possibly modulated by others. On the other hand, TFIID (via TAFII150 or TAFII250), TFII-I, and Pol II all seem to have Inr-specific binding capacity and possible involvement in mediating Inr specificity of transcription initiation (for review, seeSmale 1997).

Drosophila TAFII150 contacts the DNA as far as 35 bp 3′ of the transcription start site (Verrijzer et al. 1994) and could perhaps also be involved in functionally important patterns downstream of the Inr. Ince and Scotto (1995) identified a conserved region 20–45 bp downstream of the 3′-most TSS in a set of 14 promoters lacking both a TATA box and an Inr, and having a similar pattern of multiple start sites. This site, with consensus GCTCCS, was found to bind two proteins in a sequence-specific manner and, by mutation, was found to be essential for the pattern of TSS in at least one of the genes. Larsen et al. (1995) found a conserved motif, CTNCNG, at about +8 in a large-scale alignment of mammalian promoters. Burke and Kadonaga (1996) found an RGWCGTG motif at about +30 in a number of TATA-less Drosophila promoters. Mutation analysis demonstrated function, and footprinting showed TFIID binding. At present, the generality of these patterns is unknown.

To a first approximation, it seems that gene expression is controlled by a proximal promoter, which with the PIC determines the location of transcription initiation, together with a number of specific regulatory regions (often, but not always, 5′ to the proximal promoter), that specify the tissue, developmental stage, or biochemical context of gene expression (for an overview, see Tjian 1995). Usually each such regulatory region contains binding sites for a number of specific transcription factors, sometimes called activators or repressors, that seem to act synergistically. There may be many such regions, and they may either enhance or repress expression of the gene in particular circumstances (see Yuh and Davidson 1996 for an elegant example). Often these specific regulatory regions are active even if their location of orientation is changed, in which case they are termed enhancers. Enhancers may be located up to tens of thousands of base pairs from the TSS.

Transcription factor binding sites are typically 5–15 bp long. The nucleotide specificity at different positions within the site varies. For a site n long, the information content of the binding specificity is typical much less than the maximal 2n bits. Note that if a protein is to be sufficiently discriminatory to have a binding site only once every N bases, its binding specificity must have information content at least log2 N bits (cf. Schneider et al. 1986).

Protein–protein interactions mediating synergistic action of multiple transcription factors may impose spacing constraints on the protein–DNA-binding sites. To take one example from among many, insertion of 5 bp (CCAAC) between a MyoD site and the TATA box in the desmin promoter was found to reduce myotube expression to 45% of normal, whereas insertion of 10 bp (CGGAGTGTCG) gave 85% of normal expression (Li and Capetanaki 1994).

There is also dependence between the DNA sequence at the binding site of one transcription factor and the ability of that factor to interact with another. For example, there has been evidence for over a decade that activator inducibility probably depends on the sequence of the core promoter (e.g., Struhl 1986). Emami et al. (1995) reviewed the field and tested various chimeric transcription factors with synthetic promoters containing a TATA box, an Inr, both, or neither. Among a number of interesting conclusions, they found that Sp1 contains multiple activation domains, one of which preferentially interacts with a core promoter containing an Inr. Another example of Inr/TATA differences is found in the FcγR1b gene, which contains a canonical Inr but not a TATA box. FcγR1b is normally expressed only in myeloid cells, and is γ-interferon (IFN-γ)- but not IFN-α-inducible. When a 3-bp mutation introduced a TATA box 30 bp upstream of the transcription initiation site, the altered gene responded to IFN-α as well as IFN-γ, and cell type specificity was lost (Eichbaum et al. 1994). In a few cases, detailed studies have shown that point mutations in the TATA box destroy the ability of an upstream enhancer binding transcription factor to up-regulate expression (e.g., Harbury and Struhl 1989; Diagana et al. 1997).

The mechanism by which core promoter sequence differences are translated into different receptivity to specific transcription factors remains unclear. In some cases, a conformational change may be involved. Diagana et al. (1997) showed that when base changes in the TATA box destroy muscle-specific activation of MyHC, the contacts between TBP and the TATA box also change. In some cases, the mechanism may be differing composition of the PIC. Human TAFII30 was found by Jacq et al. (1994) to be present in only some TFIID complexes and to be required for activation by the AF-2 containing region E of the human estrogen receptor. Similarly, some TAFs are almost certainly subject to alternative splicing (e.g., Weinzierl et al. 1993). It would be surprising if the core promoter sequence did not influence the makeup of the PIC and, hence, the possibility of activation by specific transcription factors.

There are transcription factors not part of, but very frequently acting in concert with, the PIC. For example, on the order of half of all vertebrate promoters contain a somewhat conserved sequence element with a core sequence similar to CCAAT (Benoist et al. 1980; Efstratiadis et al. 1980). There seem to be a large number of factors that interact with CCAAT-like sequences, not all of which are known to actually influence transcription initiation (see Tsutsumi et al. 1993 for a list). CCAAT box-binding factor (CBF, also called NFY and CP1) is a trimeric transcription factor that is known to be involved in the activity of a number of promoters (see Sinha et al. 1996 for an overview). CBF may recruit other common factors to many promoters as well (Wright et al. 1994). Consensus sequences for the DNA-binding sites of CBF match well a mathematical derivation (PWM) of CCAAT commonality between many promoters, so that CBF may be the major factor involved in CCAAT-box function (Bucher 1990). The heavily studied CCAAT/enhancer-binding protein (C/EBP) family (for overviews, see Zhao et al. 1993; Osada et al. 1996) contains at least six members with very similar DNA-binding specificity (Osada et al. 1996) and is known to activate transcription through the CCAAT box of at least some promoters (Cao et al. 1991). There are also repressors known to act through the CCAAT box (e.g., Pattison et al. 1997).

CpG islands (also known as HTF islands and MFIs) are regions of vertebrate genomes defined primarily by the lack of methylation at CpG doublets (for an overview, see Bird 1987). CpG islands are strongly associated with TSS, a fact that gives rise to experimental procedures for isolating promoters (e.g., Shago and Giguere 1996). 5-Methyl-C often mutates to T, so that in most vertebrate DNA CpG occurs at less than one-fourth the frequency expected from the C + G content. However, in CpG islands CpG is much less under-represented. This, together with a somewhat higher than average C + G-content, may allow discrimination of CpG islands in typical DNA sequence data, where the methylation pattern is unknown (e.g., Gardiner-Garden and Frommer 1987).

Any model fully describing determinants of the transcription initiation site (and rate) will include not only discriminatory patterns in DNA sequence but also three-dimensional structure. Compare, for example, the partial explanation of sequence specificity in the TATA box based on the structure of the DNA–TBP complex (Juo et al. 1996); the competition between histones and transcription factors in gene activation/repression (for review, see Kingston et al. 1996); and the existence of transcription factors whose function seems to be reshaping the DNA to bring distant sites into proximity (see, e.g., Wolffe 1994). Unfortunately, the data available on the structural aspects of transcription initiation, particularly the data of general predictive value, remains minuscule compared to relevant data on sequence specificity of protein–DNA contacts, so that transcription factor binding sites will probably remain the focus of promoter recognition algorithms for some time.

Techniques and Resources

Because transcription initiation seems to be brought about by the cooperative binding of a number of proteins to the DNA, the primary computational approach to promoter recognition has been to combine modules recognizing individual binding sites, using some overall description of how these sites should be spatially arranged.

Sometimes binding specificity is characterized using consensus sequences, that is, by giving the most preferred base at each position within a site. But this approach loses much of the information and is of marginal utility. For example, the DNA-binding specificity of the (very large) family of basic helix–loop–helix family of transcription factors (e.g., Kadesch 1993) is often specified as CAnnTG. However, this pattern occurs about once every 256 bp. If all the factors of this family really bound so frequently and without differing specificity, they could certainly not accomplish their role of controlling terminal differentiation of many different tissue types. In fact, their binding is more specific and differs from factor to factor (e.g., cf. Hsu et al. 1994 and Wright et al. 1991).

A PWM assigns a weight to each possible nucleotide at each position of a putative binding site and gives as a site score the sum of these weights. It has been shown that in at least some cases this score approximates the energy of protein binding (Berg and von Hippel 1988and references therein; cf. also Barrick et al. 1994). It is widely recognized that a PWM is a more informative description of a protein’s DNA-binding specificity than is a consensus sequence, and PWMs are often used where enough information is available to build them. Frech et al.(1997a,b) have reviewed both tools for building the PWM (specialized multiple local alignment algorithms) and tools used to search for putative transcription factor binding sites. The statistical significance of PWM match scores has been treated by Hofmann and Bucher (1995) and Claverie and Audic (1996).

The PWM methodology is predicated on the hypothesis that different positions within the site make independent contributions to binding. Although a number of cases are known where this approximation seems to be a reasonable one (e.g., Berg and von Hippel 1988 and references therein; Fickett 1996c), most who have used PWMs know of cases where the method gave poor results. This could be attributable to many reasons, for example, the existence of multiple isoforms of the protein, leading to different classes of sites (e.g., Andres et al. 1995), or alternative protein conformations induced by the DNA structure (e.g., Bonven et al. 1995), leading to correlated preferences at different positions. It will probably be important to apply nonlinear methods of separation (and perhaps develop new ones) for this problem. Nonlinear methods have been successfully applied in the recognition of splicing junctions. Brunak et al. (1991) used multilayer neural nets; Burge and Karlin (1997) used decision trees; and a number of investigators have used position-specific oligonucleotide counts (e.g., Solovyev and Salamov 1997 and references therein).

To build any model of the DNA-binding specificity of a protein, one needs a number of known sites (it would be valuable to have the strength of the sites as well, but this information is rarely available). For core promoter elements the best data source may be the Eukaryotic Promoter Database (EPD; Bucher and Trifonov 1986), a collection of experimentally mapped TSSs and surrounding sequences. For other transcription factors, one traditional data source has been the Transcription Factor Database (TFD; Ghosh 1990), but this database is no longer maintained. Currently maintained collections include TRANSFAC (Wingender et al. 1996) and the Transcription Regulatory Region Database (TRRD; Kel et al. 1994). If one is interested in a particular factor, there is no substitute for reading the literature to find both natural sites and random oligonucleotide selection data (for an overview, see Wright and Funk 1993), and understanding the degree of evidence for each putative site. For hundreds of recently discovered transcription factors, binding site data may be scarce or absent. In some cases, it may be possible to predict the specificity of a new factor from that of a closely related factor whose specificity is known (e.g., Choo and Klug 1994; Suzuki and Yagi 1994).

Bucher (1990) constructed PWM for several core promoter elements; these are widely used in promoter recognition algorithms. PWM for many specific transcription factors have been collected in TRANSFAC and TRRD (see also Chen et al. 1995). Because some of the sites used to build these matrices have questionable experimental support, one should exercise caution in applying them.

Most of the work in this area has centered around characterizing transcription factor binding sites and their relative localization. Approaching a different aspect of the problem, Benham (1996) has described methods to predict regions of helix destabilization, likely to coincide with certain gene features, including transcriptional regulatory regions. Also, the advent of large-scale model organism sequencing allows one to identify functionally important regions of all kinds (though not to differentiate between the different possible functions) by means of sequence conservation. The application of this technique, termed phylogenetic footprinting, to the discovery of gene regulatory regions has been reviewed by Duret and Bucher (1997).

Available Promoter Prediction Tools

In this section we describe publicly available software tools for locating promoters in DNA sequence. To gain some idea of how the tools perform in practice, we tested them on a small sample of recently determined sequences in which the transcription initiation site has been experimentally mapped. We collected 18 published mammalian sequences containing 24 promoters (Table 1) in a total of 33120 bp. Two of these sequences were not found in GenBank (as of February 20, 1997); the others were dated no earlier than May 16, 1996. None of them matches a sequence in EPD (either at the level of identity or at the level of clear homology). Thus, we believe that these represent an independent test set, not overlapping in any significant way the sequences used in the development of the tools described below.

Table 1.

Mapped and Predicted Transcription Start Sites

Each tool was used with the default settings and was tested in early March 1997 (most of the on-line services do not give version numbers). The computer predictions are given alongside the mapped TSS in Table 1. It is difficult to summarize the degree of agreement of the computer predictions with experimental results, because of ambiguities in the results on both sides. Experimental accuracy may be impacted by mRNA degradation, which can lead to the mapped location of the TSS being 3′ to its true location. Some programs aim to locate the TSS exactly, tolerating a high false-positive rate, with the idea that the approximate location will already be known. Some are intended to analyze large genomic sequences and have as their goal the approximate localization of promoters or gene starts. We evaluated only the ability to approximately locate the TSS itself. If a program gave a promoter prediction but not an explicit TSS, we took the 3′ end of any predicted promoter window as the predicted TSS. The predicted TSS, explicit or implicit, was counted as correct if it was within 200 bp 5′, or 100 bp 3′, of any experimentally mapped TSS. Given these criteria, accuracy results are summarized in Table 2. Because of the limited sample size and the possibly skewed nature of the sample (discussed below), results should be taken as provisional and perhaps pessimistic.

Table 2.

Program Accuracy

Audic/Claverie

Audic and Claverie (1997) construct Markov models of vertebrate promoter sequences (based on EPD) and nonpromoter sequences (based on regions adjacent to the promoters used). For an arbitrary test window a Bayesian choice is then made between the promoter and nonpromoter hypotheses. This program (available at audic{at}newton.cnrs-mrs.fr) identified 5 (21%) of the true promoters and reported 33 false positives, or 1/1004 bp (here and below it is base pairs, not single-strand bases, that are counted).

Autogene

Autogene (available by ftp from ftp.bionet.nsc.ru; directory pub/biology/aug) includes a module for promoter recognition (Kondrakhin et al 1995). The program utilizes a set of 136 consensus sequences for transcription factor binding sites collected by Faisst and Meyer (1992). A training set of 472 promoters was taken from the EMBL Database, based on annotation in EPD and EMBL. The occurrence frequencies for each of the consensus sequences in ∼50 fixed length subregions of the promoters was determined. In a test sequence, an occurrence of one of the consensus sequences in one of the subregions was weighted according to the frequency with which it occurred in that subregion in a certain subset of the training set (determined by a clustering algorithm based on the consensus site occurrences) and the expected frequency of occurrence in random DNA. In most cases, the program suggested a range of a few base pairs, of which we took the last as the prediction. Autogene identified 7 (29%) of the true promoters and reported 51 false positives, or 1/649 bp.

GeneID/Promoter1.0

An unpublished promoter-finding algorithm, developed by S. Knudsen (Technical University of Denmark), is included in the GeneID e-mail server (send “help” to geneid{at}darwin.bu.edu). According to the on-line documentation, “Promoters are predicted by a program called promoter1.0. It has been developed as an evolution of simulated transcription factors that interact with sequences in promoter regions.” In our tests promoter1.0 identified 10 (42%) of the promoters, and reported 51 false positives (1/649 bp).

NNPP

NNPP (M. Reese, http://www-hgc.lbl.gov/inf/nnpp-abstract.html) combines recognition of the TATA box and the Inr, using the time delay neural net architecture, which allows for variable spacing between the features. We tested the algorithm using the on-line service athttp://www-hgc.lbl.gov/projects/promoter.html. When tested on our data set NNPP identified 13 of the 24 promoters (54%) and reported 72 false positives (1/460 bp). [At the optional threshold 0.9, 7 (29%) of the promoters were identified, and 31 false positives (1/1068 bp) were reported.]

PromFind

PromFind (Hutchinson 1996) is not based on any collection of putative transcription factor binding sites but, rather, on the differences in nucleotide hexamer frequencies (following Claverie and Bougueleret 1986) between promoters, protein coding regions, and noncoding regions downstream of the first coding exon. Training and testing sets were taken from some of the GenBank sequences with corresponding entries in EPD. Among all sites in an input sequence where the promoter versus coding region discriminant exceeds a certain threshold, the site where the promoter versus noncoding region discriminant reaches its maximum (over the input sequence) is taken as a promoter. PromFind (taken from the ftp site iubio.bio.indiana.edu, directory molbio/ibmpc; for future versions, see also http://www.rabbithutch.com) identified 7 of the 24 promoters (29%) and reported 29 false positives (1/1142 bp).

PromoterScan

PromoterScan (Prestridge 1995) recognizes primate promoters by means of (1) the TATA PWM from Bucher (1990), and (2) the density of specific transcription factor binding sites. In calibration, occurrences of each transcription factor binding site listed in TFD was counted in EPD primate sequences and in primate nonpromoter sequences from GenBank. The ratio of the densities of occurrence in each of these two sets is used as a weighting factor for that site. Then in application, the weighting factors for those sites occurring in the test sequence are combined with a TATA box score. The algorithm sometimes suggests a TSS and sometimes only gives a 250-bp window within which a core promoter sequence is thought to occur. In the latter case, we took the end of the window as the predicted TSS. In our tests (at http://biosci.cbs.umn.edu/software/proscan/promoterscan.ht) PromoterScan identified three (13%) of the known promoters and predicted six apparent false positives, or 1/5520 bp.

TATA

Because many investigators rely heavily on the TATA box to help locate a possible promoter, we also tested the TATA PWM from Bucher (1990) as an independent predictor. Bucher found that most TATA boxes were centered at a point 20–36 bp upstream of the TSS, so we took the point 28 bp downstream of the center of the putative TATA box as the predicted TSS. At the recommended cutoff score (−8.16) the TATA PWM gave 159 predictions in our test set. We used a more restrictive cutoff, namely −6.5, that gave 54 predictions, more in line with the other methods. With these parameters the TATA PWM identified 6 (25%) of the known promoters and predicted 47 apparent false positives (1/705 bp).

TSSG and TSSW

TSSG and TSSW (Solovyev and Salamov 1997) both use the same underlying algorithm, which uses a linear discriminant function combining (1) a TATA box score, (2) triplet preferences around the TSS, (3) hexamer preferences in the regions −1 to −100, −101 to −200, and −201 to −300 relative to the TSS, and (4) potential transcription factor binding sites. TSSG is based on the promoter.dat file derived from TFD by Prestridge (1995), whereas TSSW is based on TRANSFAC. TSSG and TSSW were accessed at the sitehttp://dot.imgen.bcm.tmc.edu:9331/gene-finder/gf.html. TSSG correctly predicted 7 (29%) of the true promoters and predicted 25 false positives (1/1325 bp). TSSW correctly predicted 10 (42%) of the true promoters and gave 42 false positives (1/789 bp).

Algorithms Not Included in the Test Results

GRAIL includes promoter recognition as one component of integrated gene structure prediction (Matis et al. 1996). The promoter recognition module combines matrix scores for the TATA-, GC- and CAAT-boxes, the Inr, and the translation start site with constraints on the distances between these elements, using a neural network. Then several rules are applied to combine this independent evidence for a promoter with the expected location of a promoter based on predicted coding exons. The independent promoter component is not available separately; we tested the integrated algorithm using the XGRAIL interface (ftparthur.epm.ornl.gov, directory pub/xgrail), but these results cannot be compared directly with those for the tools considered above. In the test set used here, GRAIL was unable to find the promoters because the coding regions were not included. In sequences with complete genes, GRAIL performed better than the other algorithms (data not shown), but it is difficult to judge how well this reflects the performance of the promoter module per se. The program of Chen et al. (1997) also makes predictions that are not comparable with the others, being non-strand-specific. The method of Crowley et al. (1997) was published after the benchmarking here had been carried out. Descriptions of other possible promoter recognition methods may be found in Larsen et al. (1995); Hatzigeorgiou et al. (1996); and Pedersen et al. (1996).

DISCUSSION

The accuracies of the various programs are plotted in Figure 1, where it may be seen that the true positive rate is approximately a constant fraction of the total number of predictions. For comparison we also show a line on which the accuracy rates of completely random predictions would fall.

Figure 1.

Each point plotted represents the accuracy of one program, with the abscissa being the total number of predictions made by the program, and the ordinate being the number of correct predictions. For comparison the line y = 0.11x is plotted. 0.11 is the fraction of all bases in the test set where a prediction would be counted as correct, so that points on the line would reflect the accuracy, on average, of random predictions.

The results presented here should not be used to compare the various programs among themselves (except perhaps to note that no technique used to date is obviously superior to the others), in part because the test set is small for this purpose. Also, the programs use somewhat different definitions of the problem and are not really directly comparable. Our tests were in some sense unfair for each program, usually in a unique way for each. For example, PromFind is intended to locate the promoter when one already knows the approximate gene location and the coding strand, and so it makes exactly one prediction, on the strand presented, in each sequence it is given to analyze; but we had multiple promoters in some sequences, and we tested both strands of each sequence with each program. An examination of the test results in light of each program’s design goals will still show, however, that our conclusions about the general state of the field are not materially affected.

At the default settings, the algorithms we tested found 13%–54% of the true promoters in our test set. However, in the test sets used by the developers the correct prediction rates were higher, and it must be noted that the test set we used was perhaps not representative. It is possible that the way we chose the test set, namely searching recent issues of journals with a focus on transcriptional regulation, retrieved promoters that are active in very specialized contexts. Furthermore, in two cases there are fewer nucleotides upstream of the experimentally mapped TSS than are required for the analysis window of some of the programs. Nevertheless, investigators do need to analyze sequences like the ones in our test set, and the test results do suggest that the challenge of finding all promoters reliably is far from being met.

The programs reported on the order of one false positive per kilobasepair. On the surface, this suggests that if they were applied to a mammalian genome as a whole (with approximately one gene per few tens of kilobases), they would give a few tens of false positives for each real gene. This too may be misleading, however. Because most of the algorithms make use of transcription factor binding site density, they may be expected to give a high signal on enhancers as well as promoters. And although enhancers may be found anywhere up to tens of kilobases away from the TSS, they tend to be more concentrated near the promoter. Thus, it is quite possible that current tools have simply not developed far enough to differentiate reliably between promoters and enhancers and that some of the false positives are in fact true transcriptional regulatory regions. On the other hand, it is also possible that some of the true positives in this set, where the promoter density is high, are attributable to chance and that the false-positive rate would be higher in general genomic DNA.

Although our current knowledge of transcription initiation is still far from complete, it is clear that considerable information is available that has not yet found its way into current algorithms. Given the advances in our understanding of promoters gained from experimental methods in the last few years, there are grounds for cautious optimism that better algorithms can, in fact, be developed.

Wherever a consensus sequence, a PWM, or other recognition module is built to discern the binding sites of a protein, it is probably worth taking the time to fully evaluate the experimental data available, as well as using the latest computational techniques. To quote Frech et al. (1997b), “perhaps more time and effort should be invested in improving the quality of matrix libraries rather than in developing new algorithms to calculate matrix scores.”

However, it will be many years before the majority of transcription factors and their DNA-binding specificities becomes known. One natural way to try to improve promoter prediction would be to concentrate on the core promoter elements. For example, (1) an evaluation of the Bucher TATA matrix on a large number of TATA boxes with proven function would be valuable. Also, given the dependence of activator function on TATA sequence, it would be worth attempting nonlinear recognition methods, such as neural nets or quadratic discriminant analysis. (2) The very low information content of the overall Inr consensus (Javahery et al. 1994), together with the evidence for involvement of multiple proteins families and the existence of conserved elements that occur in some but not all sequences downstream of promoters, suggests that it might be worthwhile to attempt either cluster analysis or nonlinear discrimination of proven, functional Inr sequences. (3) The CCAAT box pattern most used in current algorithms, namely that of Bucher (1990), was derived not from a biological definition, but from a computational one. Bucher’s algorithm was, very roughly, to find a linearly definable pattern common to many promoters and with a strong similarity to CCAAT. Now that several proteins are known to recognize a similar pattern and to be involved in transcription initiation, it seems worth investigating whether there are different classes of CCAAT boxes corresponding to the different proteins.

Acknowledgments

This work was supported by SmithKline Beecham Pharmaceuticals, Synaptic Ltd., and U.S. Public Health Service grant HG00981-01A1 from the National Center for Human Genome Research. We thank P. Agarwal, J.-M. Claverie, M. Gelfand, I. Grosse, R. Guigo, W. Wasserman, and M. Zhang for valuable comments on the work.

Footnotes

  • 3 Corresponding author.

  • E-MAIL ficketjw{at}molbio.sbphrd.com; FAX (610) 270-5580.

REFERENCES

| Table of Contents

Preprint Server



Navigate This Article