The Significance of Digital Gene Expression Profiles

Stéphane Audic; Jean-Michel Claverie

doi:10.1101/gr.7.10.986

Abstract

Genes differentially expressed in different tissues, during development, or during specific pathologies are of foremost interest to both basic and pharmaceutical research. “Transcript profiles” or “digital Northerns” are generated routinely by partially sequencing thousands of randomly selected clones from relevant cDNA libraries. Differentially expressed genes can then be detected from variations in the counts of their cognate sequence tags. Here we present the first systematic study on the influence of random fluctuations and sampling size on the reliability of this kind of data. We establish a rigorous significance test and demonstrate its use on publicly available transcript profiles. The theory links the threshold of selection of putatively regulated genes (e.g., the number of pharmaceutical leads) to the fraction of false positive clones one is willing to risk. Our results delineate more precisely and extend the limits within which digital Northern data can be used.

Very large-scale, single-pass partial sequencing of cDNA clones from a large number of libraries has led to the identification of ∼50,000 human genes (Adams et al. 1995; Aaronson et al. 1996;Hillier et al. 1996). However, a precise function or a complete transcript sequence are known for <5000 of these (Adams et al. 1995; Boguski and Schuler 1995). In the absence of functional clues for most of the newly identified genes, evidence of differential expression is the most important criteria to prioritize the exploitation of anonymous sequence data in both basic and pharmaceutical (Nowak 1995; Adams 1996; Bains 1996; Editorial 1996) research. For example, the study of expression profiles in various tumors is central to the new Cancer Genome Anatomy project (Kuska 1996;O’Brien 1997). In contrast to functional assays, the quantitative analysis of gene expression level lends itself to large-scale implementation. Two main approaches have been proposed (1) “analog” methods based on hybridization to arrayed cDNA libraries (Lennon and Lehrach 1991; Gress et al. 1992; Nguyen et al. 1995; Schena et al. 1995; Zhao et al. 1995) or oligonucleotide “chips” (Fodor et al. 1991; Southern et al. 1992; Guo et al. 1994; Matson et al. 1995); and (2) “digital” methods, based on the generation of sequence tags. This paper focuses on the latter. The sequence tag-based method (Okubo et al. 1992; Matsubara and Okubo 1994) consists of generating a large number (thousands) of expressed sequence tags (ESTs) (Adams et al. 1991; Wilcox et al. 1991; Adams et al. 1992; Khan et al. 1992) from 3′-directed regional non-normalized cDNA libraries. Recently, Velculescu et al. (1995) have introduced the serial analysis of gene expression (SAGE). Although tags are 100–300 nucleotides in length in the original EST approach, the SAGE method only requires nine nucleotides, therefore allowing a larger throughput. In both protocols, the number of tags is reported to be proportional to the abundance of cognate transcripts in the tissue or cell type used to make the cDNA library. The variation in the relative frequency of those tags, stored in computer databases, is then used to point out the differential expression of the corresponding genes: This is the concept of a “digital Northern” comparison. In the absence of a sound theoretical framework, the validity of the method has only been verified for a handful of genes in the context of two cellular differentiation systems (Lee et al. 1995; Okubo et al. 1995) inducible in vitro. Yet, with a total number of human genes of ∼80,000 or more, it is not intuitive that sequencing a mere few thousand tags (a typical experiment) from highly redundant non-normalized cDNA libraries can produce a useful picture, or realistic “transcript profile,” of a given tissue, development stage, or cell type. What variations in tag numbers allow for a reliable inference about differential expression? How many tags should be generated? Here we present the statistical framework required to answer those questions and analyze transcript profiles in a quantitative manner.

RESULTS

In Methods we establish the probability distribution governing the occurrence of the same rare event in duplicate experiments. This probability distribution is a general result applicable to a wide variety of experimental situations, although this paper focuses on its use to analyze digital gene expression patterns. The main and only mathematical assumption behind the derivation is that the observed events are rare and part of a large population of possible outcomes (the distribution of which is not specified). In the context of a digital Northern, one event is the observation of a given cDNA sequence tag, and the experiment consists of the random picking and partial sequencing of a number N of cDNA clones. Given the usual complexity (i.e., the number of different genes expressed) of cDNA libraries, observing a given cDNA qualifies as a rare event, as the abundance of most individual messages is of the order of a few percents or less.

Random Fluctuation vs. Significant Change in Tag Number: When to Infer Differential Expression

Let us randomly pick N = 1000 clones from a cDNA library and generate the corresponding sequence tags; a given message (e.g., interleukin-2) will be picked x (e.g., two) times, withx in a typical (0–10) range. If we now redo this experiment, that is, again pick 1000 clones and generate the tags, the same message will now be picked y (e.g., 3) times. If the experiments have been duplicated correctly and the clones selected at random, we expectx and y to be close, albeit often different because of random fluctuations. In the Methods section, we show that the expected probability of observing y occurrences of a clone already observed x times is given by the simple formula:

{ label needed for disp-formula[@id='E1'] }

p (y | x) = \frac{(x + y)!}{x! y! 2^{(x + y + 1)}}

Equation { label needed for disp-formula[@id='E1'] } can be used to compute a confidence interval [y_min, y_max]_ε within which we expect to find y with a given probability, noted 1–2ε, where 2ε is the significance level. For ε small (e.g., 2.5% or less), y values falling outside the [y_min, y_max]_ε interval correspond top(y | x) << 1, therefore pointing out very unlikely random fluctuations between the two experiments. The confidence intervals for the usual 1% and 5% significance levels are given in Table 1.

Table 1.

Confidence Intervals in Function of the Value ofx

[i] The value of x (first column), one of the occurrence numbers. The intervals are given for the 95% (2 ɛ = 0.05) and 99% (2ɛ = 0.01) confidence levels. Up to x = 20, the exact boundaries, immediately outside the confidence interval (first significantly different values) are indicated. A star is used when none are possible. For larger values, the boundaries are given as percentages to be subtracted or added to x. Ricker’s confidence interval characterizes the value of λ, not y(see Methods). The use of a flat p (λ) prior distribution results in the most stringent test, as expected. Although the number (N) of clones sampled does not appear in the expression ofp(y|x) (Equation { label needed for disp-formula[@id='E1'] }), its influence shows in the fact that the confidence interval becomes proportionally smaller asx (and y) increases (e.g., 1 ➝ 7 has the same statistical significance as 40 ➝ 60). For the same expression level, larger N will result in larger absolute values forx and y, making the detection of significant differential expression more sensitive.

Open in new tabLink to table

The same confidence intervals listed in Table 1 can in fact be used to analyze the results of sampling N clones from two different libraries. Provided all experimental factors are well replicated, significant discrepancies between x (from one library) andy (from the other) will now characterize differentially expressed genes, for example, the relative abundance of which is unlikely to be the same in the two libraries. Simply reading Table 1, we see that variations in counts such as 7 → 0, or 2 → 12 are significant (P < 0.01) evidence of regulated gene expression, whereas variations such as 3 → 0 or 8 → 16 are not (P > 0.05). However, we do not advocate the use of rigid significance thresholds to analyze digital transcript profiles, as discussed below.

Influence of the Sampling Size

Surprisingly at first, p(y | x) in Equation { label needed for disp-formula[@id='E1'] } does not involve the sampling size N, that is, the total number of picked clones. The fluctuation probabilities, and confidence intervals, depend only on the values of the observed counts. To understand why, we must remember that Equation { label needed for disp-formula[@id='E1'] } governs the results of strictly duplicated experiments. Given N clones are sampled, the most likely tags to be picked up are, intuitively, those corresponding to cDNA, the abundance of which is of the order of 1/N, or larger (according to Equation { label needed for disp-formula[@id='E3'] }, the probability of finding a given cDNA with 1/N abundance while picking upN clones is 0.63, see also Equation { label needed for disp-formula[@id='E13'] }). Choosing a sampling size therefore corresponds to targeting a given subset of genes, the level of expression of which allows their tags to occur at reasonable frequencies.

As expected, more reliable inferences can be made on clones corresponding to larger absolute frequencies (i.e., the ones more often picked up). For example (see Table 1), a variation in counts from 1–3 (threefold increase) is not indicative of a significant (P < 0.05) increase, whereas a variation from 4–12 is significant at P < 0.05, and a variation from 7–21 is significant at P < 0.01. For a gene expressed at a given rate, increasing the sampling size N leads to higher tag counts, and allows more stringent statistical inference to be made, for the same proportional variation.

Most often in practice one wishes to compare digital Northerns or gene profiles that have been computed from the random picking of different numbers of clones, N ₁ and N ₂. The mathematical problem is now to establish the probability for a given cDNA (e.g., interleukin-2) to be picked up x times when the sampling size was N ₁ and y times when the sampling size was N ₂. Equation { label needed for disp-formula[@id='E1'] } then becomes (see Methods):

{ label needed for disp-formula[@id='E2'] }

p (y | x) = {(\frac{N_{2}}{N_{1}})}^{y} \frac{(x + y)!}{x! y! {(1 + \frac{N_{2}}{N_{1}})}^{(x + y + 1)}}

Whereas Equation { label needed for disp-formula[@id='E1'] } applied to the analysis of fluctuation in counts in strictly identical experiments, Equation { label needed for disp-formula[@id='E2'] } now applies to the analysis of counts in experiments only differing by the total number of clones randomly picked up. In practice, Equation { label needed for disp-formula[@id='E2'] } will be used to analyze experiments performed on two different libraries, using different sampling sizes. As for Equation { label needed for disp-formula[@id='E1'] }, smallp(y | x) are expected to characterize the genes exhibiting regulated expression, the relative abundance of which is unlikely to be the same in the two libraries.

Comparison with Fisher’s (2 × 2) Exact Test

The (2 × 2) contingency tables arising from treatment versus control experiments are traditionally analyzed with Fisher’s exact test (Siegel 1956; Agresti 1996). Differential EST count data can be presented in a tabulated form so as to suggest the use of this test, as follows: $\begin{matrix} Brain cDNA library & Liver cDNA library \\ Number of actin ESTs & 2 & 11 \\ Number of other ESTs & 998 & 1189 \\ Total clones sampled & 1000 & 1200 \end{matrix}$

The statistical significance according to Fisher’s exact test for such a result is 4.6% (two-tail P-value, i.e., the probability for such a table to occur in the hypothesis that actin EST frequencies are independent of the cDNA libraries). In comparison, theP-value computed from the cumulative form (Equation 9, see Methods) of Equation { label needed for disp-formula[@id='E2'] } (i.e., for the relative frequency of actin ESTs to be the same in both libraries, given that at least 11 cognate ESTs are observed in the liver library after two were observed in the brain library) is 1.6%. Fisher’s (2 × 2) exact test is always more conservative than our test (e.g., Fisher’s P-value of 1.6% requires a 2 → 13 EST count transition in the above setting). Besides being too conservative, there is a more fundamental difficulty in using this test to analyze EST count data. The sampling scheme assumed by Fisher’s exact test in principle requires the total number of data values in the contingency table to be fixed, as well as both the row marginal total and the column marginal totals. In our prospective experimental situation, only the column marginals (i.e., the numbers of clones sampled from each library) are fixed. The extension of Fisher’s exact test to cases where only one set of marginal totals is fixed (Tocher 1950) is still controversial. In the context of the above EST counting results, there is an additional problem with the lack of homogeneity in the definition of the “other EST” category. This category represents different subsets of transcripts for different libraries.

The use of Fisher’s (2 × 2) exact test is more natural for a different type of EST data analysis: the study of library-dependent alternative transcripts of the same gene (i.e., splice or polyadenylation variants) (D. Gautheret, O. Poirot, F. Lopez, S. Audic, and J.-M. Claverie, in prep.). Here, the results for an hypothetical gene G₁ may look as follows: $\begin{matrix} G_{1} related transcripts in brain library & G_{1} related transcripts in liver library \\ Long form mRNA & 2 & 10 \\ Short form mRNA & 8 & 3 \\ Total G_{1} related clones & 10 & 13 \end{matrix}$

where the alternative categories are unambiguously defined and refer to the same objects. For example, the above results constitute good evidence that G₁ is expressed in different forms in those tissues (Fisher’s exact test two-tailP-value = 1.2%).

False Leads in the Selection of Candidate Genes

A crucial measure of the power of statistical significance tests is their rate of false alarm, that is, how often random fluctuations are expected to be mistaken for significant differences in the results. When analyzing the transcript profiles from two different libraries, a false alarm would cause a gene to be deemed differentially transcribed, whereas in fact it is not. The rate of false alarm is therefore a direct estimate of the fraction of false leads, when searching for differentially expressed genes on the basis of differences in tag counts. The rates of false alarm associated with theP < 0.01 and P < 0.05 confidence intervals listed in Table 1 have been computed by Monte-Carlo simulation on the basis of two experimental sequence tag distributions (Table2; Fig. 1). The rate of false alarms associated with the use of Equation { label needed for disp-formula[@id='E1'] } (in fact, its cumulative form Equation 9, see Methods) is very small for genes represented by small tag counts and slowly increases for higher tag counts, without ever exceeding the selected significance level. Such good behavior validates the use of the confidence intervals (Table 1) computed from Equation { label needed for disp-formula[@id='E1'] } and Equation 9 to assess the statistical significance of variations in digital Northern data. The curves labeled “window” characterize the very similar behavior of a slightly less conservative derivation of the same test (see Methods, Equation { label needed for disp-formula[@id='E15'] }). For comparison, Figure 1 also presents the behavior of another test, based on an inappropriate application of Ricker’s confidence intervals (Ricker 1937) (see Methods).

Table 2.

Publicly Available Distributions of Sequence Tags

[i] (Left) Data from Velculescu et al. (1995): Frequency of occurrence of each of the 428 transcript species represented in 840 SAGE tags randomly generated from a 3′-directed cDNA library from human pancreas. (Right) Data from Okubo et al. (1992): Frequency of occurrence of each of 641 transcript species represented in 982 randomly sequenced clones from a 3′-directed cDNA library from human liver cell line HepG2.

Open in new tabLink to table

Figure 1.

Rate of false alarm computed according to the confidence intervals listed in Table 1. (Top) Monte-Carlo simulation of the random sampling of 840 tags distributed according to the data from Velculescu et al. (1995; see Table 2). (Bottom) Monte-Carlo simulation of the random sampling of 982 ESTs distributed according to the data fromOkubo et al. (1992; see Table 2). The frequency of false alarm was computed for two significance levels (2ε = 5%, leftand 2ε = 1%, right) and plotted in function of the tag class size (from 1–64 for Velculescu et al., from 1–22 for Okubo et al.). In all cases, the rate of false alarm increases up to a plateau for larger class sizes. The test (cumulative form of Equation{ label needed for disp-formula[@id='E1'] }) derived from the flat p(λ) prior shows perfect behavior with a maximal rate of false alarm always less than the significance levels (broken lines). The test (cumulative form of Equation { label needed for disp-formula[@id='E15'] }) derived from the window p(λ) prior exhibits a slightly higher rate of false alarms. Both versions of the test exhibit conservative behaviors for class size <5, with a false alarm rate even less than expected. In contrast, Ricker’s confidence intervals (Equation 12) are grossly inadequate and lead to false alarm rates up to four times the significance level. Graphs are computed from the analysis of 1000 repetitions of each experiment.

Open in new tab Download PowerPointLink to figure

DISCUSSION

An appropriate statistical test is now at our disposal to begin analyzing digital gene expression profiles in a more quantitative way. For example, the test can be used to determine how many genes appear regulated at various confidence levels using the data from a typical experiment (e.g., sampling a thousand clones). We analyzed the data gathered by Okubo et al. (1995) on the human promyelocytic leukemia cell line HL60 induced by dimethylsulfoxide (DMSO) or tetradecanoylphorbolacetate (TPA). Table 3 shows the 21 EST classes the occurrences of which exhibit significant variations at the 1% level. Most of the corresponding genes make biological sense in term of differentiation along the granulocyte or monocyte pathways.

Table 3.

List of ESTs Exhibiting Significant (P < 0.01) Differences in Abundance in the HL60 Cell Line Induced by DMSO or TPA

EST ID	HL60	HL60 + TPA	HL60 + DMSO	Significance
418	22	10	1	3 × 10⁻⁷
211	24	10	2	4 × 10⁻⁷
19	8	23	2	8 × 10⁻⁷
356	16	2	0	3 × 10⁻⁶
380	12	1	0	6 × 10⁻⁵
135	4	12	0	6 × 10⁻⁵
285	14	8	1	1 × 10⁻⁴
2015	0	11	0	2 × 10⁻⁴
244	0	1	14	3 × 10⁻⁴
293	13	6	1	3 × 10⁻⁴
292	11	0	1	5 × 10⁻⁴
650	14	5	2	5 × 10⁻⁴
335	15	3	3	9 × 10⁻⁴
444	10	4	1	2 × 10⁻³
1674	0	8	1	4 × 10⁻³
155	0	8	3	4 × 10⁻³
861	6	1	0	7 × 10⁻³
305	6	2	0	7 × 10⁻³
1806	0	6	0	7 × 10⁻³
1808	0	6	0	7 × 10⁻³
1766	0	6	0	7 × 10⁻³

[i] Only the probability (computed according to Equations { label needed for disp-formula[@id='E7'] } and { label needed for disp-formula[@id='E8'] }) corresponding to the most significant transition (numbers in bold) is listed (Okubo et al. 1995). The total EST numbers sampled from the HL60, HL60 + TPA and HL60 + DMSO cDNA libraries are 845, 845, and 1058, respectively. ESTs 418, 211, 356, 285, 293, 292, 650, 335, 444, 861, 305 corresponding to ribosomal proteins, and EST 380, a tag to an unkown gene, exhibit a marked reduction of expression level in the DMSO- and/or TPA-induced differentiated states. In constrast, ESTs 135 (ferritin), 2015 (LD78/macrophage inflammatory protein), 1674 (methionine adenosyltransferase), 155 (thymosin β-4), 1806 (lipocortin), 1808 (thymosin β-10), and 1766 (a metallothionein) appear more abundant in the TPA-induced state, also highly enriched in EST 19 (the ubiquitous elongation factor 1-α). β-Actin (EST 244), is the only markedly increased tag in the DMSO-induced state. EST numbers, abundance data, and protein assignments are from the “body map” public expression data repository athttp://www.imcb.osaka-u.ac.jp (K. Okubo and K. Matsubara).

Open in new tabLink to table

This example serves to discuss a subtle point in the interpretation of the P values computed from Equation { label needed for disp-formula[@id='E1'] }, 2, and 9. Rigorously, these equations apply to the case where a given gene (e.g., lipocortin) would have been selected for scrutiny before looking at the differences in cognate tag counts between libraries. When comparing two libraries without specifying in advance the transcripts we want to follow, and then focusing a posteriori on any of those exhibiting significant variations, the average number of expected false positiveN _false isN _false = PN _species, whereN _species is the number of different transcript species encountered and p is a given significance level. For instance, in the experiment analyzed in Table 3,N _species is of the order of 600 (Okubo et al. 1995). It is therefore possible that up to four (600 × 7 × 10⁻³) out of the 21 transcript species listed in Table 3 are not truly differentially expressed.

Therefore, when two libraries are compared without prior gene selection, the use of a predetermined significance threshold is not advisable. The P values computed from Equation { label needed for disp-formula[@id='E1'] }, 2, and 9 should simply be used to rank all observed variations by order of decreasing statistical significance (analogous to how “similarity hits” are listed after database searches). The end-users can then make their own choice about the number of candidate target genes to be retained from the top of the list, bearing in mind the corresponding number of expected false positives.

Although the present interpretation of a digital Northern focuses on the genes exhibiting the most spectacular differential expressions, there is already ample evidence that small changes can cause drastic effects. Disease states caused by haploinsufficiency and trisomy suggest that 2 → 1 or 2 → 3 proportional changes in expression level may be of biological significance. Table 1 shows that there is no theoretical limit to the detection of such small variations from the comparison of digital expression patterns. Simply, the sampling size has to be increased enough for the required numbers of cDNA tags to reach a significance threshold (for instance 40 → 60, for a confidence level of 95%).

Analog hybridization-based methods (Fodor et al. 1991; Lennon and Lehrach 1991; Gress et al. 1992; Southern et al. 1992; Guo et al. 1994;Matson et al. 1995; Nguyen et al. 1995; Schena et al. 1995; Zhao et al. 1995) are traditionally opposed to digital tag-counting methods (Okubo et al. 1992; Matsubara and Okubo 1994; Lee et al. 1995; Okubo et al. 1995; Velculescu et al. 1995) for the analysis of differential gene expression. Both types of methods are sensitive to the quality of the original messenger RNA preparation and/or cDNA libraries. Analog methods promise higher throughput, lower cost, and have the capacity of studying transcripts on a much wider scale of abundance. They are therefore expected to supersede digital methods. On the down side, however, hybridization signals are not easily reproducible, and can be affected by many unknown properties such as the cDNA library complexity, as well as clone and sequence specific features (e.g., insert size, nucleotide composition, presence of repeats, secondary structure, triple helix interaction, etc.). Therefore, the hybridization-based methods require an estimation of the dispersion of the signal associated with each clone (i.e., enough repetitions of each experiment), and multiple standardization and calibration procedures to allow the meaningful comparison of hybridization patterns obtained from various sources (tissues, cell types, etc.) or from different membranes or chips. This is far from routine and has yet to be worked out. In contrast, and thanks to the unique properties of the Poisson distribution, digital methods have the capacity of providing a quantitative assessment of differential expression without the repetition or the standardization of individual tag-counting experiments. The statistical analysis presented here provides an objective method to analyze digital transcript profile data, and adapts it to fit (1) the number of leads one wants to be followed; (2) the fraction of false clues to be tolerated; and (3) the level of modulation in gene expression considered of biological interest.

A program is available on our web site (http://igs-server.cnrs-mrs.fr) to compute the confidence intervals corresponding to arbitrary significance levels and sampling sizeN ₁ and N ₂.

METHODS

Let us denote p(x) the probability to observex sequence tags of the same gene (i.e., from the 3′ end of the same transcript) when N cDNA clones are picked randomly. For each transcript representing a small (i.e., less than 5%) fraction of the library and N ⩾ 1000, p(x) will closely follow the Poisson distribution:

{ label needed for disp-formula[@id='E3'] }

p (x) = \frac{e^{- λ} λ^{x}}{x!}

where λ is the actual (albeit unknown) number of transcript of this type per N clones in the library. If we duplicate this experiment (i.e., once again randomly pick N clones of the same library and generate sequence tags), we will now observey occurrences of the same transcript. What is the probability of the various y values? An approximate solution consists in using x as the maximum likelihood estimate for λ and compute the probability for y occurrences given a Poisson distribution of mean λ = x:

{ label needed for disp-formula[@id='E4'] }

p (y | x) = \frac{e^{- x} x^{y}}{y!}

Equation { label needed for disp-formula[@id='E4'] } is not symmetrical in x and y.This is an obvious flaw as the probability should not depend on which of the x or y values were observed first.p(y | x) = p(x | y) should hold provided that an equal number N of clones is sampled in both experiments. Equation { label needed for disp-formula[@id='E4'] } is not the correct formula, because we have not yet taken into account the fluctuation ofx around the unknown mean λ. To account for the fact that the actual value of λ is unknown, we have to integrate Equation { label needed for disp-formula[@id='E4'] }over all possible λ values:

{ label needed for disp-formula[@id='E5'] }

p (y | x) = \int_{0}^{\infty} d λ p (d = λ | x) p (y | d = λ)

p(d = λ | x) in Equation { label needed for disp-formula[@id='E5'] } is the probability that the actual abundance of a given transcript is λ given that x occurrences of a cognate tag have been observed in one experiment. The second term in the integral is the probability of drawing y occurrences given a Poisson distribution of mean λ:

p (y | d = λ) = \frac{e^{- λ} λ^{y}}{y!}

Using Bayes’ theoremp(d = λ | x) can be written as

{ label needed for disp-formula[@id='E7'] }

p (d = λ | x) = \frac{p (x | d = λ) p (d = λ)}{\int_{0}^{\infty} d λ ‘ p (x | d = λ ‘) p (d = λ ‘)}

To evaluate Equation { label needed for disp-formula[@id='E7'] }, we need to define the prior distribution p(d = λ). The least constrained hypothesis (i.e., with the least information content), is to attribute an equal a priori probability to all λ values in the [0, ∞] range. Incorporating such a flat prior in Equation { label needed for disp-formula[@id='E5'] } leads to

{ label needed for disp-formula[@id='E8'] }

p (y | x) = \frac{1}{x! y!} \int_{0}^{\infty} d λ e^{- 2 λ} λ^{(x + y)}

From the definition of the Γ function for integer arguments we observe that

\int_{0}^{\infty} d λ e^{- 2 λ} λ^{(x + y)} = \frac{(x + y)!}{2^{(x + y + 1)}}

and finally obtain the expression given in Results:

p (y | x) = \frac{(x + y)!}{x! y! 2^{(x + y + 1)}}

This equation can be used in a wide variety of experimental situations. Equation { label needed for disp-formula[@id='E1'] } defines the probability of observing xand y occurrences of the same rare event in duplicated experiments, regardless of the detailed probability distribution of those events among the set of possible outcomes. In particular, in the context of transcription profiles,p(y | x) can be evaluated regardless of the distribution of each transcript (provided it is rare) within a cDNA library.

To compute the confidence intervals listed in Table 1, we made use of the cumulative distributions:

C (y ⩽ y_{min} | x) = \sum_{y = 0}^{y ⩽ y_{min}} p (y | x)

D (y ⩾ y_{max} | x) = \sum_{y = y_{max}}^{\infty} p (y | x)

These equations allow the computation of an interval [y _min, y _max]_εsuch asC(y y _min | x) ε andD(y ⩾ y _max | x) ε. Given that an event is observed x times in one experiment, the number y of occurrences of this event in a duplicate experiment is expected to fall within the interval [y _min, y _max]_εwith a probability of 1–2ε. Equation 9, a and b, can therefore serve as a significance test when comparing, for instance, the results of sampling N clones from two different libraries. For 2ε small (e.g., 5% or less), y values falling outside the [y _min, y _max]_εinterval correspond top(y | x) << 1, and point out significant differences between the two experiments. They should include differentially expressed genes, for example, for which λ is different in the two libraries.

Generalization to Different Sampling Sizes

When different numbers of clones N ₁ andN ₂ are sequenced from the same library, Equation { label needed for disp-formula[@id='E5'] }becomes

p (y | x) = \int_{0}^{\infty} d λ_{2} \int_{0}^{\infty} d λ_{1} p (d_{1} = λ_{1} | x) p (y | d_{2} = λ_{2}) δ (λ_{2} - \frac{N_{2}}{N_{1}} λ_{1})

where the two abundance values λ₁ and λ₂ are forced in the same ratio as N ₁ andN ₂. Using the same bayesian argument as before (Equation { label needed for disp-formula[@id='E7'] }) leads to

p (y | x) = \frac{1}{x! y!} {(\frac{N_{2}}{N_{1}})}^{y} \int_{0}^{\infty} d λ_{1} e^{- λ_{1} (1 + \frac{N_{2}}{N_{1}})} λ_{1}^{(x + y)}

the last integral is simply

\frac{(x + y)!}{(1 + \frac{N_{2}}{N_{1}})^{(x + y + 1)}}

leading to the formula presented in the Results section:

p (y | x) = {(\frac{N_{2}}{N_{1}})}^{y} \frac{(x + y)!}{x! y! {(1 + \frac{N_{2}}{N_{1}})}^{(x + y + 1)}}

Ricker’s Confidence Interval

The confidence interval computed from Equation { label needed for disp-formula[@id='E1'] } (and its cumulative form, Equation 9, a and b) is different from one introduced previously by Ricker (1937) although, at first, the two may appear to be related.

Given x occurrences of a sequence tag, Ricker’s formula defines a confidence interval [λ_min, λ_max]_x for λ (again the actual number of transcripts of this type per N clones in the library) such as

p (k ⩽ x) = \sum_{k = 0}^{x} \frac{e^{- λ_{max}} λ_{{max}^{k}}}{k!} ⩽ \frac{α}{2}

and

p (k ⩾ x) = \sum_{k = x}^{\infty} \frac{e^{- λ_{min}} λ_{{min}^{k}}}{k!} ⩽ \frac{α}{2}

where α is typically 5% or 1%. Ricker’s confidence intervals for various values of x are given in Table 1. Those intervals are close to those computed from Equation { label needed for disp-formula[@id='E1'] }, but delineate the range of likely λ values, not y (the number of occurrences of the same event in a duplicated experiment). It is possible for xand y to fall outside each other’s Ricker’s confidence interval [λ_min, λ_max], while still being nonsignificant fluctuations around the same λ value. The confidence intervals computed from Equation 12, a and b, are therefore too narrow to properly define significant discrepancies between x andy. The false alarm rate associated with the use of Ricker’s confidence intervals is too high (Fig. 1).

However, an interesting use of Equation 12, a and b, is the estimation of the range of possible frequencies [λ_min, λ_max]_x = 0 for cDNAs not yet encountered after picking N clones. For example, the 95% confidence interval is given by:

{ label needed for disp-formula[@id='E13'] }

0 < N λ < 3.7

That is, the abundance of a cDNA not picked up among one thousand clones is unlikely (5% chance) to be larger than 3.7/1000.

Influence of the Prior Distribution

In the bayesian context, it is prudent to assess the influence of the prior hypothesis used to derive Equation { label needed for disp-formula[@id='E1'] } and Equation { label needed for disp-formula[@id='E2'] }. The flatp(λ) prior allowing equiprobable λ values in the [0 , ∞] range might appear too broad and unrealistic. Nevertheless, it is the most intuitively neutral distribution one can use. The quick convergence of the Poisson distribution rends the contribution of extreme λ values negligible as soon as | λ − x | or | λ − y | increase. To verify this point, more reasonable distributions forp(λ) can be constructed by confining the accessible λ values within a window [λ_min, λ_max]_x centered around the already observed value x. Such a window can for instance be Ricker’s confidence intervals as defined in the previous section (Equation 12, a and b). We then confine the only permitted values of λ to be in this interval, with an equal probability; therefore,

p (d = λ) = \frac{H (λ - λ_{min}) \times [1 - H (λ - λ_{max})]}{λ_{max} - λ_{min}}

where H denotes the Heaviside function, the value of which is 1 for positive argument and 0 otherwise. Equation { label needed for disp-formula[@id='E1'] } then becomes

{ label needed for disp-formula[@id='E15'] }

p (y | x) = \frac{F_{2 λ_{min}, 2 λ_{max}} (x + y)}{F_{λ_{min}, λ_{max} (x) y! 2^{(x + y + 1)}}}

with

F_{λ_{min}, λ_{max}} (x) = \int_{λ_{min}}^{λ_{max}} d λ e^{- λ} λ^{x}

When λ_min → 0 and λ_max → ∞, we recover the initial Equation { label needed for disp-formula[@id='E1'] }. We note in passing that the other limit, λ_min = λ_max = x, corresponds to the most stringent “Dirac prior”:

p (d = λ | x) = δ (x - λ)

forcing λ = x and turningp(y | x) into the simple Poisson distribution (Equation { label needed for disp-formula[@id='E4'] }).

The confidence intervals for the usual 1% and 5% significance levels are given in Table 1 for both the flat and the window priorp(d = λ). There is little difference, with the test derived from using a flat prior being a bit more conservative, as expected. On the down side, Figure 1 shows that the test derived from the window p(λ) prior gives rise to a higher rate of false alarm.

We thank Drs. J. Weissenbach, D. Gautheret, R. Ewing, and C. Abergel for critically reading the manuscript. This work was sponsored by a collaborative research grant from Incyte Pharmaceuticals, Inc.

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Notes

[4] Corresponding author.

Notes

[5] E-MAIL [email protected]; FAX 334 91 16 45 49.

REFERENCES

↵
J.S. AaronsonB. EckmanR.A. BlevinsJ.A. BorkowskiJ. MyersonS. ImranK.O. Elliston(1996) Toward the development of a gene index to the human genome: An assessment of the nature of high-throughput EST sequence data. Genome Res. 6:829–845.
Google Scholar CrossRef Abstract/Full Text PubMed Web of Science
↵
M.D. Adams(1996) Progress towards a complete set of human genes. in Genomes, molecular biology and drug discovery, eds M.J. BrowneP.L. Thurby(Academic Press, London, UK).
Google Scholar
↵
M.D. AdamsJ.M. KelleyJ.D. GocayneM. DubnickM.H. PolymeropoulosH. XiaoC.R. MerrilA. WuB. OldeR.F. Moreno(1991) Complementary DNA sequencing: Expressed sequence tags and human genome project. Science 252:1651–1656.
Google Scholar CrossRef PubMed Web of Science
↵
M.D. AdamsM. DubnickA.R. KerlavageR. MorenoJ.M. KelleyT.R. UtterbackJ.W. NagleC. FieldsJ.C. Venter(1992) Sequence identification of 2,375 human brain genes. Nature 355:632–634.
Google Scholar CrossRef PubMed Web of Science
↵
M.D. AdamsA.R. KerlavageR.D. FleischmannR.A. FuldnerC.J. BultN.H. LeeE.F. KirknessK.G. WeinstockJ.D. GocayneO. White(1995) Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence. (The Genome Directory, Suppl.) Nature 377:3–174.
Google Scholar PubMed
↵
A. Agresti(1996) An introduction to categorical data analysis. (John Wiley, New York, NY).
Google Scholar
↵
W. Bains(1996) Virtually sequenced: The next genomic generation. Nature Biotechnol. 14:711–713.
Google Scholar CrossRef PubMed Web of Science
↵
M.S. BoguskiG.D. Schuler(1995) ESTablishing a human transcript map [News]. Nature Genet. 10:369–371.
Google Scholar CrossRef PubMed Web of Science
↵
Editorial (1996) Capitalizing on the genome. Nature Genet. 13:1–5.
Google Scholar
↵
S.P.A. FodorJ.L. ReadM.C. PirrungL. StryerA.T. LuD. Solas(1991) Light-directed, spatially addressable parallel chemical synthesis. Science 251:467–470.
Google Scholar
↵
T.M. GressJ.D. HoheiselG.G. LennonG. ZehetnerH. Lehrach(1992) Hybridization fingerprinting of high-density cDNA-library arrays with cDNA pools derived from whole tissues. Mamm. Genome 3:609–619.
Google Scholar CrossRef PubMed Web of Science
↵
Z. GuoR.A. GuilfoyleA.J. ThielR. WangL.M. Smith(1994) Direct fluorescence analysis of genetic polymorphisms by hybridization with oligonucleotides arrays on glass supports. Nucleic Acids. Res. 22:5456–5465.
Google Scholar CrossRef PubMed Web of Science
↵
A.S. KhanA.S. WilcoxM.H. PolymeropoulosJ.A. HopkinsT.J. StevensM. RobinsonA.K. OrpanaJ.M. Sikela(1992) Single pass sequencing and physical and genetic mapping of human brain cDNAs [see Comments]. Nature Genet. 2:180–185.
Google Scholar CrossRef PubMed Web of Science
↵
B. Kuska(1996) Cancer genome anatomy project set for take-off. J. Natl. Cancer Inst. 88:1801–1803.
Google Scholar CrossRef PubMed
↵
L. HillierG. LennonM. BeckerM.F. BonaldoB. ChiapelliS. ChissoeN. DietrichT. DuBuqueA. FavelloW. Gish(1996) Generation and analysis of 280,000 human expressed sequence tags. Genome Res. 6:807–828.
Google Scholar CrossRef Abstract/Full Text PubMed Web of Science
↵
N.H. LeeK.G. WeinstockE.F. KirknessJ.A. Earle-HuguesR.A. FuldnerS. MarmarosA. GlodekJ.D. GocayneM.D. AdamsA.R. Kerlavage(1995) Comparative expressed-tag analysis of differential gene expression profiles in PC-12 cells before and after nerve growth factor treatment. Proc. Natl. Acad. Sci. 92:8303–8307.
Google Scholar CrossRef PubMed Web of Science
↵
G.G. LennonH. Lehrach(1991) Hybridization analyses of arrayed cDNA libraries. Trends Genet. 7:314–317.
Google Scholar PubMed Web of Science
↵
R.S. MatsonJ. RampalS.L. Pentoney JrP.D. AndersonP. Coassin(1995) Biopolymer synthesis on polypropylene supports: Oligonucleotide arrays. Anal. Biochem. 224:110–116.
Google Scholar CrossRef PubMed Web of Science
↵
K. MatsubaraK. Okubo(1994) Identification of new genes by systematic analysis of cDNAs and database construction. Curr. Opin. Biotechnol. 4:672–677.
Google Scholar
↵
C. NguyenD. RochaS. GranjeaudM. BalditK. BernardP. NaquetB.R. Jordan(1995) Differential gene expression in the murine thymus assayed by quantitative hybridization of arrayed cDNA clones. Genomics 29:207–216.
Google Scholar CrossRef PubMed Web of Science
↵
R. Nowak(1995) Entering the postgenome era. Science 270:368–369.
Google Scholar PubMed Web of Science
↵
C. O’Brien(1997) Cancer genome anatomy project launched. Mol. Med. Today 3:94.
Google Scholar
↵
K. OkuboN. HoriR. MatobaT. NiiyamaA. FukushimaY. KojimaK. Matsubara(1992) Large scale cDNA sequencing for analysis of quantitative and qualitative aspects of gene expression. Nature Genet. 2:173–179.
Google Scholar CrossRef PubMed Web of Science
↵
K. OkuboK. ItohA. FukushimaJ. YoshiiK. Matsubara(1995) Monitoring cell physiology by expression profiles and discovering cell type-specific genes by compiled expression profiles. Genomics 30:178–186.
Google Scholar CrossRef PubMed Web of Science
↵
W.E. Ricker(1937) The concept of confidence or fiducial limits applied to the Poisson frequency distribution. J. Am. Statist. Assoc. 32:349–357.
Google Scholar CrossRef Web of Science
↵
S. Siegel(1956) Nonparametric methods for the behavioral sciences. (McGraw-Hill, New York, NY).
Google Scholar
↵
M. SchenaD. ShalonR.W. DavisP.O. Brown(1995) Quantitative monitoring of gene expression patterns with a complementary DNA microarray. Science 270:467–470.
Google Scholar CrossRef PubMed Web of Science
↵
E.M. SouthernU. MaskosJ.K. Elder(1992) Analyzing and comparing nucleic acid sequences by hybridization to arrays of oligonucleotides: Evaluation using experimental models. Genomics 13:1008–1017.
Google Scholar CrossRef PubMed Web of Science
↵
K.D. Tocher(1950) Extension of the Neyman-Pearson theory of tests to discontinuous variates. Biometrika 37:130–144.
Google Scholar CrossRef PubMed Web of Science
↵
V.E. VelculescuL. ZhangB. VogelsteinK.W. Kinzler(1995) Serial analysis of gene expression. Science 270:484–487.
Google Scholar CrossRef PubMed Web of Science
↵
A.S WilcoxA.S. KhanJ.A. HopkinsJ.M. Sikela(1991) Use of 3′ untranslated sequences of human cDNAs for rapid chromosome assignment and conversion to STSs: Implications for an expression map of the genome. Nucleic Acids Res. 19:1837–1843.
Google Scholar CrossRef PubMed Web of Science
↵
N. ZhaoH. HashidaN. TakahashiY. MisumiY. Sakaki(1995) High-density cDNA filter analysis: A novel approach for large-scale, quantitative analysis of gene expression. Gene 156:207–213.
Google Scholar CrossRef PubMed Web of Science

RESEARCH

The Significance of Digital Gene Expression Profiles

Current Issue:

Abstract

RESULTS

Random Fluctuation vs. Significant Change in Tag Number: When to Infer Differential Expression

Influence of the Sampling Size

Comparison with Fisher’s (2 × 2) Exact Test

False Leads in the Selection of Candidate Genes

DISCUSSION

METHODS

Generalization to Different Sampling Sizes

Ricker’s Confidence Interval

Influence of the Prior Distribution

Notes

Notes

REFERENCES

Article contents

Announcement(s)

RESEARCH

The Significance of Digital Gene Expression Profiles

Cite this article

Share

Current Issue:

Abstract

RESULTS

Random Fluctuation vs. Significant Change in Tag Number: When to Infer Differential Expression

Influence of the Sampling Size

Comparison with Fisher’s (2 × 2) Exact Test

False Leads in the Selection of Candidate Genes

DISCUSSION

METHODS

Generalization to Different Sampling Sizes

Ricker’s Confidence Interval

Influence of the Prior Distribution

Notes

Notes

REFERENCES

Article contents

Announcement(s)