An Efficient and Robust Statistical Modeling Approach to Discover Differentially Expressed Genes Using Genomic Expression Profiles
Abstract
We have developed a statistical regression modeling approach to discover genes that are differentially expressed between two predefined sample groups in DNA microarray experiments. Our model is based on well-defined assumptions, uses rigorous and well-characterized statistical measures, and accounts for the heterogeneity and genomic complexity of the data. In contrast to cluster analysis, which attempts to define groups of genes and/or samples that share common overall expression profiles, our modeling approach uses known sample group membership to focus on expression profiles of individual genes in a sensitive and robust manner. Further, this approach can be used to test statistical hypotheses about gene expression. To demonstrate this methodology, we compared the expression profiles of 11 acute myeloid leukemia (AML) and 27 acute lymphoblastic leukemia (ALL) samples from a previous study (Golub et al. 1999) and found 141 genes differentially expressed between AML and ALL with a 1% significance at the genomic level. Using this modeling approach to compare different sample groups within the AML samples, we identified a group of genes whose expression profiles correlated with that of thrombopoietin and found that genes whose expression associated with AML treatment outcome lie in recurrent chromosomal locations. Our results are compared with those obtained using t-tests or Wilcoxon rank sum statistics.
The development of oligonucleotide microarray technologies allows scientists to monitor the mRNA transcript levels of thousands of genes in a single experiment. Indeed, several groups have already begun to simultaneously examine the expression profiles of entire genomes for organisms such as yeast whose complete DNA sequences are known (Lashkari et al. 1997; Chu et al. 1998; Spellman et al. 1998;Ferea et al. 1999). This power of examination and discovery moves well beyond the traditional experimental approach of focusing on one gene at a time. Nevertheless, the tremendous amount of data that can be obtained from microarray studies presents a challenge for data analysis (Brent 2000).
At present, the most commonly used computational approach for analyzing microarray data is cluster analysis. Cluster analysis groups genes or samples into “clusters” based on similar expression profiles and provides clues to the function or regulation of genes or similarity of samples via shared cluster membership (Tamayo et al. 1999; Tavazoie et al. 1999; Gaasterland and Bekiranov 2000). Several clustering methods have been usefully applied to analyzing genome-wide expression data and can be classified largely into three categories. The tree-based approach uses distance measures between genes such as correlation coefficients to group genes into a hierarchical tree (Eisen et al. 1998). The second category clusters genes so that within-cluster variation is minimized and between-cluster variation is maximized (Tamayo et al. 1999; Tavazoie et al. 1999). The third category groups genes into blocks, in which the correlation is maximized and between which the correlation is minimized (Ben-Dor et al. 1999).
The power of cluster analysis for microarray studies lies in discovering gene transcripts or samples that show similar expression profiles. Examples include identification of transcripts that appear to be coregulated over a time course (Chu et al. 1998; Spellman et al. 1998), or uncovering previously unknown sample groupings (Alon et al. 1999; Alizadeh et al. 2000). However, identification of “like” groups is not necessarily the objective in a microarray study. For example, microarrays present a high-throughput method to discover genes that are differentially expressed between predefined sample groups, such as normal versus cancerous tissues (Alon et al. 1999; Coller et al. 2000). Cluster analysis is not a sensitive method for this type of study because it focuses on group similarities, not differences within each individual gene. Furthermore, clustering algorithms such as those listed above are also unable to take advantage of preexisting knowledge of the data, such as the sample groupings.
The technique that has been most commonly applied for group comparisons from microarray studies is to simply look for genes with a twofold or higher difference between the mean intensities for each group (DeRisi et al. 1997). However, relative mean comparisons fail to account for sample variation, may require ad hoc data manipulation (e.g., to avoid divide-by-zero errors), and ignore the fact that differences in expression level of <100% can exert meaningful biological effects. Indeed, scientists would rarely use similar criteria when focusing their analysis on a single gene, such as comparing a panel of Northern blots or enzymatic assays between healthy and cancer tissue samples.
Classic statistical approaches used for detecting differences between two groups include the parametric t-test and the nonparametric Wilcoxon rank sum (Snedecor and Cochran 1980). Recently, thet-test was used to compare expression profiles in microarray experiments (Arfin et al. 2000; Tanaka et al. 2000). One must bear in mind three important issues when applying such standard statistical tests to microarray data analysis. First, the t-test assumes normality and constant variance for every gene across all samples. These assumptions are certainly inappropriate for a subset of genes despite any given transformation. Second, these tests cannot take advantage of the genomic data when correcting for heterogeneity between samples. Third, it is essential to correct for the high false-positive rate resulting from multiple comparisons. Otherwise, if a typicalP-value of 0.05 were used to signify differential expression for individual genes between two groups, one would expect to find 50 positives for every 1000 genes under examination, even though none of these genes are differentially expressed.
In this manuscript, we introduce a well-founded and robust statistical procedure that compares the expression profiles of individual genes between two sample groups while taking into consideration the complexity of the genomic data. This methodology makes no distributional assumptions about the data and accounts for high false-positive error rate resulting from multiple comparisons. To demonstrate the statistical modeling technique, we examined expression profiles from 38 leukemia patients, 27 of whom were diagnosed with acute lymphoblastic leukemia (ALL) and 11 of whom were diagnosed with acute myeloid leukemia (AML) (Golub et al. 1999). Our results are compared with those obtained with the t-test or Wilcoxon rank sum. The findings show that our statistical modeling approach provides a sensitive and robust means to extract relevant information from DNA microarrays.
RESULTS
Methodology
The first step in our statistical analysis of oligonucleotide-array expression profiles is preprocessing and/or transformation of the data. In the present work this includes removal of the spiked oligonucleotide controls. The second step is to estimate correction factors for sample-specific heterogeneity, as well as for chip-specific heterogeneity, and to use these factors to normalize the data. The final step is to perform a regression analysis to estimate the relevant model parameters (equation 1 in Methods) for each gene transcript using robust statistical techniques. The results are ranked by the absolute value of the Z-score for each transcript. The higher theZ-score, the greater the confidence level that the corresponding gene is differentially expressed between the two groups.
Our methodology is implemented in a software program. Interested investigaors may contact L.P.Z. for details.
Multiple Comparisons
At issue when performing a large number of statistical tests is the high occurrence rate of false positives resulting from the multiple comparisons. To address this concern, we propose to raise the statistical threshold for declaring a transcript differentially expressed to ensure that the significance level is applicable on the genomic scale. A conservative choice to adjust the significance is the Bonferroni's correction, which divides the desired significance, for example, 1% or P-value = 0.01, by the total number of statistical tests performed. In this work, we calculated the significance value (i.e., P-value) for each probe set using a modified Bonferroni's correction as proposed by Hochberg (Hochberg 1988) (see Methods for details).
Applying Bonferroni's correction to data from Affymetrix Hu6800 GeneChip oligonucleotide arrays, which contain 7070 noncontrol probe sets for 6817 individual genes, the adjusted significance level for each probe set is 0.01/7070. Assuming that the Z-score follows the normal distribution, the corresponding 1% significance threshold at the genomic level is a Z-score of 4.8. Alternatively, one may adjust the significance by the total number of genes rather than the total number of probe sets. However, different probe sets for the same gene may yield dissimilar results, and either level of correction results in a rounded Z-score of 4.8 at the 1% significance level.
Leukemia Study
A previous study examined mRNA expression profiles from 38 leukemia patients (27 ALL and 11 AML) to develop an expression-based classification method for acute leukemia (Golub et al. 1999). Affymetrix Hu6800 GeneChips were used in the study. The data set from this study was ideal for illustrating our modeling technique as it contains a large number of patients and has been well characterized (Golub et al. 1999). Furthermore, there is a great deal of literature concerning leukemia from which we can assess the validity of our findings.
Our statistical modeling approach identified 141 probe sets that were differentially expressed between AML and ALL with a Z-score of 4.8 or higher. Twenty-four of these were detected at higher levels in AML and the remainder were expressed preferentially in ALL. Tables 1and 2 list the top 25 differentially expressed probe sets in either sample group. These tables also include the corresponding P-values and ordering of the statistics given to each probe set by t-tests with either equal or unequal variance, and by the Wilcoxon rank sum. As expected, the ranked significance given to each gene by any of the statistical tests did not appear to correlate with either relative or absolute mean expression level differences. Tables 1 and 2 show that parametric t-tests under equal variances yielded rather different test statistics and ordering than our modeling approach. In contrast, the ordering of the probe sets by t-tests performed assuming unequal variances was very similar to that obtained in our regression analysis. Although t-tests are efficient under the assumption of equal variances, the results of this analysis appeared very sensitive to this assumption. In cases of discrepancies betweent-tests with unequal variances and Z-scores, the latter are considered to be more robust because the assumptions of homogeneous variances within groups and normality made by thet-test may be violated. Note that the differences ofP-values between the two statistics are associated with distributions; the t-distribution with heavy tails gives more conservative values than the asymptotic normal distribution we used to translate Z-scores to P-values. The Wilcoxon rank sum failed to identify any genes as differentially expressed at the 1% significance level. These findings are not surprising because nonparametric statistics may be too robust to yield any significant results.
Top 25 Genes More Highly Expressed in AML Than in ALL
Top 25 Genes More Highly Expressed in ALL Than in AML
We next applied the statistical modeling method to examine expression profiles within subgroups of the 11 AML patients. Thrombopoietin (TPO) is the major cytokine responsible for the transition of myeloid progenitors into megakaryocytes (Caen et al. 1999), but also plays a more general role in the differentiation of hematopoietic stem cells into all types of progenitors (Kaushansky 1999). Furthermore, TPO is known to be expressed in a number of AML cell lines (Graf et al. 1996). We noticed a sharp delineation of TPO expression profiles between patients 28, 30, 32, 34, 36, and 38 versus patients 29, 31, 33, 35, and 37 and therefore compared these patient groups using our statistical modeling technique. This approach identified eight transcripts with aZ-score >4.8, with TPO itself yielding the highest ranking (Table 3). In contrast, neithert-tests nor Wilcoxon rank sum identified any gene with a genomic significance level of 1% (Table 3). Of the 15 highest ranking mRNAs from our analysis, three of the corresponding gene products are known to be influenced by or interact directly with TPO, two have not been characterized heavily but are highly homologous to proteins that interact with TPO, and eight others are involved in myeloid hematopoiesis. Although we have no evidence for any biological significance of the patient groups used in this comparison other than TPO transcript level, we noted that the groupings appear to fall along the lines of samples with high or low percentage of blasts (seehttp://www.genome.wi.mit.edu/MPR). Interestingly, TPO can stimulate the proliferation of AML blasts (Motoji et al. 1996; Luo et al. 2000).
Top 15 Genes Whose Expression Profiles Correlated with Differential Expression of TPO Among 11 AML Samples
We next examined the association of gene expression with the success or failure of treatment. Among the 11 AML patients, 6 patients did not respond to treatment (patients 28–33) and five patients survived (patients 34–38) (see www.genome.wi.mit.edu/MPR [Golub et al. 1999]). The 25 transcripts with the highest Z-scores from the comparison of these groups are listed in Table4, five of which had a Z-score greater than 4.8. As above, neither t-tests nor Wilcoxon rank sum identified any genes as differentially expressed between these groups at a 1% significance level (Table 4). We examined the chromosomal locations of the corresponding genes because chromosomal abnormalities are prevalent in leukemia and often have prognostic implications (El-Rifai et al. 1997; Rowley 2000). Almost all of the genes listed in Table 4 lie in regions that have been identified previously to contain abnormalities in AML or other forms of leukemia. Furthermore, three of the genes are encoded within 5q11–31, four are in the 2q region, two are within 1q32–26, and two others are found at 6p12–p11 (Table 4). The identification of five “mini-clusters” of chromosomal locales in the top 25 genes from a random pool of 6800+ genes is striking. Of note, the region 5q11–31 is frequently lost in AML and known to influence prognosis (Shipley et al. 1996; El-Rifai et al. 1997; Van den Berghe and Michaux 1997). Furthermore, Set(Li et al. 1996) and HoxA9 (Lawrence et al. 1999) are known to play a role in AML progression, and COL4A4 (Verfaillie et al. 1992), thioredoxin (Nilsson et al. 2000; Soderberg et al. 2000), caspase-8 (Pervaiz et al. 1999), integrin beta5 (Feng et al. 1999), α-tubulin (Hirose and Takiguchi 1995), and SPS2 (Soderberg et al. 2000) may well contribute to the disease. Although it should be kept in mind that clinical outcome is influenced by a number of nongenetic factors, including patient age, time of diagnosis, and treatment protocol, the above findings are promising for the discovery of prognostic indicators using genome-wide microarray analysis.
Top 25 Genes Differentially Expressed between AML Patients Who Lived or Died After Treatment
DISCUSSION
The Z-scores we propose for testing differences of mean expression levels between two groups are connected closely with classical t-tests or Wilcoxon rank sum statistics, but it is important to realize that there are subtle differences between these tests. The t-test requires that expression levels be normally distributed and homogeneous within groups, and may also require equal variances between the groups. In contrast, the estimating equation technique we used to calculate Z-scores does not require any distributional assumptions or homogeneity of variances (see Methods for details). In practice, Z-scores are expected to be similar tot-test statistics, particularly those calculated assuming unequal variances, when the distribution of expression levels can be approximated by the normal distribution. When these assumptions are violated, Z-scores will differ from t-statistics and will be more reliable for making statistical inferences. On the other hand, the Wilcoxon statistic for two-group comparisons is nonparametric and thus robust. However, its power is reduced, which could be of concern in light of small sample sizes in typical array studies. Indeed, the Wilcoxon test did not detect any genes as differentially expressed between AML and ALL at the 1% genomic significance level (Tables 1 and 2, data not shown). Finally, there is no obvious method, besides ad hoc corrections of the expression values, to adjust for heterogeneity among samples when using the t-test or Wilcoxon statistics. The regression paradigm we propose provides a natural correction for heterogeneity using all expression values.
It is important to note that we analyzed the leukemia data without applying any questionable filtering methods to the Affymetrix data. For example, we did not subtract a background noise level from the data, rescale any values other than to correct for between-chip heterogeneity, or remove genes based on fluorescent signal intensities or Affymetrix present/absent calls. These filtering techniques may be required to make the strongest associations when clustering data or when calculating fold changes in means. However, ad hoc filtering could remove potential genes of interest, especially those with modest expression levels, and therefore reduce the power of discovery. For example, the difference of only a few transcripts to zero transcripts per cell may become undetectable after applying filtering techniques, but could nevertheless have a very real biological significance or present a considerable opportunity to target a cell specifically for therapeutic treatment. To illustrate this point, we note that TPO was called absent by the Affymetrix software in every sample in the leukemia data set. Nevertheless, by dichotomizing the AML samples along the lines of TPO expression values, we were able to uncover a group of proteins that interact directly with TPO or perform similar cellular functions.
Another distinct advantage of statistical modeling is that these tests take advantage of the random variations (i.e., “noise”) in the data. For example, the mean expression level of activation-induced C-type lectin (AICL) was threefold higher in AML than ALL, and the absolute mean difference was substantial at 826 units. Considering that AICL is expressed in a variety of hematopoietic-derived cell lines (Hamann et al. 1997), one might reasonably conclude that AICL was indeed overexpressed in AML based on this evidence. However, our modeling approach gave AICL a Z-score of 0.91. This apparent discrepancy is explained by the fact that one of the AICL samples in the AML set had an intensity value more than fivefold higher than any other. Excluding just this one sample, the relative and absolute mean differences for AICL between AML and ALL were 1.3-fold and −94 ± 216, respectively. Clearly, simple comparisons of fold changes are insufficient for drawing proper conclusions.
Our modeling approach can be extended. First, we can incorporate nonlinear models or apply other transformations to the observed expression levels to account for nonlinearity in fluorescent intensity. Second, the model (equation 1 in Methods) can be extended naturally to incorporate additional covariates. For example, in a clinical study of multiple patients, one may be interested in assessing the association of expression profiles with several clinical variables. Third, one may extend the model (equation 1) by incorporating nonparametric smoothing function for a continuous covariate, for example, in the assessment of nonlinear dose-response relationship. Fourth, as our knowledge accumulates about the genetic regulatory circuitry of multiple genes, we may be able to formulate a functional relationship among genes, via postulating a “high-level” model for regression coefficients α(π) = (α1,α2,…,αJ) and β(π) = (β1,β2,…,βJ), in which π could be a common set of parameters characterizing the entire genetic regulatory circuitry. One may then test how well such a genetic circuitry model fits the data using estimating equations.
The main limitation of the current approach is associated with the calculation of P-values. As noted earlier, a Z-score of 4.8 is chosen to ensure that the genome-wide significance is controlled at 1% for the Affymetrix 6800 GeneChips. However, the calculation of the corresponding P-value relies on the asymptotic normal distribution for Z-scores. With small to modest sample sizes this normality may be questionable, and such a threshold value is overly conservative. Currently, we are developing simulation-based methods to evaluate the exact significance level. It is also important to note that for the purpose of discovery science with small sample sizes, the Z-score 4.8 threshold value should be treated as a tentative guideline. In the context of testing associations with a specific candidate gene, the accepted threshold value to ensure the false-error rate of 1% for a single gene is aZ-score of 2.58. Finally, we note that the Bonferroni's correction or modifications thereof do not take into account covariation of gene expression levels, resulting in conservative estimates for the P-values. Our future research will improve on Bonferroni's correction by acknowledging expression dependencies among genes.
The capability of simultaneously assessing the expression of thousands of gene transcripts provides an opportunity of monitoring cellular activity at the genomic level. We can therefore begin addressing complex pathways of basic physiology and disease etiology, the foundation of functional genomics. The development of the statistical method described here provides a tool for researchers to pursue functional genomics systematically and rigorously. Modeling can also be used to aid the design of efficient and robust functional genomic studies, and to develop methods that estimate sample sizes and powers required for expression studies. The use of rigorous statistical tools will help functional genomic studies yield much-needed information in understanding human biology and pathology.
METHODS
Leukemia Study
The Affymetrix 6800 GeneChip oligonucleotide arrays contain a combined total of 7070 oligonucleotide probe sets (excluding controls) for 6817 individual genes. Investigators at the Massachusetts Institute of Technology gathered blood samples from 38 leukemia patients (27 ALL and 11 AML) and used Affymetrix Hu6800 GeneChip oligonucleotide arrays to assess gene expression profiles for each patient (Golub et al. 1999). We used the training data set exclusively in this work. Experimental protocols used to perform the microarray analysis and the data values obtained are available to the public at (http://waldo.wi.mit.edu/MPR/pubs.html).
Regression Model
An array of gene expression profiles may be conceptualized as a vector of outcomes. Let Yk
= (Y
1k,Y
1k,…, YJk
)‘ denote the array, where Yjk
denotes the expression of the jth gene in the kth sample (j = 1,2,…,J;k = 1,2,…, K). Let xk
denote a covariate associating with each kth sample. For example, xk
= 1 for the presence of a marker gene and xk
= 0 for its absence. We propose a regression model for the expression level of the jth gene in the kth sample:
in which (aj, bj
) are gene-specific regression coefficients, (δk,λk) are the sample-specific additive and multiplicative heterogeneity factors, respectively, and ɛjk is a random variable reflecting variation due to sources other than the one identified by the known covariate and the systematic
heterogeneity between samples. Because xk
is binary, aj
measures the mean expression level of the jth gene in normal samples (xk
= 0), and bj
measures the difference of averaged expression levels of thejth gene between the two sample groups.
The heterogeneity factors, (δk,λk), are introduced to account for variations in preparing multiple mRNA samples. Such corrections have been well conceived
in comparing two samples. Under the null hypothesis of no overall differential expression between these two samples, one can
adjust this heterogeneity by normalizing the sample data to fall on the diagonal line, a common technique (Wodicka et al. 1997). An intercept may also be estimated to ensure the numerical stability. If the intercept is different from zero, the diagonal
line is adjusted to compensate. Formalizing this correction, one may assume that typical genome-wide expression patterns are
stable, and hence may use a linear model, μjk = δk + λk
aj
, to characterize average expression values for every gene in every sample. These heterogeneity factors are then estimated
via the weighted least square method (Carroll and Ruppert 1988). Estimated heterogeneity factors are used to adjust the observed expression level as
, and corrected expression values are then used for further analysis under the above model (equation 1).
The random variation, ɛjk, is used to depict variations due to all unknown sources. Specifically, this variation may be associated with sampling preparations, cross-hybridization of genes, or other anomalies on microarrays. The stochastic distribution of these random variations is typically unknown and is unlikely to follow any familiar distributions, such as the normal distribution. Hence, no distribution assumption is made.
Analytic Strategy
The first step in the statistical analysis of oligonucleotide-array expression profiles is preprocessing of the data, which
includes elimination of control genes and transformation of the data (e.g., logarithmic transformation) as desired. The second
step is to examine heterogeneity among samples by estimating additive and multiplicative heterogeneity factors, (δk,λk). The estimate is obtained via minimizing the weighted least square, ∑j,k (Yjk
− δk − λk
aj
)2
w
,where the summation is over all genes and samples (Carroll and Ruppert 1988). The weight is chosen so that the contribution of every gene is standardized between 0 and 1. Consequently, the above weighted
least square equals the number of genes when samples are homogeneous. The estimated parameters (δk,λk) are used to correct the data. Because we do not impose distributional assumptions about residuals, the third step is to
use the weighted least square (Huber 1967) to estimate gene-specific parameters (aj, bj
) in the model (equation 1). The corresponding robust standard errors for each gene are calculated using estimating equation
theory (Godambe 1960; Liang and Zeger 1986;Prentice and Zhao 1991). Z-scores for each gene are computed as the ratio of mean difference between the two groups for each gene,bj
, over the standard error for the corresponding gene, S.E.j.
Statistical Significance and Multiple Comparisons
To measure the significance of the findings, we translatedZ-scores into P-values under asymptotic normality. To address the multiple comparison issue, we adjusted the threshold for declaring genes differentially expressed using a modified Bonferroni’s correction proposed by Hochberg (1988). The Hochberg stepdown method divides the P-values by the total number of comparisons with equal or lesser test statistics; for 7070 probe sets, the 1% genomic significance level for the probe set with the highest test statistic is 0.01/7070, the genomic significance threshold for the probe set with the second highest test statistic is 0.01/7069, etc.
t-test and Wilcoxon Rank Sum Test
The t-test and Wilcoxon rank sum test were performed after correcting the data for heterogeneity using our regression approach.t-tests were performed assuming both equal and unequal variances between the sample groups. The functions used were those built into MATLAB (MathWorks). The P-values derived from these tests were adjusted using the modified Bonferroni's correction described above.
Acknowledgments
We thank Tracy Bergemann, Chun Cheng, Robert Eisenman, and Jerry Radich for comments on this manuscript. We also thank T.R. Golub and colleagues at MIT for making their excellent AML/ALL data set (Golub et al. 1999) available in the public domain. This work was supported by National Institute of Health grants HG02283, GM58897, and CA53996.
The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.
Footnotes
-
↵3 Corresponding author.
-
E-MAIL lzhao{at}fhcrc.org; FAX (206) 667-2437.
-
Article and publication are at www.genome.org/cgi/doi/10.1101/gr.165101.
-
- Received September 15, 2000.
- Accepted April 11, 2001.
- Cold Spring Harbor Laboratory Press











