Bioinformatics Advance Access originally published online on May 31, 2007
Bioinformatics 2007 23(16):2063-2072; doi:10.1093/bioinformatics/btm289
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Statistical assessment of functional categories of genes deregulated in pathological conditions by using microarray data
1Istituto di Studi sui Sistemi Intelligenti per l'A;utomazione, CNR, Via Amendola 122/D-I, 70126 Bari, 2Unità Operativa di Gastroenterologia, IRCCS, Casa Sollievo della Sofferenza-Ospedale, Viale Cappuccini, 71013 San Giovanni Rotondo (FG), 3Istituto di Tecnologie Biomediche-Sezione di Bari, CNR, Via Amendola 122/D, 70126 Bari, 4Servizio di Genetica Medica, IRCCS, Casa Sollievo della Sofferenza-Ospedale, Viale Cappuccini, 71013 San Giovanni Rotondo (FG) and 5Dipartimento di Biochimica e Biologia Molecolare - Università di Bari, Via E. Orabona 4, 70126 Bari, Italy
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: A major challenge in current biomedical research is the identification of cellular processes deregulated in a given pathology through the analysis of gene expression profiles. To this end, predefined lists of genes, coding specific functions, are compared with a list of genes ordered according to their values of differential expression measured by suitable univariate statistics.
Results: We propose a statistically well-founded method for measuring the relevance of predefined lists of genes and for assessing their statistical significance starting from their raw expression levels as recorded on the microarray. We use prediction accuracy as a measure of relevance of the list. The rationale is that a functional category, coded through a list of genes, is perturbed in a given pathology if it is possible to correctly predict the occurrence of the disease in new subjects on the basis of the expression levels of the genes belonging to the list only. The accuracy is estimated with multiple random validation strategy and its statistical significance is assessed against a couple of null hypothesis, by using two independent permutation tests. The utility of the proposed methodology is illustrated by analyzing the relevance of Gene Ontology terms belonging to biological process category in colon and prostate cancer, by using three different microarray data sets and by comparing it with current approaches.
Availability: Source code for the algorithms is available from author upon request.
Contact: ancona{at}ba.issia.cnr.it
Supplementary information: Colon cancer data set and a complete description of experimental results are available at: ftp://bioftp:76bioftpxxx@marx.ba.issia.cnr.it/supp-info.htm
| 1 INTRODUCTION |
|---|
|
|
|---|
The advent of DNA microarray technology has driven an epochal change in genomic research. It provides a genomewide expression snapshot of the samples analyzed and allows to get meaningful insights into biological mechanisms (Schena et al., 1995). Typically, gene expression profiles relative to samples belonging to two distinct categories (e.g. diseased patients versus healthy controls, or patients in two different stages of the same pathology) are collected (Alon et al., 1999; Barrier et al., 2006). Successively, suitable univariate statistics are used for finding those genes which are differentially expressed in the experimental conditions analyzed (Golub et al., 1999; Guyon et al., 2002). The common approach is to use the list of genes selected for identifying and understanding the main cellular processes relevant in the given experimental conditions. Such processes are coded through lists of genes, defined on the basis of a priori biological knowledge, composed of those genes which are co-expressed in the particular cellular mechanism or function (Ashburner et al., 2000; Kanehisa et al., 2002; Khatri et al., 2002). So the problem of identifying biological processes (BPs) correlated to the phenotype reduces to the problem of comparing lists of genes (Subramanian et al., 2005; Tian et al., 2005). Currently, this approach is de facto standard for the secondary analysis of gene expression profiles and many tools have been proposed for this purpose (Khatri and Draghici, 2005).
This approach has a few major limitations. (a) Single gene analysis provides a limited view of the phenomena under examination since it does not take into account interactions among genes and is unable to uncover the correlation between groups of genes and phenotype. Many different genes contribute to a given disorder with no particular gene having a remarkably large effect (Risch, 2000). Thus, a specific phenotype may result from the combination of effects by a large number of moderately contributing genes. (b) The information embedded in genes weakly connected with the phenotype may be lost due to both the statistic adopted and the correction for multiple hypothesis testing. (c) The secondary analysis of a particular functional category is decoupled from the actual values of the gene expression levels measured in the microarray experiments. In any biological phenomenon, different genes are regulated at a different extent, and the differential expression of the genes can be useful for ranking the BPs according to their relevance with respect to the phenotype (Kharti and Draghici, 2005).
For overcoming the aforementioned limitations, we propose a statistically well-founded method for measuring the relevance of predefined lists of genes in given experimental conditions and for assessing their statistical significance, by analyzing their raw expression levels as recorded on the microarrays. We use the prediction accuracy of the phenotype as a measure of relevance or correlation of the list. The rationale is that a functional category coded through a list of genes is perturbed in a particular disease if it is possible to correctly predict the occurrence of the pathology in new subjects, on the basis of the expression levels of those genes only. In other words, a functional category is informative for or is deregulated in a disease if the expression levels of the genes involved in the category are useful for training classifiers able to generalize, i.e. able to correctly predict the status of new subjects (Vapnik 1995). So, generalization ability of predictors trained by using the expression levels of the genes cooperating in a given cellular mechanism or function can be seen as a measure of the relevance of the function in the pathology at hand. The phenotype is predicted through regularized least squares (RLS) classifiers (Ancona et al., 2005, 2006; Maglietta et al., 2007; Rifkin et al., 2003) a valuable alternative to support vector machine (SVM) classifiers (Vapnik, 1995) for tumor classification by DNA microarray data.
The prediction accuracy of the phenotype is estimated by using a multiple random validation strategy which provides a statistically significant estimate of the generalization error of outcome cancer predictors (Michiels et al., 2005; Mukherje et al., 2003). The statistical significance of the measured accuracy is assessed against a couple of null hypothesis by using two independent permutation tests (Good, 1994). The first one aims at measuring how the estimated prediction accuracy is due to the actual correlation existing between the expression levels of the genes in the list and the phenotype, and how it is due by chance. The second one aims at evaluating how the accuracy depends on the identity of the genes present in the list. In particular, it aims at assessing if lists of the same size composed of randomly selected genes from the ones present on the microarray produce comparable prediction accuracies. Moreover, to account for multiple hypothesis testing, we adjust the estimated significance level. In particular, we control the proportion of false positives by calculating the false discovery rate (FDR) (Storey and Tibshirani, 2003) defined by the proportion of false hypothesis findings over the amount of alternative hypotheses accepted at a given level of statistical significance.
As an application of the proposed method, we measure the relevance of Gene Ontology (GO) terms (Ashburner et al., 2000) belonging to the category of BPs in colon and prostate cancer and discuss the biological implications of those terms found deregulated in the analyzed pathologies with high statistical significance.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
2.1 Data set description
Three DNA microarray data sets were used. The first data set was collected in Casa Sollievo della Sofferenza Hospital, Foggia, Italy (Ancona et al., 2006). The data set is made up of 22 normal and 25 tumor specimens of patients affected by colon cancer, profiled using the Affymetrix (Santa Clara, CA, USA) HGU133A GeneChip (22 283 probe sets), provided as Supplementary Material. The second data set is composed by 52 tumor and 50 normal specimens of patients affected by prostate cancer (Singh et al., 2002), profiled using the Affymetrix (Santa Clara, CA, USA) HGU95Av2 GeneChip (12 625 probe sets). The third data set is relative to patients affected by stage II colon cancer, and it is composed of 50 specimens: 25 specimens were relative to patients developing a metachronous metastasis, whereas the other 25 specimens were relative to patients remained disease-free for at least 5 years (Barrier et al., 2006). All the specimens were profiled using the Affymetrix (Santa Clara, CA, USA) HGU133A GeneChip (22 283 probe sets).
2.2 Algorithms
2.2.1 Mapping probes to gene lists
Let
be a list of l genes supposed to be involved in a given cellular mechanism. We build the set
composed of all the probes present on the microarray associated to the genes in L. Denoted with
a given microarray composed of d probes, then
is the set of expression levels of the genes belonging to the list L, where
and n << d.
2.2.2 Estimating prediction accuracy of gene lists
We are given a data set
composed of
labeled specimens, where
and yi
{ – 1,1} for
. Let us suppose we have
+ positive and
– negative examples, such that
=
+ +
–. Moreover, we are given a list
of l genes. The objective is to measure the accuracy or generalization ability of predictors from the available data in S by using the expression levels of the genes present in L only. To this end, we build a reduced data set
composed of
examples whose components are the expression levels of probes on the microarray relative to the genes in L, i.e.
, where
, for
and
is the set of all the probes present on the microarray associated to the genes in L. Successively, a multiple random cross validation strategy is adopted for estimating the accuracy of RLS predictors. In this approach s pairs
of training and test sets are built by random sampling without replacement into the data set
, with h and k as their respective examples, where
= h + k. In the training/test split of the data, the same proportion of positive and negative examples as
is preserved. For every random split, an RLS classifier is trained by using the examples in
and its error rate ei is evaluated by testing the classifier on
. The selection of the parameter on which the classifier depends is carried out by using the examples in Dh only. In particular, the
parameter in RLS is selected minimizing the leave-one-out (LOO) error. Note that in the case of RLS, the evaluation of the LOO error requires just one training (Ancona et al., 2005). This procedure for selecting the parameter ensures that ei is unbiased as it involves only the data belonging to the training set. Finally, the error rate eL associated to the gene list L is given by
.
2.2.3 Assessing the statistical significance of eL
The assessment of the statistical significance of the measured eL is carried out performing two independent permutation tests (Fig. 1). The first one (T1) aims at measuring how eL is due to the actual correlation between the genes in L and the phenotype and how it is due by chance. To this end, we estimate the empirical probability density function of eL under the null hypothesis
in which
and y are supposed to be independent random variables. Specifically,
1 random permutations of the phenotypic labels of the examples in
are performed and the relative error rates
,
, are evaluated. The nominal P-value py relative to eL is so given by the percentage of random errors
smaller than eL:
, where I is the indicator function.
|
The second permutation test (T2) aims at evaluating how eL is dependent on the n genes cooperating in the biological function coded by the list L and how it depends only on the size of the list. In particular, in this test we assess if lists of the same size as L, composed of genes randomly selected from the ones present on the microarray, produce error rates smaller than eL. To this end, we estimate the empirical probability density function of eL under a different null hypothesis
2 lists Li*, i = 1,2,...,
2, are generated, composed of n probes randomly drawn from the ones available and the corresponding error rate
2.2.4 Multiple hypothesis testing
To account for multiple hypothesis testing, an estimate of the FDR is computed for the lists with a nominal P-value py (pn) in a given rejection region. Here, we describe the procedure for computing the FDRy relative to py. The same procedure is adopted for computing an estimate of FDRn relative to pn. The estimate of FDR is based on a permutation procedure (Barry et al., 2005). Let
be the nominal P-values of the lists, where m is the total number of lists. Then
indicates the number of lists called significant at level p. Moreover, let Ey be the
1 x m matrix of error rates
estimated in the ith random permutation of the labels, relative to the jth gene list. Such matrix is converted in a
1 x m matrix of permuted P-values with elements
|
|
|
|
Note that the quantity in square brackets to the denominator represents an estimate of the number of true positive lists at a given level of statistical significance.
| 3 RESULTS |
|---|
|
|
|---|
We used the annotation facilities provided by NetAffx Analysis Center1 for mapping probe sets of HGU133A and HGU95Av2 technologies to the GO terms relative to the BPs.
3.1 Colon cancer
In the colon cancer data set (Ancona et al., 2006), we found 2158 BPs represented in the HGU133A platform, with a number of probes ranging from 2 to 16 523. The error rate eL relative to a list L was estimated performing s = 1000 cross-validations of the reduced data set
composed of the expression levels of the genes belonging to L only. In each cross-validation, we used 30 examples for training and the remaining 17 for testing RLS classifiers with linear kernel. We found 759 BPs with an error rate
. For assessing the statistical significance of eL, the permutation test T1 was carried out. To this end, for each list,
1 = 5000 random permutations of the phenotypic labels were performed and the relative error rates
,
, were evaluated. The test T1 revealed 666 statistically significant BPs (
) having error rates
.
The second phase of the analysis aimed at determining if the accuracy estimated for a particular BP was due to the identity of the genes cooperating in the given cellular mechanism, or simply to the number of genes present in the list. To this end, we carried out the permutation test T2. Specifically, indicated with n the size of a list L,
2 lists Li*,
, were generated, with
2 = 1000, composed of n probes randomly drawn from the ones available on the microarray. The corresponding error rate
was estimated performing 200 random cross-validations. Such analysis revealed 51 BPs (
) having an error rate
(see Table 1). Note that the BP with the maximum number of probes detected by our method is localization composed of
probes, with eL = 13%,
. Such number provides an upper bound on the size of statistically significance lists, detectable in the current experimental conditions. Many BPs correlated with the phenotype in our data set with a greater number of probes do not have any statistical significance. For example, regulation of BP, composed of n = 5318 probes, provides an error rate eL = 15% with py = 0.0008,
, but we can not maintain that such a category is really deregulated in our data set. In fact, as its second P-value shows (pn = 0.245), many lists composed of the same number of probes provide error rates smaller than eL. This highlights that the current experimental conditions do not allow us to detect if categories composed of more than 4000 probes are deregulated in the given pathology. This experimental evidence is illustrated in Figure 2 which depicts the distribution of the error rate of random lists as a function of the list size. As the picture shows, the higher the value of n, the more picked the distribution of the values of en is close to the median. As a consequence, the error rate eL associated to lists composed of a great number n of genes results to be not statistically significant because many lists composed of the same number of genes produce comparable accuracies, independently of the identity of the genes belonging to the list. On the other hand, our methodology is able to detect BPs significantly involved in the pathology, composed of a few number of probes. For example, lipoxygenase pathway, n = 4 (eL = 13%,py = 0.001,
and arachidonic acid metabolism, n = 10 (eL = 17%, py = 0.001,
, highly deregulated in our data set and strongly correlated to colon cancer (Ulrich et al., 2006) would not have been detected by methods which limit the analysis to those categories composed of an a priori defined minimum number of genes (Barry et al., 2005).
|
|
The BP most correlated to colon cancer in our data set is lipid metabolism. This GO term is represented by n = 844 probes on the chip and provides an error rate of eL = 12% (py = 0.0008,
For understanding the relevance of each gene singularly in the pathology at hand and for elucidating the importance of considering simultaneously genes acting in concert for outcome cancer prediction, we measured the prediction accuracy associated to each probe (Maglietta et al.2007). This is equivalent to applying our method to lists composed of one probe only. In this case, for assessing the statistical significance of the measured accuracy, only the permutation test T1 of the phenotypic labels was carried out. We found 471 probes
with error rate lower than 28%. In Table 1 for each BP, we report the minimum value of the error rate
associated to the probes belonging to the lists selected by our method. This table sheds light on the importance of considering simultaneously genes cooperating in the same biological function for cancer prediction. Importantly, this table reveals that to consider simultaneously genes weakly connected with the phenotype increases the prediction accuracy. For example, the minimum error rate measured in cell migration is
. Moreover, the mean error of the probes in this BP is 44%. Nevertheless, the error rate estimated by using all the genes involved in this BP is eL = 13%, which is equivalent to a reduction of 23% of
. This indicates the importance of also considering weakly connected or moderately deregulated genes in measuring cancer outcome prediction as well as for evaluating the degree of correlation of the category with the phenotype.
3.2 Prostate cancer
In prostate cancer data set (Singh et al., 2002), we found 2509 BPs represented, having a number of probes ranging from 2 to 10 568. In the analysis of this data set, we used the same parameters as for colon, with the exception of the sizes of training and test sets: 65 examples for training and 37 for testing were used in multiple cross-validation. The T1 permutation test revealed 1109 statistically significant BPs (
) having error rates
. The T2 permutation test reduced the number of statistically significant BPs to 18 (
) having error rates
(see Table 2 and Supplementary Material). The differences between normal and tumor specimens are sharper in prostate than in colon cancer. In fact, the smallest error rate is 8% in prostate and 12% in colon (see Supplementary Material). The importance of considering jointly, genes cooperating in the same biological function for cancer prediction is confirmed in this data set. In anti-apopotosis BP, e.g. the minimum error rate decreases of 50% by considering genes weakly connected with the phenotype.
|
3.3 Stage II colon cancer
In the analysis of the stage II colon cancer data set, (Barrier et al., 2006), we used the same parameters as the first data set. As these two data sets were obtained by using the same technology, we had the same number of BPs represented. The T1 permutation test showed 69 statistically significant BPs (
|
| 4 BIOLOGICAL ANALYSIS |
|---|
|
|
|---|
Among the BPs most deregulated in colon cancer data set and important for its ability to modulate tumorigenesis and for applications in the treatment and chemoprevention of neoplastic diseases, we found Lipid metabolism. Long-chain fatty acids are essential constituents of membrane lipids and are important substrates for energy metabolism in the cells. The regulation of lipogenic gene expression is mainly regulated by sterol regulatory element-binding proteins at the transcriptional level (Consolazio et al., 2006). During tissue differentiation, lipogenic gene expression is linked to exit from the cell cycle. Similar regulation is active in many proliferative human cancers, including colorectal carcinoma and in fetal tissues, including the bowel, with linkage to proliferation. These observations suggest the existence of a coordinate regulatory mechanism for lipogenic gene expression in the context of proliferation, which might function normally during fetal development, and become abnormally activated during carcinogenesis (Li et al., 2000). Recently, it has been described a possible correlation between diet and colon cancer risk connected to the deregulation of Fatty Acid Metabolism and Organic Acid Metabolism BPs. Diets high in energy and saturated fat, with high glycemic index carbohydrate and low levels of fiber and n–3 fatty acids lead to insulin resistance with hyperinsulinemia, hyperglycemia and hypertriglyceridemia. These suggest that insulin, the related insulin-like growth factors, triglycerides, and non-esterified fatty acids could lead to increased growth of colon cancer precursor lesions and the development of colorectal cancer (Bruce et al., 2000a). These circulating factors subject colonic epithelial cells to a proliferative stimulus and also expose them to reactive oxygen intermediates. These long-term exposures result in the promotion of colon cancer (Bruce et al., 2000b).
Another critical and indispensable component of tumor progression is the inflammation that include Arachidonic Acid Metabolism and Lypoxygenase pathway two other BPs put in evidence in our analysis. The arachidonic acid is a substrate for the production of leukotrienes and prostagladins found to be expressed in neoplastic tissue, including colon cancer (Cornelia et al., 2006). Recent evidence show that colon cancers target the prostaglandin biogenesis pathway by ubiquitously abrogating expression of a prostaglandin-degrading enzyme that physiologically antagonizes cyclooxygenase-2 (COX-2), a key enzyme of the inflammation highly expressed in normal colon mucosa but ubiquitously lost in human colon cancers (Myung et al., 2006; Yan et al., 2004). Epidemiolgy study have provided interesting and promising data regarding prevention of tumorigenesis.
Among our BPs strongly associated to prevention of colon cancer and correlated themselves, we found Epidermal Growth Factor Receptor (EGFR) signaling pathway, Angiogenesis and Maintenance of localization. Multiple growth factor receptor are deregulated in colon cancer, presenting potential targets for therapeutic intervention. Growth factor activation of receptors together with increase COX-2 expression, and consequently up-regulation of prostaglandin could promote both deregulation of angiogenesis and motility-related characteristics as cell cytoskeleton and cell–cell junction structure such as invasiveness of cancer cells (Kumar, 2005).
In prostate cancer data set (Table 2), between the BPs most deregulated we found the Retinol, Retinoid and Vitamin A metabolisms. The term vitamin A is used to denote retinol or all-trans-retinol and a family of biologically active retinoids derived from this. Retinoids exert potent apoptotic effects both in development and in cancer cells. Induction of apoptosis by retinoids has been observed in various prostate cancer cells in vitro and in vivo and appears to be associated with down-regulation of Bcl-2 expression, induction of insulin-like growth factor-binding protein-3 (IGFBP-3) and tissue transglutaminase, an enzyme that accumulates in cells undergoing apoptosis (Zhang, 2002).
Genomic instability represents an important mechanism in cancer and in particular colon cancer. The first evidence of the accumulation of mutations in colorectal cancer in specific genes which control cell division, apoptosis and DNA repair has been published many years ago (Kinzler and Vogelstein, 1996). In many forms of the cancer, genomic instability creates a permissive state where a potential cancer cell can acquire enough mutations to become a cancer cell. In colorectal cancer, three types of genetic instability have been identified (Lengauer et al., 1998). The majority of colorectal and most other solid cancers have chromosomal instability (CIN). CIN refers to an increased rate of losing or gaining whole chromosomes or large parts of chromosomes during cell division. The consequence of CIN is an imbalance in chromosome number (aneuploidy) and an increased rate of loss of heterozygosity (LOH). These so-called CIN genes are involved in spindle assembly checkpoint, DNA recombination, checkpoint control of the cell cycle and transcription, and they belong to Cell Proliferation, Regulation of cell proliferation BPs (Milner et al., 1997; Rajagopalan et al., 2004; Yarden et al., 2002). In a small fraction of colorectal cancer, a defect in DNA mismatch repair (Lee et al., 2004; Southey et al., 2005) results in an elevated mutation rate at the nucleotide level and consequent widespread microsatellite instability (MIN) (Kinzler and Vogelestein, 1996). Among the BPs responsible for this mechanism, we have Regulation of DNA replication and Negative Regulation of progression through cell cycle (Anand et al., 2002; Grady and Markowitz, 2000; Kane et at., 1997). Epigenetic silencing (CIMP) is now recognized as a third pathway in model of colorectal cancer tumorigenesis and can affect gene function without genetic changes. DNA methylation within gene promoters and alterations in histone modifications appear to be the primary mediators of epigenetic inheritance in cancer cells (Kondo and Issa, 2004). In particular, in stage II colon cancer data set we observe some BPs that are involved in the epigenetic processes, such as: Negative Regulation of Histone Modification, Regulation of Histone Acetylation, Regulation of Histone Modification and Histone Acetylation (Table 3). Recent evidences show the correlation of these BPs and colon cancer progression (Kondon and Issa, 2004; Orr and Hamilton, 2007). Moreover, our analysis confirms the major role of genes coding ribosomal proteins in colon cancer progression as hypothesized in Barrier et al.,(2006). In fact, we found Ribosome Biogenesis and Assembly BP deregulated in their data set (see Supplementary Material) even though with a poor statistical significance.
Importantly, many BPs commonly associated to development and progression of tumors such as apoptosis, anti-apoptosis, cell proliferation and DNA repair (Evan and Vousden, 2001) have been detected by our method (see Table 4 and Supplementary Material). In fact, as we can deduce from the values of py and
reported in Table 4, these BPs are strongly deregulated in all three data sets considered. Nevertheless, the actual experimental conditions, i.e. the heterogeneity of the samples analyzed, the size of the data set and the quality of the gene lists considered, do not allow to assign any statistical significance to these findings as the values of pn and
indicate.
|
| 5 DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
In this article, we have proposed a new method for measuring the correlation between gene lists, defined on the basis of prior biological knowledge, and the phenotype by using gene expression profiles. As a measure of correlation of the list we used the prediction accuracy of classifiers trained to predict the phenotypic labels of subjects in two different experimental conditions, by using the raw expression levels of the genes belonging to the list, as recorded by the microarray. The analysis does not involve only the differentially expressed genes (if any) in a list. On the contrary, all the genes belonging to a given BP concur to determine the deregulation level of the process in the disease. Such an approach, which can be thought of as a natural extension of the one proposed by Maglietta et al., (2007), in which prediction accuracy is used for selecting relevant genes in microarray experiments, overcomes many of the limitations and drawbacks connected to the current methods based on the comparison between lists of genes. GSEA (Subramanian et al., 2005), e.g. evaluates the correlation with the phenotype of a list L, coding a given cellular mechanism, comparing it with an ordered list S of genes. The ranking is done according to the values of a univariate statistic which measures the differential expression of genes in the two classes. Univariate statistics do not take into account the co-expression of different genes and they are not able to evaluate the connection between groups of genes and the phenotype. As a consequence, some information could be lost by not considering genes jointly. What matters in this method is the rank of the genes in S and the corresponding values of the statistic. Genes weakly connected with the phenotype will always appear in the central positions of the list S, without contributing significantly to the final score of L. Our method, on the contrary, uses simultaneously the expression levels of all the genes in L for measuring the correlation of L with the phenotype. The results shown in tables and Supplementary Material also suggest that genes weakly connected with the phenotypic labels y also contribute to the estimate of the correlation because they significantly influence the prediction accuracy of the outcome. In fact the error rate eL is always smaller than the mean error of the probes in the list, and in many cases lower than the minimum error
The idea of using prior biological information coded through lists of genes has been previously suggested for class finding (Redestig et al., 2006) and for class prediction (Lottaz and spang 2005) without systematically addressing the problem of the statistical significance of the biological findings determined in the experimental conditions analyzed.
The problem of assessing if the correlation of a given list L with the phenotype depends on the identity of the genes belonging to the list or depends only on the number of genes in L has been addressed in Tian et al. (2005). The authors highlighted that when there is a significant proportion of genes associated with the phenotype, a gene set would contain genes with association, even if the gene set is purely a random subset from the entire gene list. Although this aspect is not new, nevertheless the method presented in Tian et al. (2005) is completely different from the one we propose here. The authors used a univariate statistic (t-test) t for measuring the correlation of each gene with the phenotype and measured the correlation of L with y by summing the t-values of the genes belonging to L: T = 1/n
i
Lti, where n is the size of L. The rationale is that the higher the sum T, the stronger the correlation of L with y. The statistical significance of the list L was measured by permuting the t-values
and comparing the new sum
with T. This is equivalent to draw randomly n genes from the ones available on the microarray and to sum the corresponding t-values. Our approach do not use any univariate statistic and the estimate of the correlation of L is not simply a sum of univariate statistics. As a measure of correlation of a list L with the phenotype we use an estimate of the generalization error eL, obtained by using the expression levels of all the genes belonging to L and present on the microarray. The statistical significance of L is assessed by drawing randomly genes from the microarray and using their expression levels for training classifiers which predict the actual phenotypic labels.
These last considerations allow us to underline an important aspect of our approach. The statistic that we use for measuring the correlation between L and y is meaningful. It is not simply a number. It provides an estimate of the generalization error, i.e. the expected value of misclassified subjects, or the probability of making an error:
|
|
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
R.M. is a PhD student of Dipartimento Interateneo di Fisica, Bari, associated to Istituto Nazionale di Fisica Nucleare, sez. di Bari and to Center of Innovative Technologies for Signal Detection and Processing (TIRES), Univerisitá degli Studi di Bari, Italy. This work was supported by grants from Regione Puglia, Progetto Strategico PS_012, and Minister of Healthy N
: RC0402GA16. Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Trey Ideker
1 http://www.affymetrix.com/analysis/index.affx ![]()
Received on April 5, 2007; revised on May 14, 2007; accepted on May 21, 2007
| REFERENCES |
|---|
|
|
|---|
Alon U, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. (1999) 96:6745–6750.
Anand GM, et al. Down-regulation of HLA-A expression correlates with a better prognosis in colorectal cancer patients. Lab. Invest (2002) 82:1725–1733.[Web of Science]
Ancona N, et al. Regularized least squares cancer classifiers from DNA microarray data. BMC. Bioinformatics (2005) 6(Suppl. 4):S2.
Ancona N, et al. On the statistical assessment of classifiers using DNA microarray data. BMC. Bioinformatics (2006) 7:387.[CrossRef][Medline]
Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet (2000) 25:25–29.[CrossRef][Web of Science][Medline]
Barry WT, et al. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics (2005) 21:1943–1949.
Barrier A, et al. Stage II colon cancer prognosis prediction by tumor gene expression profiling. J. Clin. Oncol (2006) 24:4691.
Bruce WR, et al. Mechanisms linking diet and colorectal cancer: the possible role of insulin resistance. Nutr. Cancer (2000a) 37:19–26.[CrossRef][Web of Science][Medline]
Bruce WR, et al. Possible mechanisms relating diet and risk of colon cancer. Cancer Epidemiol Biomarkers Prev (2000b) 9:1271–1279.
Consolazio A. Related articles, links overexpression of fatty acid synthase in ulcerative colitis. Am. J. Clin. Pathol (2006) 126:113–118.
Cornelia MU, et al. Non-steroid anti-inflammatory drugs for cancer prevention: promise, perils and pharmacogenetics. Nat. Rev. Cancer (2006) 6:130–140. review.[CrossRef][Web of Science][Medline]
Evan GI, Vousden KH. Proliferation, cell cycle and apoptosis in cancer. Nature (2001) 411:342–348.[CrossRef][Medline]
Golub TR, et al. Molecular classification of cancer: class discovery andclass prediction by gene expression monitoring. Science (1999) 286:531–537.
Good P. Permutation Tests: a Practical Guide to Resampling Methods for Testing Hypotheses. (1994) New York: Springer Verlag.
Grady WM, Markowitz S. Genomic instability and colorectal cancer. Curr. Opin. Gastroenterol (2000) 16:62–67.[CrossRef][Web of Science][Medline]
Guyon I, et al. Gene selection for cancer classification using support vector machines. Mach. Learn (2002) 46:389–422.[CrossRef]
Kane MF, et al. Methylation of the hMLH1 promoter correlates with lack of expression of hMLH1 in sporadic colon tumors and mismatch repair-defective human tumor cell lines. Cancer Res (1997) 57:808–811.
Kanehisa M, et al. The KEGG databases at GenomeNet. NucleicAcidsRes (2002) 30:42–46.
Khatri P, et al. Profiling gene expression using onto-express. Genomics (2002) 79:266–270.[CrossRef][Web of Science][Medline]
Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (2005) 21:3587–3595.
Kinzler KW, Vogelstein B. Lessons from hereditary colorectal cancer. Cell (1996) 87:159–170.[CrossRef][Web of Science][Medline]
Kondo Y, Issa JP. Epigenetic changes in colorectal cancer. Cancer Metastasis Rev (2004) 23:29–39.[CrossRef][Web of Science][Medline]
Kumar R. Commentary: targeting colorectal cancer through molecular biology. Semin. Oncol (2005) 32(Suppl. 9):S37–S39.[Web of Science][Medline]
Lee S, et al. Aberrant CpG island hypermethylation of multiple genes in colorectal neoplasia. Lab. Invest (2004) 84:884–893.[CrossRef][Web of Science][Medline]
Lengauer C, et al. Genetic instabilities in human cancers. Nature (1998) 396:623–649.[CrossRef][Medline]
Li JN, et al. Sterol regulatory element-binding protein-1 participates in the regulation of fatty acid synthase expression in colorectal neoplasia. Exp.Cell. Res (2000) 261:159–165.[CrossRef][Web of Science][Medline]
Lottaz C, Spang R. Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics (2005) 2(1):1978.
Maglietta R, et al. Selection of relevant genes in cancer diagnosis based on their prediction accuracy. Artif. Intell. Med (2007) 40:29–44.[CrossRef][Web of Science][Medline]
Michiels S, et al. Predictor of cancer outcome with microarrays: a multiple random validation strategy. Lancet (2005) 365:488–492.[CrossRef][Web of Science][Medline]
Milner J, et al. Transcriptional activation functions in BRCA2. Nature (1997) 386:772–773.[CrossRef][Medline]
Myung SJ, et al. 15-Hydroxyprostaglandin dehydrogenase is an in vivo suppressor of colon tumorigenesis. Proc. Natl Acad. Sci. USA (2006) 103:12098–12102.
Mukherjee S, et al. Estimating dataset size requirements for classifying DNA microarray data. J. Comput. Biol (2003) 10:119–142.[CrossRef][Web of Science][Medline]
Orr JA, Hamilton PW. Histone acetylationand chromatin pattern in cancer. A review. Anal. Quant. Cytol. Histol (2007) 1:17–31.
Rajagopalan H, et al. Inactivation of hCDC4 can cause chromosomal instability. Nature (2004) 428:77–81.[CrossRef][Medline]
Redestig H, et al. Integrating functional knowledge during sample clustering for microarray data using unsupervised decision trees. Biom. J (2006) 48:1–16.
Rifkin R, et al. Regularized least squares classification. In: Advances in Learning Theory: Methods, Model and Applications—Saykens, et al, eds. (2003) Amsterdam: IOS Press. 153. Vol.190, NATO Science Series III: Computer and Systems Sciences.
Risch NJ. Searching for genetic determinants in the new millennium. Nature (2000) 405:847–856.[CrossRef][Medline]
Schena M, et al. Quantitative monitoring of gene-expression patterns with a complementary-DNA microarray. Science (1995) 270:467–470.
Singh D, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell (2002) 1:203–209.[CrossRef][Web of Science][Medline]
Southey MC, et al. Use of molecular tumor characteristics to prioritize mismatch repair gene testing in early-onset colorectal. cancer. J. Clin. Oncol (2005) 23:6524–6532.
Storey JD, Tibshirani R. Statistical significance for genomwide studies. Proc. Natl Acad. Sci (2003) 100:9440–9445.
Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci (2005) 102:15545–15550.
Tian L, et al. Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci (2005) 102:13544–13549.
Ulrich CM, et al. Non-steroidal anti-inflammatory drugs for cancer prevention: promise, perils and pharmacogenetics. Nat. Rev. Cancer (2006) 6:130–140.[CrossRef][Web of Science][Medline]
Vapnik V. The Nature of Statistical Learning Theory. (1995) New York: Springer Verlag.
Yan M, et al. 15-Hydroxyprostaglandin dehydrogenase, a COX-2 oncogene antagonist, is a TGF-beta-induced suppressor of human gastrointestinal cancers. Proc. Natl Acad. Sci. USA (2004) 101:17468–17473.
Yarden RI, et al. BRCA1 regulates the G2/M checkpoint by activating Chk1 kinase upon DNA damage. Nat. Genet (2002) 30:265–269.
Zhang XK. Vitamin A and apoptosis in prostate cancer. Endocr. Relat. Cancer (2002) 9:87–102.[Abstract]
This article has been cited by other articles:
![]() |
J. Russo, G. A. Balogh, I. H. Russo, and and the Fox Chase Cancer Center Hospital Network P Full-term Pregnancy Induces a Specific Genomic Signature in the Human Breast Cancer Epidemiol. Biomarkers Prev., January 1, 2008; 17(1): 51 - 66. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



