Skip Navigation


Bioinformatics Advance Access originally published online on May 31, 2007
Bioinformatics 2007 23(16):2063-2072; doi:10.1093/bioinformatics/btm289
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/16/2063    most recent
btm289v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Maglietta, R.
Right arrow Articles by Ancona, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Maglietta, R.
Right arrow Articles by Ancona, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Statistical assessment of functional categories of genes deregulated in pathological conditions by using microarray data

R. Maglietta 1, A. Piepoli 2, D. Catalano 3, F. Licciulli 3, M. Carella 4, S. Liuni 3, G. Pesole 3,5, F. Perri 2 and N. Ancona 1,*

1Istituto di Studi sui Sistemi Intelligenti per l'A;utomazione, CNR, Via Amendola 122/D-I, 70126 Bari, 2Unità Operativa di Gastroenterologia, IRCCS, ‘Casa Sollievo della Sofferenza’-Ospedale, Viale Cappuccini, 71013 San Giovanni Rotondo (FG), 3Istituto di Tecnologie Biomediche-Sezione di Bari, CNR, Via Amendola 122/D, 70126 Bari, 4Servizio di Genetica Medica, IRCCS, ‘Casa Sollievo della Sofferenza’-Ospedale, Viale Cappuccini, 71013 San Giovanni Rotondo (FG) and 5Dipartimento di Biochimica e Biologia Molecolare - Università di Bari, Via E. Orabona 4, 70126 Bari, Italy

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 BIOLOGICAL ANALYSIS
 5 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: A major challenge in current biomedical research is the identification of cellular processes deregulated in a given pathology through the analysis of gene expression profiles. To this end, predefined lists of genes, coding specific functions, are compared with a list of genes ordered according to their values of differential expression measured by suitable univariate statistics.

Results: We propose a statistically well-founded method for measuring the relevance of predefined lists of genes and for assessing their statistical significance starting from their raw expression levels as recorded on the microarray. We use prediction accuracy as a measure of relevance of the list. The rationale is that a functional category, coded through a list of genes, is perturbed in a given pathology if it is possible to correctly predict the occurrence of the disease in new subjects on the basis of the expression levels of the genes belonging to the list only. The accuracy is estimated with multiple random validation strategy and its statistical significance is assessed against a couple of null hypothesis, by using two independent permutation tests. The utility of the proposed methodology is illustrated by analyzing the relevance of Gene Ontology terms belonging to biological process category in colon and prostate cancer, by using three different microarray data sets and by comparing it with current approaches.

Availability: Source code for the algorithms is available from author upon request.

Contact: ancona{at}ba.issia.cnr.it

Supplementary information: Colon cancer data set and a complete description of experimental results are available at: ftp://bioftp:76bioftpxxx@marx.ba.issia.cnr.it/supp-info.htm


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 BIOLOGICAL ANALYSIS
 5 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The advent of DNA microarray technology has driven an epochal change in genomic research. It provides a genomewide expression snapshot of the samples analyzed and allows to get meaningful insights into biological mechanisms (Schena et al., 1995). Typically, gene expression profiles relative to samples belonging to two distinct categories (e.g. diseased patients versus healthy controls, or patients in two different stages of the same pathology) are collected (Alon et al., 1999; Barrier et al., 2006). Successively, suitable univariate statistics are used for finding those genes which are differentially expressed in the experimental conditions analyzed (Golub et al., 1999; Guyon et al., 2002). The common approach is to use the list of genes selected for identifying and understanding the main cellular processes relevant in the given experimental conditions. Such processes are coded through lists of genes, defined on the basis of a priori biological knowledge, composed of those genes which are co-expressed in the particular cellular mechanism or function (Ashburner et al., 2000; Kanehisa et al., 2002; Khatri et al., 2002). So the problem of identifying biological processes (BPs) correlated to the phenotype reduces to the problem of comparing lists of genes (Subramanian et al., 2005; Tian et al., 2005). Currently, this approach is de facto standard for the secondary analysis of gene expression profiles and many tools have been proposed for this purpose (Khatri and Draghici, 2005).

This approach has a few major limitations. (a) Single gene analysis provides a limited view of the phenomena under examination since it does not take into account interactions among genes and is unable to uncover the correlation between groups of genes and phenotype. Many different genes contribute to a given disorder with no particular gene having a remarkably large effect (Risch, 2000). Thus, a specific phenotype may result from the combination of effects by a large number of moderately contributing genes. (b) The information embedded in genes weakly connected with the phenotype may be lost due to both the statistic adopted and the correction for multiple hypothesis testing. (c) The secondary analysis of a particular functional category is decoupled from the actual values of the gene expression levels measured in the microarray experiments. In any biological phenomenon, different genes are regulated at a different extent, and the differential expression of the genes can be useful for ranking the BPs according to their relevance with respect to the phenotype (Kharti and Draghici, 2005).

For overcoming the aforementioned limitations, we propose a statistically well-founded method for measuring the relevance of predefined lists of genes in given experimental conditions and for assessing their statistical significance, by analyzing their raw expression levels as recorded on the microarrays. We use the prediction accuracy of the phenotype as a measure of relevance or correlation of the list. The rationale is that a functional category coded through a list of genes is perturbed in a particular disease if it is possible to correctly predict the occurrence of the pathology in new subjects, on the basis of the expression levels of those genes only. In other words, a functional category is informative for or is deregulated in a disease if the expression levels of the genes involved in the category are useful for training classifiers able to generalize, i.e. able to correctly predict the status of new subjects (Vapnik 1995). So, generalization ability of predictors trained by using the expression levels of the genes cooperating in a given cellular mechanism or function can be seen as a measure of the relevance of the function in the pathology at hand. The phenotype is predicted through regularized least squares (RLS) classifiers (Ancona et al., 2005, 2006; Maglietta et al., 2007; Rifkin et al., 2003) a valuable alternative to support vector machine (SVM) classifiers (Vapnik, 1995) for tumor classification by DNA microarray data.

The prediction accuracy of the phenotype is estimated by using a multiple random validation strategy which provides a statistically significant estimate of the generalization error of outcome cancer predictors (Michiels et al., 2005; Mukherje et al., 2003). The statistical significance of the measured accuracy is assessed against a couple of null hypothesis by using two independent permutation tests (Good, 1994). The first one aims at measuring how the estimated prediction accuracy is due to the actual correlation existing between the expression levels of the genes in the list and the phenotype, and how it is due by chance. The second one aims at evaluating how the accuracy depends on the identity of the genes present in the list. In particular, it aims at assessing if lists of the same size composed of randomly selected genes from the ones present on the microarray produce comparable prediction accuracies. Moreover, to account for multiple hypothesis testing, we adjust the estimated significance level. In particular, we control the proportion of false positives by calculating the false discovery rate (FDR) (Storey and Tibshirani, 2003) defined by the proportion of false hypothesis findings over the amount of alternative hypotheses accepted at a given level of statistical significance.

As an application of the proposed method, we measure the relevance of Gene Ontology (GO) terms (Ashburner et al., 2000) belonging to the category of BPs in colon and prostate cancer and discuss the biological implications of those terms found deregulated in the analyzed pathologies with high statistical significance.


    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 BIOLOGICAL ANALYSIS
 5 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Data set description
Three DNA microarray data sets were used. The first data set was collected in Casa Sollievo della Sofferenza Hospital, Foggia, Italy (Ancona et al., 2006). The data set is made up of 22 normal and 25 tumor specimens of patients affected by colon cancer, profiled using the Affymetrix (Santa Clara, CA, USA) HGU133A GeneChip (22 283 probe sets), provided as Supplementary Material. The second data set is composed by 52 tumor and 50 normal specimens of patients affected by prostate cancer (Singh et al., 2002), profiled using the Affymetrix (Santa Clara, CA, USA) HGU95Av2 GeneChip (12 625 probe sets). The third data set is relative to patients affected by stage II colon cancer, and it is composed of 50 specimens: 25 specimens were relative to patients developing a metachronous metastasis, whereas the other 25 specimens were relative to patients remained disease-free for at least 5 years (Barrier et al., 2006). All the specimens were profiled using the Affymetrix (Santa Clara, CA, USA) HGU133A GeneChip (22 283 probe sets).

2.2 Algorithms
2.2.1 Mapping probes to gene lists
Let Formula be a list of l genes supposed to be involved in a given cellular mechanism. We build the set Formula composed of all the probes present on the microarray associated to the genes in L. Denoted with Formula a given microarray composed of d probes, then Formula is the set of expression levels of the genes belonging to the list L, where Formula and n << d.

2.2.2 Estimating prediction accuracy of gene lists
We are given a data set Formula composed of {ell} labeled specimens, where Formula and yi isin { – 1,1} for Formula . Let us suppose we have {ell}+ positive and {ell} negative examples, such that {ell} = {ell}+ + {ell}. Moreover, we are given a list Formula of l genes. The objective is to measure the accuracy or generalization ability of predictors from the available data in S by using the expression levels of the genes present in L only. To this end, we build a reduced data set Formula composed of {ell} examples whose components are the expression levels of probes on the microarray relative to the genes in L, i.e. Formula , where Formula , for Formula and Formula is the set of all the probes present on the microarray associated to the genes in L. Successively, a multiple random cross validation strategy is adopted for estimating the accuracy of RLS predictors. In this approach s pairs Formula of training and test sets are built by random sampling without replacement into the data set Formula , with h and k as their respective examples, where {ell} = h + k. In the training/test split of the data, the same proportion of positive and negative examples as Formula is preserved. For every random split, an RLS classifier is trained by using the examples in Formula and its error rate ei is evaluated by testing the classifier on Formula . The selection of the parameter on which the classifier depends is carried out by using the examples in Dh only. In particular, the {lambda} parameter in RLS is selected minimizing the leave-one-out (LOO) error. Note that in the case of RLS, the evaluation of the LOO error requires just one training (Ancona et al., 2005). This procedure for selecting the parameter ensures that ei is unbiased as it involves only the data belonging to the training set. Finally, the error rate eL associated to the gene list L is given by Formula .

2.2.3 Assessing the statistical significance of eL
The assessment of the statistical significance of the measured eL is carried out performing two independent permutation tests (Fig. 1). The first one (T1) aims at measuring how eL is due to the actual correlation between the genes in L and the phenotype and how it is due by chance. To this end, we estimate the empirical probability density function of eL under the null hypothesis Formula in which Formula and y are supposed to be independent random variables. Specifically, {pi}1 random permutations of the phenotypic labels of the examples in Formula are performed and the relative error rates Formula , Formula , are evaluated. The nominal P-value py relative to eL is so given by the percentage of random errors Formula smaller than eL: Formula , where I is the indicator function.


Figure 1
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Statistical assessment of the measured error rate eL relative to the list L. In the first permutation test, the actual phenotypic labels y are randomly permuted and the probes belonging to L are used for predicting the permuted labels Figure 1. In the second permutation test, random lists Figure 1 composed of the same number of probes as L are built and used for predicting the actual phenotypic labels y.

 
The second permutation test (T2) aims at evaluating how eL is dependent on the n genes cooperating in the biological function coded by the list L and how it depends only on the size of the list. In particular, in this test we assess if lists of the same size as L, composed of genes randomly selected from the ones present on the microarray, produce error rates smaller than eL. To this end, we estimate the empirical probability density function of eL under a different null hypothesis Formula , where n denotes the size of L. Under this hypothesis, we assume that any set L* of n probes provides an error rate less than or equal to eL for predicting the actual phenotypic labels. For testing this hypothesis, {pi}2 lists Li*, i = 1,2,...,{pi}2, are generated, composed of n probes randomly drawn from the ones available and the corresponding error rate Formula is evaluated. The nominal P-value pn relative to eL is estimated as the percentage of errors Formula smaller than eL: Formula .

2.2.4 Multiple hypothesis testing
To account for multiple hypothesis testing, an estimate of the FDR is computed for the lists with a nominal P-value py (pn) in a given rejection region. Here, we describe the procedure for computing the FDRy relative to py. The same procedure is adopted for computing an estimate of FDRn relative to pn. The estimate of FDR is based on a permutation procedure (Barry et al., 2005). Let Formula be the nominal P-values of the lists, where m is the total number of lists. Then Formula indicates the number of lists called significant at level p. Moreover, let Ey be the {pi}1 x m matrix of error rates Formula estimated in the ith random permutation of the labels, relative to the jth gene list. Such matrix is converted in a {pi}1 x m matrix of permuted P-values with elements


Formula

Note that Formula is the P-value associated to the error rate Formula . From this matrix we estimate the number of false positives Formula at level p in the ith random permutation, Formula , as well as the mean number of false positives at level p: Formula . Then, for a rejection region [0,p], an estimate of the FDR is given by:


Formula

Note that the quantity in square brackets to the denominator represents an estimate of the number of true positive lists at a given level of statistical significance.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 BIOLOGICAL ANALYSIS
 5 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We used the annotation facilities provided by NetAffx Analysis Center1 for mapping probe sets of HGU133A and HGU95Av2 technologies to the GO terms relative to the BPs.

3.1 Colon cancer
In the colon cancer data set (Ancona et al., 2006), we found 2158 BPs represented in the HGU133A platform, with a number of probes ranging from 2 to 16 523. The error rate eL relative to a list L was estimated performing s = 1000 cross-validations of the reduced data set Formula composed of the expression levels of the genes belonging to L only. In each cross-validation, we used 30 examples for training and the remaining 17 for testing RLS classifiers with linear kernel. We found 759 BPs with an error rate Formula . For assessing the statistical significance of eL, the permutation test T1 was carried out. To this end, for each list, {pi}1 = 5000 random permutations of the phenotypic labels were performed and the relative error rates Formula , Formula , were evaluated. The test T1 revealed 666 statistically significant BPs (Formula ) having error rates Formula .

The second phase of the analysis aimed at determining if the accuracy estimated for a particular BP was due to the identity of the genes cooperating in the given cellular mechanism, or simply to the number of genes present in the list. To this end, we carried out the permutation test T2. Specifically, indicated with n the size of a list L, {pi}2 lists Li*, Formula , were generated, with {pi}2 = 1000, composed of n probes randomly drawn from the ones available on the microarray. The corresponding error rate Formula was estimated performing 200 random cross-validations. Such analysis revealed 51 BPs (Formula ) having an error rate Formula Formula (see Table 1). Note that the BP with the maximum number of probes detected by our method is localization composed of Formula probes, with eL = 13%, Formula . Such number provides an upper bound on the size of statistically significance lists, detectable in the current experimental conditions. Many BPs correlated with the phenotype in our data set with a greater number of probes do not have any statistical significance. For example, regulation of BP, composed of n = 5318 probes, provides an error rate eL = 15% with py = 0.0008, Formula , but we can not maintain that such a category is really deregulated in our data set. In fact, as its second P-value shows (pn = 0.245), many lists composed of the same number of probes provide error rates smaller than eL. This highlights that the current experimental conditions do not allow us to detect if categories composed of more than 4000 probes are deregulated in the given pathology. This experimental evidence is illustrated in Figure 2 which depicts the distribution of the error rate of random lists as a function of the list size. As the picture shows, the higher the value of n, the more picked the distribution of the values of en is close to the median. As a consequence, the error rate eL associated to lists composed of a great number n of genes results to be not statistically significant because many lists composed of the same number of genes produce comparable accuracies, independently of the identity of the genes belonging to the list. On the other hand, our methodology is able to detect BPs significantly involved in the pathology, composed of a few number of probes. For example, lipoxygenase pathway, n = 4 (eL = 13%,py = 0.001, Formula Formula and arachidonic acid metabolism, n = 10 (eL = 17%, py = 0.001, Formula Formula , highly deregulated in our data set and strongly correlated to colon cancer (Ulrich et al., 2006) would not have been detected by methods which limit the analysis to those categories composed of an a priori defined minimum number of genes (Barry et al., 2005).


Figure 2
View larger version (36K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Empirical probability density functions of en determined in the T2 permutation test for different values of n. For each value of n the minimum, maximum, lower quartile, median and upper quartile values of en are reported.

 

View this table:
[in this window]
[in a new window]

 
Table 1. BPs most correlated with the phenotype in colon cancer data set

 
The BP most correlated to colon cancer in our data set is lipid metabolism. This GO term is represented by n = 844 probes on the chip and provides an error rate of eL = 12% (py = 0.0008, Formula Formula lower than the error of e = 15% (py = 0.027) obtained in the same experimental conditions, by using all the probes simultaneously (Ancona et al., 2006).

For understanding the relevance of each gene singularly in the pathology at hand and for elucidating the importance of considering simultaneously genes acting in concert for outcome cancer prediction, we measured the prediction accuracy associated to each probe (Maglietta et al.2007). This is equivalent to applying our method to lists composed of one probe only. In this case, for assessing the statistical significance of the measured accuracy, only the permutation test T1 of the phenotypic labels was carried out. We found 471 probes Formula with error rate lower than 28%. In Table 1 for each BP, we report the minimum value of the error rate Formula associated to the probes belonging to the lists selected by our method. This table sheds light on the importance of considering simultaneously genes cooperating in the same biological function for cancer prediction. Importantly, this table reveals that to consider simultaneously genes weakly connected with the phenotype increases the prediction accuracy. For example, the minimum error rate measured in cell migration is Formula Formula . Moreover, the mean error of the probes in this BP is 44%. Nevertheless, the error rate estimated by using all the genes involved in this BP is eL = 13%, which is equivalent to a reduction of 23% of Formula . This indicates the importance of also considering weakly connected or moderately deregulated genes in measuring cancer outcome prediction as well as for evaluating the degree of correlation of the category with the phenotype.

3.2 Prostate cancer
In prostate cancer data set (Singh et al., 2002), we found 2509 BPs represented, having a number of probes ranging from 2 to 10 568. In the analysis of this data set, we used the same parameters as for colon, with the exception of the sizes of training and test sets: 65 examples for training and 37 for testing were used in multiple cross-validation. The T1 permutation test revealed 1109 statistically significant BPs (Formula ) having error rates Formula . The T2 permutation test reduced the number of statistically significant BPs to 18 (Formula ) having error rates Formula (see Table 2 and Supplementary Material). The differences between normal and tumor specimens are sharper in prostate than in colon cancer. In fact, the smallest error rate is 8% in prostate and 12% in colon (see Supplementary Material). The importance of considering jointly, genes cooperating in the same biological function for cancer prediction is confirmed in this data set. In anti-apopotosis BP, e.g. the minimum error rate decreases of 50% by considering genes weakly connected with the phenotype.


View this table:
[in this window]
[in a new window]

 
Table 2. BPs most correlated with the phenotype in prostate cancer data set

 
3.3 Stage II colon cancer
In the analysis of the stage II colon cancer data set, (Barrier et al., 2006), we used the same parameters as the first data set. As these two data sets were obtained by using the same technology, we had the same number of BPs represented. The T1 permutation test showed 69 statistically significant BPs (Formula ) having error rates Formula . The T2 permutation test reduced the number of statistically significant BPs to 52 (Formula ) having error rates Formula (see Table 3 and Supplementary Material). The small number of deregulated lists found by the method and the high estimated prediction error, in the range [16%,25%], indicate the subtle differences existing between the two phenotypes analyzed. In this data set, all the subjects are affected by stage II colon cancer, the only difference being the recurrence of the disease.


View this table:
[in this window]
[in a new window]

 
Table 3 BPs most correlated with the phenotype in stage II colon cancer data set

 

    4 BIOLOGICAL ANALYSIS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 BIOLOGICAL ANALYSIS
 5 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Among the BPs most deregulated in colon cancer data set and important for its ability to modulate tumorigenesis and for applications in the treatment and chemoprevention of neoplastic diseases, we found Lipid metabolism. Long-chain fatty acids are essential constituents of membrane lipids and are important substrates for energy metabolism in the cells. The regulation of lipogenic gene expression is mainly regulated by sterol regulatory element-binding proteins at the transcriptional level (Consolazio et al., 2006). During tissue differentiation, lipogenic gene expression is linked to exit from the cell cycle. Similar regulation is active in many proliferative human cancers, including colorectal carcinoma and in fetal tissues, including the bowel, with linkage to proliferation. These observations suggest the existence of a coordinate regulatory mechanism for lipogenic gene expression in the context of proliferation, which might function normally during fetal development, and become abnormally activated during carcinogenesis (Li et al., 2000). Recently, it has been described a possible correlation between diet and colon cancer risk connected to the deregulation of Fatty Acid Metabolism and Organic Acid Metabolism BPs. Diets high in energy and saturated fat, with high glycemic index carbohydrate and low levels of fiber and n–3 fatty acids lead to insulin resistance with hyperinsulinemia, hyperglycemia and hypertriglyceridemia. These suggest that insulin, the related insulin-like growth factors, triglycerides, and non-esterified fatty acids could lead to increased growth of colon cancer precursor lesions and the development of colorectal cancer (Bruce et al., 2000a). These circulating factors subject colonic epithelial cells to a proliferative stimulus and also expose them to reactive oxygen intermediates. These long-term exposures result in the promotion of colon cancer (Bruce et al., 2000b).

Another critical and indispensable component of tumor progression is the inflammation that include Arachidonic Acid Metabolism and Lypoxygenase pathway two other BPs put in evidence in our analysis. The arachidonic acid is a substrate for the production of leukotrienes and prostagladins found to be expressed in neoplastic tissue, including colon cancer (Cornelia et al., 2006). Recent evidence show that colon cancers target the prostaglandin biogenesis pathway by ubiquitously abrogating expression of a prostaglandin-degrading enzyme that physiologically antagonizes cyclooxygenase-2 (COX-2), a key enzyme of the inflammation highly expressed in normal colon mucosa but ubiquitously lost in human colon cancers (Myung et al., 2006; Yan et al., 2004). Epidemiolgy study have provided interesting and promising data regarding prevention of tumorigenesis.

Among our BPs strongly associated to prevention of colon cancer and correlated themselves, we found Epidermal Growth Factor Receptor (EGFR) signaling pathway, Angiogenesis and Maintenance of localization. Multiple growth factor receptor are deregulated in colon cancer, presenting potential targets for therapeutic intervention. Growth factor activation of receptors together with increase COX-2 expression, and consequently up-regulation of prostaglandin could promote both deregulation of angiogenesis and motility-related characteristics as cell cytoskeleton and cell–cell junction structure such as invasiveness of cancer cells (Kumar, 2005).

In prostate cancer data set (Table 2), between the BPs most deregulated we found the ‘Retinol, Retinoid and Vitamin A metabolisms’. The term ‘vitamin A’ is used to denote retinol or all-trans-retinol and a family of biologically active retinoids derived from this. Retinoids exert potent apoptotic effects both in development and in cancer cells. Induction of apoptosis by retinoids has been observed in various prostate cancer cells in vitro and in vivo and appears to be associated with down-regulation of Bcl-2 expression, induction of insulin-like growth factor-binding protein-3 (IGFBP-3) and tissue transglutaminase, an enzyme that accumulates in cells undergoing apoptosis (Zhang, 2002).

Genomic instability represents an important mechanism in cancer and in particular colon cancer. The first evidence of the accumulation of mutations in colorectal cancer in specific genes which control cell division, apoptosis and DNA repair has been published many years ago (Kinzler and Vogelstein, 1996). In many forms of the cancer, genomic instability creates a permissive state where a potential cancer cell can acquire enough mutations to become a cancer cell. In colorectal cancer, three types of genetic instability have been identified (Lengauer et al., 1998). The majority of colorectal and most other solid cancers have chromosomal instability (CIN). CIN refers to an increased rate of losing or gaining whole chromosomes or large parts of chromosomes during cell division. The consequence of CIN is an imbalance in chromosome number (aneuploidy) and an increased rate of loss of heterozygosity (LOH). These so-called ‘CIN genes’ are involved in spindle assembly checkpoint, DNA recombination, checkpoint control of the cell cycle and transcription, and they belong to Cell Proliferation, Regulation of cell proliferation BPs (Milner et al., 1997; Rajagopalan et al., 2004; Yarden et al., 2002). In a small fraction of colorectal cancer, a defect in DNA mismatch repair (Lee et al., 2004; Southey et al., 2005) results in an elevated mutation rate at the nucleotide level and consequent widespread microsatellite instability (MIN) (Kinzler and Vogelestein, 1996). Among the BPs responsible for this mechanism, we have Regulation of DNA replication and Negative Regulation of progression through cell cycle (Anand et al., 2002; Grady and Markowitz, 2000; Kane et at., 1997). Epigenetic silencing (CIMP) is now recognized as a ‘third pathway’ in model of colorectal cancer tumorigenesis and can affect gene function without genetic changes. DNA methylation within gene promoters and alterations in histone modifications appear to be the primary mediators of epigenetic inheritance in cancer cells (Kondo and Issa, 2004). In particular, in stage II colon cancer data set we observe some BPs that are involved in the epigenetic processes, such as: Negative Regulation of Histone Modification, Regulation of Histone Acetylation, Regulation of Histone Modification and Histone Acetylation (Table 3). Recent evidences show the correlation of these BPs and colon cancer progression (Kondon and Issa, 2004; Orr and Hamilton, 2007). Moreover, our analysis confirms the major role of genes coding ribosomal proteins in colon cancer progression as hypothesized in Barrier et al.,(2006). In fact, we found Ribosome Biogenesis and Assembly BP deregulated in their data set (see Supplementary Material) even though with a poor statistical significance.

Importantly, many BPs commonly associated to development and progression of tumors such as apoptosis, anti-apoptosis, cell proliferation and DNA repair (Evan and Vousden, 2001) have been detected by our method (see Table 4 and Supplementary Material). In fact, as we can deduce from the values of py and Formula reported in Table 4, these BPs are strongly deregulated in all three data sets considered. Nevertheless, the actual experimental conditions, i.e. the heterogeneity of the samples analyzed, the size of the data set and the quality of the gene lists considered, do not allow to assign any statistical significance to these findings as the values of pn and Formula indicate.


View this table:
[in this window]
[in a new window]

 
Table 4 Statistical assessment of some BPs commonly deregulated in cancer by using colon, prostate and stage II colon cancer data sets

 

    5 DISCUSSION AND CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 BIOLOGICAL ANALYSIS
 5 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we have proposed a new method for measuring the correlation between gene lists, defined on the basis of prior biological knowledge, and the phenotype by using gene expression profiles. As a measure of correlation of the list we used the prediction accuracy of classifiers trained to predict the phenotypic labels of subjects in two different experimental conditions, by using the raw expression levels of the genes belonging to the list, as recorded by the microarray. The analysis does not involve only the differentially expressed genes (if any) in a list. On the contrary, all the genes belonging to a given BP concur to determine the deregulation level of the process in the disease. Such an approach, which can be thought of as a natural extension of the one proposed by Maglietta et al., (2007), in which prediction accuracy is used for selecting relevant genes in microarray experiments, overcomes many of the limitations and drawbacks connected to the current methods based on the comparison between lists of genes. GSEA (Subramanian et al., 2005), e.g. evaluates the correlation with the phenotype of a list L, coding a given cellular mechanism, comparing it with an ordered list S of genes. The ranking is done according to the values of a univariate statistic which measures the differential expression of genes in the two classes. Univariate statistics do not take into account the co-expression of different genes and they are not able to evaluate the connection between groups of genes and the phenotype. As a consequence, some information could be lost by not considering genes jointly. What matters in this method is the rank of the genes in S and the corresponding values of the statistic. Genes weakly connected with the phenotype will always appear in the central positions of the list S, without contributing significantly to the final score of L. Our method, on the contrary, uses simultaneously the expression levels of all the genes in L for measuring the correlation of L with the phenotype. The results shown in tables and Supplementary Material also suggest that genes weakly connected with the phenotypic labels y also contribute to the estimate of the correlation because they significantly influence the prediction accuracy of the outcome. In fact the error rate eL is always smaller than the mean error of the probes in the list, and in many cases lower than the minimum error Formula . We applied GSEA to the three data sets analyzed in this article. Concerning prostate (Singh et al., 2002) and stage II colon cancer (Barrier et al., 2006) data sets, this method did not find any BPs deregulated Formula . Applied to our colon cancer data set, GSEA found 205 BPs differentially expressed Formula . Comparing this list with the one composed of 666 BPs determined by our method after the T1 permutation test, we found 128 common BPs pointing out that more than 60% of the BPs determined by GSEA are detected by our method. Moreover, comparing the GSEA list with the one obtained by our method after the T2 permutation test we found 10 common BPs, indicating that many of the BPs revealed by GSEA do not depend on the identity of the genes in the list, but on the size of the list.

The idea of using prior biological information coded through lists of genes has been previously suggested for class finding (Redestig et al., 2006) and for class prediction (Lottaz and spang 2005) without systematically addressing the problem of the statistical significance of the biological findings determined in the experimental conditions analyzed.

The problem of assessing if the correlation of a given list L with the phenotype depends on the identity of the genes belonging to the list or depends only on the number of genes in L has been addressed in Tian et al. (2005). The authors highlighted that when there is a significant proportion of genes associated with the phenotype, a gene set would contain genes with association, even if the gene set is purely a random subset from the entire gene list. Although this aspect is not new, nevertheless the method presented in Tian et al. (2005) is completely different from the one we propose here. The authors used a univariate statistic (t-test) t for measuring the correlation of each gene with the phenotype and measured the correlation of L with y by summing the t-values of the genes belonging to L: T = 1/n{sum}iisinLti, where n is the size of L. The rationale is that the higher the sum T, the stronger the correlation of L with y. The statistical significance of the list L was measured by permuting the t-values Formula and comparing the new sum Formula with T. This is equivalent to draw randomly n genes from the ones available on the microarray and to sum the corresponding t-values. Our approach do not use any univariate statistic and the estimate of the correlation of L is not simply a sum of univariate statistics. As a measure of correlation of a list L with the phenotype we use an estimate of the generalization error eL, obtained by using the expression levels of all the genes belonging to L and present on the microarray. The statistical significance of L is assessed by drawing randomly genes from the microarray and using their expression levels for training classifiers which predict the actual phenotypic labels.

These last considerations allow us to underline an important aspect of our approach. The statistic that we use for measuring the correlation between L and y is meaningful. It is not simply a number. It provides an estimate of the generalization error, i.e. the expected value of misclassified subjects, or the probability of making an error:


Formula

where, fL is the predictor trained by using the expression levels of the genes in L. This is different from the correlation measures proposed in Subramanian et al. (2005) and Tian et al. (2005), which do not provide any estimate of prediction error. Under this perspective, our approach overcomes the problem of gene selection and aims to evaluate predefined gene lists on the basis of their prediction accuracy. Finally, our method sheds light on the importance of having accurate and well-defined gene lists, and provides statistically significant experimental evidences on the effective co-expression of genes in particular cellular mechanisms through gene expression profiles.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 BIOLOGICAL ANALYSIS
 5 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
R.M. is a PhD student of Dipartimento Interateneo di Fisica, Bari, associated to Istituto Nazionale di Fisica Nucleare, sez. di Bari and to Center of Innovative Technologies for Signal Detection and Processing (TIRES), Univerisitá degli Studi di Bari, Italy. This work was supported by grants from Regione Puglia, Progetto Strategico PS_012, and Minister of Healthy N{circ}: RC0402GA16.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Trey Ideker

1 http://www.affymetrix.com/analysis/index.affx Back

Received on April 5, 2007; revised on May 14, 2007; accepted on May 21, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS
 4 BIOLOGICAL ANALYSIS
 5 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Alon U, et al. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. (1999) 96:6745–6750.

    Anand GM, et al. Down-regulation of HLA-A expression correlates with a better prognosis in colorectal cancer patients. Lab. Invest (2002) 82:1725–1733.[Web of Science]

    Ancona N, et al. Regularized least squares cancer classifiers from DNA microarray data. BMC. Bioinformatics (2005) 6(Suppl. 4):S2.

    Ancona N, et al. On the statistical assessment of classifiers using DNA microarray data. BMC. Bioinformatics (2006) 7:387.[CrossRef][Medline]

    Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet (2000) 25:25–29.[CrossRef][Web of Science][Medline]

    Barry WT, et al. Significance analysis of functional categories in gene expression studies: a structured permutation approach. Bioinformatics (2005) 21:1943–1949.[Abstract/Free Full Text]

    Barrier A, et al. Stage II colon cancer prognosis prediction by tumor gene expression profiling. J. Clin. Oncol (2006) 24:4691.

    Bruce WR, et al. Mechanisms linking diet and colorectal cancer: the possible role of insulin resistance. Nutr. Cancer (2000a) 37:19–26.[CrossRef][Web of Science][Medline]

    Bruce WR, et al. Possible mechanisms relating diet and risk of colon cancer. Cancer Epidemiol Biomarkers Prev (2000b) 9:1271–1279.[Abstract/Free Full Text]

    Consolazio A. Related articles, links overexpression of fatty acid synthase in ulcerative colitis. Am. J. Clin. Pathol (2006) 126:113–118.[Abstract/Free Full Text]

    Cornelia MU, et al. Non-steroid anti-inflammatory drugs for cancer prevention: promise, perils and pharmacogenetics. Nat. Rev. Cancer (2006) 6:130–140. review.[CrossRef][Web of Science][Medline]

    Evan GI, Vousden KH. Proliferation, cell cycle and apoptosis in cancer. Nature (2001) 411:342–348.[CrossRef][Medline]

    Golub TR, et al. Molecular classification of cancer: class discovery andclass prediction by gene expression monitoring. Science (1999) 286:531–537.[Abstract/Free Full Text]

    Good P. Permutation Tests: a Practical Guide to Resampling Methods for Testing Hypotheses. (1994) New York: Springer Verlag.

    Grady WM, Markowitz S. Genomic instability and colorectal cancer. Curr. Opin. Gastroenterol (2000) 16:62–67.[CrossRef][Web of Science][Medline]

    Guyon I, et al. Gene selection for cancer classification using support vector machines. Mach. Learn (2002) 46:389–422.[CrossRef]

    Kane MF, et al. Methylation of the hMLH1 promoter correlates with lack of expression of hMLH1 in sporadic colon tumors and mismatch repair-defective human tumor cell lines. Cancer Res (1997) 57:808–811.[Abstract/Free Full Text]

    Kanehisa M, et al. The KEGG databases at GenomeNet. NucleicAcidsRes (2002) 30:42–46.[Abstract/Free Full Text]

    Khatri P, et al. Profiling gene expression using onto-express. Genomics (2002) 79:266–270.[CrossRef][Web of Science][Medline]

    Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (2005) 21:3587–3595.[Abstract/Free Full Text]

    Kinzler KW, Vogelstein B. Lessons from hereditary colorectal cancer. Cell (1996) 87:159–170.[CrossRef][Web of Science][Medline]

    Kondo Y, Issa JP. Epigenetic changes in colorectal cancer. Cancer Metastasis Rev (2004) 23:29–39.[CrossRef][Web of Science][Medline]

    Kumar R. Commentary: targeting colorectal cancer through molecular biology. Semin. Oncol (2005) 32(Suppl. 9):S37–S39.[Web of Science][Medline]

    Lee S, et al. Aberrant CpG island hypermethylation of multiple genes in colorectal neoplasia. Lab. Invest (2004) 84:884–893.[CrossRef][Web of Science][Medline]

    Lengauer C, et al. Genetic instabilities in human cancers. Nature (1998) 396:623–649.[CrossRef][Medline]

    Li JN, et al. Sterol regulatory element-binding protein-1 participates in the regulation of fatty acid synthase expression in colorectal neoplasia. Exp.Cell. Res (2000) 261:159–165.[CrossRef][Web of Science][Medline]

    Lottaz C, Spang R. Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics (2005) 2(1):1978.

    Maglietta R, et al. Selection of relevant genes in cancer diagnosis based on their prediction accuracy. Artif. Intell. Med (2007) 40:29–44.[CrossRef][Web of Science][Medline]

    Michiels S, et al. Predictor of cancer outcome with microarrays: a multiple random validation strategy. Lancet (2005) 365:488–492.[CrossRef][Web of Science][Medline]

    Milner J, et al. Transcriptional activation functions in BRCA2. Nature (1997) 386:772–773.[CrossRef][Medline]

    Myung SJ, et al. 15-Hydroxyprostaglandin dehydrogenase is an in vivo suppressor of colon tumorigenesis. Proc. Natl Acad. Sci. USA (2006) 103:12098–12102.[Abstract/Free Full Text]

    Mukherjee S, et al. Estimating dataset size requirements for classifying DNA microarray data. J. Comput. Biol (2003) 10:119–142.[CrossRef][Web of Science][Medline]

    Orr JA, Hamilton PW. Histone acetylationand chromatin pattern in cancer. A review. Anal. Quant. Cytol. Histol (2007) 1:17–31.

    Rajagopalan H, et al. Inactivation of hCDC4 can cause chromosomal instability. Nature (2004) 428:77–81.[CrossRef][Medline]

    Redestig H, et al. Integrating functional knowledge during sample clustering for microarray data using unsupervised decision trees. Biom. J (2006) 48:1–16.

    Rifkin R, et al. Regularized least squares classification. In: Advances in Learning Theory: Methods, Model and Applications—Saykens, et al, eds. (2003) Amsterdam: IOS Press. 153. Vol.190, NATO Science Series III: Computer and Systems Sciences.

    Risch NJ. Searching for genetic determinants in the new millennium. Nature (2000) 405:847–856.[CrossRef][Medline]

    Schena M, et al. Quantitative monitoring of gene-expression patterns with a complementary-DNA microarray. Science (1995) 270:467–470.[Abstract/Free Full Text]

    Singh D, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell (2002) 1:203–209.[CrossRef][Web of Science][Medline]

    Southey MC, et al. Use of molecular tumor characteristics to prioritize mismatch repair gene testing in early-onset colorectal. cancer. J. Clin. Oncol (2005) 23:6524–6532.

    Storey JD, Tibshirani R. Statistical significance for genomwide studies. Proc. Natl Acad. Sci (2003) 100:9440–9445.[Abstract/Free Full Text]

    Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl Acad. Sci (2005) 102:15545–15550.[Abstract/Free Full Text]

    Tian L, et al. Discovering statistically significant pathways in expression profiling studies. Proc. Natl. Acad. Sci (2005) 102:13544–13549.[Abstract/Free Full Text]

    Ulrich CM, et al. Non-steroidal anti-inflammatory drugs for cancer prevention: promise, perils and pharmacogenetics. Nat. Rev. Cancer (2006) 6:130–140.[CrossRef][Web of Science][Medline]

    Vapnik V. The Nature of Statistical Learning Theory. (1995) New York: Springer Verlag.

    Yan M, et al. 15-Hydroxyprostaglandin dehydrogenase, a COX-2 oncogene antagonist, is a TGF-beta-induced suppressor of human gastrointestinal cancers. Proc. Natl Acad. Sci. USA (2004) 101:17468–17473.[Abstract/Free Full Text]

    Yarden RI, et al. BRCA1 regulates the G2/M checkpoint by activating Chk1 kinase upon DNA damage. Nat. Genet (2002) 30:265–269.

    Zhang XK. Vitamin A and apoptosis in prostate cancer. Endocr. Relat. Cancer (2002) 9:87–102.[Abstract]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Cancer Epidemiol. Biomarkers Prev.Home page
J. Russo, G. A. Balogh, I. H. Russo, and and the Fox Chase Cancer Center Hospital Network P
Full-term Pregnancy Induces a Specific Genomic Signature in the Human Breast
Cancer Epidemiol. Biomarkers Prev., January 1, 2008; 17(1): 51 - 66.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/16/2063    most recent
btm289v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Maglietta, R.
Right arrow Articles by Ancona, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Maglietta, R.
Right arrow Articles by Ancona, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?