Skip Navigation


Bioinformatics Advance Access originally published online on June 2, 2005
Bioinformatics 2005 21(16):3385-3393; doi:10.1093/bioinformatics/bti526
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3385    most recent
bti526v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (13)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chu, W.
Right arrow Articles by Wild, D. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chu, W.
Right arrow Articles by Wild, D. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Biomarker discovery in microarray gene expression data with Gaussian processes

Wei Chu 1, Zoubin Ghahramani 1, Francesco Falciani 2 and David L. Wild 3,*

1Gatsby Computational Neuroscience Unit, University College London UK
2School of Biosciences, University of Birmingham UK
3Keck Graduate Institute of Applied Life Sciences Claremont, CA 91711, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

Motivation: In clinical practice, pathological phenotypes are often labelled with ordinal scales rather than binary, e.g. the Gleason grading system for tumour cell differentiation. However, in the literature of microarray analysis, these ordinal labels have been rarely treated in a principled way. This paper describes a gene selection algorithm based on Gaussian processes to discover consistent gene expression patterns associated with ordinal clinical phenotypes. The technique of automatic relevance determination is applied to represent the significance level of the genes in a Bayesian inference framework.

Results: The usefulness of the proposed algorithm for ordinal labels is demonstrated by the gene expression signature associated with the Gleason score for prostate cancer data. Our results demonstrate how multi-gene markers that may be initially developed with a diagnostic or prognostic application in mind are also useful as an investigative tool to reveal associations between specific molecular and cellular events and features of tumour physiology. Our algorithm can also be applied to microarray data with binary labels with results comparable to other methods in the literature.

Availability:The source code was written in ANSI C, which is accessible at www.gatsby.ucl.ac.uk/~chuwei/code/gpgenes.tar

Contact: wild{at}kgi.edu


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
Microarray technologies now enable the simultaneous interrogation of the expression level of thousands of genes to obtain a quantitative assessment of their differential activity in a given tissue or cell. The development of these technologies has also motivated interest of their use in clinical trials and diagnosis. For instance, a key aim of many investigators is to identify genomic factors that are prognostic for survival or relapse-free survival, and that predict those patients who respond to treatment. Typically, such experiments investigate on the order of dozens of samples from different patients. The samples are usually labelled with some information about the disease. Many studies have attempted to find subsets of genes that distinguish well between samples with different labels. A minimal subset of these relevant genes, often referred to as ‘biomarkers’, may be useful in segregating patients in diagnosis, prognosis and for appropriate therapeutic selection in clinical management.

The increasing use of gene expression profiles in these types of study requires computational methods of high accuracy for solving feature selection and classification problems associated with these data. Although the cases of binary labels, e.g. healthy/diseased, have been extensively studied in the literature (Alon et al., 1999; Furey et al., 2000; Golub et al., 1999; Guyon et al., 2002; Li et al., 2002; Shevade and Keerthi, 2003), the observed or measured labels are often ordinal in routine clinical practice, such as the TNM system for staging prostate cancer and the Gleason grading system for tumour cell differentiation. These ordinal scales are discrete and finite, differing from continuous variables, and metric distances between the adjacent ordinal scales are not defined. In contrast to the labels of multiple classes, ordinal scales are rank-ordered, e.g. ‘low’, ‘medium’ and ‘high’. The learning task of predicting ordinal variables is known as ordinal regression. Interestingly, the popular binary label is a special case of the ordinal variable with only two ranks. Singh et al. (2002) studied gene expression patterns that are correlated with the Gleason score and built an expression-based model to predict patients’ clinical outcome. However, the ordinal nature of the Gleason score has not previously been treated in a principled way.

In this paper, we propose a feature selection algorithm based on Gaussian processes (Williams and Barber, 1998) to identify biomarkers for tasks with ordinal (or binary) labels. The important advantage of Gaussian process models is the explicit probabilistic framework that can efficiently take into account the uncertainty in microarray data. The automatic relevance determination (ARD) parameters1 can be embedded into the covariance function, which represents the correlation between samples, to control the contribution from individual features. After Bayesian inference, the optimal values of the ARD parameters can be used as the indicator of the relevance level of a particular gene. A relatively large ARD parameter indicates that the associated gene is more correlated with the sample labels, while a gene weighted with a very small ARD parameter implies that this gene is irrelevant. Genes can then be sorted downwards from relevant to irrelevant according to the optimal values of these ARD parameters. A forward selection procedure can be further employed to determine the minimal set of relevant genes as biomarkers. We apply this ARD technique to the publicly available microarray gene expression datasets. The usefulness of these biomarkers are validated by reference to the biological literature.

The paper is organized as follows. In Section 2, we describe the Gaussian processes model for ordinal regression and then present our algorithm in detail. The experimental results on three publicly accessible datasets are reported and discussed in Section 3. We conclude in Section 4.


    2 METHODOLOGY
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
Consider a gene expression dataset composed of n samples from different patients. Each sample is represented by the expression level of the d genes, denoted as a column vector , and labelled by an ordinal scale . These labels are denoted as consecutive integers, , that keep the known ordering information.

2.1 Bayesian framework
The main idea is to assume an unobservable latent function associated with a sample xi in a Gaussian process, and the label yi dependent on the latent function f(xi) by modelling the ordinal scales as intervals on the real line (Chu and Ghahramani, 2004 www.gatsby.ucl.ac.uk/~chuwei/paper/gpor.pdf).

2.1.1 Prior probability
The values of the latent function {f(xi)} are assumed to be the realizations of random variables in a zero-mean Gaussian process. The covariance between the function values corresponding to the inputs xi and xj can be defined as

(1)
where {kappa}{ell} > 0. denotes the {ell}-th gene expression level of the i-th sample and {kappa}{ell} is the ARD variable for the {ell}-th gene that controls the contribution of this gene in the modelling. For simplicity, we have chosen the covariance (1) that corresponds to a prior on functions, where f(x) is a linear function of x. Many other covariance functions could be used (MacKay, 1998). The prior probability of these latent function values {f(xi)} is a multivariate Gaussian,

(2)
where f = [f(x1), f(x2),...,f(xn)]T and {Sigma} is the n x n covariance matrix whose ij-th element is defined as in Equation (1). This covariance matrix is positive semi-definite.

2.1.2 Ordinal likelihood
The likelihood is the joint probability of observing the sample labels given the the latent function values. The likelihood can be evaluated as a product of the likelihood function on individual observations:

A standard likelihood function for ordinal labels is obtained from the difference of two cumulative normals,

(3)
where , and . The noise level {sigma} > 0 is unknown and reflects the measurement noise in the microarray experiments. b0 = –{infty} and br = +{infty} are defined as auxiliary variables, and we impose the inequality b1 < b2 < ... < br–1 on these thresholds. The role of the thresholds is to divide the real line into r contiguous intervals, these intervals map the real function value f(xi) into the discrete variable yi while enforcing the ordinal constraints. As a special case with r = 2, the ordinal likelihood function (3) becomes the probit function for binary classification.2

2.1.3 Model evidence
The Bayesian framework described above is conditional on the model parameters including the ARD parameters {kappa}{ell} in the covariance function (1), the threshold parameters {b1,b2,...,br–1} and the noise level {sigma} in the likelihood function (3). All these parameters can be collected into {theta}, which is the model parameter vector. The quantity , more exactly , is known as the evidence for {theta}, a yardstick for model selection. The optimal values of the model parameters {theta} can be inferred by maximizing theevidence .3

A popular idea for computing the evidence is to approximate the posterior distribution as a Gaussian by applying the Laplace approximation at the maximum a posteriori (MAP) estimate of f, and then the evidence can be calculated by an explicit formula. The MAP estimate on the latent functions is the mode point of the posterior distribution, i.e. fMAP = arg maxf . This is a convex programming problem that guarantees a unique solution. The Laplace approximation refers to carrying out the Taylor expansion for at the MAP point and retaining the terms up to the second order (MacKay, 1994). The evidence can then be approximated as an explicit expression analytically:

(4)
where , I is an n x n identity matrix and {Lambda}MAP is a diagonal matrix whose ii-th entry is at the MAP estimate.

The gradients of the approximate evidence (4) with respect to the model parameters {theta} can be derived analytically [refer to Chu and Ghahramani (2004) for detailed formulae]. Gradient-based optimization methods can then be employed to search for the maximizer of the evidence, {theta}* = arg max{theta} . Since there might be several local maxima on the curve of , it is possible that the optimization problem may stick at local maxima in the determination of {theta}. We can avoid poor local maxima by maximizing (4) several times starting from several different initial states, and simply choosing the one with the highest evidence as our preferred choice {theta}*.

2.2 Prediction
At the optimal model parameters {theta}*, let us take a test sample xt for which the target yt is unknown. The correlations between the test case {xt} and the training samples xi are defined by the covariance function as in Equation (1). The predictive distribution over ordinal labels yt is

(5)
where µt = kT {Sigma}–1 fMAP, and . The predictive label is decided as

(6)

2.3 Forward selection
The optimal values of {kappa}{ell}'s can be determined by the maximizer of the evidence {theta}*, denoted as , which indicates the relevance level of the genes to the labels. Based on these values, , we can sort the genes in descending order from relevant to irrelevant accordingly.

It is desirable to further select a minimal subset of the top-ranked genes as the biomarkers for modelling, denoted as , while keeping the accuracy of the resulting model and reducing the computational overhead. For this purpose, we need to define a quality criterion for the quality of a particular biomarker set. The leave-one-out (LOO) validation error is popularlyused4 which is evaluated as

(7)
where {sum}t means the sum over all LOO validation cases, is defined as in Equation (6) and {delta}(s) is 1 if s is true, otherwise 0.

We can carry out LOO validation on a progressively larger biomarker set, adding one gene at a time as ordered by the gene ranking. Here a linear covariance function, defined as without ARD parameters, is employed in the Gaussian process modelling. The inclusion of a relevant gene should result in a decrease of the LOO error criterion (7). The gene set that yields the minimal LOO error is identified as the set of biomarkers that contain the most informative genes for predicting target labels.

2.4 Algorithm
The optimal values of ARD parameters are estimated by maximizing the approximate evidence, which is also known as Type-II maximum-likelihood estimate. Qi et al. (2004) have shown that the evidence optimization can lead to overfitting by picking one from numerous linear classifiers that can correctly classify the limited training data. This potential difficulty becomes more serious on gene expression datasets with only dozens of samples. To address this problem, we propose a resampling procedure as the outer loop of our algorithm. The outline of our algorithm is given in Table 1. We found this algorithm to be robust both to overfitting and local minima problems.


View this table:
[in this window]
[in a new window]
 
Table 1 The outline of our algorithm for gene selection

 
Given a gene expression dataset, we randomly generated k folds after pre-processing. One fold was left out in turn and evidence optimization was carried out using the samples in the remaining k – 1 folds. We maximized the evidence (4) several times starting from different initial states, and simply chose the one with the highest evidence as the optimal {theta}*. Based on the optimal values of the ARD parameters, the genes were sorted in the descending order from relevant to irrelevant, accordingly. In forward selection, we added one top-ranked gene each time into the gene subset and then carried out LOO cross-validation using the linear covariance function on the training samples in the remaining k 1 folds. The minimal subset that yielded the minimal LOO error was identified as . This procedure was repeated k times, and k subsets were obtained. The number of times each gene was selected in the k subsets was used as the final criterion for gene ranking, which we refer to as number of hits. Genes with same number of hits are further ranked by the average ARD values. We carried out forward selection again based on the final gene rank to identify the minimal subset of relevant genes .


    3 RESULTS AND DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
Three publicly accessible gene expression datasets, related to colon, leukaemia and prostate cancer, were analysed using our algorithm. In all the cases, the expression levels of each sample were first normalized to zero-mean and unit variance and then the expression levels of each gene were again normalized to zero-mean and unit variance over all the samples. We tackled two kinds of tasks with our algorithm, i.e. normal versus tumour (binary classification) and Gleason score prediction (ordinal regression).

3.1 Normal versus tumour
Many popular gene ranking methods employ the t-statistic as a criterion to measure the variance of the expression levels in different classes for each gene (Alon et al., 1999; Furey et al., 2000). Variants of the t-statistic, such as the measure of correlation proposed by Golub et al. (1999) and Fisher's discriminant criterion adapted by Pavlidis et al. (2001) have also been extensively applied. The t-statistic-like methods make the assumption that the data are described by a Gaussian distribution. However, according to Deng et al. (2004) and others, the normality condition often cannot be met in real gene expression datasets with very limited samples. Non-parametric tests, e.g. the Wilcoxon rank sum test, are superior to the t-test in this case.

As a pre-processing step, we used the Wilcoxon rank sum test on the normalized expression data to remove the most uninformative genes. The significance level was fixed at P = 0.01, and the P-values were calculated using all the samples.5 We then generated 10 folds of the whole dataset for the resampling step in Table 1. The detailed results on these three datasets are reported in the following.

The colon cancer dataset, originally analysed by Alon et al. (1999) contains expression levels of d = 2000 genes from 40 tumour and 22 normal colon tissues (http://microarray.princeton.edu/oncology/affydata/index.html). There are 373 genes significantly differentially expressed in the rank sum test at the significance level of P = 0.01.

The leukaemia dataset, originally studied by Golub et al. (1999) (http://www.genome.wi.mit.edu/MPR), contains expression values of d = 7129 genes from 47 samples of acute myeloid leukaemia (AML) and 25 samples of acute lymphoblastic leukaemia (ALL). There are 1169 genes significantly differentially expressed at the significance level of P = 0.01.

Singh et al. (2002) carried out microarray expression analysis on 12 600 genes to identify genes that are correlated with the distinction of prostate tumour from the normal (the dataset is available at http://www.genome.wi.mit.edu/MPR/prostate). Fifty-two samples of prostate tumour and fifty samples of normal cells were investigated. There are 2717 genes significantly differentially expressed at the significance level P = 0.01.

The left part of Figure 1 represents the results of the LOO error for the 50 top-ranked genes sorted by the number of hits of our algorithm, along with those for the genes ranked by the P-values of the rank sum test in the right part. A lower LOO error can be achieved using the gene rank of our algorithm, although this may involve the use of more genes than when using the P-value rankings on the colon and leukaemia data.



View larger version (34K):
[in this window]
[in a new window]
 
Fig. 1 The leave-one-out error using the top-ranked genes of the three datasets. The top-ranked 50 genes are progressively used in the modelling and the corresponding LOO error (7) are shown as circles. In the left-hand figures the genes are ranked by the number of hits of our algorithm, while in the right-hand figures the genes are ranked by their P-values of the Wilcoxon rank sum test. The closed circles indicate the set of selected genes with minimal LOO error.

 
The selected genes are listed in Tables 24 with more descriptions. From Table 2, we found that all the eight genes selected by Shevade and Keerthi (2003) and six of the seven genes selected by Guyon et al. (2002) are also in our list. In Table 3, eight of the nine genes selected by Shevade and Keerthi (2003) are also selected by our algorithm. The six genes in boldface were also identified by Golub et al. (1999) as being part of their 50 gene signature that distinguished AML from ALL. The gene selections are further visualized in Figure 2 by representing the covariance matrices in grey scale. The covariance matrices turn out to be clearly blocked using the selected genes. The samples in the same class are generally positively correlated, whereas the samples in different classes are negatively correlated.


View this table:
[in this window]
[in a new window]
 
Table 2 The selected 26 genes in colon cancer data

 

View this table:
[in this window]
[in a new window]
 
Table 4 The selected 13 genes in the prostate cancer data

 

View this table:
[in this window]
[in a new window]
 
Table 3 The selected 14 genes in the leukaemia data

 


View larger version (136K):
[in this window]
[in a new window]
 
Fig. 2 The covariance matrices for the binary classification tasks. The covariance matrix is the n x n covariance matrix whose ij-th elements are defined by the linear covariance function . In the left-hand figures the covariance matrices were evaluated over all the original genes, whereas in the right-hand figures the covariance matrices were evaluated over the genes selected by our algorithm. The samples have been grouped by their labels. The pairs in rows from top to bottom are for the colon, leukaemia and prostate datasets, accordingly. Arrows are used to indicate the range of the blocks.

 
To estimate the predictive accuracy of our algorithm, we report in Table 5 the test error rates of a 10-fold cross-validation experiment. One fold was left out for test in turn, and a Gaussian process model was trained on the remaining nine folds using a gene subset selected by the rank sum test or the proposed algorithm separately. Note that the gene selection was carried out by using the samples in the nine training folds only, and then tested on the unused fold. We observed that the validation results using hundreds of genes selected by rank sum test are always better than those using all the original genes. The improvement is especially significant on the leukaemia dataset. Our algorithm can further reduce the number of selected genes to <50, and yields competitive performance on the colon and leukaemia datasets and much better results on the prostate dataset.


View this table:
[in this window]
[in a new window]
 
Table 5 Test error rates in the 10-fold cross-validation experiments

 
3.2 Gleason score prediction
The Gleason score is based exclusively on the architectural pattern of the glands of the prostate tumour. It evaluates how effectively the cells of any particular cancer are able to structure themselves into glands resembling those of the normal prostate. The ability of a tumour to mimic a normal gland architecture is called its differentiation. The Gleason grading from very well differentiated (grade 1) to very poorly differentiated (grade 5) is usually done for most parts by viewing a low magnification microscopic image of the tumour. There are two types of Gleason scores, type I and type II, both of which have five scales. Hereafter, Gleason score refers to the sum of the grades of the two types.

Singh et al. (2002) investigated 52 samples of prostate tumour to identify a subset of the 12 600 genes correlated with pathological features. For each sample, the Gleason score given by the pathologist ranges from 6 to 10. Singh et al. (2002) treated the Gleason scores as continuous variables in their analysis. We argue that the Gleason score are ordinal variables in nature rather than continuous variables, as the grades are ordered as ranks, and the metric distances between the adjacent grades are not defined. Predicting the Gleason score from the gene expression data is thus a typical ordinal regression problem. In our experiments, as only six samples had a score >7, we merged them as the top level, leading to three levels {=6,=7,≥8} with 26, 20 and 6 samples, respectively. We generated six folds in the resampling procedure, and presented the quality criteria for the top 50 genes ranked by the number of hits as given in Figure 3a. The minimal LOO error number was observed when the top 21 genes were used. The selected 21 genes are listed in Table 6 with detailed descriptions. We further visualized the selected genes, as given in Figure 4, by presenting the covariance matrices in grey scale. We observed three clearly blocked regions for the three ordinal scales in the covariance matrices using the selected genes. Moreover, the samples of the level 6 are strongly negatively correlated to the samples of level ≥8.



View larger version (17K):
[in this window]
[in a new window]
 
Fig. 3 The leave-one-out error for the task of predicting the Gleason score, using the top-ranked gene sets of the prostate dataset. The top-ranked genes are progressively used in the modelling and the corresponding LOO error numbers are presented in the graphs (a) and (b), respectively.

 

View this table:
[in this window]
[in a new window]
 
Table 6 The selected 21 genes in the prostate cancer data for predicting the Gleason score

 


View larger version (60K):
[in this window]
[in a new window]
 
Fig. 4 The covariance matrices for the task of predicting Gleason score. As in Figure 2, the left graph represents the covariance matrix evaluated over all the original genes, while the right graph represents the matrix evaluated over the selected genes by our proposed algorithm. The samples have been grouped by their ordinal scales, and arrows are used to indicate the range of the blocks.

 
Cuzick's test is a Wilcoxon-like test for trend across ordered groups (Lehmann, 1998). The informative genes can be selected based on the P-values of the Cuzick test. The LOO error numbers using the 100 top-ranked genes are presented in Figure 3b. When more than 80 genes are used in modelling, the LOO error becomes smaller than that obtained using the top-ranked gene only. A much lower LOO error was obtained by our algorithm using the top 21 ranked genes. We also tried the Kruskal–Wallis rank sum test, which is designed for the case of multiple categories (Lehmann, 1998). Since this test is insensitive to the ordering information among the ordinal scales, the LOO errors are always greater than that using the first top-ranked gene. This observation also implies that multi-classification methods should not be generally applied to tackle ordinal regression problems.

3.3 Discussion
The models we have developed to discriminate between normal and tumour tissues (prostate and colon cancer datasets) and between AML and ALL are very promising and reflect to some degree about what is known of the biology of these systems. A representative case is hepsin in Table 4 (a gene selected in the signature discriminating between normal and tumour prostate samples). Hepsin is a cell surface serine protease that is known to be markedly upregulated in human prostate cancer. Overexpression of hepsin in a mouse model of non-metastasizing prostate cancer has no impact on cell proliferation, but causes disorganization of the basement membrane and promotes primary prostate cancer progression and metastasis to liver, lung and bone (Klezovitch et al., 2004).

Of particular interest are the models linking the degree of differentiation of prostate tumour (Gleason score) to the molecular state of tumour cells. In their original attempt Singh et al. (2002) have identified genes whose expression was correlated to this pathological variable. There are two major limitations in their approach. First, genes are selected individually rather than in combination. The second limitation is that the Gleason score is not a continuous variable but a categorical one, as mentioned earlier.

Our approach significantly improves on the previous study by providing a statistical model representing the Gleason score as an ordered categorical variable. The molecular signature we have developed is robust and has good explanatory power. Although the signature we have identified does not include any of the genes originally selected by Singh et al. (2002) we have observed some degree of functional overlap. Both signatures, in fact, include genes involved in insulin response [IGF1 in our model, i.e. no. 3 in Table 6, and Insulin-like growth factor binding protein 3 in the model developed by Singh et al. (2002)] and contain members of the complement component pathway (complement component 2 in the original analysis and complement component 7 in our model, i.e. no. 21 in Table 6). Interestingly, the large majority of the genes in the models we have developed to explain the degree of differentiation of the tumour are known to be associated with tumour physiology or are related to molecular functions that are highly informative of the molecular events underlying the pathology. Table 6 shows a functional classification of the selected genes. The most striking feature of our model is that seven genes are either tumour suppressor genes or oncogenes and therefore are known to be directly involved in the neoplastic process. Our signature contains five genes with tumour-suppressor activity. Of these, three have a demonstrated function in prostate cancer. The expression of the lysyl oxidase-like protein (LLP) (no. 13 in Table 6) gene has been reported to be progressively lost in primary prostate cancer and associated metastatic lesions (Ren et al., 1998) and is inactivated by methylation and loss-of-heterozygosity in human gastric cancers (Kaneda et al., 2004). These observations are strongly supportive of the role of LLP as a tumour suppressor gene in solid tumours. The expression of IGF1 is also decreased in human prostate cancer. A clear tumour-suppressive activity in prostate cancer has been demonstrated through an apoptotic mechanism (Mutaguchi et al., 2003). Another gene selected in our model with a demonstrated tumour suppressive activity in prostate cancer is the inducible cAMP early repressor (CREM/ICER, no. 11). This gene is an important mediator of cAMP antiproliferative activity that specifically affects the tumorigenicity of prostate cancer cells without affecting their growth (Memin et al., 2002). Phosphatidylethanolamine N-methyltransferase (PEMT, no. 9) is an enzyme in liver that catalyses the stepwise methylation of phosphatidylethanolamine to phosphatidylcholine. PEMT protein decreased in pre-neoplastic nodules and virtually disappeared in hepatocellular carcinoma induced by aflatoxin B1. Transfection experiments demonstrated that the loss of PEMT function may contribute to malignant transformation of hepatocytes (Tessitore et al., 2000). This enzyme is expressed at similar levels in liver and prostate cells (estimated by looking at the frequency of expressed sequence tags in the Unigene database) and, therefore, it is reasonable to hypothesize that a similar role may be shared in these different organs. Last of the tumour suppressor genes included in the model is the RbAp48 gene (no. 6). The protein encoded by this gene has been demonstrated to mediate the retinoblastoma protein tumour suppressor activity (Qian et al., 1993). RbA would, in fact, be a component of the histone deacetylase complex that is associated with the retinoblastoma protein (Nicolas et al., 2000). Two genes encoding proteins with oncogenic activity have also been selected in the model. These are ERT (no. 14) and RET(no. 1). The proteins of the ETS family are transcription factors involved in signal transduction, cell cycle progression and differentiation. It has been demonstrated that cell neoplastic transformation is associated with a dramatic increase in ETS transcriptional activity (de Nigris et al., 2001). The RET proto-oncogene encodes a protein that belongs to the tyrosine kinase growth factor receptor family. The RET proto-oncogene is expressed in human prostate cancer xenografts and prostate cancer cell lines (Dawson et al., 1998).

Angiogenesis is another important process in the development of the tumour and it is represented in our model by two genes. These are vascular endothelial cell growth factor (VEGF) (no. 19) and one of its main regulators, the gene encoding for tissue factor (TF) (no. 20). VEGF is the only mitogen that specifically acts on endothelial cells and its function is key to the development of tumour angiogenesis in vivo (Affara and Robertson, 2004). TF, when produced by tumour cells, has been implicated in the regulation of new blood vessels formation through its ability to concurrently induce the expression of angiogenic molecules such as VEGF, while inhibiting the expression of anti-angiogenic molecules. The expression of TF has been directly linked to vascularization in prostate cancer (Abdulkadir et al., 2000). Another molecular function represented in our model and with great relevance in tumour physiology is the ability to develop drug resistance. MRP5/MOAT-C (represented twice in the model we have developed, i.e. no. 16 and no. 12 in Table 6) is a drug resistant gene that has been implicated in the transport of cyclic nucleotides from cultured cells or isolated tissues (Wielinga et al., 2003).

Our model representing the degree of tumour differentiation is particularly interesting since most of the genes are directly linked to the molecular events underlying tumour progression (tumour suppressor genes, oncogenes and vascularization markers) or are related to cellular function relevant to cancer physiology (motility and secretion). The function of genes represented in our models suggests that the ability of tumour cells to aggregate into glandular structures may be correlated to the regulation of proliferation and survival. Of interest is also the link between vascularization and the degree of tumour differentiation. This link is strongly supported by our model (both VEGF and one of its activators have been selected). Ultimately, the ability to develop resistance to anticancer drugs could also be linked to the degree of differentiation of the tumour. Our results demonstrate how the multi-gene markers that may be initially developed with a diagnostic or prognostic application in mind are also useful as an investigative tool to reveal associations between specific molecular and cellular events, and features of tumour physiology.


    4 CONCLUSIONS
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
We have presented a feature selection algorithm based on Gaussian processes for biomarker discovery associated with ordinal (including binary) clinical phenotypes. This algorithm is clearly superior to the simple ranking method using the rank sum test. Our results on the three microarray datasets are very promising and supported by existing biological knowledge. Moreover, our algorithm can be directly applied for biomarker discovery in large-scale proteomics and metabolomics datasets and this would be a focus of our future work.


    Acknowledgments
 
We would like to thank the Institute for Pure and Applied Mathematics (IPAM) at UCLA, where part of this work was carried out. W.C. would like to thank S. Sathiya Keerthi for many discussions. W.C., Z.G. and D.L.W. would like to acknowledge the support from NIH (Grant Number 1P01GM63208).

Conflict of Interest: none declared.


    Footnotes
 
1The techniques of ARD were originally proposed by MacKay (1994) and Neal (1996) in the context of Bayesian neural networks as a hierarchical prior over the weights. Back

2For multi-label classification problems, the softmax function can be employed as the likelihood function for multinomial labels, as discussed by Williams and Barber (1998). Back

3Monte Carlo sampling methods can provide a good approximation to the posterior distribution of {theta}, but might be prohibitively expensive to use for high-dimensional problems. Back

4For the microarray datasets with dozens of samples, this might result in the same LOO error for multiple biomarker sets, which makes it difficult to discern the difference in performance. It is acceptable to employ other quality criteria such as the predictive probability of misclassifications in LOO, which is defined as , where is computed as in Equation (5) and is defined as in Equation (6). Back

5Since we are using the ranking by P-value as a pre-processing step, it was unnecessary for us to apply any correction for multiple testing or false discovery rate. Back

Received on April 21, 2005; revised on May 19, 2005; accepted on May 31, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 METHODOLOGY
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

    Abdulkadir, S.A., et al. (2000) Tissue factor expression and angiogenesis in human prostate carcinoma. Hum. Pathol., 31, 443–447[CrossRef][Web of Science][Medline].

    Affara, N.I. and Robertson, F.M. (2004) Vascular endothelial growth factor as a survival factor in tumor-associated angiogenesis. In Vivo, 18, 525–542[Abstract/Free Full Text].

    Alon, U., et al. (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc. Natl Acad. Sci. USA, 96, 6745–6750[Abstract/Free Full Text].

    Technical report Chu, W. and Ghahramani, Z. (2004) Gaussian processes for ordinal regression. , UK Gatsby Unit, University College London.

    Dawson, D.M., et al. (1998) Altered expression of RET proto-oncogene product in prostatic intraepithelial neoplasia and prostate cancer. J. Natl Cancer Inst., 90, 519–523[Abstract/Free Full Text].

    de Nigris, F., et al. (2001) Induction of ETS-1 and ETS-2 transcription factors is required for thyroid cell transformation. Cancer Res., 61, 2267–2275[Abstract/Free Full Text].

    Deng, L., Pei, J., Ma, J., Lee, D.L. (2004) A rank sum test method for informative gene discovery. Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'04) , Seattle, Washington, USA ACM Press, pp. 410–419.

    Furey, T.S., et al. (2000) Support vector machine classification and validation of cancer tissue samples using microarray expression data. Bioinformatics, 16, 906–914[Abstract/Free Full Text].

    Golub, T., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Guyon, I., et al. (2002) Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422[CrossRef][Web of Science].

    Kaneda, A., et al. (2004) Lysyl oxidase is a tumor suppressor gene inactivated by methylation and loss of heterozygosity in human gastric cancers. Cancer Res., 64, 6410–6415[Abstract/Free Full Text].

    Klezovitch, O., et al. (2004) Hepsin promotes prostate cancer progression and metastasis. Cancer Cell, 6, 185–195[CrossRef][Web of Science][Medline].

    Lehmann, E.L. Nonparametrics: Statistical Methods Based on Ranks, (1998) rev. 1st edn Prentice Hall.

    Li, Y., et al. (2002) Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics, 18, 1332–1339[Abstract/Free Full Text].

    MacKay, D.J.C. (1994) Bayesian methods for backpropagation networks. In van Hemmen, J.L., Damany, E., Schulten, K. (Eds.). Proceedings of the Models of Neural Networks III, , New York Chapter 6. Springer-Verlag, pp. 211–254.

    MacKay, D.J.C. (1998) Introduction to gaussian processes. In Bishop, C.M. (Ed.). Neural Networks and Machine Learning, , Berlin, Heidelberg Volume 168 of F: Computer and Systems Sciences NATO Advanced Study Institute, Springer-Verlag, pp. 133–165.

    Memin, E., et al. (2002) ICER reverses tumorigenesis of rat prostate tumor cells without affecting cell growth. Prostate, 53, 225–231[CrossRef][Web of Science][Medline].

    Mutaguchi, K., et al. (2003) Restoration of insulin-like growth factor binding protein-related protein 1 has a tumor-suppressive activity through induction of apoptosis in human prostate cancer. Cancer Res., 63, 7717–7723[Abstract/Free Full Text].

    Neal, R.M. Bayesian Learning for Neural Networks, (1996) , New York Lecture Notes in Statistics. No. 118 Springer-Verlag.

    Nicolas, E., et al. (2000) RbAp48 belongs to the histone deacetylase complex that associates with the retinoblastoma protein. J. Biol. Chem., 275, 9797–9804[Abstract/Free Full Text].

    Pavlidis, P., Weston, J., Cai, J., Grundy, W.N. (2001) Gene functional classification from heterogeneous data. Proceedings of the Fifth International Conference on Computational Molecular Biology , New York, NY, USA ACM Press, pp. 249–255.

    Qi, Y., Minka, T.P., Picard, R.W., Ghahramani, Z. (2004) Predictive automatic relevance determination by expectation propagation. Proceedings of the Twenty-first International Conference on Machine Learning , pp. 671–678.

    Qian, Y.-W., et al. (1993) A retinoblastoma-binding protein related to a negative regulator of Ras in yeast. Nature, 364, 648–652[CrossRef][Medline].

    Ren, C., et al. (1998) Reduced lysyl oxidase messenger RNA levels in experimental and human prostate cancer. Cancer Res., 58, 1285–1290[Abstract/Free Full Text].

    Shevade, S.K. and Keerthi, S.S. (2003) A simple and efficient algorithm for gene selection using sparse logistic regression. Bioinformatics, 19, 2246–2253[Abstract/Free Full Text].

    Singh, D., et al. (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell, 1, 203–209[CrossRef][Web of Science][Medline].

    Tessitore, L., et al. (2000) Inactivation of phosphatidylethanolamine N- methyltransferase-2 in aflatoxin-induced liver cancer and partial reversion of the neoplastic phenotype by PEMT transfection of hepatoma cells. Int. J. Cancer, 86, 362–367[CrossRef][Medline].

    Wielinga, P.R., et al. (2003) Characterization of the MRP4- and MRP5-mediated transport of cyclic nucleotides from intact cells. J. Biol. Chem., 278, 17664–17671[Abstract/Free Full Text].

    Williams, C.K.I. and Barber, D. (1998) Bayesian classification with Gaussian processes. IEEE Trans. Pattern Anal. Machine Intel., 20, 1342–1351[CrossRef].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
O. Stegle, L. Payet, J.-L. Mergny, D. J. C. MacKay, and J. L. Huppert
Predicting and understanding the stability of G-quadruplexes
Bioinformatics, June 15, 2009; 25(12): i374 - i1382.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
P. Sykacek, R. Clarkson, C. Print, R. Furlong, and G. Micklem
Bayesian modelling of shared gene function
Bioinformatics, August 1, 2007; 23(15): 1936 - 1944.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
G. C. Cawley and N. L. C. Talbot
Gene selection in cancer classification using sparse logistic regression with Bayesian regularization
Bioinformatics, October 1, 2006; 22(19): 2348 - 2355.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
V. Gowri-Shankar and M. Rattray
On the Correlation Between Composition and Site-Specific Evolutionary Rate: Implications for Phylogenetic Inference
Mol. Biol. Evol., February 1, 2006; 23(2): 352 - 364.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/16/3385    most recent
bti526v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (13)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chu, W.
Right arrow Articles by Wild, D. L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chu, W.
Right arrow Articles by Wild, D. L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?