Bioinformatics Advance Access originally published online on November 30, 2006
Bioinformatics 2007 23(3):390-391; doi:10.1093/bioinformatics/btl602
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Prophet, a web-based tool for class prediction using microarray data
1 Department of Bioinformatics Centro de Investigación Príncipe Felipe (CIPF), Valencia, E46013, Spain
2 Functional Genomics Node, (INB) Centro de Investigación Príncipe Felipe (CIPF), Valencia, E46013, Spain
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Sample classification and class prediction is the aim of many gene expression studies. We present a web-based application, Prophet, which builds prediction rules and allows using them for further sample classification. Prophet automatically chooses the best classifier, along with the optimal selection of genes, using a strategy that renders unbiased cross-validated errors. Prophet is linked to different microarray data analysis modules, and includes a unique feature: the possibility of performing the functional interpretation of the molecular signature found.
Availability: Prophet can be found at the URL http://prophet.bioinfo.cipf.es/ or within the GEPAS package at http://www.gepas.org/
Contact: jdopazo{at}cipf.es
Supplementary information: http://gepas.bioinfo.cipf.es/tutorial/prophet.html
| BACKGROUND |
|---|
|
|
|---|
One of the crucial factors behind the success of DNA microarray technologies has been its application to the definition of predictors of clinical outcomes (van 't Veer et al., 2002). Albeit not free from criticisms (Simon, 2005), the practical implications of this particular goal have definitively fuelled the use of microarrays. Common errors in the early proposals of predictors, such as the selection bias (Ambroise and Mclachlan, 2002; Simon et al., 2003), which causes unrealistic, biased-down error estimations, are behind the above mentioned criticisms. Recently, proper strategies for an unbiased cross-validation have been proposed. The estimation of the classification errors must take into account the gene selection step as well as any other parallel step taken such as the optimization of the number of selected genes, the selection among various classifiers, etc. However, it is still frequent to find publications in which this important fact has not been taken into account (Ambroise and McLachlan, 2002). In the root of this commonly extended conceptual error is, probably, the lack of easy-to-use, accurate and freely available solutions that allow end users to carry out such analysis.
Prophet aims to fulfill the demand of a simple but powerful tool for prediction purposes in the microarray context. Since web-based solutions are gaining acceptance in the microarray community for data analysis purposes (see for example: http://bioinformatics.ubc.ca/resources/links_directory/index.php?subcategory_id=101), Prophet was conceived to be accessible over the web. To our knowledge, the only other web-based, equivalent tool available is the M@CBETH (Pochet et al., 2005). However, this program can only handle two-class problems. Moreover, since PCA is used to reduce the dimension of the data the identity of the genes is lost and only the principal components can be retrieved from the predictor. Finally, only support vector machines (SVMs) and Fisher's discriminant analysis can be used as classification algorithms.
| BUILDING THE PREDICTOR AND PREDICTING |
|---|
|
|
|---|
Prophet has two main options: train and predict. The first one (corresponding to the training step) is used to build the predictor while in the second one the predictor found can be used for predicting class membership for new samples.
Prophet builds a prediction rule based on genes. There are several options for defining the dataset of genes to be used for training the predictors. Prophet accepts user-defined selections of genes or, alternatively, it can find the optimal subset within the whole set of genes. For the second option, also known as the filter approach in the machine learning literature, Prophet pre-selects the genes which will potentially provide more accuracy to the predictor. Two ways of ranking genes for subsequent selection have been implemented: the F-ratio (Dudoit et al., 2002) and the Wilcoxon statistic, a non-parametric test for differences between two classes. These can be used in combination with any of the class-prediction algorithms implemented in Prophet, which have been shown to perform very efficiently with microarray data (Dudoit et al., 2002; Romualdi et al., 2003; Wessels et al., 2005). The methods are: SVM (Vapnik, 1999), k-nearest neighbor (KNN), diagonal linear discriminant analysis (DLDA), SOM (Kohonen, 1997) and shrunken centroids (PAM) (Tibshirani et al., 2002).
The train option of the Prophet form implements the strategy for finding the best predictor with the optimal number of genes. A leave-one-out (LOO) cross-validation strategy is implemented here to return the cross-validated error rate of the complete process of building several predictors and then choosing the one with the smallest error rate. The procedure used is as follows: a LOO sample is drawn from the training dataset. Genes are ranked by one of the methods above mentioned (F-ratio or Wilcoxon statistic) and using the n top genes (n = 2, 5, 10, 20, 35, 50, 75 and 100, by default) a predictor is built with the methods above mentioned (KNN, DLDA, SVM, PAM, SOM or a sub-selection of them). Then, the LOO error is calculated for each method for each n genes. Finally, the smallest set of n genes in combination with the method that results in the smallest CV error is reported. The results include a plot of the CV error across the range of sets of n genes for all the classification methods tried along with the corresponding confusion matrices (very useful to detect asymmetries in the determination of classes). In addition, the prediction for each LOO sample is provided, which is quite useful for detecting outlayers or anomalous missassignments. More detailed information and examples are available in the tutorial page at, http://gepas.bioinfo.cipf.es/tutorial/prophet.html. Finally, all the supplementary information was included in the tutorial.
Once the optimal predictor (combination of a set of n genes and a classification method) has been found, it can be saved. Then, in the predict option of the form, the predictor can be retrieved and applied to new samples and a class membership prediction will be obtained for them.
The input file format is quite simple: a tab-delimited text file with genes in rows and experiments in columns. The first column corresponds to the gene identifiers. Individual experiment identifiers as well as class identifiers can be provided in a separate file or within the main file with the corresponding labels (#NAME and #CLASS, respectively, see tutorial and Supplementary information for details).
Prophet is integrated within the GEPAS (Herrero et al., 2003; Montaner et al., 2006) environment, thus a complete analysis of the microarray data, from the first steps of normalization and preprocessing, can be performed without the necessity of switching among different programs with different input/output formats. Another unique feature is the possibility of having a functional interpretation of the genes included in the predictor. This is achieved through tools such as FatiGO+ (Al-Shahrour et al., 2004) an others, included in the Babelomics package (Al-Shahrour et al., 2005, 2006), to which Prophet is also linked.
In addition to the web interface, Prophet can be invoked as a web service.
To summarize, Prophet provides an accurate, conceptually correct and easy-to-use framework for building predictors based on microarray gene expression data that can be later used to predict class membership for new samples. Moreover, this is the only web-based tool that builds predictors based on genes and allows a further functional interpretation of the results.
| Acknowledgments |
|---|
This work is supported by grants from NRC Canada-SEPOCT Spain, project BIO 2005-01078 from the MEC and INDIGO EU project. The Functional Genomics node (INB) is supported by Genoma España. Funding to pay the Open Access publication charges for this article was provided by Genoma Españna.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Chris Stoeckert
Received on October 23, 2006; revised on November 19, 2006; accepted on November 20, 2006
| REFERENCES |
|---|
|
|
|---|
Al-Shahrour, F., et al. (2004) FatiGO: a web tool for finding significant associations of Gene Ontology terms with groups of genes. Bioinformatics, 20, 578580
Al-Shahrour, F., et al. (2005) BABELOMICS: a suite of web tools for functional annotation and analysis of groups of genes in high-throughput experiments. Nucleic Acids Res, . 33, W460W464
Al-Shahrour, F., et al. (2006) BABELOMICS: a systems biology perspective in the functional annotation of genome-scale experiments. Nucleic Acids Res, . 34, W472W476
Ambroise, C. and McLachlan, G.J. (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl Acad. Sci. USA, 99, 65626566
Dudoit, S., et al. (2002) Comparison of discrimination methods for the classification of tumors suing gene expression data. J. Am. Stat. Assoc, . 97, 7787[CrossRef][Web of Science].
Herrero, J., et al. (2003) GEPAS: A web-based resource for microarray gene expression data analysis. Nucleic Acids Res, . 31, 34613467
In Kohonen, T. (Ed.). Self-organizing Maps, (1997) , Berlin Springer-Verlag.
Montaner, D., et al. (2006) Next station in microarray data analysis: GEPAS. Nucleic Acids Res, . 34, W486W491
Pochet, N.L., et al. (2005) M@CBETH: a microarray classification benchmarking tool. Bioinformatics, 21, 31853186
Romualdi, C., et al. (2003) Pattern recognition in gene expression profiling using DNA array: a comparative study of different statistical methods applied to cancer classification. Hum. Mol. Genet, . 12, 823836
Simon, R. (2005) Roadmap for developing and validating therapeutically relevant genomic classifiers. J. Clin. Oncol, . 23, 73327341
Simon, R., et al. (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J. Natl Cancer Inst, . 95, 1418
Tibshirani, R., et al. (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc. Natl Acad. Sci USA, 99, 65676572
van 't Veer, L.J., et al. (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530536[CrossRef][Medline].
In Vapnik, V. (Ed.). Statistical Learning Theory, (1999) , New York John Wiley and Sons.
Wessels, L.F., et al. (2005) A protocol for building and evaluating predictors of disease state based on microarray data. Bioinformatics, 21, 37553762
This article has been cited by other articles:
![]() |
T. Schepeler, J. T. Reinert, M. S. Ostenfeld, L. L. Christensen, A. N. Silahtaroglu, L. Dyrskjot, C. Wiuf, F. J. Sorensen, M. Kruhoffer, S. Laurberg, et al. Diagnostic and Prognostic MicroRNAs in Stage II Colon Cancer Cancer Res., August 1, 2008; 68(15): 6416 - 6424. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Tarraga, I. Medina, J. Carbonell, J. Huerta-Cepas, P. Minguez, E. Alloza, F. Al-Shahrour, S. Vegas-Azcarate, S. Goetz, P. Escobar, et al. GEPAS, a web-based tool for microarray data analysis and interpretation Nucleic Acids Res., July 1, 2008; 36(suppl_2): W308 - W314. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Saeys, I. Inza, and P. Larranaga A review of feature selection techniques in bioinformatics Bioinformatics, October 1, 2007; 23(19): 2507 - 2517. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


