Skip Navigation


Bioinformatics Advance Access originally published online on November 8, 2005
Bioinformatics 2006 22(2):245-247; doi:10.1093/bioinformatics/bti760
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/2/245    most recent
bti760v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Buturovic, L. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Buturovic, L. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org

PCP: a program for supervised classification of gene expression profiles

Ljubomir J. Buturovic

San Francisco State University 1600 Holloway Avenue, San Francisco, CA 94132, USA


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCHITECTURE
 3 ALGORITHMS AND METHODS
 REFERENCES
 

Summary: PCP (Pattern Classification Program) is an open-source machine learning program for supervised classification of patterns (vectors of measurements). The principal use of PCP in bioinformatics is design and evaluation of classifiers for use in clinical diagnostic tests based on measurements of gene expression. PCP implements leading pattern classification and gene selection algorithms and incorporates cross-validation estimation of classifier performance. Importantly, the implementation integrates gene selection and class prediction stages, which is vital for computing reliable performance estimates in small-sample scenarios. Additionally, the program includes automated and efficient model selection (optimization of parameters) for support vector machine (SVM) classifier. The distribution includes Linux and Windows/Cygwin binaries. The program can easily be ported to other platforms.

Availability: Free download at http://pcp.sourceforge.net

Contact: ljubomir{at}sfsu.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCHITECTURE
 3 ALGORITHMS AND METHODS
 REFERENCES
 
Clinical diagnostic tests based on measurements of gene expression are starting to be offered commercially and have a potential to gradually enter widespread clinical practice (van de Vijver et al., 2002; Soonmyung et al., 2004; Moraleda et al., 2004). The research and development of automated diagnostic tools based on genome-wide gene expression patterns include the gene selection and classifier design phases. The gene selection phase chooses the optimal (according to some suitable criterion) subset of genes thought to be the most relevant for discriminating among the disease categories. The task of the classifier is to assign the specimen being interrogated into one of the previously defined diagnostic classes, using measurements of expression of the selected genes.

Pattern Classification Program (PCP) package has been designed to assist in the development of these two stages. It is a stand-alone, open-source application which implements leading algorithms for gene selection and classifier learning, prediction and performance evaluation. The principal applications of PCP are evaluation of inherent discrimination (i.e. diagnostic power) of the datasets under study, identification of optimal gene subsets, and comparison of proprietary or novel algorithms with the more established ones. The program can be used and redistributed without restrictions in binary and source forms.


    2 ARCHITECTURE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCHITECTURE
 3 ALGORITHMS AND METHODS
 REFERENCES
 
PCP source code is written in C and C++ programming languages, strictly conforming to the corresponding ANSI standards. The distribution contains pre-compiled binaries for Linux and Windows (the latter requires installation of the free Cygwin environment). The program uses (links with) LAPACK linear algebra library (Anderson et al., 1999), available on many platforms.

PCP is a desktop application which uses hierarchically organized menus for user interaction. The main menu is shown in Figure 1. The program is started from a command prompt. The menus are controlled interactively from the keyboard, by pressing keys corresponding to the menu actions (e.g. Learning, Cross-validation, Prediction, etc.). All processing parameters are entered from the keyboard in response to the program prompts.


Figure 1
View larger version (36K):
[in this window]
[in a new window]
 
Fig. 1 PCP Main Menu. The functions are activated by pressing the corresponding keyboard keys. For example, to enter the Pattern Classification Menu, press the ‘b’ key.

 
Input data are read from whitespace-delimited text files. The data files are assumed to contain normalized expression values of individual genes for all specimens. The expression values will typically be generated using software specific to the microarray platform. For example, for Affymetrix GeneChip platform, the gene expression values may be produced from CEL files using the Affymetrix Microarray Suite (MAS) software or an open source program implementing the RMA algorithm (Irizarry et al., 2003) for normalization and summarization.

Results are presented in tabular form on the screen and also saved in text files in easily parsable formats. The files can then be used for graphical display or further analyses by other programs.

PCP also supports batch mode, in which it reads commands (corresponding to the navigation controls and processing parameters) from a text file. This makes it convenient to incorporate PCP processing in a complex data analysis dataflow driven from a scripting language such as Perl. This mode of operation is strengthened by the robust error handling facility, which stores all diagnostics in an easily parsable text file.


    3 ALGORITHMS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCHITECTURE
 3 ALGORITHMS AND METHODS
 REFERENCES
 
The algorithms supported by PCP are shown in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Algorithms implemented in PCP

 
3.1 Gene selection
PCP offers two groups of algorithms for projecting input gene expression values into lower-dimensional space (the process often referred to as dimensionality reduction): gene (feature) extraction and gene (feature) selection.

Gene extraction refers to algorithms which build a linear mapping transformation for reducing the dimensionality of the input gene space. Note that a diagnostic test incorporating such a transformation utilizes all of the input gene expression measurements.

Gene selection chooses an optimal subset of genes for further processing (classification). In contrast to gene extraction, a diagnostic test incorporating gene selection only utilizes the expression values of the genes in the subset. This permits the use of a more accurate expression measurement technology (e.g. RT–PCR) for the genes in the chosen subset. In addition, the process potentially identifies a biologically meaningful subset of genes. For these reasons, gene selection is usually the preferred dimensionality reduction method in expression-based diagnostics. Nevertheless, on occasion gene extraction provides superior classification performance, and may be used to evaluate discrimination power of all available measurements.

Gene selection can further be subdivided into algorithms which evaluate and compare predictive power of individual genes, and algorithms which compare groups (subsets) of genes. The first group of algorithms is known as Gene Ranking (Su,Y. et al., http://genomics10.bu.edu/yangsu/rankgene). The algorithms in this group assume independence among the genes’ expressions and differ by the gene ranking criterion. The criteria currently available in PCP are listed in Table 1. The second group of gene selection algorithms includes forward selection and backward elimination, supported in PCP. These methods may be able to identify complex relationships among genes, at a cost of significantly higher computational complexity. PCP uses 1-NN error rate estimate, Bayes error estimate and inter-intra distance as gene subset evaluation criteria for this group of gene selection algorithms.

3.2 Performance evaluation
PCP utilizes cross-validation to evaluate classifier performance. One of the challenges in cross-validation-based estimation of classifier performance is the integration of the gene selection stage. It has been demonstrated in the context of microarray data analysis (Molinari et al., 2005) that a reliable estimate requires repeated gene subset selection for each training resampling subset. PCP rigorously implements this requirement. Thus, for each cross-validation fold, the gene selection is performed anew using the training subset, and the chosen genes are then extracted from the training and test subsets. This processing significantly increases the overall computational complexity, but is vital to avoid severe underestimation of classifier error rates.

3.3 SVM model selection
Model selection refers to the process of choosing optimal parameters of a classifier. For example, MLP model selection consists in determining the optimal number of hidden nodes. For algorithms with a single discrete-valued parameter, such as k-NN and neural networks, the process is straightforward and amounts to cross-validation of the classifier for a relatively small set of values of the parameter. The optimal parameter is the value which gives best cross-validation performance. This analysis can often be executed interactively or within a simple driver script.

In contrast, SVM presents considerable challenges for model selection. The various incarnations of SVM usually have two or more continuously valued parameters. An exhaustive search of parameter space is computationally demanding and requires automation. The situation is further complicated with the addition of gene selection phase in the classifier design. A rigorous SVM model selection involves classifier cross-validation for each set of parameter values, including the repeated gene selection for each cross-validation subset, as explained in Section 3.2. This requirement dramatically increases the computational complexity. Given the complexity of gene selection algorithms themselves, the total computational burden may be impractical even for small datasets.

PCP efficiently solves the problem of SVM model selection and reduces complexity to manageable levels. The improvements are achieved by employing a heuristic based on the Simplex algorithm (Press et al., 1992) in parameter search and by pre-computing the gene selection subsets before starting parameter search. As a result, PCP makes it practical to use SVM for analysis of high-dimensional, small-sample datasets typically encountered during the development of gene expression-based diagnostics.


    Acknowledgments
 
Sasha Jaksic of San Francisco State University contributed the code to the gene selection functionality and suggested the pre-computation of gene selection subsets. Professor Milan Milosavljevic of the University of Belgrade collaborated on an earlier version of the program.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alvis Brazma

Received on August 10, 2005; revised on September 18, 2005; accepted on November 2, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ARCHITECTURE
 3 ALGORITHMS AND METHODS
 REFERENCES
 

    Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D. LAPACK Users’ Guide, Third Edition, (1999) Philadelphia: Society for Industrial and Applied Mathematics.

    Fukunaga, K. and Hummels, D.M. (1987) Bayes error estimation using Parzen and k-NN procedures. IEEE Trans. Pattern Anal. Mach. Intell, . PAMI-9, 634–643.

    Golub, T., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Irizarry, R.A., et al. (2003) Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res, . 31, 345–349[Abstract/Free Full Text].

    Molinari, A.M., et al. (2005) Prediction error estimation: a comparison of resampling methods. Bioinformatics, 21, 3301–3307[Abstract/Free Full Text].

    Moraleda, J., Grove, N., Tran, Q., Doan, J., Hull, J., Nguyen, L., Pattin, A., Anderson, G. (2004) Gene expression data analytics with interlaboratory validation for identifying anatomical sites of origin of metastatic carcinomas. Proceedings of the American Society of Clinical Oncology Annual MeetingNew Orleans, LA. 23, , pp. 862.

    Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P. Numerical Recipes, Second Edition, (1992) , Cambridge Section 10.4 Cambridge University Press.

    Soonmyung, P., et al. (2004) A multigene assay to predict recurrence of Tamoxifen-treated, Node-negative breast cancer. N. Engl. J. Med, . 351, 2817–2826[Abstract/Free Full Text].

    Su, Y., et al. (2003) RankGene: identification of diagnostic genes based on expression data. Bioinformatics, 19, 1578–1579[Abstract/Free Full Text].

    Theodoridis, S. and Koutroumbas, K. Pattern Recognition, Second Edition, (2003) , Amsterdam Academic Press.

    van de Vijver, M., et al. (2002) A gene-expression signature as a predictor of survival in breast cancer. N. Engl. J. Med, . 347, 1999–2009[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Toxicol SciHome page
R. S. Thomas, L. Pluta, L. Yang, and T. A. Halsey
Application of Genomic Biomarkers to Predict Increased Lung Tumor Incidence in 2-Year Rodent Cancer Bioassays
Toxicol. Sci., May 1, 2007; 97(1): 55 - 64.
[Abstract] [Full Text] [PDF]


Home page
Toxicol SciHome page
R. S. Thomas, T. M. O'Connell, L. Pluta, R. D. Wolfinger, L. Yang, and T. J. Page
A Comparison of Transcriptomic and Metabonomic Technologies for Identifying Biomarkers Predictive of Two-Year Rodent Cancer Bioassays
Toxicol. Sci., March 1, 2007; 96(1): 40 - 46.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
22/2/245    most recent
bti760v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Buturovic, L. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Buturovic, L. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?