Skip Navigation


Bioinformatics Advance Access originally published online on March 10, 2009
Bioinformatics 2009 25(9):1203-1204; doi:10.1093/bioinformatics/btp126
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
25/9/1203    most recent
btp126v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Leek, J. T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leek, J. T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2009 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

The tspair package for finding top scoring pair classifiers in R

Jeffrey T. Leek

Department of Oncology, Johns Hopkins School of Medicine, Baltimore, MD 21287, USA


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE TSPAIR PACKAGE
 3 AN EXAMPLE SESSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Top scoring pairs (TSPs) are pairs of genes whose relative rankings can be used to accurately classify individuals into one of two classes. TSPs have two main advantages over many standard classifiers used in gene expression studies: (i) a TSP is based on only two genes, which leads to easily interpretable and inexpensive diagnostic tests and (ii) TSP classifiers are based on gene rankings, so they are more robust to variation in technical factors or normalization than classifiers based on expression levels of individual genes. Here I describe the R package, tspair, which can be used to quickly identify and assess TSP classifiers for gene expression data.

Availability: The R package tspair is freely available from Bioconductor: http://www.bioconductor.org

Contact: jtleek{at}jhu.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE TSPAIR PACKAGE
 3 AN EXAMPLE SESSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Classification of patients into disease groups or subtypes is the most direct way to translate microarray technology into a clinically useful tool (Quackenbush, 2006). A small number of tests based on microarrays have even been approved for clinical use, for example, for diagnosing breast cancer subtypes (Ma et al., 2004; Marchionni et al., 2008; Paik et al., 2004; van't Veer et al., 2002). But standard microarray classifiers are based on complicated functions of many gene expression measurements. This type of classifier is both hard to interpret and depends critically on the platform, pre-processing and normalization steps to be effective (Quackenbush, 2006). Identifying biologically interpretable, robust and cheap classifiers based on small subsets of genes would greatly speed progress in the development of clinical tests from microarray experiments.

Top scoring pairs (TSPs) are pairs of genes that accurately classify patients into clinically relevant groups based on their ranks (Geman et al., 2004; Tan et al., 2005; Xu et al., 2005). The basic idea is to search among all pairs of genes, and look for genes whose ranking most consistently switches between two groups. To understand how the classification scheme works, consider the simulated gene expression data in Figure 1. In this figure there are two groups of arrays, separated by the black line. These groups could represent healthy patients versus cancer patients, or two distinct subtypes of cancer. For all but one array in Group 1, Gene 1 has higher expression than Gene 2, and the reverse is true in Group 2. In this case, Genes 1 and 2 form a classifier based on their relative levels of expression. A new sample where the gene expression for Gene 1 was higher than the gene expression for Gene 2 would be classified as Group 1.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. An Example of a TSP. In this simulated example, the expression for Gene 1 is higher than the expression for Gene 2 for almost all of the arrays in the group on the left and this relationship reverses for the group on the right.

 
The TSP approach has been successfully applied to identify subtypes of sarcoma, resulting in a RT-PCR-based test that correctly classified 20 independent tumors with perfect accuracy (Price et al., 2007). This early success suggests that it may be possible to identify TSP classifiers for other important diseases and quickly develop new inexpensive diagnostic tests.


    2 THE TSPAIR PACKAGE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE TSPAIR PACKAGE
 3 AN EXAMPLE SESSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Calculating the TSP for a gene expression dataset is relatively straightforward, but computationally intensive. I have developed an R package tspair that can rapidly calculate the TSP for typical gene expression datasets, with tens of thousands of genes. The TSP can be calculated both in R or with an external C function, which allows both for rapid calculation and flexible development of the tspair package. The tspair package includes functions for calculating the statistical significance of a TSP by permutation test, and is fully compatible with Bioconductor expression sets. The R package is freely available from the Bioconductor web site (www.bioconductor.org).


    3 AN EXAMPLE SESSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE TSPAIR PACKAGE
 3 AN EXAMPLE SESSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Here I present an example session on a simple simulated dataset included in the tspair package. I calculate the TSP, assess the strength of evidence for the classifier with a permutation test, plot the output and show how to predict outcomes for a new dataset. The main function in the tspair package is tspcalc(). This function accepts either (i) a gene expression matrix or an expression set and a group indicator vector, or (ii) an expression set object and a column number, indicating which column of the annotation data to use as the group indicator. The result is a tsp object which gives the TSP score, indices, gene expression data and group labels for the TSP. If there are multiple pairs that achieve the top score, then the tie-breaking score developed by Tan et al. (2005) is reported.

Formula

The function tspsig() can be used to calculate the significance of a TSP classifier by permutation as described in Geman et al. (2004). The class labels are permuted, a new TSP is calculated for each permutation, and the null scores are compared with the observed TSP score to calculate a P-value. Since the maximum score is calculated for each null permutation, tspsig() performs a test of the null hypothesis that no TSP classifier is better than random chance.

Formula

Once a TSP has been calculated, the tspplot() function can be used to visualize the classifier. The resulting TSP figure (Fig. 2) plots the expression for the first gene in the pair versus the expression for the second gene in the pair. The true group difference is indicated by the color of the points, and the score for the TSP classifier is shown in the title of the plot. The black 45{circ} line indicates the classification from the TSP; the better the black line separates the colors the better the accuracy of the TSP.


Figure 2
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. A TSP plot. A TSP plot for the simulated data example in the tspair package. The colors indicate the true groups, and the black line indicates the TSP classification. The black line is the line where expression for ‘Gene 5’ equals the expression for ‘Gene 338’; the classification boundary is not data-driven, it is set in advance.

 
Formula

A major advantage of the TSP approach is that predictions are very simple and can be easily calculated either by hand or using the built-in functionality of the tspair package. In this example, the expression value for ‘Gene5’ is greater than the expression value for ‘Gene338’ much more often for the diseased patients. In a new dataset, when the expression for ‘Gene5’ is greater than the expression for ‘Gene338’ I predict that the patient will be diseased. The tspair package can be used to predict the outcomes of new samples based on new expression data. The new data can take the form of a new expression matrix, or an expression set object. The R function predict() searches for the TSP gene names from the original tspcalc() function call, and based on the row names or featureNames of the new dataset identifies the genes to use for prediction. If multiple TSPs are reported, the default is to predict with the TSP achieving the top tie-breaking score (Tan et al., 2005), but the user may also elect to use a different TSP for prediction.

Formula

In this example, the predict() function finds the genes with labels ‘Gene5’ and ‘Gene338’ in the second dataset and calculates the TSP predictions based on the values of these two genes. The new data matrix need not be defined by a microarray, it could easily be the result of RT-PCR or any other expression assay, imported into R as a tab-delimited text file.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE TSPAIR PACKAGE
 3 AN EXAMPLE SESSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The author acknowledges the useful discussions with Giovanni Parmigiani, Leslie Cope, Dan Naiman and Don Geman.

Funding: National Science Foundation (DMS034211); National Institutes of Health (1UL1RR025005-01).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Joaquin Dopazo

Received on January 11, 2009; revised on February 16, 2009; accepted on March 1, 2009

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE TSPAIR PACKAGE
 3 AN EXAMPLE SESSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Geman D, et al. Classifying gene expression profiles from pairwise mRNA comparisons. Stat. Appl. Genet. Mol. Biol. (2004) 3.

    Ma XJ, et al. A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell (2004) 5:607–616.[CrossRef][Web of Science][Medline]

    Marchionni L, et al. Systematic review: gene expression profiling assays in early-stage breast cancer. Ann. Inter. Med. (2008) 148:358–369.[Abstract/Free Full Text]

    Paik S, et al. A multigene assay to predict recurrence of tamoxifen-treated, node-negative breast cancer. N. Engl. J. Med. (2004) 351:2817–2826.[Abstract/Free Full Text]

    Price ND, et al. Highly accurate two-gene classifier for differentiating gastrointestinal stromal tumors and leiomyosarcomas. Proc. Natl Acad. Sci. USA (2007) 104:3414–3419.[Abstract/Free Full Text]

    Quackenbush J. Microarray analysis and tumor classification. N. Engl. J. Med. (2006) 354:2463–2472.[Free Full Text]

    Tan AC, et al. Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics (2005) 21:3896–3904.[Abstract/Free Full Text]

    van't Veer LJ, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature (2002) 415:530–536.[CrossRef][Medline]

    Xu L, et al. Robust prostate cancer marker genes emerge from direct integration of inter-study microarray data. Bioinformatics (2005) 21:3905–3911.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Cancer Res.Home page
S. K. Patnaik, E. Kannisto, S. Knudsen, and S. Yendamuri
Evaluation of MicroRNA Expression Profiles That May Predict Recurrence of Localized Stage I Non-Small Cell Lung Cancer after Surgical Resection
Cancer Res., January 1, 2010; 70(1): 36 - 45.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
25/9/1203    most recent
btp126v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Leek, J. T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Leek, J. T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?