Skip Navigation


Bioinformatics Advance Access originally published online on June 22, 2007
Bioinformatics 2007 23(17):2256-2264; doi:10.1093/bioinformatics/btm322
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/17/2256    most recent
btm322v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lottaz, C.
Right arrow Articles by Spang, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lottaz, C.
Right arrow Articles by Spang, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Annotation-based distance measures for patient subgroup discovery in clinical microarray studies

Claudio Lottaz *, Joern Toedling {dagger} and Rainer Spang

Max Planck Institute for Molecular Genetics and Berlin Center for Genome Based Bioinformatics, Ihnestr. 73, D-14195 Berlin, Germany

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Clustering algorithms are widely used in the analysis of microarray data. In clinical studies, they are often applied to find groups of co-regulated genes. Clustering, however, can also stratify patients by similarity of their gene expression profiles, thereby defining novel disease entities based on molecular characteristics. Several distance-based cluster algorithms have been suggested, but little attention has been given to the distance measure between patients. Even with the Euclidean metric, including and excluding genes from the analysis leads to different distances between the same objects, and consequently different clustering results.

Results: We describe a new clustering algorithm, in which gene selection is used to derive biologically meaningful clusterings of samples by combining expression profiles and functional annotation data. According to gene annotations, candidate gene sets with specific functional characterizations are generated. Each set defines a different distance measure between patients, leading to different clusterings. These clusterings are filtered using a resampling-based significance measure. Significant clusterings are reported together with the underlying gene sets and their functional definition.

Conclusions: Our method reports clusterings defined by biologically focused sets of genes. In annotation-driven clusterings, we have recovered clinically relevant patient subgroups through biologically plausible sets of genes as well as new subgroupings. We conjecture that our method has the potential to reveal so far unknown, clinically relevant classes of patients in an unsupervised manner.

Availability: We provide the R package adSplit as part of Bioconductor release 1.9 and on http://compdiag.molgen.mpg.de/software

Contact: claudio.lottaz{at}molgen.mpg.de


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Gene expression profiling using whole genome microarrays has generated large amounts of data in various clinical contexts. One goal of these studies is the discovery of clinically relevant patient subgroups, e.g. groups of patients requiring a particular treatment.

1.1 An example from lymphoma research
Alizadeh et al. (2000) define two new subtypes of diffuse large B-cell lymphoma based on a hierarchical clustering analysis using a functionally restricted set of genes. The two disease entities refer to distinct differentiation stages of B-cells. Monti et al. (2005) postulate a different partitioning of diffuse large B-cell lymphomas supported by genes which have been excluded from the first analysis. Their disease entities reflect proliferation properties of the B-cell malignancies. None of the results can be easily proven wrong. In fact, they do not contradict each other. The two research groups had a priori different notions as to which genes are relevant. This led to two dissimilar but relevant clusterings of samples.

1.2 Different genes—different distances—different results
In the context of class discovery, we cluster patient profiles. For clustering, pairwise distances between these objects are calculated. A decision to use the Euclidean or any other metric to do this, does not yet uniquely define these distances, though. Which genes to include in the analysis is very important. Using all measured genes as such is not a good choice. Several independent molecular characteristics of the patients like age, gender and disease status will overlap and obscure the result. Gene selection is called for but certainly affects the clustering. Each choice of a gene set to use defines a particular distance between any two samples. Different gene sets lead to different distances between the same objects, although we always use the Euclidean metric to compute them. In many clinical studies, gene selection is used for unsupervised analysis either in order to reduce noise in the expression data (e.g. Cario et al., 2005) or, in addition, to focus on reproducible features (e.g. Bhattacharjee et al., 2001; Monti et al., 2005). However, little attention on the effect of gene discarding on the resulting disease class definition has been given.

1.3 The concept of our algorithm
Instead of selecting genes according to purely statistical characteristics, we suggest a systematic gene selection approach according to functional annotation. We describe an algorithm that generates a list of alternative clusterings using different gene sets to compute distances between samples. We derive candidate gene sets from functional annotation data, and filter the list by a novel significance measure for clustering strength.

1.4 Previous work
Clustering of gene expression data is routine in bioinformatics. Several methods have been suggested in this field (for a review, see Chapter 4 of Speed, 2003).Various approaches to score the quality of clusterings, and to determine the best number of clusters exist (Dudoit and Fridlyand, 2002; Kerr and Churchill, 2001). All these methods have in common that the underlying metrics need to be specified beforehand. Several authors also have suggested ways to judge stability and statistical significance of clusters (Halkidi et al., 2001; Lange et al., 2004; McShane et al., 2002; Monti et al., 2003; Munneke et al., 2005). Semi-supervised clustering approaches include additional clinical information about patients. Bullinger et al. (2004) as well as Bair and Tibshirani (2004) suggest finding classes of patients using a clustering metric derived from the expression data and additional survival times. In a completely unsupervised setting, biclustering (Cheng and Church, 2000; Madeira and Oliveira, 2004; Tanay et al., 2004;) and class-finding algorithms (Roth and Lange, 2004; Varma and Simon, 2004; von Heydebreck et al., 2001) combine the gene selection process with the clustering. These methods produce alternative clusterings and characterize them by underlying gene sets. Unfortunately, such methods are rarely used in clinical studies. One reason might be that a large set of alternative clusterings is hard to interpret, unless the driving genes have a clear functional focus.

1.5 The role of functional annotations
A major shortcoming of class discovery algorithms is that they treat gene expression levels as anonymous variables. For many genes, however, a lot is known about their function and their role in cellular processes. Such knowledge is stored in databases like the Gene Ontology (GO) (Ashburner et al., 2000), Transpath (Schacherer et al. 2001), Biocarta (http://www.biocarta.com) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) (Kanehisa, 1996). Today, such annotations are routinely used to interpret results produced by statistical analysis. Tools for such a posteriori analysis include Beissbarth and Speed, (2004), Dennis et al. (2003) Adryan and Schuh (2004) Doniger et al. (2003), Subramanian et al. (2005) and Grossmann et al. (2006).

1.6 A priori use of functional annotations
Unlike a posteriori methods, we propose using annotations within the statistical analysis of the expression data. In different contexts this a priori use of functional annotations has already been investigated. Pavlidis et al. (2002) and Zien et al. (2000) use functional annotations to improve the sensitivity of algorithms for detecting differentially expressed genes. Rahnenführer et al. (2004) apply pathway annotations to investigate metabolic pathways. Subclass finding in complex clinical phenotypes using functional annotations is the topic of Lottaz and Spang (2005). Here, we apply similar concepts to the problem of molecular class discovery in patients.

1.7 Outline of the article
In the next section, we describe the clustering procedure as well as the scoring of clustering results. In Section 3, we illustrate the usefulness of functional gene annotation for producing alternative clusterings of samples on a number of cancer related clinical microarray datasets. Finally, we discuss possible extensions of the method and interpret our observations from a biological perspective in Section 4.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The key idea for our class discovery algorithm is to use meaningful gene sets for computing distances between samples. For practical use, it is desirable to have functional rationales characterizing clusterings, such as clusterings related to proliferation or apoptosis. To this end, we define candidate gene sets using functional annotations, and call the resulting clusterings annotation driven.

We use the k-means algorithm to generate clusterings based on candidate gene sets. The quality of these clusterings is assessed using the diagonal linear discriminant (DLD) score (von Heydebreck et al., 2001). In order to determine the statistical significance of scores, we also compute DLD scores for clusterings driven by randomly chosen gene sets. Empirical P-values are calculated and false discovery rates (FDR) computed according to Benjamini and Hochberg (Benjamini and Hochberg, 1995). Finally, we filter the list of clusterings for minimal subgroup size and to control the FDR. In a nutshell, the algorithm consists of the following steps:

For each biological term/pathway of interest, denoted Bi:

  1. Find all nBi genes annotated to Bi and discard all others.
  2. Perform 2-means clustering on the reduced expression matrix. This yields an annotation-driven clustering CBi.
  3. Compute DLD score S(CBi) for this clustering.
  4. Draw 10 000 random gene sets of size nBi from the set of all measured genes. For each of them compute Steps 2 and 3. This yields a vector r nBi of 10 000 scores.
  5. Assign an empirical P-value to the original clustering, denoting the proportion of entries of r nBi being greater or equal than S(CBi).

In the following, we provide more details on certain steps of the procedure.

2.1 Annotation data
We suggest the use of annotation data to generate candidate gene sets. Genes in a candidate set have common involvement in biological processes or pathways. To generate such gene sets, pathway databases such as KEGG and GO are particularly adequate. Sets of genes collected for a particular application from literature or a biologist's experience are possible alternatives. Very small gene sets should not be considered, since clusterings supported by very few genes are unlikely to represent a clustering of biological interest. On the other hand, sets containing too many genes are prone to be very unspecific, and thus their results are of little explanatory power.

2.2 Distance metric
K-means clustering is based on pairwise object dissimilarities. Objects in our case are patient expression profiles. We obtain dissimilarity measures from the family of restricted Euclidean metrics, which we define next.

Let (xi, xi') be any two expression profiles, both containing measured expression values for P genes. Reducing the expression profiles to a limited set of genes before computing the distance, is the same as computing a Euclidean distance specific for gene set G between the original profiles


Formula

where Ij isin G is an indicator variable taking the value 1 if gene j is in set G and 0 otherwise. We call DG a restricted Euclidean metric on patient space.

By selecting different gene sets before clustering, we choose different measures of distance between any two expression profiles. Since the choice of the distance measure affects the outcome of clustering stronger than the choice of the clustering algorithm (see Hastie et al., 2001, Chapter. 14), clusterings of the same samples with different metrics disagree substantially.

2.3 K-means initialization
K-means clustering critically depends on its initialization step. We derive an initialization based on the first split of a divisive hierarchical clustering (Kaufman and Rousseeuw, 1990, Chapter 6). Of the resulting two clusters, we compute centroids which provide the starting points for the k-means algorithm (MacQueen, 1967). This has been shown to outperform standard k-means with random starting points (Milligan and Sokol, 1980). In other words, k-means is used to refine individual clusters and to correct inappropriate assignments made by the hierarchical method.

2.4 Scoring clusterings
For clustering evaluation, we employ the DLD score, adopted from von Heydebreck et al. (2001):

Let X be the reduced expression matrix with rows containing the genes from the set of interest and columns representing the patient samples. Given a clustering C of samples, i.e. a binary vector of class labels for classes A and B, we are interested in genes, which best reflect this class division in their expression. A natural score for this purpose is Student's t-statistic. We only keep those 50 genes with the highest absolute t-statistic. All genes are kept for functional groups with fewer than 50 genes. Disease entities typically constitute expression changes of many genes. This is why we avoid clusterings with very few supporting genes by discarding the genes with the largest absolute t-statistic. The number of genes discarded is a user-defined parameter of our method, which defaults to 5. Discarding the respective rows (genes) from X , yields a shortened expression matrix X*. Now, we project the samples (columns) of X* onto a 1D space using the projection method from the classification step of DLD (Mardia et al., 1979). The DLD score S of a clustering C is the Student's t-statistic of the two clusters of C on this projection.

2.5 Assessing clustering significance
We introduce a new approach to address the question whether an annotation-driven clustering is statistically significant. To this aim, we observe clusterings based on randomly drawn gene sets, which have the same size as the set of functionally related genes but otherwise no restrictions on included genes. For each of these random gene sets, we find the optimal clustering and compute its DLD score as described above. The score derived from the annotation-driven clustering is compared with these random scores.

The DLD scores derived from random gene sets define a null distribution of scores for gene sets of the given size. For each annotation-driven clustering C, we can compute an empirical P-value {pi}E(C) denoting the proportion of random scores being equal to or greater than the annotation-driven clustering's DLD score. This empirical P-value provides us with a measure of significance for clusterings.

2.6 Multiple testing
The algorithm described so far, determines an empirical P-value for each term we can find associated genes for. Depending on the employed annotation sources and the microarray at hand, hundreds of terms are considered to generate annotation-driven clusterings. Hence, the determination of empirical P-values is subject to multiple testing. A conservative approach to correct for the multiple testing problem is to determine FDRs according to Benjamini and Hochbergs (1995). We employ this correction although its results are to be interpreted with care given the many dependencies between GO and KEGG terms which share commonly associated genes. First attempts to decorrelate overlapping gene sets for the gene set enrichment problem are described in Alexa et al. (2006) and Grossmann et al. (2006) but difficult to transfer to our method.

2.7 Implementation
We have implemented annotation-based clustering in the statistical programming language R (Ihaka and Gentleman, 1996; R Development Core Team, 2005). We employ the divisive hierarchical clustering method from the cluster package and the implementation of k-means clustering (Hartigan and Wong, 1979) from R's stats package. The implementation of the DLD score is taken from the isis package (von Heydebreck et al., 2001). We retrieve gene annotations for GO and KEGG from meta-data packages of the Bioconductor project (Gentleman et al., 2004). Our code is available in the R package adSplit (Lottaz et al., 2005) from http://compdiag.molgen.mpg.de/software. The package is also part of Bioconductor release 1.9.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We show results of our method on several cancer related datasets from clinical gene expression studies. We focus on the use of GO and the KEGG for annotations.

3.1 Expression data
We investigate the behavior of our clustering procedure on 15 clinical microarray studies. These studies are concerned with diagnostic and prognostic issues in the context of brain tumors (Freije et al., 2004; Nutt et al., 2003; Pomeroy et al., 2002; Rickman et al., 2001), breast cancer (Huang et al., 2003; West et al., 2001), leukemia (Armstrong et al., 2002; Cheok et al., 2003; Ross et al., 2004; Willenbrock et al., 2004; Yeoh et al., 2002), lung cancer (Beer et al., 2002; Bhattacharjee et al., 2001) and prostate tumors (Singh et al., 2002). All 15 microarray studies are based on Affymetrix GeneChip® technology. Eight datasets are generated using the genome wide HG-U95Av2 microarray based on release 95 of UniGene (Schuler, 1997), four studies are based on the older HU6800 chip, in two studies the newer HG-U133A chip based on release 133 of UniGene was applied, and one team worked with the HG-Focus chip, a microarray holding a subset of the probe-sets of the HG-U133A chip. Table 1 holds further information on the results obtained for these 15 studies.


View this table:
[in this window]
[in a new window]

 
Table 1. Cancer related datasets used for evaluation

 
For each of these datasets, gene expression profiles were background corrected and normalized on probe level using variance stabilization (Huber et al., 2002) before summarizing the probes into probe-set expression levels using median polish (Tukey, 1977) as suggested in RMA (Irizarry et al., 2003). The preprocessing steps are crucial to the clustering method. The variance stabilization method accounts for labeling effects and hybridization efficiency. Therefore, it improves gene to gene comparability of expression values but avoids blowing up the variance of low-variance genes, as genewise standardization would do. The summarization method using median polish further reduces adverse effects of probe-wise hybridization affinity. We used implementations of the cited preprocessing methods from Bioconductor (Gentleman et al., 2004).

3.2 Annotation data
For the systematic exploration of functional gene annotations, we suggest the use of the GO and KEGG. GO holds 17 601 biological terms, while KEGG comprises 231 pathways. For the considered Affymetrix microarrays, Table 2 states the number of terms and pathways, which have more than 20 probe-sets but < 10% of all probe-sets on the chip annotated.


View this table:
[in this window]
[in a new window]

 
Table 2. Gene sets defined by GO and KEGG per chip

 
Strikingly many GO terms have very few genes attributed: more than 75% of all terms hold less than 20 probe-sets. On the other hand, very few terms are too general holding more than 10% of the genes on the microarrays. KEGG also defines some very small gene sets, but roughly two thirds hold more than 20 genes. The cellular function of many genes is still unknown. Hence for those genes no functional annotations are available. At present we can only discard these genes, thus loosing information. This loss will become less important as functional knowledge increases.

On commercial Affymetrix oligonucleotide microarrays, many genes are represented by more than one probe-set, thus several rows in an expression matrix give measurements for the same gene. When extracting probe-sets with a common annotation, either all or none of the probe-sets representing the same gene are included. When drawing random sets of probe-sets, we mimic this fact, by actually drawing Entrezgene-IDs and including all probe-sets mapped to these in our random set. In this manner, we make sure that random scores actually correspond to random gene sets rather than random sets of probe-sets.

3.3 Annotation-driven clusterings
We observe that many annotation-driven clusterings of patients obtain low empirical P-values. As illustrated in Figure 1 for the leukemia study by Yeoh et al. (Yeoh et al., 2002), the distribution of empirical P-values has a peak close to zero. Apparently, certain gene sets with common functional annotation provide a better basis for clustering samples than random sets of genes. Moreover, the clusterings corresponding to low P-values are of particular interest for the biological focus of their supporting genes.


Figure 1
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Distribution of empirical P-values of annotation-driven clusterings on the gene expression study by Yeoh et al. (2002) on leukemia translocations.

 
Our second observation is that many clusterings with small P-values assign only few samples to one of the two clusters. In addition to a stringent P-value, we therefore also require a minimum group size of at least five samples for interesting clusterings. For the datasets analyzed, we thus obtain the number of interesting clusterings shown in the column ‘#C’ of Table 1.

From the same table, we see that our clustering procedure behaves differently on different datasets. While it finds dozens of annotation-driven clusterings with FDR lower than 10% and size of the small subgroup larger than 5 on most of our evaluation studies, it does not find any clustering in four datasets. In Yeoh et al. (2002) very heterogeneous expression profiles caused by chromosomal aberrations are included, thus leading to a large number of significant annotation-driven clusterings. We observe that our algorithm typically finds fewer annotation-driven clusterings in small datasets, where the minimal group size criteria is more stringent.

The set of annotation-driven clusterings for one project may be quite heterogeneous. Figure 2 illustrates such a case occurring in the study on embryonic brain tumors by Pomeroy et al. (2002). Stratifying these tumors by morphological features is controversial. Hence, they present an interesting field of research for diagnosis on a molecular level. The authors of this study acknowledge that the investigated tumors are very heterogeneous. In accordance with this observation, our method reports clearly differing annotation driven clusterings. Based on terms widely spread over the whole GO, we determine 23 different gene sets justifying splits of samples into two groups on significantly better grounds than randomly picked genes.


Figure 2
View larger version (41K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Annotation-driven clusterings for the study by Pomeroy et al. (2002). Colors code the cluster to which a patient is attributed with respect to the corresponding gene set. In the gene set descriptions to the right of the image, the GO source ontologies of the annotations are indicated by BP (biological process), CC (cellular component) and MF (molecular function). Columns correspond to samples and rows to gene sets. The image is clustered in both directions in order to bring similar clusterings and sample profiles close together. The depicted set of clusterings achieves a FDR of 8.8%.

 
3.4 Sorting out splits with little support
In Section 2.4, we describe in detail, how splits are evaluated. The DLD scores use to this end are computed on expression values from the most differential genes in a given split. In order to mask splits, which are supported by very few genes, we recommend to discard the n top-most differential genes from the determination of DLD scores. We consider such splits to be of little interest and prone to be artifacts. In order to demonstrate the efficiency of masking, we have analyzed the dataset by Ross et al. (2004) for various settings of this parameter. Figure 3 illustrates our observations.


Figure 3
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Median number of differential genes in top ranked splits. Splits are ranked according to DLD scores computed ignoring the n most differential genes. Rankings are deduced from the dataset by Ross et al. (2004).

 
We have computed DLD scores for each split of the dataset, ignoring 0,1, ... ,19 genes, respectively. This yields 20 different rankings of splits. For the top most splits of each ranking, we have computed the number of differentially expressed genes among all genes associated to the corresponding annotation. In Figure 3, the median of differential genes from the top ranked splits is related to the number of genes ignored when generating the ranking. We see that taking into account all genes in the evaluation of splits, places many splits with very few supporting genes at the top of the lists. Ignoring a moderate number of genes improves this behavior and setting this parameter higher than five does no longer improve the wide support of top ranked splits.

3.5 Coherence between clusterings and phenotype
The cited datasets from clinical microarray studies come with clinical information. For instance, in the lung cancer study discussed in Bhattacharjee et al. (2001), histologically defined subtype assignments are provided for the biopsies, while in Ross et al. (2004), cytogenetically determined translocations are given for each patient. In order to assess the clinical relevance of identified significant clusterings, we compare these with clinical parameters. We expect that our method rediscovers clinically relevant patient subgroups and characterizes them with related biological terms. The clinical parameters were not used in the unsupervised class finding procedure. In order to confirm this ability, we have applied {chi}2-tests to detect annotation-driven clusterings which are highly correlated with categorical clinical parameters.

On several datasets, we observed clusterings of striking correlation with clinical parameters, thus supporting previous findings. For instance, on the acute myeloid leukemia (AML) dataset of Ross et al. (2004), we found 11 patient splits for which the two groups correspond to some phenotypical separation of the samples. Less than 10 profiles are attributed inconsistently by these splits to the corresponding phenotypical separation and {chi}2 contingency table tests yield P-values below 10–10. Seven of these clusterings consistently separate the group of megakaryocytic leukemia profiles plus one other profile described as having an unspecified AML subtype from the other AML subtypes. The seven clusterings stem from gene sets annotated to blood coagulation (GO:0007596) and related GO-terms. See Figure 4 for a display of the relationships between the seven GO terms and their ancestors within the GO.


Figure 4
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Clusterings driven by the gene sets associated to the seven emphasized nodes identify acute megakaryocytic leukemia with just one conflicting class assignment in the dataset by Ross et al. (2004). The figure shows the GO subgraph induced by these nodes.

 
On the lung cancer dataset by Bhattacharjee et al. (2001), we identified 17 clusterings showing P-values <10–10 in the {chi}2-test and differing by not more than 10 cluster assignments from the corresponding morphological classification of the tumors. Nine of these clusterings separate the group of 20 pulmonary carcinoid tumors from all other tumors. Five of the nine clusterings also assign one or two other profiles to the cluster of carcinoid tumors. The nine clusterings are derived from gene sets annotated to central nervous system development (GO:0007417), ion channel activity (GO:0005216) and related terms.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
An important goal of clinical microarray studies is the discovery of cohesive subgroups of patients according to molecular criteria. Commonly, unsupervised clustering is employed to this aim, although the evaluation of clustering results is notoriously difficult. One suggestion, to show whether a clustering is biologically meaningful, is to point out that functional annotation of the genes supporting the clustering are coherent or plausible.

In this article, we propose an algorithm to use functional annotations stored in the GO and the KEGG database of pathways directly to search for cohesive groups of samples. By selecting genes sharing common annotation in GO or KEGG and limiting gene expression profiles to these, we define distinct distances between samples for each term or pathway. Consequently, different clusterings are found for each GO term or KEGG pathway. A notable difference to other approaches to select genes before clustering (e.g. Bullinger et al., 2004) is that the selection stems from independent data, which represent biological expert knowledge and are not affected by experimental variations.

The use of GO and KEGG to extract functional annotations leads to the inclusion of some unreliable data. These databases are always incomplete and their computationally derived annotations may contain errors. However, we expect our approach to be robust against such erroneous annotation data, since clusterings are always supported by several genes with common annotation. Currently, our method is limited to clusterings into two groups. This is mainly done to ensure comparability of clustering scores and to allow for a meaningful statistical analysis. However, our method can easily be applied iteratively taking only the samples in one of the clusters, to see whether a different biological theme could help to divide this one cluster into sub-clusters. Since the iterations of this clustering procedure can rely on different gene sets, the procedure is even more flexible than allowing for more than two clusters in the original algorithm.

In our evaluation on cancer related datasets (see Table 1) we found several significant annotation-driven clusterings, which strongly correlate with clinical patient stratifications. Moreover, the driving genes often confirm previous reports on the biology behind tumor development. For instance, for the AML dataset of Ross et al. (2004), we found a large number of significant clusterings. AML is a heterogeneous disease, comprising abnormal proliferation of the precursors of granulocytes, monocytes and thrombocytes (Jaffe et al., 2001). Thus, it is not surprising to find many significant clusterings separating one type of AML from the rest. For example, seven clusterings that separate AML of the FAB-M7 type, i.e. acute megakaryocytic leukemia, from the other AML types, are based on gene sets attributed to blood coagulation, cell adhesion and five related terms. Since megakaryocytes give rise to thrombocytes, whose primary function is to mediate cell adhesion to damaged endothelium and blood coagulation, they are bound to excel in the expression of genes involved in these processes. Remarkably, one patient profile that was clinically described as having an unspecified AML subtype is consistently assigned to the cluster of FAB-M7 samples. This sample seems to display molecular characteristics of the FAB-M7 subtype, although it would not be assigned to this subtype based on clinical criteria.

In accordance with other studies, Bhattacharjee et al. (2001) have described lung cancer to be a general concept comprising very different tumor subtypes. We as well observe large biological differences between these subtypes in form of significant annotation-driven clusterings. For example, nine clusterings clearly separate pulmonary carcinoid tumors from all other types of lung cancer. These nine clusterings are derived from gene sets annotated to central nervous system development (GO:0007417), ion channel activity (GO:0005216) and seven related terms. Pulmonary carcinoid tumors have been previously reported to be of neuroendocrine origin and to be closely related to brain tumors (Anbazhagan et al., 1999). Our finding of remarkable expression of nerve-cell associated genes by these tumors supports such reports.

In summary, the method presented in this article has the potential to uncover clinically relevant clusterings in gene expression studies. Moreover, such clusterings may be of particular interest due to the biological focus of their supporting genes.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors are grateful to Jochen Jäger, Dennis Kostka, Stefanie Scheid and Stefan Bentink from our work group as well as to our partners Renate Kirschner-Schwabe, Christian Hagemeier and Karl Seeger from the Charité Medical Center for fruitful discussions. This research has been supported by BMBF grants 01GS0445 and 01GR0455 of the German Federal Ministry of Education and the National Genome Research Network.

Conflict of Interest: none declared.


    FOOTNOTES
 
{dagger}Present address: EMBL – European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK. Back

Received on October 24, 2006; revised on February 8, 2007; accepted on June 11, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Adryan B, Schuh R. Gene-ontology-based clustering of gene expression data. Bioinformatics (2004) 20:2851–2852.[Abstract/Free Full Text]

    Alexa A, et al. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics (2006) 22:1600–1607.[Abstract/Free Full Text]

    Alizadeh A, et al. Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature (2000) 403:503–511.[CrossRef][Medline]

    Anbazhagan R, et al. Classification of small cell lung cancer and pulmonary carcinoid by gene expression profiles. Cancer Res (1999) 59:5119–5122.[Abstract/Free Full Text]

    Armstrong S, et al. Mll translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat. Genet (2002) 30:41–47.[CrossRef][Web of Science][Medline]

    Ashburner M, et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet (2000) 25:25–29.[CrossRef][Web of Science][Medline]

    Bair E, Tibshirani R. Semi-supervised methods to predict patient survival from gene expression data. PLoS Biol (2004) 2:E108.[CrossRef][Medline]

    Beer DG, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med (2002) 8:816–824.[Web of Science][Medline]

    Beissbarth T, Speed T. Gostat: find statistically overrepresented gene ontologies within a group of genes. Bioinformatics (2004) 20:1464–1465.[Abstract/Free Full Text]

    Benjamini Y, Hochberg Y. Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (1995) 57:289–300.

    Bhattacharjee A, et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl Acad. Sci. USA (2001) 98:13790–13795.[Abstract/Free Full Text]

    Bullinger L, et al. Use of gene-expression profiling to identify prognostic subclasses in adult acute myeloid leukemia. N. Engl. J. Med (2004) 350:1605–1616.[Abstract/Free Full Text]

    Cario G, et al. Distinct gene expression profiles determine molecular treatment response in childhood acute lymphoblastic leukemia. Blood (2005) 105:821–826.[Abstract/Free Full Text]

    Cheng Y, Church G. Biclustering of expression data. In: Intelligent System in Molecular Biology (2000) 93–103.

    Cheok M, et al. Treatment-specific changes in gene expression discriminate in vivo drug response in human leukemia cells. Nat. Genet (2003) 34:85–90.[CrossRef][Web of Science][Medline]

    Dennis GJ, et al. David: database for annotation, visualization, and integrated discovery. Genome Biol (2003) 4:P3.[CrossRef][Medline]

    Doniger S, et al. MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biol (2003) 4:R7.[CrossRef][Medline]

    Dudoit S, Fridlyand J. A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biology (2002) 3:R36.

    Freije W, et al. Gene expression profiling of gliomas strongly predicts survival. Cancer Res (2004) 64:6503–6510.[Abstract/Free Full Text]

    Gentleman R, et al. Bioconductor: open software development for computational biology and bioinformatics. Genome Biol (2004) 5:R80.[CrossRef][Medline]

    Grossmann S, et al. An improved statistic for detecting over-representated gene ontology annotations in gene sets. In: In Research in Computational Molecular Biology: 10th Annual International Conference, Proceedings of RECOMB 2006, Venice, Italy, April 2-5, 2006—Apostolico A, et al, eds. (2006) 3909. Lecture Notes in Computer Science, Springer Heidelberg. pp. 85–98.

    Halkidi M, et al. On clustering validation techniques. J. Intell. Inform. Sys (2001) 17:107–145.[CrossRef]

    Hartigan J, Wong MA. A k-means clustering algorithm. Applied Statistics (1979) 28:100–104.[CrossRef]

    Hastie T, et al. The Elements of Statistical Learning (2001) Springer. Springer Series in Statistics New York.

    Huang E, et al. Gene expression predictors of breast cancer outcomes. Lancet (2003) 361:1590–1596.[CrossRef][Web of Science][Medline]

    Huber W, et al. Variance stabilization applied to microarray data calibration and to the quantification of differential expression. Bioinformatics (2002) 18(Suppl. 1):96–104.

    Ihaka R, Gentleman R. R: a language for data analysis and graphics. J. Comput. Graph. Stat (1996) 5:299–314.[CrossRef]

    Irizarry R, et al. Exploration, normalization, and summaries of high density oligonucleotide array probe level data. Biostatistics (2003) 4:249–64.[Abstract]

    Jaffe E, et al, eds. World Health Organization Classification of Tumours. Pathology and Genetics of Tumours of Haematopoietic and Lymphoid Tissues (2001) Lyon, France: IARC Press,

    Kanehisa M. Toward pathway engineering: a new database of genetic and molecular pathways. Sci. & Technol Japan (1996) 59:34–38.

    Kaufman L, Rousseeuw PJ. Finding Groups in Data: An Introduction to Cluster Analysis (1990) New York: Wiley.

    Kerr M, Churchill GA. Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proc. Natl. Acad. Sci. USA (2001) 98:8961–8965.[Abstract/Free Full Text]

    Lange T, et al. Stability-based validation of clustering solutions. Neural Comput (2004) 6:1299–1323.

    Lottaz C, Spang R. Molecular decomposition of complex clinical phenotypes using biologically structured analysis of microarray data. Bioinformatics (2005) 21(9):1971–1978.[Abstract/Free Full Text]

    Lottaz C, et al. Annotation-driven class discovery. In: Technical report 2005/02 MPI for molecular genetics (2005) Berlin, Germany.

    MacQueen JB. Some methods for classification and analysis of multivariate observations. In Symposium on Math, Statistics, and Probability (1967) 1:281–97.

    Madeira S, Oliveira AL. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinformatics (2004) 1:24–45.[CrossRef]

    Mardia K, et al. Multivariate Analysis (1979) San Diego, CA, USA: Academic Press.

    McShane L, et al. Methods for assessing reproducibility of clustering patterns observed in analyses of microarray data. Bioinformatics (2002) 18:1462–1469.[Abstract/Free Full Text]

    Milligan G, Sokol L. A two stage clustering algorithm with robust recovery characteristics. Educ. Psychol. Meas (1980) 40:755–759.[Abstract]

    Monti S, et al. Consensus clustering: a resampling-based method for class discovery and visualization of gene expression microarray data. Mach. Learn (2003) 52:91–118.[CrossRef]

    Monti S, et al. Molecular profiling of diffuse large B-cell lymphoma identifies robust subtypes including one characterized by host inflammatory response. Blood (2005) 105:1851–1861.[Abstract/Free Full Text]

    Munneke B, et al. Adding confidence to gene expression clustering. Genetics (2005) 107:2003–2011.

    Nutt C, et al. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res (2003) 63:1602–1607.[Abstract/Free Full Text]

    Pavlidis P, et al. Exploring gene expression data with class scores. In: In Proceecdings of the Pacific Symposium on Biocomputings (2002) 474–485.

    Pomeroy S, et al. Prediction of central nervous system embryonal tumour out come based on gene expression. Nature (2002) 415:436–442.[CrossRef][Medline]

    R Development Core Team. R: A language and environment for statistical computing (2005) Austria: R Foundation for Statistical Computing Vienna. ISBN3-900051-07-0.

    Rahnenführer J, et al. Calculating the statistical significance of changes in path way activity from gene expression data. Stat. Appl. Genet. Mol. Biol (2004) 3.

    Rickman D, et al. Distinctive molecular profiles of high-grade and low-grade gliomas based on oligonucleotide microarray analysis. Cancer Res (2001) 61:6885–6891.[Abstract/Free Full Text]

    Ross M, et al. Gene expression profiling of pediatric acute myelogenous leukemia. Blood (2004) 104:3679–3687.[Abstract/Free Full Text]

    Roth V, Lange T. Featureselection in clustering problems. In: In Advances in Neural Information Processing Systems 16—Thrun S, et al, eds. (2004) Cambridge, MA: MIT Press.

    Schacherer F, et al. The transpath signal transduction database:a knowledge base on signal transduction networks. Bioinformatics (2001) 17:1053–1057.[Abstract/Free Full Text]

    Schuler G. Pieces of the puzzle:express edsequence tags and the catalog of humangenes. J. Mol. Med (1997) 75:694–698.[CrossRef][Web of Science][Medline]

    Singh D, et al. Gene expression correlates of clinical prostate cancer behavior. CancerCell (2002) 1:203–209.

    Speed T, ed. Statistical Analysis of Gene Expression Microarray Data (2003) Florida, USA: Chapman&Hall/CRC.

    Subramanian A, et al. Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA (2005) 102:15545–15550.[Abstract/Free Full Text]

    Tanay A, et al. Revealing modularity and organizationin the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc. Natl Acad. Sci. USA (2004) 101:2981–2986.[Abstract/Free Full Text]

    Tukey JW. Exploratory Data Analysis (1977) Reading, MA, USA: Addison-Wesley.

    Varma S, Simon R. Iterative class discovery and feature selection using Minimal Spanning Trees. BMC Bioinformatics (2004) 5:126.[CrossRef][Medline]

    von Heydebreck A, et al. Identifying splits with clear separation: a new class discovery method for gene expression data. Bioinformatics (2001) 17(Suppl. 1):S107–S114.[Abstract]

    West M, et al. Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl Acad Sci (2001) 98:11462–11467. USA.[Abstract/Free Full Text]

    Willenbrock H, et al. Prediction of immunophenotype, treatment response, and relapse in childhood acute lymphoblastic leukemia using DNA microarrays. Leukemia (2004) 18:1270–1277.[CrossRef][Web of Science][Medline]

    Yeoh E, et al. Classification, subtype discovery, and prediction of outcome in pediatric all by gene expression profiling. Cancer Cell (2002) 1:133–143.[CrossRef][Web of Science][Medline]

    Zien A, et al. Analysis of geneexpression data with pathway scores. Proc. Int. Conf. Intell. Syst. Mol. Biol (2000) 8:407–417.[Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/17/2256    most recent
btm322v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Lottaz, C.
Right arrow Articles by Spang, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lottaz, C.
Right arrow Articles by Spang, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?