Skip Navigation


Bioinformatics Advance Access originally published online on May 30, 2007
Bioinformatics 2007 23(15):1995-2003; doi:10.1093/bioinformatics/btm261
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/15/1995    most recent
btm261v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Burkart, M. F.
Right arrow Articles by Garner, H. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Burkart, M. F.
Right arrow Articles by Garner, H. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Clustering microarray-derived gene lists through implicit literature relationships

Mark F. Burkart 1,*, Jonathan D. Wren 2, Jason I. Herschkowitz 3,4, Charles M. Perou 3,4,5 and Harold R. Garner 1

1Departments of Internal Medicine and Biochemistry, The McDermott Center for Human Growth and Development, Division of Translational Research, The University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd., Dallas, Texas 75390, 2Arthritis & Immunology Program, Oklahoma Medical Research Foundation, 825 N.E. 13th Street, Oklahoma City, Oklahoma 73104, 3Lineberger Comprehensive Cancer Center, 4Department of Genetics and 5Department of Pathology & Laboratory Medicine, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA.

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Microarrays rapidly generate large quantities of gene expression information, but interpreting such data within a biological context is still relatively complex and laborious. New methods that can identify functionally related genes via shared literature concepts will be useful in addressing these needs.

Results: We have developed a novel method that uses implicit literature relationships (concepts related via shared, intermediate concepts) to cluster related genes. Genes are evaluated for implicit connections within a network of biomedical objects (other genes, ontological concepts and diseases) that are connected via their co-occurrences in Medline titles and/or abstracts. On the basis of these implicit relationships, individual gene pairs are scored using a probability-based algorithm. Scores are generated for all pairwise combinations of genes, which are then clustered based on the scores. We applied this method to a test set composed of nine functional groups with known relationships. The method scored highly for all nine groups and significantly better than a benchmark co-occurrence-based method for six groups. We then applied this method to gene sets specific to two previously defined breast tumor subtypes. Analysis of the results recapitulated known biological relationships and identified novel pathway relationships unique to each tumor subtype. We demonstrate that this method provides a valuable new means of identifying and visualizing significantly related genes within gene lists via their implicit relationships in the literature.

Contact: mark.burkart{at}utsouthwestern.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
DNA microarray experiments can be used to study the expression levels of thousands of genes for the analysis of cell signaling pathways, disease marker discovery and research and development of therapeutics (Cooper, 2001). However, interpretation of the subsequent results within biological context is often daunting, due to one's limited knowledge of specific genes and the typically complex nature of biological relationships.

Analysis of microarray data sets generally begins with unsupervised clustering of genes based on expression patterns, or supervised analysis to identify gene sets, followed by retrieval of gene list annotations with ontological descriptions (Hosack et al., 2003). However, while ontology definitions, such as those from the Gene Ontology Consortium (Ashburner et al., 2000), can help provide insights into the properties and functions of individual genes, they are often incomplete, lacking information related to specific molecular interactions, disease states or associated phenotypes (Khatri and Draghici, 2005). Electronically available Medline abstracts, on the other hand, offer a more comprehensive source of information that can be mined for more diverse and biologically relevant relationships.

Several methods that use Medline-derived relationships to group functionally related genes have been reported (Shatkay and Feldman, 2003). Jenssen et al. (2001) grouped genes by identification of co-occurring gene pairs in Medline abstracts to construct gene relationship networks. Chaussabel and Sher (2002) and Alako et al. (2005) identified shared terms or concepts that co-occurred with gene names in Medline abstracts and then used those relationships to cluster both the genes (based on their concept score profiles) and the concepts (based on their gene score profiles). Jelier et al. (2005) used both gene co-occurrences and co-occurrences of genes with shared concepts to map genes in a Euclidean space in which distance represented semantic relatedness. Relationships involving indirect linkage of concepts via co-occurrence in the literature have been termed implicit (Swanson, 1986) and are useful for literature-driven discovery (Hristovski et al., 2003; Srinivasan and Libbus, 2004; Weeber et al., 2003; Wren et al., 2004).

Here, we describe a method that uses implicit literature relationships to score pairs of genes for relatedness and subsequently cluster the full gene set based on these scores. In our method, biomedical concepts (objects) are connected to each other in a network by their mutual co-occurrences in Medline titles and abstracts. Within this object network genes are evaluated for implicit connections through other network objects and scored for relatedness to other genes by the ratio of their observed/expected implicit connections in the network using probabilistic methods. The relatedness scores for all gene pairs in the set are then used to cluster the gene set.

Our method used a thesaurus of primary names and synonyms derived from electronically available bioscience-oriented databases to efficiently map terms with spelling variations, synonyms or aliases to a single corresponding database object. Thesauri have been previously used to increase the sensitivity of the analysis (Alako et al., 2005; Jelier et al., 2005). We tested our implicit analysis method by grouping control sets of genes having known functional relationships and then comparing the results to a gene co-occurrence-based method as a benchmark. The implicit analysis compared favorably against the gene co-occurrence-based method in all control sets, indicating the general efficacy of the method.

We next applied this method to microarray-derived gene sets characteristically expressed by Basal-like and Luminal breast tumor subtypes (Hu et al., 2006; Sorlie et al., 2003). The analysis identified gene clusters with functional relationships unique to each tumor subtype. Several of these relationships corresponded to previously described, tumor-specific phenotypes.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Construction of the literature-derived network
Both the literature-derived network of biomedical objects used in this study and the methodology used in filtering and scoring connections have been previously described (Wren et al., 2004). The network consisted of a collection of concepts (biomedical objects) extracted from electronically available, curated biomedical databases, including Locus Link, HGNC, GDB, OMIM and GO. Objects were classified as genes, diseases or ontologies based on the source: genes—Locus Link, HGNC, GDB; diseases—OMIM; ontologies—GO. Classifications allowed for refined filtering of network associations when context-specific associations were desired. Only human genes were used in this study. An object's definition consisted of both a primary name and any synonyms derived from the source database (thesaurus), allowing concepts identified within texts to be matched to corresponding primary database objects.

Titles and abstracts from over 15 million Medline records dating from 1967 to 2005 were processed to catalogue all co-occurrences of objects. Co-occurrences for each pair of objects were totaled and classified as either sentence or abstract co-occurrences. Records were stored in a Microsoft Access 2003 database, and queries were executed by SQL statements with partial automation by VBA macros.

2.2 Implicit analysis and clustering process
For a given list of genes, the literature-derived database was queried to identify gene objects matching identifiers in the gene list. The matching identifiers produced a query list of genes that were found to be present in the database. The database was then searched to identify all other objects found to co-occur in Medline records with any of the gene objects in the query list. This produced a list of co-occurrences between the query genes and other objects in the database (Fig. 1A). Filters were applied to remove co-occurrences from the list that did not achieve preset cutoff values (filtering is discussed in depth in Section 2.3).


Figure 1
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Measurement of pairwise observed/expected scores for a set of genes. (A) A gene set (gray circles) was examined for co-occurring biomedical objects (connecting lines to white circles). (B) Individual gene pairs were compared to identify both unshared (white circles) and shared objects (black circles). Shared objects formed implicit relationships. An observed/expected score was calculated based on shared and unshared objects for each gene pair (see Scoring Function Section). (C) Subsequent pairs were compared and scored. This process was repeated until all gene pairings were scored (D). A pairwise matrix was populated with the scores (gray cells must also be filled, values determined in (B) and (C) are shown filled).

 
An implicit connection between two genes resulted when two genes in the co-occurrence list shared a co-occurrence with a common object (Fig. 1B and C). The common object thus served as the intermediate in the implicit connection between the two genes. This differs from co-occurrence-based methods of identifying related genes, since relationships are made though intermediate, shared objects, rather than through a literature co-occurrence of the two genes. For all possible pairings of genes in the query list, the number of observed implicit connections was counted. The number of expected implicit connections between each gene pair was then calculated as the number expected in an equally sized, randomly connected network (scoring function is discussed in Section 2.4). From the two values an observed/expected ratio was determined for each gene pair. Pairs with no observed implicit connections received a score of zero.

When the observed/expected scores were rank sorted and graphed, the shape followed a power law distribution in which the top-ranking scores increased exponentially with respect to the rest of the distribution. A threshold equal to the observed/expected ratio at the 95th percentile was therefore applied to those rank-sorted scores above that percentile to prevent very high values from masking those in the rest of the distribution. Each gene's pairwise self-identity score was set equal to the threshold score to represent the highest attainable score, and a symmetric matrix of observed/expected scores was created from this list (Fig. 1D). Hierarchical clustering (average linkage using standard Pearson correlation) of the pairwise score matrix was performed using Cluster (Eisen et al., 1998) and visualized using Java Treeview (Saldanha, 2004).

2.3 Network connection filters
A set of filters was applied to object connections within the network to specify the types of relationships used in the analysis. Previous studies had shown that a pair of objects that co-occurred more frequently in an abstract were more likely to be meaningfully related and that objects that co-occurred rarely were more likely to be relationships of a trivial nature (Alako et al., 2005; Jenssen et al., 2001; Wren et al., 2004). Wren et al., (2004) experimentally determined false-positive error rates of 42% for abstract co-occurrences and 17% for sentence co-occurrences, similar to rates reported by Ding et al. (2002). Following these studies, object connections resulting from fewer than three abstract or two sentence co-occurrences were omitted from the analysis.

During testing, it was observed that objects with relatively high connection frequencies forming implicit connections between genes were less likely to represent real or significant relationships between the genes. For example, the object ‘chromosome’ has 31 951 connections to other objects in the database, whereas ‘Y-chromosome’ has only 2848. Many genes would be expected to co-occur with the object ‘chromosome’, and any resulting implicit relationship would probably be deemed less interesting than an implicit connection formed via the more specific object, ‘Y-chromosome’. While object frequencies are used to weight connections in the scoring function, very common shared objects can still occasionally result in misleading or uninteresting associations between genes. Therefore, as a trade-off between specificity and sensitivity, a filter was applied to remove implicit connections formed via objects having more than 5000 connections in the network.

Implementation of these filters is similar to TF*IDF (term frequency)* (inverse document frequency) weightings used in natural language processing and information retrieval. The low sentence/abstract co-occurrence filter is similar to TF because object frequencies in abstracts can be used to assign importance to objects, and the high connection frequency filter is similar to IDF in that common words are assigned reduced values. However, the filters are applied as absolute cutoffs whereas TF*IDF weightings are relative. Relative weighting is employed within the scoring function for object connections passing the filter settings.

2.4 Scoring function
For a gene list, every possible gene pairing was scored for relatedness based on their shared implicit relationships. Scores were determined using an observed/expected ratio, which was the observed number of implicit connections between the genes in the actual network divided by the number expected in an equally sized, randomly connected network. As previously described by Wren et al. (2004), the probability, P, of a direct connection between a gene (A), and any other object in a random network (B), can be estimated using the formula:


Formula 1

(1)
where KA and KB are the number of network connections for A and B, respectively, and Nt is the total number of objects in the network. An implicit connection between a pair of genes is the combination of two such direct connections. Therefore, the probability that two genes (A and C), will be implicitly connected via another object, B, can be estimated as:


Formula 2

(2)

The expected number of connections, E, for a given gene pair in a random network can be estimated by summing the individual probabilities for each possible implicit connection:


Formula 3

(3)

The number of implicit connections that are possible, Bn is taken to be the union of all unique Bi connected to either A or C in the real network, where Ba is the set of all B connected to A, and Bc is the set of all B connected to C. For every gene pairing in the list, both the expected number of random connections and the number of real implicit connections are determined to obtain observed/expected scores.

2.5 Selection of functionally related gene sets
To test the efficiency of the implicit analysis method, nine control groups of genes with known functional relationships (Table 1) were selected from Biocarta (http://www.biocarta.com), GO (http://www.geneontology.org) and MeSH (http://www.nlm.nih.gov/mesh/meshhome.html). The three lists of genes from each database were non-overlapping and represented those related by function within particular cell signaling pathways (Biocarta), common ontologies (GO) and implication in a particular disease (MeSH). Genes for the MeSH disease categories were derived by selecting the top 10 genes found to co-occur with each disease in Medline records. The control groups were selected based on their diversity of functional relationship types and because they were relatively unambiguous and distinct.


View this table:
[in this window]
[in a new window]

 
Table 1. Pathways, Gene Ontologies and diseases used as control groups

 
2.6 Evaluation
The methodology used for performance measurements was similar to that used by Jelier et al. (2005) for assessing their associative concept space (ACS) method's performance for grouping genes. The nine control groups were combined into a single group of 115 genes to measure the efficiency of the implicit analysis method in identifying the original control group from which each gene was originally derived. The implicit analysis was performed as described earlier to obtain pairwise observed/expected scores. For each gene, the pairwise Pearson correlations generated by the clustering program were used to rank sort all other genes by the correlation coefficient. For a method to be considered a good measure of relatedness between genes, genes from a gene's particular functional group were expected to be top ranked for that gene.

The implicit analysis method was compared to gene co-occurrence rankings as a performance benchmark. For the gene co-occurrence method, the number of Medline co-occurrences between all gene pairs within the set is determined. For each gene, all other genes are rank sorted from greatest to least co-occurrences with the gene. For the purpose of the comparison, genes from a particular gene's control group were termed ‘positives’ and genes not in the same functional group were termed ‘negatives’. A receiver operating characteristic (ROC), which plots the true-positive rate (correctly identified positives divided by all positives) against the false-positive rate (incorrectly classified negatives divided by all negatives), was calculated for each gene from the rank-sorted list. The area under the ROC curve (AUC) was used as a measure of the method's performance for each gene (Hanley and McNeil, 1982). AUC values range from 0 to 1, with 0.5 representing random sorting, 1 representing perfect sorting (all of a gene's group members are rank sorted in the top ranked slots) and 0 representing the worst possible sorting. To test for significance between the two methods, the AUCs for each group were compared using the Wilcoxon signed-ranks test, a non-parametric alternative to the paired Student's t-test. Because the AUC distributions were weakly dependent and skewed, the P-values from the Wilcoxon test are only approximations. More accurate P-values would normally be obtained through bootstrapping to simulate data independency, however, this was not affordable due to the large number of repeated analyses that would need to be individually and manually analyzed using substituted gene sets.

2.7 Selection of biologically relevant gene sets
Gene lists used in this study were selected from a recent microarray study of 105 different human breast tumors (Hu et al., 2006). Hierarchical clustering analysis of the expression data resulted in successful identification of four previously defined tumor subtypes: Basal-like, Luminal, HER2+/ER- and Normal Breast-like (see Suppleimentary Fig. 1 for the complete cluster diagram). We performed two-class unpaired comparisons using significance analysis of microarrays (SAM) (Tusher et al., 2001) for each subtype versus all other tumors individually at a ≤5% false discovery rate to identify genes specific for the subtypes. Subtype-specific genes were selected from the Basal-like and Luminal sets for this study. There were no shared genes between the sets. The data discussed in this publication are available in NCBI's Gene Expression Omnibus (Barrett et al., 2005) (http://www.ncbi.nlm.nih.gov/geo/, GEO series accession number GSE1992 [NCBI GEO] ).


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Gene set identification by functional relationships
To test the effectiveness of the implicit analysis method for grouping genes, we developed a test set in which nine control gene lists representing different biological pathways, ontologies or disease states were combined (Table 1). The goal of the test was to determine if the genes in the test set could be grouped with functionally related genes from their original control groups. To be assured that a ‘correct’ assignment of a gene would be to the original control group, it was necessary to determine that the original groups were sufficiently distinct that for most genes a correct gene assignment would be to a single group. The rate of gene co-occurrences between control groups was analyzed and found to be 3.81%. This was deemed sufficiently low to assume that positive assignment of genes to their respective categories would be relatively unambiguous. The set was analyzed using both the implicit analysis method and the gene co-occurrence method, in which genes’ relatedness is assessed by the number of co-occurrences observed between gene pairs in abstracts. The co-occurrence method was used as a benchmark because it has been previously used for assessing methods of literature analysis for gene lists (Jelier et al., 2005), and is the most direct approach to finding gene–gene relationships in the literature.

For the combined set of 115 genes, 111 had implicit relationships to other genes in the set, and 102 genes had co-occurrences with other genes in the set. Four genes were not detected by either method, and it was found that these were not discussed in any Medline abstract. Nine genes had only implicit relationships to other genes in the set. These included, e.g. DYRK1B, which is part of the sonic hedgehog pathway (Mao et al., 2002), BRIP1, which is a DNA helicase (Cantor et al., 2001), and CNFN, which is part of the cornified cell envelope (Michibata et al., 2004). Thus, the implicit method was able to capture more genes for subsequent analysis, ~9%, because some of the genes could be related only implicitly.

The 102 genes that were identified using both methods were used to compare the ability of the methods for grouping genes in the combined set with those in their original control groups. For each gene, the other 101 genes were ranked by either pairwise correlations determined using the implicit clustering method or by the number of observed pairwise co-occurrences. ROC curves were generated for each gene. In ROC graphs, the true positive (within control group) rate is plotted against the false-positive (not within control group) rate for each rank to generate a curve. Area under the curve (AUC) can be measured to assess the quality of the rankings for each gene (Hanley and McNeil, 1982). AUC values ranged from 0 to 1, with 1 representing perfect ranking of control genes, 0.5 representing random ranking and 0 the worst possible ranking.

The average of the median AUC values for the implicit method exceeded 0.94 for all nine functional groups, with none falling below 0.85 (Fig. 2).


Figure 2
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Performance of the implicit analysis versus co-occurrence methods and random grouping. For each of the nine functional groups, the median AUC score and SD are shown for the implicit (diamonds) and co-occurrence (boxes) methods. Asterisks above columns denote statistical significance for a difference between the two methods. P-values were: apoptosis 0.0005, adhesion 0.3652, sonic hedgehog 0.9375, cornified cell envelope 0.0312, DNA helicase <0.0001, telomere maintenance 0.0078, nephroblastoma 0.002, retinitis pigmentosa 0.002 and chronic pancreatitis 0.1934.

 
Median AUC was 0.75 for co-occurrence with one group scoring just above random, 0.56. The median AUC for co-occurrence was marginally better than the implicit method for only one group (Biocarta adhesion and diapedesis of lymphocytes). This difference, however, was not statistically significant. The implicit method out-performed co-occurrence for six groups at or above the 0.05 significance level, as determined by the Wilcoxon signed-ranks test. A notable example of improved performance was the GO cornified cell envelope category, for which perfect sorting was achieved for genes in the group using the implicit method whereas co-occurrence achieved 0.65.

Gene pairs in each control group were analyzed to determine rates of implicit relationships and co-occurrences. Implicit- or co-occurrence-related gene pairs were summed for each group and expressed as the percentage of all possible pairwise connections. It was found that a higher percentage of genes in each group were implicitly related than were related by co-occurrence (Table 2). All intra-group pairs related by co-occurrence were simultaneously implicitly related with the exception of a single gene pair in the DNA helicase category. The average number of gene pairs related implicitly or by co-occurrence was ~89 and ~51%, thus an average of ~38% of gene pairs were related only implicitly for all sets. This demonstrates that related genes can frequently be identified through implicit relationships where co-occurrences do not exist. Significantly, it also shows that a large percentage (at least ~38%) of the implicit relationships could not have been made only via objects related implicitly within the same abstract, i.e. where two genes are mentioned together with some other intermediate object in an abstract.


View this table:
[in this window]
[in a new window]

 
Table 2. Breakdown of relationships for gene pairs within control groups

 
The median AUC by the co-occurrence method was found to be highly correlated with the percentage of co-occurring gene pairs (r2 = 0.96), thus the performance of the co-occurrence method was generally dependent on percentage of co-occurring gene pairs within the control group. For those groups in which greater than ~71% of gene pairs co-occurred, the implicit method did not significantly out-perform the co-occurrence method. However, for all other groups the implicit method significantly out-performed the co-occurrence method (Table 2).

Thus, the implicit method appears to be able to infer relationships between genes where less literature directly mentions genes together in abstracts. Since the scientific literature is a perpetual work in progress and many related genes will not co-occur in abstracts, implicit methods may find related genes in their absence.

We next examined the performance of the implicit method while restricting object types used as intermediates in the implicit connections. Not surprisingly, genes sharing functional relationships within a specific context (gene, ontology or disease-related) were most efficiently identified using implicit relationships of the same type (Table 3). For example, gene-type implicit connections worked best for cell signaling pathways and ontology-type implicit connections worked best for GO categories. Interestingly, sensitivity of the method when using all three object types was generally equal to or better (6/9 groups) than the most effective single object type. Thus, using relationships of different contexts to analyze gene sets should in some cases improve performance, particularly when the contexts most relevant to a given set are unknown.


View this table:
[in this window]
[in a new window]

 
Table 3. Method performance using different object type intermediates

 
3.2 Implicit analysis of breast tumor-specific gene sets
In order to test the implicit analysis using biologically relevant gene sets, we analyzed gene lists specific to two breast tumor subtypes, Basal-like and Luminal, previously identified in gene expression studies (Hu et al., 2006). The lists shared no common genes. Cancer gene sets were chosen because of the central role altered gene expression has in the development and pathology of the disease and because literature relationships between the genes could provide insight into disease mechanisms. Subtype-specific gene sets were compared to determine if insights unique to the specific subtypes could be provided.

The two gene lists were clustered based on pairwise observed/expected scores as described in Methods section. Clustering on both axes produced symmetric, gene-x-gene arrays in which the magnitude of the observed/expected score was represented by the color intensity of the cells (Fig. 3). Groups of genes having high correlations between their respective pairwise scores were generally situated together along the array diagonal and formed identifiable clusters.


Figure 3
View larger version (58K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. (A) Basal-like tumor gene set clustered by implicit analysis method with gene-type intermediate connections. Clusters discussed in text were: (a) NF-kB signaling pathway; (b) Wnt signaling; (c) DNA repair; (d) DNA synthesis; (e) cell cycle control; (f) keratins and (g) kallikreins. Proliferation expression cluster genes are represented below the array by green bars; blue bars represent other genes from the Basal-like expression cluster. (B) Luminal tumor gene set clustered with gene-type intermediate connections. Clusters discussed in text were: (a) tight-junction formation; (b) secretion and protein trafficking; (c) androgen receptor related; (d) ERBB2 and AF-6 related; (e) JAK/STAT pathway; (f) EP300/NCOR1 related; (g) GRB2 and PI3K related; (h) Wnt signaling and (i) EGR1 related. (C) Magnified Basal-like DNA repair cluster overlaid with gene co-occurrences: A = fewer than 2 co-occurrences; B = 2 or more co-occurrences. (D) Magnified Luminal EGR1 related cluster shows no co-occurrences of genes.

 
Implicit analysis and clustering was performed separately using each of the three object types (genes, ontologies and diseases) present in the database in order to break down implicit relationships by category, e.g. all gene-type implicit relationships were analyzed simultaneously to identify only pathway relationships. Arrays clustered based upon gene-type implicit relationships for the Basal-like and Luminal sets are shown in Figure 3A and B. Arrays based upon ontology and disease-type implicit relationships for the sets are shown in Supplementary Figures 2 and 3 for Basal-like and Supplementary Figures 4 and 5 for Luminal. Highly correlated clusters (Pearson correlation > =0.4) broken down by genes and the top five implicit objects shared by them are detailed in Supplementary Tables 1 and 2 for Basal-like and Luminal sets, respectively.

3.3 Functionally related clusters in the basal-like set
Clusters having genes with previously known functional relationships within the Basal-like set were identified, e.g. keratin (Fig. 3A–[f]) and kallikrein (Fig. 3A–[g]) families. Several functional clusters not previously described for the set are detailed below.

3.3.1 NF-{kappa}B signaling
Genes involved in NF-{kappa}B signaling were found separately via all three implicit relationship types (Table 4–[1]) (Fig. 3A–[a], Supplementary Figs 2–[a], 3–[c]). Constitutive activation of the NF-{kappa}B pathway can squelch apoptotic induction signals (Shishodia and Aggarwal, 2002) and might confer apoptotic resistance to Basal-like tumors. Notably, TNFRSF21, (also known by its alias, DR6) is induced through NF-{kappa}B activation (Kasof et al., 2001) and did not co-occur with any of these genes and could therefore be identified only implicitly.


View this table:
[in this window]
[in a new window]

 
Table 4. Examples of clusters found in Basal-like and Luminal sets

 
3.3.2 Wnt signaling/developmental proteins
Genes involved in cell fate and development decisions, including several members of the Wnt signaling pathway, were identified by gene-type relationships, such as MYOD, BMP4 and other WNT genes (Table 4–[2]) (Fig. 3A–[b]). Wnt signaling is known to be involved in breast cancer and other human cancers (Howe and Brown, 2004). Several of these genes are suppressors of myogenic development (EZH2, MDFI, ID4, NOTCH1) and could inhibit differentiation of Basal-like tumor cells into normal breast myoepithelial cells. This hypothesis is supported by the fact that Basal-like tumors express some markers (keratins 5, 4 and 17) of differentiated myoepithelial cells but do not express several other classic markers (smooth muscle actin, p63 or membrane metallo-endopeptidase—CD10/CALLA) (Livasy et al., 2006). Notably, EZH2, MDFI and ID4 did not co-occur and could have been related only implicitly.

3.3.3 Proliferation signature genes
Three clusters found via gene-type relationships (Fig. 3A–[c–e]), as well as clusters found by ontology- and disease-type relationships (Supplementary Figs 2 and 3), consisted of ‘proliferation signature’ genes originally identified by gene expression-based clustering and typically observed in rapidly proliferating tumors (Whitfield et al., 2006). A large percentage of these genes are highly expressed during the cell cycle (Whitfield et al., 2002) and are predictive of poor prognosis in breast cancer patients (Dai et al., 2005). It is of interest that these genes, originally clustered by expression, are also clustered by implicit literature relationships, because it is suggestive of shared or mutual functionalities. Implicit relationships in these clusters indicated involvement in cell cycle control, DNA synthesis, DNA damage repair and chromosome segregation, detailed subseqently.

3.3.4 DNA damage repair
Most of the genes found in this cluster function in DNA recombination and/or repair. Several have also been shown to repair stalled replication forks and maintain telomere length via homologous recombination (Tarsounas and West, 2005). These clusters may be phenotypically correlated with the high rate of gross chromosomal changes observed in these tumors (Richardson et al., 2006). This is supported by the fact that almost all human BRCA1 mutation carriers develop Basal-like tumors (Sorlie et al., 2003). Interestingly, disease-type relationships included diseases involving chromosomal instability and a predisposition to cancer. These genes were identified with all three object types (Table 4–[3]) (Fig. 3A–[c], Supplementary Figs 2–[d] and 3–[b]).

3.3.5 Chromosome segregation
Genes required for chromosome segregation were identified via ontology-type implicit relationships (Table 4–[4]) (Supplementary Fig. 2–[c]). ORC6L, associated with the origin recognition complex, and MID1, which stabilizes microtubules, had no co-occurrences with other genes in the cluster and could be identified only implicitly.

3.3.6 DNA synthesis, cell cycle control
A large cluster of 34 genes was identified via gene-type implicit relationships that included many proliferation signature genes. Two highly correlated subgroups within this cluster correlated at 0.8 consisted of DNA synthesis (Fig. 3A–[d]) and cell cycle control (Fig. 3A–[e]) genes. DNA synthesis genes, especially those involved in the origin recognition complex, were also found via ontology-type implicit relationships (Supplementary Fig. 2–[e]). Several of these are regulated by transcription factor E2F-1 during G1/S phase of the cell cycle and form a complex required for initiation of DNA replication (Fang and Han, 2006).

3.4 Functionally related clusters in the Luminal set
Luminal and Basal cells have different developmental fates and are likely to express genes differentially even in the wild-type state. Luminal cells, e.g. have an apical surface facing the lumen and show greater expression of proteins required for secretory functions. Indeed, numerous secretion and trafficking-related genes were found in one cluster (Fig. 3B–[b]) and another contained genes involved in the formation of tight junctions found in luminal epithelia (Fig. 3B–[a]). Several functional clusters having not previously been described in this tumor subtype are detailed subsequently.

3.4.1 Androgen receptor-related
One cluster was found containing genes related by the androgen receptor (AR) (Fig. 3B–[c]) (Table 4–[5]). Some (SPDEF, MAP3K1, TRPS1) regulate transcriptional activity of AR and others (KRT37, TRPS1) are regulated by AR at the transcriptional level. Interestingly, while AR is generally linked to prostate cancer, it may be of special significance in the Luminal subtype, because HER2/NEU, which is expressed in some tumors of the Luminal subtype, has been shown to modulate the activity of AR in prostate and breast cancer cell lines (Mellinghoff et al., 2004). SPDEF, TRPS1, KRT37 had no co-occurrences with other genes in the cluster and could be identified only implicitly.

3.4.2 JAK/STAT pathway
Signaling through the JAK/STAT pathway (Fig. 3B–[e])(Table 4–[6]) could be responsible for some of the anti-apoptotic and proliferative features of this subtype. Cyclin D1 (CCND1) and CISH transcription are increased through STAT5 signaling (Matsumura et al., 1999) (Mitchell et al., 2003). A false-positive identification, SYT1, resulted because the SYT1 alias, p65, is also an alias for the RELA gene.

3.4.3 EP300/NCOR1
Genes which form transcriptional co-activator and co-repressor complexes with nuclear factors including EP300 and NCOR1 (Fig. 3B–[f]) (Table 4–[7]). None of these genes co-occurred. Most of these factors interact with EP300 or NCOR1, which interact with hormone receptors and other factors.

Other luminal clusters identified included genes related by HER2/NEU and AF6 (Fig. 3B–[d]), PI-3-kinase and GRB2 (Fig. 3B–[g]), WNT signaling (Fig. 3B–[h]), and EGR1 (Fig. 3B–[i]).

3.5 Implicit relationships show subtype specificity
Breast tumor subtypes show considerable biological and clinical diversity and may represent distinct disease processes (Hu et al., 2006). It was therefore of interest to determine if groups of functionally interacting genes could be identified that were specific to the individual subtypes. To investigate this, we compared the degree of overlap between implicit relationships in clusters of the Basal-like and Luminal sets. Clusters obtained from gene-type implicit relationships were selected from each set having at least four genes and correlated at 0.4 or higher, resulting in 19 and 11 clusters for the Basal-like and Luminal sets, respectively. All pairwise combinations of the 19 Basal-like and 11 Luminal clusters, 209 unique cluster pairs, were compared for overlap of the top 10 implicit relationships (determined by counting the number of genes in a cluster sharing a given implicit relationship). It was found that 13 of the paired Luminal and Basal clusters had matching implicit relationships within the top 10. However, the overlap of these clusters was low, with three clusters exhibiting 20% overlap (two shared relationships) and 10 clusters exhibiting 10% overlap (one shared). The low degree of overlap indicated that the majority of genes of the respective subtypes were clustered through unique implicit relationships.

The relationships in the cluster pairs with two overlapping relationships were NF-{kappa}B and p65 (RelA, NF-{kappa}B subunit), STAT1 and STAT5 and RHOA and p85 (PI-3-kinase, regulatory subunit). These represent gene products often functioning as central mediators for several signaling pathways, possibly indicating that differing gene products of the respective tumor subtypes could function through some of the same pathway intermediates.

3.6 Most clusters are identifiable only implicitly
For the Basal-like and Luminal sets, only ~9.4 and ~9.6% of those genes that were implicitly connected also co-occurred. To more closely examine specific clusters for co-occurring gene pairs, co-occurrences were mapped onto the arrays of implicitly clustered genes, and it was observed that some of the clusters had gene pairs related by direct co-occurrences (Fig. 3C) and some did not (Fig. 3D). All clusters of four or more genes correlated at 0.4 or greater were manually examined in both sets, and it was found that ~12.2% of gene pairs within these clusters co-occurred in both sets.

However, even given the low percentage of direct relationships both within the gene lists and within the clusters themselves, there remained a possibility that significant implicit associations (i.e. those in highly correlated clusters) resulted as a consequence of co-occurring gene pairs occurring simultaneously in the same abstracts with intermediate objects connecting them implicitly.

To explore this possibility, implicit relationship scores between genes that also co-occurred were omitted from the list of pairwise scores to produce a truncated, implicit-only set of scores. Scores from the implicit-only set were used to cluster genes as before, producing clusters via purely implicit, and not simultaneously direct, connections. Unique clusters found correlated at 0.4 or greater were compared to those obtained previously at the same correlation. Original clusters were matched to the most similar clusters (>=50% of genes shared) obtained using the implicit-only results. No clusters had more than one match. Of the original clusters, 39/43 (91%) of the Basal-like and 35/42 (83%) of the Luminal clusters were matched to similar clusters from the truncated set. Many of the lost clusters included those made up of genes having well known relationships, such as a kallikrein cluster (KLK5, 6, 8 and 10) and a keratin cluster (KRT6, 13, 16 and 17). Of the original clusters that matched clusters from the truncated set, 80.3% of Basal-like and 81.3% of Luminal genes were conserved in the matched clusters. Thus, most gene relationships in clusters were formed via purely implicit connections.


    Discussion
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this study, we developed a method for implicit analysis and clustering of gene sets using a literature-derived biomedical object network. We began by showing that implicit analysis can identify functionally related genes with improved performance over the gene co-occurrence method using a control set of genes with known functional relationships. This was particularly true for groups in which literature co-occurrence rates were lower and literature relationships could only be identified implicitly. This should be considered an advantage of the implicit method given that the biomedical literature is a perpetual work in progress. We also showed that genes sharing functional relationships of a specific semantic type (gene, ontology or disease) were efficiently identified using the three object types simultaneously, a potential advantage when relevant relationship contexts for a given gene set are not known ahead of time.

We then employed implicit analysis and clustering against a real biological data set consisting of genes significantly expressed in Basal-like and Luminal breast tumor subtypes. Gene clusters were identified that shared functional relationships that reflected previously observed phenotypes (keratin cluster in Basal-like; secretion, trafficking and tight-junction formation clusters in Luminal). Previously unobserved functionalities were also observed (NF-{kappa}B-related anti-apoptotic mechanisms, DNA recombination and repair enzymes in the Basal-like subtype, hormone-activated receptors and JAK-STAT pathways in the Luminal). These represent potentially novel and relevant cancer-related pathway relationships for each subtype. Another interesting finding was that a subset of the Basal-set genes grouped by shared functional relationships consisted of proliferation signature genes previously identified by expression-based clustering. Comparison of implicit relationships shared by genes in clusters of both subtypes showed minimal overlap, and thus identified relatively subtype-specific clusters.

This method differs from previous literature-based gene clustering or grouping methods. In several previous methods (Alako et al., 2005; Chaussabel and Sher, 2002; Jelier et al., 2005), similarity measures between both genes and other concepts were used for subsequent clustering or arranging of genes. In this method, genes are not scored for literature similarity, but are instead compared using an observed/expected ratio of network connections to obtain statistical connection strengths between genes. One possible advantage of this approach is that genes could be considered related without having significantly overlapping literature profiles if they have a sufficiently high ratio of observed/expected connections. This may be useful for genes not conceptually related in the literature but that may still share significant functional relationships.

Since genes are scored prior to the clustering step, each pairwise observed/expected score is represented within a single cell of the clustered matrix. Because the method produces a gene-x-gene matrix, it is possible to overlay the implicit matrix with co-occurrences (Fig. 3C and D) to show which clusters have co-occurring genes and which do not. This could have discovery utilities, such as identifying potentially interacting genes with no literature co-occurrences. While the gene-x-gene matrix presented here simplifies visual comparison of gene pairs, it does somewhat reduce ease of use in identifying shared relationships, which must be identified via an additional query. In some methods (Alako et al., 2005; Chaussabel and Sher, 2002), shared relationships are visible in the clustered matrix since genes are clustered against shared concepts.

One problem that was observed was that false positive relationships occasionally occurred due to misidentifications of terms with unrelated database objects during text processing. For example, the concept thesaurus would mistake one gene for another if they shared a common synonym or alias. Within the Basal-like and Luminal sets (Supplementary Tables 1 and 2), several genes were incorrectly grouped due to false positives of this type. Time-consuming manual analysis of the related literature abstracts was necessary to identify these errors. Advances in artificially -intelligent text processing may be required to address this type of error.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
For helpful discussions regarding this article we would like to thank John W. Fondon III, Wayne Fisher, My-Hanh T. Nguyen, Kristin Lennox, Cristi L. Galindo, Mounir Errami and Ryan Weil. This research was supported by the P. O’B. Montgomery Distinguished Chair in Human Growth and Development, the Evelyn Hudson Foundation and grants from NIH/NIAID Western Regional Centers of Excellence for Biodefense and Emerging Infectious Diseases (U54AI057156) NIH/NCI (CA096901), NIH/NCI SPORE (50CA70907), and UNC SPORE Breast Cancer (P50-CA58223).

Conflict of interest: none declared.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

Received on December 29, 2006; accepted on May 8, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Alako BTF, et al. Copub mapper: mining medline based on search term co-publication. BMC Bioinformatics (2005) 6:1–15.[Free Full Text]

    Ashburner M, et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet. (2000) 25:25–29.[CrossRef][Web of Science][Medline]

    Barrett T, et al. Ncbi geo: mining millions of expression profiles – database and tools. Nucleic Acids Res. (2005) 33:D562–D566.[Abstract/Free Full Text]

    Cantor SB, et al. BACH1, a novel helicase-like protein, interacts directly with BRCA1 and contributes to its DNA repair function. Cell (2001) 105:149–160.[CrossRef][Web of Science][Medline]

    Chaussabel D, Sher A. Mining microarray expression data by literature profiling. Genome Biol. (2002) 3:1–16.[Medline]

    Cooper CS. Applications of microarray technology in breast cancer research. Breast Cancer Res. (2001) 3:158–175.[CrossRef][Web of Science][Medline]

    Dai H, et al. A cell proliferation signature is a marker of extremely poor outcome in a subpopulation of breast cancer patients. Cancer Res. (2005) 65:4059–4066.[Abstract/Free Full Text]

    Ding J, et al. Mining Medline: abstracts, sentences or phrases? Pac. Symp. Biocomput. (2002) 7:326–337.

    Eisen MB, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA (1998) 95:14863–14868.[Abstract/Free Full Text]

    Fang ZH, Han ZC. The transcription factor e2f: a crucial switch in the control of homeostasis and tumorigenesis. Histol. Histopathol. (2006) 21:403–413.[Web of Science][Medline]

    Hanley JA, McNeil BJ. A simple generalization of the area under the ROC curve to multiple class classification problems. Radiology (1982) 143:29–36.[Abstract/Free Full Text]

    Howe LR, Brown AM. Wnt signaling and breast cancer. Cancer Biol. Ther. (2004) 3:36–41.[Web of Science][Medline]

    Hristovski D, et al. Using literature-based discovery to identify disease candidate genes. Int. J. Med. Inform. (2005) 74:289–298.[CrossRef][Web of Science][Medline]

    Hu Z, et al. The molecular portraits of breast tumors are conserved across microarray platforms. BMC Genomics (2006) 7:96.[CrossRef][Medline]

    Jelier R, et al. Co-occurrence based meta-analysis of scientific texts: retrieving biological relationships between genes. Bioinformatics (2005) 21:2049–2058.[Abstract/Free Full Text]

    Kasof GM, et al. Tumor necrosis factor-alpha induces the expression of DR6, a member of the TNF receptor family, through activation of NF-kappaB. (2001) 20:7965–7975.

    Khatri P, Draghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (2005) 21:3587–3595.[Abstract/Free Full Text]

    Khatri P, et al. Profiling gene expression using onto-express. Genomics (2002) 79:266–270.[CrossRef][Web of Science][Medline]

    Livasy CA, et al. Egfr expression and her2/neu overexpression/amplification in endometrial carcinosarcoma. Gynecol. Oncol. (2006) 100:101–106.[CrossRef][Web of Science][Medline]

    Lowe HJ, Barnett GO. Understanding and using the medical subject headings (mesh) vocabulary to perform literature searches. JAMA (1994) 271:1103–1108.[Abstract/Free Full Text]

    Mao J, et al. Regulation of Gli1 transcriptional activity in the nucleus by Dyrk1. J. Biol. Chem. (2002) 277:35156–35161.[Abstract/Free Full Text]

    Matsumura I, et al. Transcriptional regulation of the cyclin D1 promoter by STAT5: its involvement in cytokine-dependent growth of hematopoietic cells. EMBO J. (1999) 18:1367–1377.[CrossRef][Web of Science][Medline]

    Mellinghoff IK, et al. HER2/neu kinase-dependent modulation of androgen receptor function through effects on DNA binding and stability. Cancer Cell (2004) 6:517–527.[CrossRef][Web of Science][Medline]

    Michibata H, et al. Identification and characterization of a novel component of the cornified envelope, cornifelin. Biochem. Biophys. Res. Commun. (2004) 318:803–813.[CrossRef][Web of Science][Medline]

    Mitchell TJ, et al. Dysregulated expression of COOH-terminally truncated Stat5 and loss of IL2-inducible Stat5-dependent gene expression in Sezary Syndrome. Cancer Res. (2003) 63:9048–9054.[Abstract/Free Full Text]

    Richardson AL, et al. X chromosomal abnormalities in basal-like human breast cancer. Cancer Cell (2006) 9:121–132.[CrossRef][Web of Science][Medline]

    Rouzier R, et al. Breast cancer molecular subtypes respond differently to preoperative chemotherapy. Clin. Cancer Res. (2005) 11:5678–5685.[Abstract/Free Full Text]

    Saldanha AJ. Java treeview – extensible visualization of microarray data. Bioinformatics (2004) 20:3246–3248.[Abstract/Free Full Text]

    Shatkay H, Feldman R. Mining the biomedical literature in the genomic era: an overview. J. Comput. Biol. (2003) 10:821–855.[CrossRef][Web of Science][Medline]

    Shishodia S, Aggarwal BB. Nuclear factor-kappab activation: a question of life or death. J. Biochem. Mol. Biol. (2002) 35:28–40.[Web of Science][Medline]

    Sorlie T, et al. Repeated observation of breast tumor subtypes in independent gene expression data sets. Proc. Natl Acad. Sci. USA (2003) 100:8418–8423.[Abstract/Free Full Text]

    Srinivasan P, Libbus B. Mining medline for implicit links between dietary substances and diseases. Bioinformatics (2004) 20(Suppl. 1):i290–i296.[Abstract]

    Swanson DR. Fish oil, raynaud's syndrome, and undiscovered public knowledge. Perspect. Biol. Med. (1986) 30:7–18.[Web of Science][Medline]

    Tarsounas M, West SC. Recombination at mammalian telomeres: an alternative mechanism for telomere protection and elongation. Cell Cycle (2005) 4:672–674.[Web of Science][Medline]

    Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA (2001) 98:5116–5121.[Abstract/Free Full Text]

    Weeber M, et al. Generating hypotheses by discovering implicit associations in the literature: a case report of a search for new potential therapeutic uses for thalidomide. J. Am. Med. Inform. Assoc. (2003) 10:252–259.[CrossRef][Web of Science][Medline]

    Whitfield ML, et al. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell (2002) 13:1977–2000.[Abstract/Free Full Text]

    Whitfield ML, et al. Common markers of proliferation. Nat. Rev. Cancer (2006) 6:99–106.[CrossRef][Web of Science][Medline]

    Wren JD, et al. Knowledge discovery by automated identification and ranking of implicit relationships. Bioinformatics (2004) 20:389–398.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
R. Frijters, B. Heupers, P. van Beek, M. Bouwhuis, R. van Schaik, J. de Vlieg, J. Polman, and W. Alkema
CoPub: a literature-based keyword enrichment tool for microarray data analysis
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W406 - W410.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/15/1995    most recent
btm261v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Burkart, M. F.
Right arrow Articles by Garner, H. R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Burkart, M. F.
Right arrow Articles by Garner, H. R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?