Skip Navigation


Bioinformatics Advance Access originally published online on May 26, 2005
Bioinformatics 2005 21(15):3324-3326; doi:10.1093/bioinformatics/bti503
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/15/3324    most recent
bti503v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Djebbari, A.
Right arrow Articles by Quackenbush, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Djebbari, A.
Right arrow Articles by Quackenbush, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2005

MeSHer: identifying biological concepts in microarray assays based on PubMed references and MeSH terms

Amira Djebbari 1,2, Svetlana Karamycheva 1, Eleanor Howe 5 and John Quackenbush 1,3,4,5,*

1The Institute for Genomic Research 9712 Medical Center Drive, Rockville, MD 20850, USA
2Department of Computer Science, The George Washington University Washington, DC, USA
3Department of Biochemistry, The George Washington University Washington, DC, USA
4Department of Statistics, Bloomberg School of Public Health, The Johns Hopkins University Baltimore, MD, USA
5Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute Boston, MA, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATIONS
 SUMMARY AND CONCLUSIONS
 REFERENCES
 

Summary: MeSHer uses a simple statistical approach to identify biological concepts in the form of Medical Subject Headings (MeSH terms) obtained from the PubMed database that are significantly overrepresented within the identified gene set relative to those associated with the overall collection of genes on the underlying DNA microarray platform. As a demonstration, we apply this approach to gene lists acquired from a published study of the effects of angiotensin II (Ang II) treatment on cardiac gene expression and demonstrate that this approach can aid in the interpretation of the resulting ‘significant’ gene set.

Availability: The software is available at http://www.tm4.org

Contact: johnq{at}jimmy.harvard.edu

Supplementary information: Results from the analysis of significant genes from the published Ang II study.


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATIONS
 SUMMARY AND CONCLUSIONS
 REFERENCES
 
DNA microarrays have been widely used to survey large numbers of genes and to identify those that correlate with the particular biological processes under study. Laboratory-based protocols for generating gene expression data have greatly improved in recent years and methods for data analysis and the identification of ‘significant’ genes have evolved substantially, the interpretation of the functional roles played by these genes remains an ongoing challenge. Tools such as EASE (Hosack et al., 2003) and GO Miner (Zeeberg et al., 2003) allow Gene Ontology assignments (GO terms) (Ashburner et al., 2000) and pathway assignments to be used to identify general classes of genes that are overrepresented in a particular dataset. While such approaches have been widely used, these cannot provide direct associations with disease states or other related phenomena. The primary source of such information remains the corpus of biological literature and mining the literature remains a laborious process. Although a number of utilities, such as HAPI (Masys et al., 2001) and MedMiner (Tanabe et al., 1999), use the Medical Subject Headings (MeSH terms, http://www.nlm.nih.gov/mesh/meshhome.html) and literature references from the PubMed database (http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed) to offer some insight into what might be represented in a particular gene set, they do not provide estimates of the significance of any particular association derived from such analysis. MeSHer was developed to extend these methods using the Fisher's exact test to estimate the likelihood that a particular MeSH term occurs in a selected gene set by chance relative to its association with the overall representation of MeSH terms associated with genes on the array and to provide a list of literature references that link these terms to specific genes.


    METHODOLOGY
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATIONS
 SUMMARY AND CONCLUSIONS
 REFERENCES
 
MeSH is the National Library of Medicine's controlled vocabulary for indexing articles in the PubMed database. MeSH terms are assigned by expert curators who attempt to summarize the information presented in each indexed article and the genes described therein. There are 19 000 MeSH terms organized in a hierarchy based on 15 top-level categories that provides a consistent way to retrieve information regarding manuscripts that may use different terminology to describe the same biological or medical concepts. MeSH terms have previously been used effectively to provide insight into results from microarray studies. Masys et al. (2001) applied MeSH terms associated with genes that had been used to discriminate between leukemia subtypes (Golub et al., 1999) and discovered that these genes were not simply markers of hematopoietic lineage but were also involved in cancer pathogenesis, confirming the result uncovered by Golub et al. (1999) in their analysis.

MeSHer builds on the available annotation for microarray and other genomic resources captured in the RESOURCERER database (Tsai et al., 2001). RESOURCERER represents the vast majority of commercial and cDNA microarray resources used for human, mouse, rat, zebrafish and Xenopus, as well as RefSeq genes for these species and other resources such as collections of mouse embryonic stem cell knock-out lines and databases of sequences with identified coding single nucleotide polymorphisms; a version of RESOURCERER has recently been released for plants. For each resource represented in the RESOURCERER, a variety of annotation is provided, including gene names, genomic locations, links to orthologous genes in other species and relevant PubMed references, which MeSHer uses as the source of MeSH terms for analysis.

In building the RESOURCERER, we faced a number of possible alternatives for associating PubMed references with individual elements represented in any resource. An obvious approach might be to use gene names. While gene names are very useful in manuscripts, computationally they are quite difficult to use, in part because they suffer from the problems of synonymy and polysemy; synonymy means that one gene can be called by several names and polysemy means that the same name can refer to several genes, and both of these problems are common. Consequently, we chose a sequence-based approach to link the individual elements in each resource to well-characterized proteins and through those to the PubMed references and their associated MeSH terms. The basis for the links are the TIGR Gene Index (TGI) databases (Quackenbush et al., 2000; 2001; Lee et al., 2005). The TGI are constructed for each of the 83 species represented by collecting high-quality expressed sequence tag and gene sequences, clustering them based on sequence similarity and assembling them into high-confidence Tentative Consensus (TC) sequences. The resulting TCs are then searched against a non-redundant amino acid database, populated with sequences from Swiss-Prot (Boeckmann et al., 2003 http://us.expasy.org/sprot/), GenPept (ftp://ftp.ncbi.nih.gov/genbank/), PIR (Wu et al., 2003 http://pir.georgetown.edu/home.shtml), PDB (Westbrook et al., 2003 http://www.rcsb.org/pdb/) and PRF (http://www4.prf.or.jp/en/), to obtain a list of the top five most significant hits. PubMed identifiers associated with these proteins are then associated with the starting elements in any individual resource.

MeSHer assumes that the user has started with one of these resources and has selected a subset of the elements in that resource based on some experimental criteria. Although the most common application is the analysis of gene expression data using a particular microarray platform, the inclusion of RefSeq genes in RESOURCERER allows other analyses, such as genetic linkage studies in which a particular locus—and the genes that map onto that region of the genome—can be compared with the collection of genes in the target species. Having been provided a resource and a subset of genes selected by an appropriate means and denoted by their corresponding GenBank accession numbers, MeSHer performs a statistical analysis of the frequency of MeSH terms associated with the selected subset relative to their frequency in the overall resource, assigning P-values based on the Fisher's exact test. The results are returned to the user either with significant terms ranked based on P-value or as an expandable tree representing the MeSH hierarchy with significant terms highlighted. By default, uninformative MeSH terms such as ‘Geographic Location’ and ‘Information Technology’ are excluded from the analysis, although they can be selected for inclusion by the user.


    APPLICATIONS
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATIONS
 SUMMARY AND CONCLUSIONS
 REFERENCES
 
As an example of the utility of MeSHer, we analyzed the results ofLarkin et al. (2004) who recently described the cardiac transcriptional response to short- and long-term exposure to angiotensin II (Ang II) in a mouse model of hypertension with the help of the data obtained using cDNA microarrays. In this study, a subset of genes responsive to Ang II were identified and analyzed using a variety of techniques, including EASE analysis (Hosack et al., 2003) of GO terms assigned to the genes, to develop a biological understanding of the observed response. We used MeSHer to analyze the 15 most significant upregulated and downregulated genes, independent of time of exposure, to identify MeSH terms that were significantly associated (P≤ 0.01) with each set of genes.

For the top 15 upregulated genes identified by Larkin et al. (2004) we found that many of the most significant MeSH terms were similar or related to the concepts they had previously identified. Ang II is known to induce hypertrophy of the heart and among the most significant MeSH terms were those linked to tissue remodeling, including Extracellular Matrix (P = 1.42 x 10–7), Vascular Endothelium (1.58 x 10–6), Fibroblast Growth Factor 1 (5.18 x 10–8) and Vascular Smooth Muscle (2.74 x 10–4), as well as a number of terms related to tumorigenesis such as Neoplastic Gene Expression Regulation (2.64 x 10–6), Brain Neoplasm (7.74 x 10–5), Neoplasm RNA (2.35 x 10–4) and Neoplasm Structural Genes (3.61 x 10–4).

For the top 15 acute and chronic downregulated genes, many of the changes found by Larkin et al. (2004) represented genes associated with fundamental metabolism and we find corresponding MeSH terms associated with this and mitochondria, including Dietary Carbohydrates (1.17 x 10–3), Lactose (5.86 x 10–4), Dinitrochlorobenzene (6.43 x 10–3), Bezafibrate (2.34 x 10–3), Ketone Oxidoreductases (5.26 x 10–3), Myoblasts (2.92 x 10–3), Antilipemic Agents (2.92 x 10–3) and Selenium (8.76 x 10–3).

Interestingly, we also find apolipoprotein E and amyloid beta-protein precursor, which are usually associated with Alzheimer's disease, to be significant in this analysis. This is of particular note as the study by Larkin et al. (2004) was the first to suggest a potential mechanistic link between hypertension and Alzheimer's, an association found previously in a number of clinical studies (Sparks et al., 2000; Sparks, 1997; Kang et al., 2005). A list of the 15 most significant upregulated and downregulated genes and the associated significant MeSH terms and their P-values, as well as the MeSHer hierarchical output, can be found in the Supplemental Material.


    SUMMARY AND CONCLUSIONS
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATIONS
 SUMMARY AND CONCLUSIONS
 REFERENCES
 
MeSHer uses a relatively straightforward approach to using the literature to shed light on the biological processes that may be associated with a particular list of ‘significant’ genes selected from a larger population. The advantage of MeSHer relative to previously described approaches is that it provides statistical support for the individual MeSH terms and displays them in the context of the MeSH hierarchy, allowing them to be more easily interpreted and providing an important supplement to the existing tools for the interpretation of array data. The software supporting MeSHer is freely available with source code distributed under the open-source Artistic License and available on http://www.tm4.org. In the future, we plan to integrate MeSHer into the MeV (Saeed et al., 2003) microarray data mining software to provide biological concepts found in the given gene lists side-by-side with MeV's clustering algorithms as we believe that such integration would help biologists interpret their microarray datasets more seamlessly.

Conflict of Interest: none declared.

Received on April 4, 2005; revised on May 16, 2005; accepted on May 16, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODOLOGY
 APPLICATIONS
 SUMMARY AND CONCLUSIONS
 REFERENCES
 

    Ashburner, M., et al. (2000) Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 25–29[CrossRef][Web of Science][Medline].

    Boeckmann, B., et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370[Abstract/Free Full Text].

    Golub, T.R., et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, 286, 531–537[Abstract/Free Full Text].

    Hosack, D.A. (2003) Identifying biological themes within lists of genes with EASE. Genome Biol., 4, R70[CrossRef][Medline].

    Kang, J.H. (2005) Apolipoprotein E, cardiovascular disease and cognitive function in aging women. Neurobiol. Aging, 26, 475–484[CrossRef][Web of Science][Medline].

    Larkin, J.E. (2004) Cardiac transcriptional response to acute and chronic angiotensin II treatments. Physiol. Genomics, 18, 152–166[Abstract/Free Full Text].

    Lee, Y. (2005) The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res., 33, D71–D74[Abstract/Free Full Text].

    Masys, D.R. (2001) Use of keyword hierarchies to interpret gene expression patterns. Bioinformatics, 17, 319–326[Abstract/Free Full Text].

    Quackenbush, J., et al. (2000) The TIGR gene indices: reconstruction and representation of expressed gene sequences. Nucleic Acids Res., 28, 141–145[Abstract/Free Full Text].

    Quackenbush, J., et al. (2001) The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res., 29, 159–164[Abstract/Free Full Text].

    Saeed, A.I., et al. (2003) TM4: a free, open-source system for microarray data management and analysis. Biotechniques, 34, 374–378[Web of Science][Medline].

    Sparks, D.L. (1997) Coronary artery disease, hypertension, ApoE, and cholesterol: a link to Alzheimer's disease? Ann. N. Y. Acad. Sci. USA, 826, 128–146[Web of Science][Medline].

    Sparks, D.L., et al. (2000) Link between heart disease, cholesterol, and Alzheimer's disease: a review. Microsc. Res. Tech., 50, 287–290[CrossRef][Web of Science][Medline].

    Tanabe, L. (1999) MedMiner: an Internet text-mining tool for biomedical information, with application to gene expression profiling. Biotechniques, 27, 1210–1214 1216–1217[Web of Science][Medline].

    Tsai, J., et al. (2001) RESOURCERER: a database for annotating and linking microarray resources within and across species. Genome Biol., 2, software0002.0001–0002.0004.

    Westbrook, J., et al. (2003) The Protein Data Bank and structural genomics. Nucleic Acids Res., 31, 489–491[Abstract/Free Full Text].

    Wu, C.H., et al. (2003) The Protein Information Resource. Nucleic Acids Res., 31, 345–347[Abstract/Free Full Text].

    Zeeberg, B.R., et al. (2003) GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biol., 4, R28[CrossRef][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Exp. Biol.Home page
J. Quackenbush
Extracting biology from high-dimensional biological data
J. Exp. Biol., May 1, 2007; 210(9): 1507 - 1517.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/15/3324    most recent
bti503v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Djebbari, A.
Right arrow Articles by Quackenbush, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Djebbari, A.
Right arrow Articles by Quackenbush, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?