Bioinformatics Advance Access originally published online on November 29, 2007
Bioinformatics 2008 24(3):428-429; doi:10.1093/bioinformatics/btm588
TOM: enhancement and extension of a tool suite for in silico approaches to multigenic hereditary disorders
1Dipartimento di Elettronica, Informatica e Sistemistica (DEIS), University of Bologna, Viale Risorgimento 2, 40136 Bologna, 2Functional Genomics Laboratory and Telethon Facility – DAMA Data Mining for Analysis of DNA Microarrays, Dipartimento di Morfologia ed Embriologia, Via Fossato di Mortara 64b, 44100 Ferrara and 3Unitá di Genetica Medica, Policlinico S. Orsola, via Massarenti 9,40138 Bologna, Italy
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: The study of complex hereditary diseases is a very challenging area of research. The expanding set of in silico approaches offers a flourishing ground for the acceleration of meaningful findings in this area by exploitation of rich and diverse sources of omic data. These approaches are cheap, flexible, extensible, often complementary and can continuously integrate new information and tests to improve the selection of genes responsible for hereditary diseases. Following this principle, we improved and extended our web-service TOM for the identification of candidate genes in the study of complex hereditary diseases.
Availability: Our tool is freely available online at http://www.micrel.deis.unibo.it/~tom/.
Contact: daniele.masotti{at}unibo.it
Supplementary information: Manuals and sample data are available in the Help section of the tool's web page.
| 1 INTRODUCTION |
|---|
|
|
|---|
The study of complex hereditary diseases is a fundamental area of research that is likely to shed light on some of the most common, severe and costly illnesses. With the recent technological improvements that offer high-throughput molecular information, the impact of bioinformatics in this field has become crucial, since it is unfeasible to treat the amount of available knowledge and data with manual approaches. Therefore, automated statistical and computational tools are required to infer novel information about genes associations relevant to diverse type of biological queries (in silico approaches). These tools are extremely flexible, as they can continuously and rapidly take advantage of new biological discoveries, and embed them in the algorithm design for data mining, for example in the form of new statistical tests on the data. Because of all the upon-described advantages, there have been several efforts performed in the research community to devise and design in silico approaches able to extract disease candidate genes (Oti and Brunner, 2007). We recently developed TOM (Rossi et al., 2006), a tool that makes use of gene expression information data available on the public repository of Gene Expression Ominbus (GEO, Edgar et al., 2002) to extract disease candidate genes given one or two linkage regions (also called loci) of interest. These inputs represent sequences on the genome indicted to encode genes relevant to a disease. Briefly, TOM accepts two possible types of inputs, and outputs a list of candidate genes for the disease under study. The first option, called One Locus, is used when one or more genes (seeds of the search) related to the disease are already known. In this case, the known gene(s) and one locus of interest are used as input. The second option, Two Loci, is used when no gene is known, but two loci are supposed to be relevant to the disease. The program identifies genes that are co-expressed in the two loci of interest or co-expressed with the gene of interest selected as seed, based on pre-computed correlation scores that measure the similarity between expression profiles (co-expression analysis), and from there functional analysis is possible based on Gene Ontology (The Gene Ontology Consortium, 2001). We describe here three main improvements to TOM original algorithm: (i) enhancement of the statistical scores that define the associations; (ii) addition of murine expression data for more comprehensive analyses; (iii) introduction of advanced flexible enrichment analysis.
| 2 STATISTICAL SCORES ENHANCEMENT |
|---|
|
|
|---|
Besides the p-value for assessing the significance of the correlation between two of the n genes profiles in the experiment (enhanced correction for multiple hypotheses for all the n–n couples), in this enhanced version another statistic is offered that gives a measure of the robustness of the result, here called R. The rationale of this score is the same of the enrichment (Zhang et al., 2005), and namely relies on the assumption that the more often a gene is found to be related to another gene, the more the association between the genes can be assumed to be robust. However, the frequency of positive results needs to be normalized on the total number of tests that were performed on the couples of genes. In fact, if a couple of genes appears to be statistically related in a small number of experiments, but corresponding to the total number of experiment tested on the same two genes, the result is more relevant than if it represents only a fraction of the tests on the same two genes performed. Namely, it is defined as: R = |{
x, y}|px, y <
|/|{
x, y}|, where x and y are the expression profiles of two genes,
is the correlation score, p the p-value corresponding to
,
the user-defined statistical significance threshold. This statistic expands the broadness of the analysis, taking into account the robustness of the relationship found, based on its redundancy, across different sets of experiments. | 3 MURINE-HUMAN DATA |
|---|
|
|
|---|
Besides adding more human data, the database has been enriched in terms of murine expression data. As for the human data, information is obtained by pre-computed correlation among gene expression profiles, for series obtained from the GEO database. Once the query is performed on one or tow loci, TOM extracts two lists of candidate genes. These two lists represent either the correlating genes on the two loci or the list of seeds plus the correlating genes on the single locus. To identify more stringent correlations TOM proceeds to another query, based on murine data. The advantage of inserting also the mouse expression data allows to identify new correlations that might be not visible by comparing the human data alone. Extensive work has been done in mice in order to study human disorders and today technology permits to target virtually any mouse candidate gene that has a human homologue (Capecchi, 2005). Using the correlations in mice, TOM retrieves the human homologues (from Homologene at NCBI) and searches them in the list identified by comparison in human array data.
| 4 EXTENDED ENRICHMENT ANALYSIS |
|---|
|
|
|---|
Finally, we integrated and improved a second tool, FIT (Nardini et al., 2006) to help geneticists understand the role of the most significant genes. FIT measures the similarity between any list of candidate genes extracted with TOM (test list) and any number of lists (reference lists), extracted in the same way, or representing a signature, a pathway, obtained from literature, custom defined or annotated in KEGG (Kanehisa and Goto, 2000) or GenMAPP (Salomonis et al., 2002). This measure of similarity consists of a sequence of three statistical tests (enrichment, specificity of the enrichment and fit), for the quantification and the ranking of the relationship between any two sets of genes. Statistical significance of enrichment (p
) is evaluated by means of the hypergeometric distribution (Sokal and Rohlf, 2003). It assesses if the number of relevant items in a set is greater than the one that would be obtained by chance. The specificity of the enrichment (p
) assesses if the enrichment is specific to the given category. Namely, specificity informs the researcher if the meaning, besides being statistically significant, is specific to a given set of genes, or if it is shared or distributed with others. In particular it can tell not only if the number of items falling in a given category is greater than what could be expected by chance, but also if it is unique to a given set of genes. To do so, the candidate gene list is represented as a distribution of all its genes across the bins defined by the categories (references) we want to compare to (sub-ontologies, pathways, other custom sets). The same is also done for any reference list, that is generally represented as an impulsive distribution (almost all the genes fall in the same bin). The specificity is then defined as a significant value of correlation among the distributions profiles. This score also helps to disambiguate particular cases with identical enrichment, but different distributions of the genes (Nardini et al., 2006). The significance of the final fit score is obtained from the Fisher inverse
2 method (Hedges and Olkin, 1985) and is defined as p
= –2(log(p
) + log(p
)). Globally, this analysis allows to define statistical scores to rank and thus help disambiguate the enrichment for the list of candidates genes for meaningful sets of known or annotates genes. To delimitate better the genomic area, it is then central to compare these same genes to others that might contribute to the same cellular pathway or are part of the same expression set. FIT allows this quantitative automated comparison and can list for example the p
-values for the enriched comparison analysis with all KEGG Pathways. The user can then for example choose to give priority in the candidate gene list to the ones that are related to the most enriched function and thus make the analysis more efficient. To make this comparison feasible, automated approaches are crucial to allow for the high-throughput quantification of these comparison. Given these necessities, we expect this approach to provide an integrative and efficient tool for enhanced effective hypothesis-driven research. Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Joaquin Dopazo
Received on September 18, 2007; revised on November 11, 2007; accepted on November 23, 2007
| REFERENCES |
|---|
|
|
|---|
Capecchi MR. Gene targeting in mice: functional analysis of the mammalian genome for the twenty-first century. Nat. Rev. Genet (2005) 6:507–512.[CrossRef][Web of Science][Medline]
Edgar R, et al. Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res (2002) 30:207–210.
Farrer MJ. Genetics of Parkinson disease: paradigm shifts and future prospects. Nat. Rev. Genet (2006) 7:306–318.[Web of Science][Medline]
The Gene Ontology Consortium. Creating the gene ontology resource: design and implementation. Genome Res (2001) 11:1425–1433.
Hedges L, Olkin I. Statistical Methods for Meta-Analysis. (1985) New York: Academic Press.
Irminger-Finger I, Jefford CE. Is there more to BARD1 than BRCA1? Nat. Rev. Cancer (2006) 6:382–391. doi:10.1038/nrc1878.[CrossRef][Web of Science][Medline]
Kanehisa M, Goto S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res (2000) 28:27–30.
Khatri P, Drâghici S. Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics (2005) 21:3587–3595. doi:10.1093/bioinformatics/bti565.
Nardini C, et al. Mining gene sets for measuring similarities. (2006) Proceedings of IEEE Symposium on Computers and Communications (ISCC). 227–232.
Oti M, Brunner HG. The modular nature of genetic diseases. Clin Genet (2007) 71:1–11.[CrossRef][Web of Science][Medline]
Rossi S, et al. TOM: a web-based integrated approach for efficient identification of candidate disease genes. Nucleic Acids Res (2006) 34(Web Server issue):W285–W292. doi:101093/nar/gkl340.
Salomonis N, et al. GenMAPP 2: new features and resources for pathway analysis. BMC Bioinformatics (2007) 8:217–229. doi:10.1186/1471-2105-8-217.[CrossRef][Medline]
Sokal RR, Rohlf FJ. Biometry. (2003) New York: Freeman.
Zhang B, et al. WebGestalt: an integrated system for exploring gene sets in various biological contexts. Nucleic Acids Res (2005) 33(Web Server issue):W741–W748. doi:10.1093/nar/gki475.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||