Skip Navigation


Bioinformatics Advance Access originally published online on January 19, 2007
Bioinformatics 2007 23(9):1132-1140; doi:10.1093/bioinformatics/btm001
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/9/1132    most recent
btm001v4
btm001v3
btm001v2
btm001v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Right arrow Articles by Gaulton, K. J.
Right arrow Articles by Vision, T. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gaulton, K. J.
Right arrow Articles by Vision, T. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

A computational system to select candidate genes for complex human traits

Kyle J. Gaulton 1,2,3,*, Karen L. Mohlke 3 and Todd J. Vision 4

1Curriculum in Genetics and Molecular Biologly, 2Bioinformatics and Computational Biology Training Program, Departments of 3Genetics and 4Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27516

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Identification of the genetic variation underlying complex traits is challenging. The wealth of information publicly available about the biology of complex traits and the function of individual genes permits the development of informatics-assisted methods for the selection of candidate genes for these traits.

Results: We have developed a computational system named CAESAR that ranks all annotated human genes as candidates for a complex trait by using ontologies to semantically map natural language descriptions of the trait with a variety of gene-centric information sources. In a test of its effectiveness, CAESAR successfully selected 7 out of 18 (39%) complex human trait susceptibility genes within the top 2% of ranked candidates genome-wide, a subset that represents roughly 1% of genes in the human genome and provides sufficient enrichment for an association study of several hundred human genes. This approach can be applied to any well-documented mono- or multi-factorial trait in any organism for which an annotated gene set exists.

Availability: CAESAR scripts and test data can be downloaded from http://visionlab.bio.unc.edu/caesar/

Contact: kgaulton{at}email.unc.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Unlike Mendelian traits, in which a mutation in one gene is causative, or oligogenic traits, where several genes are sufficient but not necessary, complex traits are caused by variation in multiple genetic and environmental factors, none of which are sufficient to cause the trait (Peltonen and McKusick, 2001). The contribution of any given gene to a complex trait is usually modest. In addition, complex traits often encompass a variety of phenotypes and biological mechanisms, making it difficult to determine which genes to study (Newton-Cheh and Hirschhorn, 2005).

As a result, traditional methods of genetic discovery, such as linkage analysis and positional cloning, while widely successful in identifying the genes for Mendelian traits, have had more limited success in identifying genes for complex traits. Candidate gene studies have had encouraging success, yet this approach requires an effective method for deciding a priori which genes have the greatest chance of influencing susceptibility to the trait (Dean, 2003). Recent advances in genotyping technology have provided researchers with the ability to test association in hundreds of genes relatively quickly, and even the entire genome through a genome-wide association study. Genome-wide association studies are promising, yet not always economically feasible or statistically desirable (Thomas, 2006). Therefore, one of the greatest challenges in disease association study design remains the intelligent selection of candidate genes.

To this end, we have developed a computational methodology, named CAESAR (CAndidatE Search And Rank), that uses text and data mining to rank genes according to potential involvement in a complex trait. CAESAR exploits the knowledge of complex traits in literature by using ontologies to semantically map the trait information to gene and protein-centric information from several different public data sources, including tissue-specific gene expression, conserved protein domains, protein–protein interactions, metabolic pathways and the mutant phenotypes of homologous genes. CAESAR uses four possible methods of integration to combine the results of data searches into a prioritized candidate gene list. In effect, CAESAR mimics the steps a researcher would undertake in selecting candidate genes, albeit faster, potentially more thoroughly, and in a more quantitative manner.

CAESAR represents a novel selection strategy in that it combines text and data mining to associate genetic information with extracted trait knowledge in order to prioritize candidate genes. In contrast to a number of existing approaches (Adie et al., 2006; Turner et al., 2003; van Driel et al., 2003), gene selection is not limited to one or more genomic regions, as all genes annotated in one of our databases are potential candidates. CAESAR is ultimately designed for traits in which the relevant biological processes may not be well understood and potentially hundreds of reasonable candidate genes exist.

The potential benefits to a researcher in adopting a computational approach to gene selection such as CAESAR include the ability to quickly and systematically process several hundred thousand biological annotations, many of which require highly specialized domain expertise to interpret. This benefit will continue to grow in importance as the volume and technical detail of annotation data increases. Relevant gene annotations can easily escape human consideration due to biases that investigators bring to the task of prioritization and that are difficult to overcome even by conscious effort. This is particularly valuable for complex traits, which may be affected by a wider array of biological processes, some of which may not have been directly implicated by previous studies. CAESAR also reports the evidence supporting the prioritization rank of each gene, allowing an investigator to trace the line of reasoning and to exercise his or her own judgment as to its validity. Thus, it can be seen as a very sophisticated aid to manual prioritization.

Though designed to help with the design of an association study involving a few hundred genes, CAESAR can also be used to prioritize a smaller number of candidates within a region of linkage, or to prioritize among polymorphisms annotated with ranked genes that show significant association in a genome-wide study.

We have tested CAESAR on 18 susceptibility genes for 11 common complex traits in humans including type 1 and type 2 diabetes mellitus, schizophrenia, Parkinson's disease, cardiovascular disease, age-related macular degeneration, rheumatoid arthritis and celiac disease. Test genes were ranked higher than 95.7% of all ranked genes on average, and higher than 99.7% in the best case.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
CAESAR is comprised of three main steps. First, previously implicated genes mentioned in the input text are identified and ontology terms are ranked based on their similarity to an input text. Second, genes are ranked for each data source independently based on the relevance of the ontology terms with which they are annotated. Third, the individual gene lists are integrated to provide a single ranked list of candidate genes that combines evidence from all data sources. We refer to these three steps as text mining, data mining and data integration, respectively. The approach of CAESAR is presented as a schematic diagram in Figure 1a.


Figure 1
View larger version (39K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. CAESAR overview. (a) Text mining is used to extract gene symbols and ontology terms from the input. In the data-mining step, genes within each gene-centric data source are ranked based on the relevance to the trait-centric terms. In the data-integration step, the results from each source are combined into a single ranked list of candidates. Db = database. (b) Eight types of functional information (GO molecular function and biological process listed together) are queried using extracted genes and anatomy, phenotype and gene ontology terms. Genomic regions of interest represent optional user input. See text for abbreviations.

 
2.1 Text mining
CAESAR requires a user-defined body of text (referred to as a corpus) as input. This text is ideally an authoritative and comprehensive source of biological knowledge about the trait of interest. If an online Mendelian inheritance in man (OMIM) (Hamosh et al., 2005) identifier is supplied, CAESAR will use the OMIM record as input. Alternately, the user can provide any other body of text, for instance one or more review articles.

Since the corpus is written in natural language, the information must be converted to machine-readable form. This is done in two ways. First, human gene symbols are identified within the corpus. If an OMIM record is used as input, gene identifiers can be extracted directly from the OMIM database. Otherwise, gene symbols are extracted by matching to a reference list. Genes are weighted based on frequency of occurrence in the corpus, fg, where the weight cg of extracted gene g is calculated as fg divided by the sum of all fg across n total extracted genes. The reference list of standard names, symbols, database identifiers and corresponding mouse homologs for each gene is compiled from Entrez Gene (Maglott et al., 2005) and Ensembl (Birney et al., 2006). The extracted genes are assumed to be relevant to the biology of the trait, but do not necessarily contribute to the genetic variation of the trait.

Second, the corpus is used to quantify the relevance of terms within several different biomedical ontologies. Four ontologies are used as part of CAESAR, the gene ontology biological process (GO bp) and molecular function (GO mf) (Harris et al., 2004), the mammalian phenotype ontology (MP) (Smith et al., 2005) and the eVOC anatomical ontology (Kelso et al., 2003) (Table 1). Relevance is quantified using a similarity search under a vector-space model (Salton et al., 1975), as follows (Fig. 2). For each ontology, the individual terms are split into separate documents containing the term name and term description if available. These documents together comprise a document database, or search space, against which the corpus is queried (Fig. 2a). The corpus and each document are converted to vectors vi = < wi1, wi2, ... , win > with dimensionality equal to the size of the word space n, which is the total number of unique words in the document database. Commonly used stop words such as ‘and’ and ‘the’ are removed from the word space. Each element of the vector for document i is calculated as wij = eij , where eij is the number of occurrences of word j in the document.


View this table:
[in this window]
[in a new window]

 
Table 1. Data sources and ontologies used in CAESAR

 

Figure 2
View larger version (34K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Vector-space similarity search. (a) Each ontology term and its description comprise a document, as in this example of three terms from the mammalian phenotype ontology. (b) The word space consists of all unique words. For illustration, here the word space is (‘insulin’, ‘resistance’, ‘glucose’). Each document, including the corpus, describes a vector in word space, where the elements of the vector are weighted counts within the document of each word in the word space. (c) The similarity of each of the documents to the corpus is measured as the cosine of angle formed by the document and corpus vectors. High-ranking ontology terms have document vectors that are similar in both direction and magnitude to the corpus vector. In this example, MP:0005331 is the highest-ranking document.

 
The similarity of the corpus to each document is calculated as the cosine of the angle between the vectors, which is equal to the dot product of the vectors divided by the product of the magnitudes of the vectors. A larger cosine indicates vectors with greater similarity. Using this measure, ontology terms are weighted based on their similarity to the corpus (Fig. 2c), where the weight ct of term t is directly equal to the cosine.

2.2 Data mining
Eight sources of gene-centric information are used to map ranked ontology terms to the genes annotated with them (Fig. 1b). The resulting output is eight lists of gene scores, one for each functional category.

Mammalian phenotype ontology terms are used to query the mouse genome database (MGD) (Blake et al., 2003) for genes producing a given phenotype when mutated and to query the genetic association database (GAD) (Becker et al., 2004) for genes showing positive evidence of association with a phenotype in a human population. The eVOC anatomical ontology terms are used to query the UniProt database (Bairoch et al., 2005) for genes expressed in a given tissue. Gene ontology terms are used to query the gene ontology annotation database (GOA) (Camon et al., 2003) for genes annotated with a given gene ontology biological process or molecular function term. Finally, the extracted genes are used to query the biomolecular interaction network database (BIND) (Alfarano et al., 2005) and the human protein reference database (HPRD) (Peri et al., 2004) for genes encoding proteins that interact with the protein products of the extracted genes, query the Kyoto encyclopedia of genes and genomes (KEGG) pathway database (Kanehisa et al., 2004) for other genes involved in the same human cellular pathways and query the InterPro protein domain database (IPro) (Apweiler et al., 2000) for genes sharing conserved protein domains with the extracted genes.

The user may also optionally input one or several genomic sequence regions to include genes in chromosomal regions implicated through genetic linkage as an additional list of genes (Fig. 1b).

The score rij of gene i for source j is then calculated as either the maximum, sum or mean of the weights of the k matching ontology terms or extracted genes c1 ... ck . The three alternatives weigh the combined evidence for relevance in different ways, as described below for data integration from multiple sources.

2.3 Data integration
The gene scores from the eight sources are integrated to produce one combined score for each gene. Integration is accomplished using one of four methods. Each method represents a different approach that an investigator might choose when manually prioritizing candidate genes on the basis of evidence from several data sources.

The first three methods involve taking the maximum, sum or mean of the z-transformed rij scores for each gene. The maximum favors genes with strong evidence from one data source, the sum favors genes with evidence in many data sources and the mean favors genes with strong evidence only, penalizing genes with any weak evidence. The maximum mean and sum are referred to as int1, int2 and int3, respectively. Transformed scores are calculated as Formula , where Formula is the mean and sj the SD of the scores from source j. The combined score {varphi} · , i is then obtained by calculating the maximum


Formula

average


Formula

or sum


Formula

of the transformed scores for gene i.

The fourth method, referred to as int4, differs from the other three by considering both the score of a gene within a data source as well as the number of genes returned for that data source. First, a transformed score sij is obtained.


Formula

The transformed gene scores are then summed together to provide a final score for each gene.


Formula

where gj is the number of genes returned for source j and


Formula

2.4 Implementation
The CAESAR algorithms were written using Perl version 5.8.1 and Java version 1.4.2. The vector space similarity searches were performed using a modified version of the Perl module Search::VectorSpace by Maciej Ceglowski (http://www.perl.com/pub/a/2003/02/19/engine.html). Databases and ontology schemas were downloaded and parsed into XML under a custom XML schema. Intermediate text and data-mining results were also stored as XML under the same schema.

2.5 Selection of the tests for complex traits
To assess the ability of CAESAR to choose valid candidates, 18 test genes were selected from recently published reports providing strong evidence of statistical association with known complex human disorders. The test genes included CTLA4 (Ueda et al., 2003), PTPN22 (Bottini et al., 2004), PTPN22 (Begovich et al., 2004), SUMO4 (Guo et al., 2004), FCRL3 (Kochi et al., 2005), ENTH (Pimm et al., 2005), EN2 (Gharani et al., 2004), TCF7L2 (Grant et al., 2006), CFH (Klein et al., 2005), LOC387715 (Rivera et al., 2005), LTA4H (Helgadottir et al., 2006), C2 (Gold et al., 2006), CFB (Gold et al., 2006), NPSR1 (Laitinen et al., 2004), MYO9B (Monsuur et al., 2005), IL2RA (Vella et al., 2005), SEMA5A (Maraganore et al., 2005) and LOC439999 (Grupe et al., 2006).

Each disorder required a custom corpus, either an OMIM record or one or more review articles describing the biology of the disorder (Table 2). Review articles were selected by searching PubMed (Wheeler et al., 2006) for articles published before the year of discovery of each gene association. Where multiple suitable review articles were available, the texts were concatenated to produce the corpus. We removed any direct reference to the testing gene in the input text. In addition, entries in the GAD containing the test genes were removed. Thus, the input data closely mimicked the state of knowledge prior to the discovery of the positive association between the disease and the test gene.


View this table:
[in this window]
[in a new window]

 
Table 2. Tests using susceptibility genes for complex human traits

 
In the case of age-related macular degeneration (ARMD), positive associations for the two test genes, CFB and C2, were reported after the discovery of CFH as a suscuptibility gene for the disease. Due to the absence of a suitable review article incorporating the discovery of CFH, results for these two test genes employ the ARMD OMIM corpus only.

A common way of summarizing the performance of previous candidate gene selection algorithms is to calculate ‘fold enrichment’, which is the total number of ranked genes divided by the rank of the test gene. Fold enrichment must be interpreted with caution, because it is not calculated relative to random expectation. Nonetheless, we report this statistic in order to facilitate comparison with other methods.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Testing of recently discovered complex trait genes
We tested the performance of the algorithm on a set of test genes previously reported to be associated with 11 complex human diseases (Table 2). For each disease, we selected one or more genes for which recent population genetic studies have reported a significant association with the disease phenotype. Nearly 15 000 genes had sufficient information from one or more data sources to be ranked. Table 2 summarizes results of the 18 test genes by separately considering tests using review articles and OMIM records as input, although not all genes were tested using both input types. In order to report the success of CAESAR using all 18 genes, we combined review article tests for 16 genes with OMIM record tests for 2 genes, CFB and C2, which were not tested using review articles (see Methods section). The following results using all 18 test genes are thus not summarized in Table 2.

First, we evaluated the choice of data-mining method for determining the score rij of each gene i for each data source j (see Methods section). The distributions of the ranks are shown in Figure 3a. Each data-mining method used the int4 integration method (data for other integration methods not shown). The maximum method had a smaller median rank (549.5) than both the sum (1353) and mean (1020) methods.


Figure 3
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Box and whisker plot distributions of the ranks of 18 test genes in Table 2 using different CAESAR parameters. Ranks are plotted on a log scale. Plots are constructed so that the bounds of the box are the upper and lower quartile medians, the line inside the box is the median, the whiskers extend to the last value no more than 1.5 times the length of the box, and all remaining values are outliers. (a) Distribution of ranks using the max, mean and average data-mining methods (int4 method for integration). (b) Distribution of ranks using the four different integration methods (max data-mining method).

 
Second, we evaluated the four different methods for the integration of data from different sources (Fig. 3b). Int4 yielded the smallest median rank (549.5) compared to the results for int1 (max), int2 (mean) and int3 (sum), which were 1488, 2594 and 1201, respectively. Furthermore, int4 had smaller upper and lower quartile ranks than int1, int2 and int3. We thus report the results for the maximum data-mining and int4 integration method in what follows.

Overall, 16 of 18 test genes were ranked with a median rank of 549.5 and 67-fold average enrichment. Seven of the 18 test genes (39%) were ranked higher than 98% of all ranked genes for the trait in question, while five (28%) ranked in the 99th percentile. The highest rank seen in our tests was 44 for CFB, a susceptibility gene for age-related macular degeneration, which corresponds to a 293-fold enrichment. Two of the genes, LOC387715 and LOC439999, were unranked due to a lack of information on these genes in any of the data sources.

We compared the observed distribution of the ranks for the 18 test genes to that expected by chance, which is a minimal test for the effectiveness of the method. The expected mean percentile for a random gene would be 50. The observed mean percentile is 80.5 and, under a binomial expectation, the 95% confidence interval is 66–95. Thus, the observed distribution of ranks for the test genes is significantly displaced relative to random expectation.

3.2 Comparison of input texts
We next examined the effect of the choice of corpus on the ranks for the test genes. Using review article corpus tests only, 14 of 16 test genes were ranked, with a median rank of 725 and 54-fold average enrichment. Six of the 16 test genes (37.5%) ranked in the 98th percentile, while four (25%) ranked in the 99th percentile (Table 2).

For comparison, we selected for each disease the relevant records from the OMIM database. For all tests the int4 method was used (Table 1). The test for candidate genes of myocardial infarction was omitted because the OMIM record for this disease is only ~100 words in length, which would be insufficient for reliably scoring a large number of ontology terms. Of the remaining 17 genes tested, 15 sufficient information to be ranked. The median rank was 879 with an average 43-fold enrichment. The best performance was observed for CFB, with 293-fold enrichment. Three of the 17 test genes (17.6%) ranked in the 98th percentile of all ranked genes, while 2 of 17 (11.8%) ranked in the 99th percentile. Only one gene, SEMA5A, had a dramatically improved rank relative to that obtained using a corpus of published review articles. Thus, the ranks for the test genes using OMIM records, while still clearly an improvement over random expectation, are in most cases inferior to those obtained using review articles.

We examined whether the length of the input text could help explain the difference in performance between the two types of input text. The length of each corpus was measured as the number of words excluding stop words and non-word characters. There was no significant correlation between the length of the corpus and the rank obtained for each test gene (Spearman's {rho} = –0.21 , P = 0.27).

3.3 Analysis of bias
CAESAR is dependent on available annotations to rank genes. Therefore, the preferential ranking of well-annotated genes is a potential source of bias in the results. We addressed this issue in two ways, by measuring the effect of both breadth and depth of annotation on gene rank. We first measured the correlation between gene rank and the breadth of annotation, or the number of sources for which a gene is annotated, across each integration method. Using the default methods (max and int4), there is a strong correlation ({rho} = –0.75), as shown in Figure 4. By comparison, again using the max method, int2 ({rho} = –0.15) and int3 ({rho} = –0.06) showed little correlation, while int1 showed modest correlation ({rho} = –0.47).


Figure 4
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. The relationship between the rank of a gene and the number of data sources in which it is annotated, using the max and int4 methods. Ranks are plotted on a log scale. Box and whisker plots were constructed as described for Figure 3.

 
We next addressed the correlation between gene rank and annotation depth by considering the number of GO annotations (biological process + molecular function) per gene. For each data-mining method, and using int4 for data integration, we calculated the mean number of GO terms for genes ranked within the top 98th percentile (max: 7.2 ± 4.1; avg: 6.2 ± 3.7; sum: 9.8 ± 5.3) and found this to be significantly higher than the mean number of GO terms across all ranked genes (4.6 ± 2.9) for all three data methods (two-tailed, unpaired t-tests, P-values <2 x 10–16).

Data sources used by CAESAR include diverse available sources of gene-centric information; however, non-independence among data sources could also potentially bias the results. To address this issue, we measured the average correlation between the ranked gene lists for each tested trait using the review article corpus (Table 3). The majority of the sources show a mild, yet significant, correlation. No two data sources show a correlation greater than {rho} = 0.43 . Several pairs of sources show very weak negative correlations.


View this table:
[in this window]
[in a new window]

 
Table 3. Independence of CAESAR data sources

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The extraordinary amount of biological information available in the published literature and in publicly available databases about complex human diseases, on the one hand, and genes and their protein products, on the other, is well suited to the in silico identification of candidate genes for disease. The approach is enabled by ontologies that provide a semantic mapping between the natural language description of diseases and traits, and the functional annotation of genes and their products. It is further enabled by the availability of well-curated pathway and protein-interaction datasets, and a wide variety of functional information about not only the genes themselves, but also their homologs in model organisms. The approach implemented in CAESAR can, in principle, be applied to any complex trait in any organism for which similar information resources exist.

CAESAR relies on human expert knowledge in order to function effectively, but it does not require that the user actually possess all of this knowledge. At a minimum, the user needs to select a relevant corpus, but much more user intervention is possible. The user may manually modify the scores from the text-mining step and/or introduce genes in addition to those that were extracted from the corpus. The final rankings may be modified based on user perceptions of the importance of particular data sources. The user may also restrict the algorithm to consider only certain genomic regions or particular sets of genes. While it is not advisable to eliminate human judgment and oversight of the candidate gene selection process, due to the volume and the complexity of the information involved, semi-automated methods such as CAESAR may well outperform an unaided expert. At the very least, CAESAR provides a quantitative starting point for which the assumptions are clear and the user's biases are minimized.

The success of CAESAR in any given instance is due both to factors that are, at least to some extent, under the user's control and those that are not. The user's choice of a corpus that accurately reflects the biology of the trait is clearly of critical importance. In our experiments, we found that review articles generally, though not always, yielded better results than OMIM records. The explanation for this difference is not clear; it does not appear to be due to differences in corpus length.

Other factors under the user's control are algorithmic, e.g. how to calculate a score for a gene within a data source and to rank genes across multiple data sources. The variety of simple methods used here can, in some cases, lead to substantially different rankings. One example is NPSR1, which had ranks of 749 and 2751 using int1 and int2, respectively. Four different data sources (GO bp, GO mf, IPro and tissue) report information on NPSR1, and the scores vary from high to low. Int1, which calculates the maximum, favors genes with a high score in one data source regardless of the others, whereas the low scores are detrimental to the final rank using int2, which calculates the average. Each of the methods can be justified (see Method section), and it is not clear a priori which should be superior.

Overall, we found that the best results on the test set were obtained using a corpus of review articles, the maximum method for combining scores for a gene within a data source, and the int4 method for data integration across multiple sources. However, other combinations of parameters were superior for particular test genes. On the basis of our test results, we have selected the ‘max’ data-mining and ‘int4’ data-integration methods to be the default settings for CAESAR. The OMIM record, if available, is used as the input text by default, though our results suggest that one or more review articles should be used instead, or in addition, when possible.

A number of factors affecting CAESAR's success are outside of the user's control. One is the depth of biological knowledge about the complex trait under study and the extent to which this knowledge has been recorded. Another is the extent to which ontologies can be used to mediate between trait-centric and gene-centric information sources. For example, anatomical ontologies are available for mammals, but not yet for all organisms. Even where an ontology exists, certain terms may not exist, have listed synonyms, or be sufficiently well defined.

The process of extracting gene names from unstructured text is also error-prone (Hirschman et al., 2005), especially when using older bodies of text containing outdated gene names and symbols. Gene extraction is complicated further by the fact that genes often share symbols with other genes and non-gene acronyms.

Perhaps most importantly, CAESAR depends on the availability of functional information. Approximately half of the unique entries in our reference set remained unranked for any trait due to lack of annotation, including two of the test genes, LOC387715 and LOC439999. As the total number of ranked genes depends on the number of ontology terms that are mapped from the corpus, the success of CAESAR for a given trait depends on the information content of the corpus. One tested trait, myocardial infarction did not have a sufficiently informative OMIM record. Therefore, CAESAR is limited to genes and traits for which there is sufficient information in the form of annotations and text descriptions, respectively. To the extent that this reflects incomplete knowledge of genes and traits, it is a limitation shared by all candidate gene approaches. The lack of gene-centric information, at least, can be partially overcome by including additional data sources from map-based studies, systematic functional genomic screens and other model systems in which homologs may have been characterized.

Given the importance of including a wide variety of functional information, CAESAR could be enhanced by the inclusion of additional data sources. A particularly valuable source would be data from transcription profiling experiments, which would provide information on a large proportion of genes that are lacking information from other sources. Inclusion of this data will be challenging, however, as the datasets available are diverse and heterogeneous, and it is not clear how best to score the relevance of a particular expression pattern to a trait.

Inclusion of additional data sources could potentially raise the issue of non-independence among them. Although no two data sources used in this study are highly correlated, most of them have a significant weak correlation. CAESAR does not currently correct for non-independence during the data-integration step.

A variety of in silico methods for candidate gene selection have previously been reported, though most have been designed and tested to prioritize positional candidates. Gene-Seeker (van Driel et al., 2003) selected candidates in a given genomic region through web-based data mining of expression and phenotype databases. This approach enriched for disease genes in 10 monogenic disorders, providing at best 25- and 7-fold enrichment on average. POCUS (Turner et al., 2003) exploited functional similarities between genes at two or more loci to predict candidates, requiring no user input beyond the genomic regions of interest. It provided 12-, 29- and 42-fold enrichment on average for three test loci of increasing size and at best provided 81-fold enrichment. Perez-Iratxeta et al. (2002) used literature mining to associate pathology with GO terms and then used these terms to rank candidate genes. The authors created artificial loci containing an average of 300 genes for testing and found 10-fold enrichment on average and, at best, 38-fold enrichment. The correct disease gene was present in their enriched set for ~50% of the loci. Freudenberg and Propping (2002) computed similarity-based clusters of known disease genes based on phenotypic sharing between diseases. Their method selected the correct disease gene in roughly two-thirds of the cases, on average resulting in 10-fold enrichment, and in the top one-third of the cases resulting in 33-fold enrichment. Franke et al. (2006) developed a functional network of human genes to select candidate genes found in pathways with known disease genes. They constructed artificial loci that contained on average 100 genes, and found 20- and 10-fold enrichment on average in 27 and 34% of tested genes, respectively.

More recently, SUSPECTS (Adie et al., 2006) and ENDEAVOUR (Aerts et al., 2006) have been developed for application to more complex traits. Both of these systems prioritized genes using a combination of annotation and sequence features based on similarity to a training set. SUSPECTS was able to identify a test gene in artificial loci on average within the top 13% of candidates, a 7-fold enrichment. In half the cases, the test gene was in the top 5% of candidates, a 20-fold enrichment. ENDEAVOUR tested both monogenic and polygenic (complex) disorders using a test set of 200 genes. Over all tested disorders, ENDEAVOUR provided 9-fold enrichment on average and 200-fold enrichment at best. Considering polygenic disorders only, ENDEAVOUR provided 5-fold enrichment on average and 18-fold enrichment at best.

The measure of success for an approach such as CAESAR ultimately depends on the specific application. Our goal has been the enrichment of candidates within the top 2% of ranked genes, which represents roughly the top 1% of genes in the human genome. Given the number of functionally annotated human genes, this corresponds to 250–300 genes, which is a reasonable number included in a high-resolution SNP association study for a complex disease in human populations. Our results suggest that approximately one-third to one-half of the genes previously associated with complex human disease would be included in this enriched candidate set. With a complex trait, for which the true effectors are only partially known, it is difficult to quantify the number of true and false positives. Nonetheless, assuming all genes outside of our test set are negatives, we can calculate sensitivity as TP/(TP+FN) and specificity as TN/(TN+FP), where TP is the number of true positives, TN is the number of true negatives, FP is the number of false positives and FN is the number of false negatives. Considering positives to be the top 2% of ranked genes, we obtained an overall sensitivity of 39% and specificity of 98% for our test set. Other measures of success may be relevant for different applications, such as prioritizing SNPs for follow-up work from a genome-wide association study. By standard measures, CAESAR compares favorably with other methods, even though we use a test set of genes associated with complex rather than monogenic or oligogenic diseases. The highest (293) and average (67) fold enrichment obtained with CAESAR are greater than those reported for other systems.

CAESAR makes use of a relatively small trait-specific corpus, comprised of one to several review articles, and a large body of gene-centric information. A similar approach could be useful for other applications involving semantic mediation between larger corpora or sets of corpora.

In conclusion, CAESAR can successfully mine large amounts of biological information to guide the selection of candidate genes for complex diseases in humans. Applications include selection of candidate genes for association or re-sequencing studies, prioritization of candidates for functional genomics experiments, or evaluation of results from linkage and genome-wide association studies. The approach may be extended to select candidates for complex traits in other organisms for which similar informatic resources are available. No computational system can select candidate genes with certainty; however, when used as a guide, CAESAR is a useful tool for candidate gene prioritization.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was supported by the National Institutes of Health (DK72193 to K.L.M.), the National Science Foundation (0227314 to T.J.V.) and a Burroughs Wellcome Career Award in the Biomedical Sciences (K.L.M.). Funding to pay the Open Access publication charges was provided by 0227314, DK72193, and the UNC Office of the Vice Chancellor for Research and Economic Development. Funding to pay the Open Access publication charges was provided by 0227314, DK72193 and the UNC Office of the Vice Chancellor for Research and Economic Development.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on October 30, 2006; revised on January 2, 2007; accepted on January 8, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Adie E, et al. Suspects: enabling fast and effective prioritization of positional candidates. Bioinformatics, ( (2006) ) 22, : 773–774.[Abstract/Free Full Text].

    Aerts S, et al. Gene prioritization through genomic data fusion. Nat. Biotechnol, ( (2006) ) 24, : 537–544.[CrossRef][ISI][Medline].

    Alfarano C, et al. The biomolecular interaction database and related tools 2005 update. Nucleic Acids Res, ( (2005) ) 33, : D418–D424.[Abstract/Free Full Text].

    Apweiler R, et al. Interpro-an integrated documentation resource for protein families, domains and functional sites. Bioinformatics, ( (2000) ) 16, : 1145–1150.[Abstract/Free Full Text].

    Bairoch A, et al. The universal protein resource (Uniprot). Nucleic Acids Res, ( (2005) ) 33, : D154–D159.[Abstract/Free Full Text].

    Becker K, et al. The genetic association database. Nat. Genet, ( (2004) ) 36, : 431–432.[CrossRef][ISI][Medline].

    Begovich A, et al. A missense single-nucleotide polymorphism in a gene encoding a protein tyrosine phosphatase (PTPN22) is associated with rheumatoid arthritis. Am. J. Hum. Genet, ( (2004) ) 75, : 330–337.[CrossRef][ISI][Medline].

    Birney E, et al. Ensembl 2006. Nucleic Acids Res, ( (2006) ) 34, : D556–D561.[Abstract/Free Full Text].

    Blake J, et al. MGD: the mouse genome database. Nucleic Acids Res, ( (2003) ) 31, : 193–195.[Abstract/Free Full Text].

    Bottini N, et al. A functional variant of lymphoid tyrosine phosphatase is associated with type 1 diabetes. Nat. Genet, ( (2004) ) 36, : 337–338.[CrossRef][ISI][Medline].

    Camon E, et al. The gene ontology annotation (GOA) project: implementation of GO in swiss-prot, trembl and interpro. Genome Res, ( (2003) ) 13, : 662–672.[Abstract/Free Full Text].

    Dean M. Approaches to identify genes for complex human diseases: lessons from mendelian disorders. Hum. Mutat, ( (2003) ) 22, : 261–274.[CrossRef][ISI][Medline].

    Franke L, et al. Reconstruction of a functional human gene network, with an application for prioritizing positional candidate genes. Am. J. Hum. Genet, ( (2006) ) 78, : 1011–1025.[CrossRef][ISI][Medline].

    Freudenberg J, Propping P. A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics, ( (2002) ) 18, : S110–S115.[Abstract].

    Gharani N, et al. Association of the homeobox transcription factor, ENGRAILED 2, 3, with autism spectrum disorder. Mol. Psychiatry, ( (2004) ) 5, : 474–484..

    Gold B, et al. Variation in factor B (BF) and complement component 2 (C2) genes is associated with age-related macular degeneration. Nat. Genet, ( (2006) ) 38, : 458–462.[CrossRef][ISI][Medline].

    Grant S, et al. Variant of transcription factor 7-like 2 (TCF7L2) gene confers risk of type 2 diabetes. Nat. Genet, ( (2006) ) 38, : 320–323.[CrossRef][ISI][Medline].

    Grupe A, et al. A scan of chromosome 10 identifies a novel locus showing strong association with late-onset alzheimer disease. Am. J. Hum. Genet, ( (2006) ) 78, : 78–88.[CrossRef][ISI][Medline].

    Guo D, et al. A functional variant of SUMO4, a new I kappa B alpha modifier, is associated with type 1 diabetes. Nat. Genet, ( (2004) ) 36, : 837–841.[CrossRef][ISI][Medline].

    Hamosh A, et al. Online mendelian inheritance in man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res, ( (2005) ) 33, : D514–D517.[Abstract/Free Full Text].

    Harris M, et al. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res, ( (2004) ) 32, : D258–D261.[Abstract/Free Full Text].

    Helgadottir A, et al. A variant of the gene encoding leukotrine A4 hydrolase confers ethnicity-specific risk of myocardial infarction. Nat. Genet, ( (2006) ) 38, : 68–74.[ISI][Medline].

    Hirschman L. Overview of BioCreAtIvE task 1B: normalized gene lists. BMC Bioinformatics, ( (2005) ) 6, : S11..

    Kanehisa M, et al. The KEGG resource for deciphering the genome. Nucleic Acids Res, ( (2004) ) 32, : D277–D280.[Abstract/Free Full Text].

    Kelso J, et al. eVOC: a controlled vocabulary for unifying gene expression data. Genome Res, ( (2003) ) 13, : 1222–1230.[Abstract/Free Full Text].

    Klein R, et al. Complement factor H polymorphism in age-related macular degeneration. Science, ( (2005) ) 308, : 385–389.[Abstract/Free Full Text].

    Kochi Y, et al. A functional variant in FCRL3, encoding fc receptor-like 3, is associated with rheumatoid arthritis and several autoimmunities. Nat. Genet, ( (2005) ) 37, : 478–485.[CrossRef][ISI][Medline].

    Laitinen T, et al. Characterization of a common susceptibility locus for asthma-related traits. Science, ( (2004) ) 304, : 300–304.[Abstract/Free Full Text].

    Maglott D, et al. Entrez gene: gene-centric information at NCBI. Nucleic Acids Res, ( (2005) ) 33, : D54–D58.[Abstract/Free Full Text].

    Maraganore D, et al. High-resolution whole-genome association study of parkinson's disease. Am. J. Hum. Genet, ( (2005) ) 77, : 685–693.[CrossRef][ISI][Medline].

    Monsuur A, et al. Myosin IXB variant increases the risk of celiac disease and points toward a primary intestinal barrier defect. Nat. Genet, ( (2005) ) 37, : 1341–1344.[CrossRef][ISI][Medline].

    Newton-Cheh C, Hirschhorn J. Genetic association studies of complex traits: design and analysis issues. Mutat. Res, ( (2005) ) 573, : 54–69.[ISI][Medline].

    Peltonen L, McKusick V. Genomics and medicine: dissecting human disease in the postgenomic era. Science, ( (2001) ) 291, : 1224–1229.[Free Full Text].

    Perez-Iratxeta C, et al. Association of genes to genetically inherited diseases using data mining. Nat. Genet, ( (2002) ) 31, : 316–319.[ISI][Medline].

    Peri S, et al. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res, ( (2004) ) 32, : D497–D501.[Abstract/Free Full Text].

    Pimm J, et al. The epsin 4 gene of chromosome 5q, which encodes the clathrin-associated protein enthoprotin, is involved in the genetic susceptibility to schizophrenia. Am. J. Hum. Genet, ( (2005) ) 76, : 902–907.[CrossRef][ISI][Medline].

    Rivera A, et al. Hypothetical LOC387715 is a second major susceptibility gene for age-related macular degeneration, contributing independently of complement factor H to disease risk. Hum. Mol. Genet, ( (2005) ) 14, : 3227–3236.[Abstract/Free Full Text].

    Salton G, et al. A Vector Space Model for Automatic Indexing. Commun. ACM, ( (1975) ) 18, : 613–620.[CrossRef].

    Smith C, et al. The mammalian phenotype ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol, ( (2005) ) 6, : R7.[CrossRef][Medline].

    Thomas D. Are we ready for genome-wide association studies? Cancer Epidemiol. Biomarkers Prev, ( (2006) ) 15, : 595–598.[Free Full Text].

    Turner F, et al. POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol, ( (2003) ) 4, : R75.[CrossRef][Medline].

    Ueda H, et al. Association of the T-cell regulatory gene CTLA4 with susceptibility to autoimmune disease. Nature, ( (2003) ) 423, : 503–511..

    van Driel M, et al. A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur. J. Hum. Genet, ( (2003) ) 11, : 57–63.[CrossRef][ISI][Medline].

    Vella A, et al. Localization of a type 1 diabetes locus in the IL2RA/CD25 region by use of tag single-nucleotide polymorphisms. Am. J. Hum. Genet, ( (2005) ) 75, : 773–779..

    Wheeler D, et al. Database resources of the national center for biotechnology information. Nucleic Acids Res, ( (2006) ) 22, : D173–D180..


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/9/1132    most recent
btm001v4
btm001v3
btm001v2
btm001v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Right arrow Articles by Gaulton, K. J.
Right arrow Articles by Vision, T. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gaulton, K. J.
Right arrow Articles by Vision, T. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?