Bioinformatics Advance Access originally published online on January 19, 2006
Bioinformatics 2006 22(6):773-774; doi:10.1093/bioinformatics/btk031
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SUSPECTS: enabling fast and effective prioritization of positional candidates
Medical Genetics Section, School of Molecular and Clinical Medicine, University of Edinburgh EH4 2XU UK
*To whom correspondence should be addressed at MMC, Western General Hospital, Crewe Road, Edinburgh EH4 2XU, UK
| ABSTRACT |
|---|
|
|
|---|
Summary: SUSPECTS is a web-based server which combines annotation and sequence-based approaches to prioritize disease candidate genes in large regions of interest. It uses multiple lines of evidence to rank genes quickly and effectively while limiting the effect of annotation bias to significantly improve performance.
Availability: SUSPECTS is freely available at http://www.genetics.med.ed.ac.uk/suspects/
Contact: euan.adie{at}ed.ac.uk
Supplementary information: A quick-start guide in Macromedia Flash format is available at http://www.genetics.med.ed.ac.uk/suspects/help.shtml and Excel spreadsheets detailing the comparative performance of the software are included as Supplementary material.
| INTRODUCTION |
|---|
|
|
|---|
When searching for the genetic basis of disease the regions of interest identified through complex-trait linkage studies regularly exceed 30 cM in size and can contain hundreds of genes (McCarthy et al., 2003). Existing tools to help researchers to prioritize candidates for further study can be separated into two distinct classes; those based on functional annotation (Perez-Iratxeta et al., 2002; Freudenberg et al., 2002; Van Driel et al., 2003; Turner et al., 2003; Tiffin et al., 2005) and those based on sequence features (Adie et al., 2005; Lopez-Bigas et al., 2004).
Methods based on functional annotation can suffer from annotation bias as they are unable to deal with genes lacking sufficiently detailed annotation. Sequence-based methods make use of intrinsic characteristics of genes like length, homology to genes in other species and base composition. As these characteristics can be readily computed from sequence they avoid the problem of annotation bias. However, sequence-based methods prioritize genes on the basis of their potential for involvement in disease in general rather than involvement in the specific disease of interest to the user.
SUSPECTS is a novel, consolidated approach that combines the increased precision of annotation-based methods with the better recall of sequence-based methods, avoiding the problems outlined above. Given a set of existing candidate genes for a particular complex or oligogenic disease, it effectively automates further candidate gene selection from large regions on the principle that genes involved in that disease will tend to share the same or similar annotation, reflecting common biological pathways.
| PRIORITIZING CANDIDATES WITH SUSPECTS |
|---|
|
|
|---|
Users of SUSPECTS can enter a region of interest by specifying flanking markers, chromosomal coordinates or bands. Alternatively, the software will examine a region of interest automatically centred on a single marker.
Users then enter the name of the disease to be considered; the software will automatically retrieve genes implicated in that disorder from OMIM (Hamosh et al., 2002), HGMD (Cooper et al., 1998) and GAD (Becker et al., 2004). Alternatively users can manually enter a list of genes thought to be involved in pathogenesis of the disease. These genes are known as the training set.
Each positional candidate gene is then scored automatically (see Methodology). Higher scores represent better candidates. The user is presented with a graphical overview of the region of interest (Fig. 1). The graphical overview is a hyperlinked image map that can be used to obtain more detailed information about each candidate gene and the reasoning behind its score. The list of candidate genes ranked by score is presented as a table underneath the graphical overview.
|
| METHODOLOGY |
|---|
|
|
|---|
Each gene in the region of interest is scored on its suitability as a candidate for further study based on four lines of evidence; first by Prospectr (Adie et al., 2005) on the basis of its sequence features, second by the extent of coexpression with the training set based on GNF expression data (Su et al., 2002), third by the number of rare (found in <5% of all proteins) Interpro domains shared with the training set and finally by the level of semantic similarity (Lord et al., 2003) that the GO terms assigned to it share with the GO terms assigned to genes in the training set.
The four scores are then combined. Each score is weighted depending on the amount of information available for each line of evidence. If little or no information is available then the importance of that score is decreased accordingly. This ensures that the scores of genes which lack sufficiently detailed GO terms or expression profiles do not suffer from annotation bias. The final score ranges from 0 to 100 where 100 represents a perfect match between the candidate gene and all genes in the training set.
| COMPARATIVE PERFORMANCE |
|---|
|
|
|---|
Approaches based on functional annotation rely on good quality information being available for each possible candidate gene. Conversely, SUSPECTS is able to prioritize all genes including those which lack detailed GO, domain or expression data, although when available those lines of evidence contribute favourably to overall performance.
The performance of SUSPECTS was tested with a set of oligogenic and complex disorders including Alzheimer's disease, hypertension, autism and systemic lupus erythematosus. The set is derived from that used by Turner et al. to test POCUS, an annotation-based classifier (Turner et al., 2003).
At least three implicated genes for each disease were available. For each implicated gene, a region of interest was created containing the implicated gene itself (the target gene) and every gene within 7.5 Mb on either side. On an average each region of interest contained 155 genes. An associated training set was then created containing the remaining implicated genes for each disorder.
We first ranked each region of interest using a classifier based on sequence features alone (Prospectr). On average the target gene was in the top 31.23% of the resulting ranked lists of candidates and in the top 5% of those lists 20 times out of 155 (13%).
In comparison, on average the target gene was in the top 12.93% of the ranked list from SUSPECTS, which took both the region of interest and the training set as input in each case. The target gene was in the top 5% of the ranked list 87 times out of 155 (56%). The test results for both the sequence features classifier and SUSPECTS have been made available as Supplementary information.
In conclusion, SUSPECTS significantly improves on the performance of candidate prioritization methods which use annotation or sequence data alone and is of value to researchers faced with large regions of interest. It is fast, easy to use and freely available on the World Wide Web at http://www.genetics.med.ed.ac.uk/suspects/
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Satoru Miyano
Received on October 26, 2005; revised on December 12, 2005; accepted on December 28, 2005
| REFERENCES |
|---|
|
|
|---|
Adie, E.A., et al. (2005) Speeding disease gene discovery by sequence based candidate prioritization. BMC Bioinformatics, 6, 55[CrossRef][Medline].
Becker, K.G., et al. (2004) The Genetic Association Database. Nat. Genet, . 36, 431432[CrossRef][Web of Science][Medline].
Cooper, D.N., et al. (1998) The human gene mutation database. Nucleic Acids Res, . 26, 285287
Freudenberg, J. and Propping, P. (2002) A similarity-based method for genome-wide prediction of disease-relevant human genes. Bioinformatics, 18, S110S115[Abstract].
Hamosh, A., et al. (2002) Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res, . 30, 5255
Lopez-Bigas, N. and Ouzounis, C.A. (2004) Genome-wide identification of genes likely to be involved in human genetic disease. Nucleic Acids Res, . 32, 31083114
Lord, P.W., et al. (2003) Investigating semantic similarity measures across the Gene Ontology: the relationship between sequence and annotation. Bioinformatics, 19, 12751283
McCarthy, M., et al. (2003) New methods for finding disease-susceptibility genes: impact and potential. Genome Biol, . 4, 119[CrossRef][Medline].
Perez-Iratxeta, C., et al. (2002) Association of genes to genetically inherited diseases using data mining. Nat. Genet, . 31, 316319[Web of Science][Medline].
Su, A., et al. (2002) Large-scale analysis of the human and mouse transcriptomes. Proc. Natl Acad. Sci. USA, 99, 44654470
Tiffin, N., et al. (2005) Integration of text- and data-mining using ontologies successfully selects disease gene candidates. Nucleic Acids Res, . 33, 15441552
Turner, F., et al. (2003) POCUS: mining genomic sequence annotation to predict disease genes. Genome Biol, . 4, R75[CrossRef][Medline].
Van Driel, M.A., et al. (2003) A new web-based data mining tool for the identification of candidate genes for human genetic disorders. Eur. J. Hum. Genet, . 11, 5763[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
J. Sun, P. Jia, A. H. Fanous, B. T. Webb, E. J.C.G. van den Oord, X. Chen, J. Bukszar, K. S. Kendler, and Z. Zhao A multi-dimensional evidence-based candidate gene prioritization approach for complex diseases-schizophrenia as a case Bioinformatics, October 1, 2009; 25(19): 2595 - 6602. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Chen, E. E. Bardes, B. J. Aronow, and A. G. Jegga ToppGene Suite for gene list enrichment analysis and candidate gene prioritization Nucleic Acids Res., July 1, 2009; 37(suppl_2): W305 - W311. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Yoshida, Y. Makita, N. Heida, S. Asano, A. Matsushima, M. Ishii, Y. Mochizuki, H. Masuya, S. Wakana, N. Kobayashi, et al. PosMed (Positional Medline): prioritizing genes with an artificial neural network comprising medical documents to accelerate positional cloning Nucleic Acids Res., July 1, 2009; 37(suppl_2): W147 - W152. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Makita, N. Kobayashi, Y. Mochizuki, Y. Yoshida, S. Asano, N. Heida, M. Deshpande, R. Bhatia, A. Matsushima, M. Ishii, et al. PosMed-plus: An Intelligent Search Engine that Inferentially Integrates Cross-Species Information Resources for Molecular Breeding of Plants Plant Cell Physiol., July 1, 2009; 50(7): 1249 - 1259. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Ortutay and M. Vihinen Identification of candidate disease genes by integrating Gene Ontologies and protein-interaction networks: case study of primary immunodeficiencies Nucleic Acids Res., February 1, 2009; 37(2): 622 - 628. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Yilmaz, P. Jonveaux, C. Bicep, L. Pierron, M. Smail-Tabbone, and M.D. Devignes Gene-disease relationship discovery based on model-driven data integration and database view definition Bioinformatics, January 15, 2009; 25(2): 230 - 236. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Tiffin, I. Okpechi, C. Perez-Iratxeta, M. A. Andrade-Navarro, and R. Ramesar Prioritization of candidate disease genes for metabolic syndrome by computational analysis of its defining phenotypes Physiol Genomics, September 17, 2008; 35(1): 55 - 64. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. F. Saccone, N. L. Saccone, G. E. Swan, P. A. F. Madden, A. M. Goate, J. P. Rice, and L. J. Bierut Systematic biological prioritization after a genome-wide association study: an application to nicotine dependence Bioinformatics, August 15, 2008; 24(16): 1805 - 1811. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Yu, S. Van Vooren, L.-C. Tranchevent, B. De Moor, and Y. Moreau Comparison of vocabularies, representations and ranking algorithms for gene prioritization by text mining Bioinformatics, August 15, 2008; 24(16): i119 - i125. [Abstract] [Full Text] [PDF] |
||||
![]() |
L.-C. Tranchevent, R. Barriot, S. Yu, S. Van Vooren, P. Van Loo, B. Coessens, B. De Moor, S. Aerts, and Y. Moreau ENDEAVOUR update: a web resource for gene prioritization in multiple species Nucleic Acids Res., July 1, 2008; 36(suppl_2): W377 - W384. [Abstract] [Full Text] [PDF] |
||||
![]() |
Q. Xiong, Y. Qiu, and W. Gu PGMapper: a web-based tool linking phenotype to genes Bioinformatics, April 1, 2008; 24(7): 1011 - 1013. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Shriner, T. M. Baye, M. A. Padilla, S. Zhang, L. K. Vaughan, and A. E. Loraine Commonality of functional annotation: a method for prioritization of candidate genes from genome-wide linkage studies Nucleic Acids Res., March 27, 2008; 36(4): e26 - e26. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Schlicker and M. Albrecht FunSimMat: a comprehensive functional similarity database Nucleic Acids Res., January 11, 2008; 36(suppl_1): D434 - D439. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Sookoian and C. J. Pirola Review: Genetics of the cardiometabolic syndrome: new insights and therapeutic implications Therapeutic Advances in Cardiovascular Disease, October 1, 2007; 1(1): 37 - 47. [Abstract] [PDF] |
||||
![]() |
M. G. Kann Protein interactions and disease: computational approaches to uncover the etiology of diseases Brief Bioinform, September 1, 2007; 8(5): 333 - 346. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Perez-Iratxeta, P. Bork, and M. A. Andrade-Navarro Update of the G2D tool for prioritization of gene candidates to inherited diseases Nucleic Acids Res., July 13, 2007; 35(suppl_2): W212 - W216. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. J. Gaulton, K. L. Mohlke, and T. J. Vision A computational system to select candidate genes for complex human traits Bioinformatics, May 1, 2007; 23(9): 1132 - 1140. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. A. George, J. Y. Liu, L. L. Feng, R. J. Bryson-Richardson, D. Fatkin, and M. A. Wouters Analysis of protein sequence and interaction data for candidate disease gene prediction Nucleic Acids Res., November 14, 2006; 34(19): e130 - e130. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Tiffin, E. Adie, F. Turner, H. G. Brunner, M. A. van Driel, M. Oti, N. Lopez-Bigas, C. Ouzounis, C. Perez-Iratxeta, M. A. Andrade-Navarro, et al. Computational disease gene identification: a concert of methods prioritizes type 2 diabetes and obesity candidate genes Nucleic Acids Res., June 6, 2006; 34(10): 3067 - 3081. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||






