Bioinformatics Advance Access published online on April 5, 2009
Bioinformatics, doi:10.1093/bioinformatics/btp158
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Evaluation of genome-wide association study results through development of ontology fingerprint
1 Bioinformatics Graduate Program, Department of Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, Charleston, SC.
2 Department of Biostatistics and Center for Statistical Genetics, School of Public Health, University of Michigan, Ann Arbor, MI
3 Division of Endocrinology, Metabolism, and Medical Genetics, Department of Medicine, Medical University of South Carolina
4 Research Service, Ralph H. Johnson Department of Veterans Affairs Medical Center, Charleston, SC
5 Department of Biostatistics, Bioinformatics & Epidemiology, Medical University of South Carolina, Charleston, SC
*To whom correspondence should be addressed. W. Jim Zheng, E-mail: zhengw{at}musc.edu
| Abstract |
|---|
Motivation: Genome-wide association (GWA) studies may identify multiple variants that are associated with a disease or trait. To narrow down candidates for further validation, quantitatively assessing how identified genes relate to a phenotype of interest is important.
Results: We describe an approach to characterize genes or biological concepts (phenotypes, pathways, diseases, etc) by ontology fingerprint—the set of Gene Ontology terms that are overrepresented among the PubMed abstracts discussing the gene or biological concept together with the enrichment p-value of these terms generated from a hypergeometric enrichment test. We then quantify the relevance of genes to the trait from a GWA study by calculating similarity scores between their ontology fingerprints using enrichment p-values. We validate this approach by correctly identifying corresponding genes for biological pathways with a ninety percent average area under the ROC curve (AUC). We applied this approach to rank genes identified through a GWA study that are associated with the lipid concentrations in plasma as well as to prioritize genes within linkage disequilibrium (LD) block. We found that the genes with highest scores were: ABCA1, LPL, and CETP for HDL; LDLR, APOE and APOB for LDL; and LPL, APOA1 and APOB for triglyceride. In addition, we identified genes relevant to lipid metabolism from the literature even in cases where such knowledge was not reflected in current annotation of these genes. These results demonstrate that ontology fingerprints can be used effectively to prioritize genes from GWA studies for experimental validation.
Associate Editor: Prof. John Quackenbush
Received on January 8, 2009; revised on February 25, 2009; accepted on March 14, 2009