Bioinformatics Advance Access published online on February 17, 2009
Bioinformatics, doi:10.1093/bioinformatics/btp085
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Identification of differential gene pathways with principal component analysis
1Department of Epidemiology and Public Health, Yale University, New Haven, CT, 06510
2Department of Biostatistics, University of North Carolina, Chapel Hill, NC, 27599
*To whom correspondence should be addressed. Shuangge Ma, E-mail: shuangge.ma{at}yale.edu
| Abstract |
|---|
Motivation: Development of high throughput technology makes it possible to measure expressions of thousands of genes simultaneously. Genes have the inherent pathway structure, where pathways are composed of multiple genes with coordinated biological functions. It is of great interest to identify differential gene pathways that are associated with the variations of phenotypes.
Results: We propose the following approach for detecting differential gene pathways. First, we construct gene pathways using databases such as KEGG or GO. Second, for each pathway, we extract a small number of representative features, which are linear combinations of gene expressions and/or their transformations. Specifically, we propose using (a) principal components of gene expression sets, (b) principal components of expanded gene expression sets, and (c) expanded sets of principal components of gene expressions, as the representative features. Third, we identify differential gene pathways as those with representative features significantly associated with the variations of phenotypes, particularly disease clinical outcomes, in regression models. The false discovery rate approach is used to adjust for multiple comparisons. Analysis of three gene expression datasets suggests that (a) the proposed approach can effectively identify differential gene pathways; (b) principal components that explain only a small amount of variations of gene expressions may bear significant associations between gene pathways and phenotypes; (c) including second order terms of gene expressions may lead to identification of new differential gene pathways; (d) the proposed approach is relatively insensitive to additional noises; and (e) the proposed approach can identify gene pathways missed by alternative approaches.
Contact: shuangge.ma{at}yale.edu
Associate Editor: Prof. David Rocke
Received on July 22, 2008; revised on February 2, 2009; accepted on February 10, 2009