Bioinformatics Advance Access published online on May 15, 2009
Bioinformatics, doi:10.1093/bioinformatics/btp325
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Integrative Analysis of Transcriptomic and Proteomic Data of Desulfovibrio vulgaris: a nonlinear model to predict abundance of undetected proteins
1 Department of Industrial, Systems and Operations Engineering, Arizona State University, Tempe AZ, 85287-5906
2 Center for Ecogenomics, The Biodesign Institute, Arizona State University, Tempe, AZ 85287-6501.
*To whom correspondence should be addressed., E-mail: Weiwen.Zhang{at}asu.edu or George.Runger{at}asu.edu
| Abstract |
|---|
Motivation: Gene expression profiling technologies can generally produce mRNA abundance data for all genes in a genome. A dearth of proteomic data persists because identification range and sensitiv-ity of proteomic measurements lag behind those of transcriptomic measurements. Using partial proteomic data, it is likely that integra-tive transcriptomic and proteomic analysis may introduce significant bias. Developing methodologies to accurately estimate missing pro-teomic data will allow better integration of transcriptomic and pro-teomic datasets and provide deeper insight into metabolic mecha-nisms underlying complex biological systems.
Results: In this study, we present a nonlinear data-driven model to predict abundance for undetected proteins using two independent datasets of cognate transcriptomic and proteomic data collected from Desulfovibrio vulgaris. We use stochastic Gradient Boosted Trees (GBT) to uncover possible nonlinear relationships between transcriptomic and proteomic data, and to predict protein abundance for the proteins not experimentally detected based on relevant pre-dictors such as mRNA abundance, cellular role, molecular weight, sequence length, protein length, GC content and triple codon counts. Initially, we constructed a GBT model using all possible variables to assess their relative importance and characterize the behavior of the predictive model. A strong plateau effect in the re-gions of high mRNA values and sparse data occurred in this model. Hence, we removed genes in those areas based on thresholds es-timated from the partial dependency plots where this behavior was captured. At this stage, only the strongest predictors of protein ab-undance were retained to reduce the complexity of the GBT model. After removing genes in the plateau region, mRNA abundance, main cellular functional categories and few triple codon counts emerged as the top ranked predictors of protein abundance. We then created a new tuned GBT model using the five most significant predictors. The construction of our nonlinear model consists of a set of serial regression trees models with implicit strength in variable selection. The model provides variable relative importance measures using as a criterion mean square error. The results showed that coefficients of determination for our nonlinear models ranged from 0.393 to 0.582 in both datasets, providing better results than linear regres-sion used in the past. We evaluated the validity of this nonlinear model using biological information of operons, regulons and path-ways, and the results demonstrated that the coefficients of variation of estimated protein abundance values within operons, regulons or pathways are indeed smaller than those for random groups of proteins.
Contact: Weiwen.Zhang{at}asu.edu or George.Runger{at}asu.edu
Supplementary Information: Supplementary data are available at Bioinformatics online.
Associate Editor: Prof. John Quackenbush
Received on April 3, 2009; revised on May 11, 2009; accepted on May 12, 2009