Skip Navigation


Bioinformatics Advance Access originally published online on May 4, 2006
Bioinformatics 2006 22(13):1641-1647; doi:10.1093/bioinformatics/btl134
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/13/1641    most recent
btl134v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nie, L.
Right arrow Articles by Zhang, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nie, L.
Right arrow Articles by Zhang, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Integrated analysis of transcriptomic and proteomic data of Desulfovibrio vulgaris: zero-inflated Poisson regression models to predict abundance of undetected proteins

Lei Nie 1, Gang Wu 2, Fred J. Brockman 3 and Weiwen Zhang 3,*

1 Department of Biostatistics, Bioinformatics and Biomathematics, Georgetown University Washington DC 20057, USA
2 Department of Biological Sciences, University of Maryland at Baltimore County Baltimore, MD 21250, USA
3 Microbiology Department, Pacific Northwest National Laboratory PO Box 999, Mail Stop P7-50, Richland, WA 99352, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

Motivation: Integrated analysis of global scale transcriptomic and proteomic data can provide important insights into the metabolic mechanisms underlying complex biological systems. However, because the relationship between protein abundance and mRNA expression level is complicated by many cellular and physical processes, sophisticated statistical models need to be developed to capture their relationship.

Results: In this study, we describe a novel data-driven statistical model to integrate whole-genome microarray and proteomic data collected from Desulfovibrio vulgaris grown under three different conditions. Based on the Poisson distribution pattern of proteomic data and the fact that a large number of proteins were undetected (excess zeros), zero-inflated Poisson (ZIP)-based models were proposed to define the correlation pattern between mRNA and protein abundance. In addition, by assuming that there is a probability mass at zero representing unexpressed genes and expressed proteins that were undetected owing to technical limitations, a Potential ZIP model was established. Two significant improvements introduced by this approach are (1) the predicted protein abundance level values for experimentally detected proteins are corrected by considering their mRNA levels and (2) protein abundance values can be predicted for undetected proteins (in the case of this study, ~83% of the proteins in the D.vulgaris genome) for better biological interpretation. We demonstrated the use of these statistical models by comparatively analyzing proteomic and microarray results from D.vulgaris grown on lactate-based versus formate-based media. These models correctly predicted increased expression of Ech hydrogenase and decreased expression of Coo hydrogenase for D.vulgaris grown on formate.

Contact: Weiwen.Zhang{at}pnl.gov

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
Recent advances in high-throughput technologies enable quantitative monitoring of the abundance of various biological molecules and their variation between various biological states on a genomic scale (Horak and Snyder, 2002; Smith et al., 2002). Integrative analyses of measurements of global mRNA and protein expression have been reported, and in several cases these analyses have helped researchers better understand the global regulatory processes or complex metabolic networks in living organisms (Hegde, 2003; Mootha et al., 2003a, b; Alter and Golub, 2004). However, most recent studies have either failed to find a correlation between protein and mRNA abundance (Gygi et al., 1999) or have only observed a weak correlation (Ideker et al., 2001; Greenbaum et al., 2003; Washburn et al., 2003). In Saccharomyces cerevisiae, it has been proposed that there are three potential reasons for the lack of a strong correlation between mRNA and protein expression levels: (1) translational regulation, (2) difference in protein half-lives in vivo and (3) significant levels of experimental error, including differences with respect to the experimental conditions being compared (Greenbaum et al., 2003; Beyer et al., 2004). We recently performed a quantitative analysis of the contributions of various biochemical and physical parameters to the correlation between mRNA and protein abundance in Desulfovibrio vulgaris. The results show that mRNA abundance alone can explain only 20–28% of the total variation of protein abundance, suggesting that the correlation between mRNA and protein levels cannot be determined by mRNA abundance alone (Nie et al., 2006).

In most of the previous comparisons of mRNA and protein abundance, (Alter and Golub, 2004; Gygi et al., 1999; Greenbaum et al., 2003; Washburn et al., 2003), analyses were typically performed using simple correlation algorithms such as Pearson's or Spearman's coefficients, which are not suitable for dealing with data of Poisson or many other distribution patterns. In addition, in these studies the undetected proteins were simply assigned a ‘zero’ value (Mootha et al., 2003a, b; Alter and Golub, 2004; Ideker et al., 2001; Greenbaum et al., 2003; Washburn et al., 2003), the same as unexpressed proteins. This resulted in the exclusion of these proteins from relationship modeling, which could seriously bias any calculations of the correlation between mRNA and proteins. In order to more accurately capture the relationship between mRNA and protein abundance, more sophisticated statistical tools are necessary to define correlation patterns between expression datasets and to generate models or mathematical frameworks that can be used to make more accurate biological predictions.

In this study, we propose a novel statistical model to integrate microarray and proteomic data. Our microarray data include mRNA abundance information for all 3507 genes in the D.vulgaris genome, whereas semi-quantitative LC-MS/MS proteomic data included identification and abundance information for only 600–700 proteins from D.vulgaris grown under three growth conditions (Nie et al., 2006; Zhang et al., 2006a, b). Based on these differences in the abundance data obtained for transcripts and proteins, the proteomic abundance data can be considered as rare events, and we therefore modeled the proteomic abundance as a Poisson distribution with the mean {lambda}. We also assumed that there was a probability mass at zero that represented both unexpressed proteins and expressed proteins that were undetected owing to technical limitations. Therefore, a non-standard zero-inflated Poisson (ZIP) regression relationship was identified between proteomic abundances and mRNA levels. The key improvement introduced in this model is that the undetected proteins are also taken into consideration, and the model thus allowed us to estimate the potential protein expression even when the proteins were experimentally undetectable owing to technical limitations. In other words, this data-driven model is able to use abundance measurements of mRNA transcripts and detected proteins as input to predict abundance levels for all proteins in the genome, including those proteins undetected by experimental approaches. The model was constructed and verified independently with three sets of microarray and proteomics data obtained from D.vulgaris (Zhang et al., 2006a, b). We also illustrated an application of our model by performing a comparative analysis of D.vulgaris grown on lactate- versus formate-based media. Consequently, novel information on differential expressions of genes involved in the energy metabolic pathway of D.vulgaris was suggested.


    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
2.1 Cultivation
Three microarray and proteomic datasets used for model construction and verification were collected from D.vulgaris subsp. vulgaris DSM 644 grown on lactate- or formate-based chemical defined media. To minimize variations between microarray and proteomic measurements, identical cell samples from each growth condition were split and used to isolate both the RNA and proteins for analyses. A complete description of the experimental design and microarray and proteomic data collection can be found in our previous studies (Nie et al., 2006; Zhang et al., 2006a, b). Briefly, cells were grown at 30°C under strictly anaerobic conditions in a chemically defined minimal medium (Zhang et al., 2006a). Growth experiments were performed in 140 ml serum bottles containing 70 ml medium, with head-space filled with a gas mixture of 10% (v/v) CO2 and 90% (v/v) N2. The time course of growth in both media was measured by optical density in a Shimadzu BioSpec 1601 Analyzer (Kyoto, Japan) to establish equivalent points for exponential and stationary phases for the two substrates. Three sets of cells were collected from the exponential phase at OD590 of 0.4 and 0.2 for lactate- and formate-based media, respectively, and from the stationary phase at OD590 of 0.65 for lactate-based medium. Cells were collected by 6000x g centrifugation at room temperature and subsequently stored at –80°C (Zhang et al., 2006a, b).

2.2 Microarray analysis
Oligonucleotide microarrays containing 3507 ORFs of the D.vulgaris genome were designed by NimbleGen Systems, Inc. (Madison, WI) (Nuwaysir et al., 2002; Heidelberg et al., 2004). The raw intensity data were normalized using tools available through the Bioconductor project (http://www.bioconductor.org). For each experimental condition, mRNA abundances were determined from the average of four measurements for each gene: two replicates (each containing a pool of three biological replicates) that were each hybridized to duplicate microarrays (Zhang et al., 2006a).

2.3 LC-MS/MS proteomics analysis
The D.vulgaris samples were analyzed by LC-MS/MS on a Finnigan model LTQ ion trap mass spectrometer (ThermoQuest Corp., San Jose, CA). MS analysis was performed using a Finnigan model LTQ ion trap (ThermoQuest Corp., San Jose, CA) with electrospray ionization (ESI). Peptide identification was performed using SEQUEST Version 2.7 (ThermoFinnigan, San Jose, CA) (Eng et al., 1994; Yates et al., 1995) to search the D.vulgaris protein sequence database (Heidelberg et al., 2004). The peptides were filtered using Xcorr criteria of >1.8 for peptide with charge state of 1 + full or partial tryptic peptides, >2.5 for peptides with charge state of 2 + full or partial tryptic peptides, and >3.5 for peptide with charge state of 3 + full or partial tryptic peptides. In addition, the DelCn (Delta Correlation value) cutoff value of ≥0.1 was used to further increase the confidence levels of protein identification (Washburn et al., 2001; Qian et al., 2005). The relative protein abundance was estimated based on the number of peptide hits (Gao et al., 2003; Qian et al., 2005). The peptide hits for a given protein was the median of three LC-MS/MS measurements. The protein abundance numbers are typically distributed in the range from 1 to 300 (Zhang et al., 2006b).

2.4 Statistical methods
The Poisson regression model, one of the so-called generalized linear models (McCullagh and Nelder, 1989), was used to model the proteomic abundance and its correlation with mRNA expression. In the Poisson regression model, for protein abundances (Y), we assume that the mean ({lambda}) of the Poisson distribution depends on log-scaled mRNA abundance (X), and therefore {lambda} = exp({alpha} + ßX), which ensures that the expected value is non-negative. This Poisson regression model provides a valid framework to integrate two types of expression data; however, it provides no explanation for the fact that ~83% genes have zero proteomic abundance. We ascribed the high percentage of proteins with zero abundance to technical limitations in the proteomic analyses, such as detection sensitivity. Therefore, a nonstandard mixture model, the ZIP regression model (Lambert, 1992), was proposed to analyze the data. In this model, we assumed that 100 x p% of the genes with proteomic abundance level of 0 may be unexpressed genes or expressed genes that were undetected owing to the technical limitations. Thus, the proteomic abundance, y, was distributed as follows: y = 0, probability mass at zero, with probability p; where y follows a Poisson regression distribution with probability (1p). Therefore the observed protein abundance (y) follows a mixture model:

Formula 1(1)
where the indicator {delta} = 1 if y = 0; otherwise {delta} = 0. We also assume that p is dependent on the mRNA level (x) through a logit model:

Formula 2(2)
where {alpha}0 and ß0 are the intercept and slope in the logit model.

To describe the variation within a data set, such as ‘molar abundance’ of proteins within one operon, we computed the coefficient of variation (CV) for each set of proteins. The CV is defined as the ratio of the standard deviation and the mean of the ‘molar abundance’ for a set of proteins (Johnson, 2005) where the calculation of CV score is independent of the sample size.


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
3.1 Distribution patterns of microarray and proteomic data
Four measurements of mRNA abundance were obtained for all genes in the D.vulgaris genome under three different growth conditions, i.e. lactate-based medium at exponential phase (LE), formate-based medium at exponential phase (FE), and lactate-based medium at stationary phase (LS). The abundance data used for the statistical analyses of the microarray data is the average of four measurements and therefore, the quality of this data was evaluated with the Pearson correlation coefficient for the four measurements. The resulting correlation coefficients were typically about 0.96–0.99, indicating good data reproducibility (Nie et al., 2006). The mRNA abundance after log transformation follows distributions which are close to normal distributions under all three growth conditions, which again indicates the quality of these datasets. LC-MS/MS analysis identified 600–700 proteins, with abundance level (spectral counts of peptides) at least 1. To identify the distribution pattern of the proteomic abundance data, we first focused on a relatively homogeneous dataset and used only genes with protein abundances <4. Consequently, 92.4, 92.7 and 92.8% of all 3507 genes were included for LE, FE and LS conditions, respectively. The mean protein abundance of these genes is 0.13 for all three conditions. For a random variable X following a Poisson distribution with a mean 0.13, a distribution pattern of P(X = 0) = 0.88; P(X = 1) = 0.11; P(X = 2) = 0.7% and P(X = 3) = 0.3% was calculated, which is in good agreement with observed proteomic measurements (Table 1), suggesting that the proteomic abundance data followed the Poisson distribution. The Pearson correlation coefficients between mRNA levels and protein abundances from three conditions were calculated. The correlation coefficients from these analyses were found to be ~0.50, P-value< 0.001 (Nie et al., 2006), indicating modest correlations between mRNA and protein abundances in D.vulgaris grown on all three conditions.


View this table:
[in this window]
[in a new window]
 
Table 1 Probability distribution versus protein abundance frequencies

 
3.2 Construction of Poisson-based models
The Poisson distribution is a widely applied discrete distribution and can serve as a useful model for a number of different types of experiments (McCullagh and Nelder, 1989; Lambert, 1992). This model approximates the binomial distribution very well when the number of trials in a binomial experiment is large, while the probability that a trial is successful is small. In our study, examining whether the protein for a given gene is present in a particular sample can be regarded as a trial. We also noted that the mean of the Poisson distribution, i.e. the expected number of successful trials of a gene, was related to the mRNA expression level of this gene (suggested by the Pearson correlation coefficients). We thus applied the Poisson-based model to construct the relationship between mRNA and protein expression levels. Model construction is exemplified below using microarray and proteomic data from D.vulgaris grown on LE condition.

3.2.1 ZIP regression
Parameters in the ZIP regression model were estimated with maximum likelihood methods through SAS Proc NLMIXED (SAS code is available upon request). The deviance the ZIP model fitting is 1.59, indicating a good fit to the model. The estimated parameters are: Formula 2, Formula 2, Formula 2, Formula 2 (P-value of parameters {alpha},{alpha}0, ß and ß0 are all <0.0001). The interpretation of Formula 2 is as follows: for a gene with mRNA expression level=5 (in log scale), it has Formula 2 probability that it will be from the probability mass distribution; i.e. there is a 93.7% chance that the gene product was unexpressed or undetected owing to technical limitations. However, when the mRNA level is relatively high; (e.g. a value of 7 in the log scale), there is a lower probability

Formula 2
that the protein encoded by this gene will be unexpressed or undetected because of technical limitations. This model allows us to predict proteomic abundances for all genes, even under the current technical limitations. Predictions of protein expression based on this model are listed in Supplementary Table 1.

In this model, the shape of the Poisson distribution depends on the actual level of mRNA, {alpha} and ß. If {alpha} + ßmRNA is small or moderate, the distribution looks like a left-truncated normal distribution. However, as the {alpha} + ßmRNA increases, the shape will look more and more like a normal distribution.

3.2.2 Potential inference from ZIP (P-ZIP) regression
Based on the ZIP model, we are able to predict the abundance of proteins that were undetectable owing to current technical limitations. The prediction of the protein abundance for a gene is expressed by

Formula 2
where x is the mRNA abundance of that gene on a log scale and p = exp({alpha}0 + ß0 x x)/(1 + exp ({alpha}0 + ß0 x x)), which is the probability that the product of the gene was not expressed or not detected simply owing to technical limitations.

For those expressed genes, we can further develop the ZIP model into a Potential inference from ZIP regression (P-ZIP) model to predict potential proteomic abundance. In this case, 100% of the prediction should rely on the true distribution of the protein abundance from the Poisson distribution instead of assigning the probability p to 0 mass; i.e. the prediction is exp({alpha} + ß x x).

Prediction of protein expression by various Poisson-based models for growth on LE condition is listed comparatively in Supplementary Table 1.

The results in Table 2 show that the {alpha} and ß parameters of the ZIP regression models calculated from another two datasets (LS and FE conditions) were similar to those calculated from the LE condition, suggesting that the model is not specific to a single growth condition and is possibly a general model of the relation between mRNA and protein expression patterns. Predictions of protein expression by ZIP and P-ZIP models for growth on lactate at the stationary phase and on formate at the exponential phase are listed in Supplementary Tables 2 and 3.


View this table:
[in this window]
[in a new window]
 
Table 2 Model fitting of microarray and proteomic data

 
3.3 Integrative analysis of microarray and proteomics data using the P-ZIP model: D.vulgaris growth on lactate versus formate
There are two valuable advantages provided by this integrated approach. First, for the proteins that have been experimentally detected, the predicted protein abundance level values were corrected with their mRNA levels being taken into consideration. Second, for the proteins undetected by experimental methods (in the case of this study, >80% of the proteins in the D.vulgaris genome), the model predicts their protein abundance values. This allows, for the first time, the analysis of a protein expression pattern at a genomic level and minimizes possible biases in biological interpretations when using an incomplete set of proteomic data.

The potential protein abundances for all genes in the D.vulgaris genome was calculated by the P-ZIP model for all three growth conditions. Surprisingly, we found a large number of potentially highly expressed proteins that were undetected by our experimental approach. This may be because of differences in stability of individual proteins and/or differences in the detection sensitivity for various types of proteins. From predictions based on the P-ZIP model, 347, 514 and 452 experimentally undetected proteins were predicted to have an abundance >1 (the spectral count of peptides) in LE, FE and LS conditions, respectively.

To check the prediction by the P-ZIP model, we calculated the ‘molar abundance’ of all proteins using protein abundance divided by molecular weight, and hypothesized that the ‘molar abundance’ of 30 ribosomal proteins expressed from the rps operon and 7 subunit proteins of ATP synthase (expressed from DVU0774-DVU0780 operon) should be roughly at the same level. To evaluate the similarity of the ‘molar abundance’ among the ribosomal protein set and the ATP synthase set, we calculated the CV values for each set and compared them with that calculated for the whole genome excluding the genes from the set. Experimental proteomic data identified 18, 22 and 9 RPs out of a total of 30 RPs of the rps operon, and 2, 2 and 1 subunits out of 7 ATP synthase F1F0 subunits from the LE, FE and LS conditions, respectively (Zhang et al., 2006b). Through the P-ZIP model, we obtained the ‘molar abundance’ all 30 RPs and 7 subunits of ATP synthase F1F0 in all three growth conditions, and the CV values were computed (Table 3). The CV values of the ribosomal protein set and the ATP synthase set were found to be 3.0- to 7.5-fold smaller than the control, which was calculated using all protein in the D.vulgaris genome except the RPs and ATP synthase (Table 3). An additional control set was obtained by analyzing multiple random sets (500 sets) of 10–20 proteins, and the same conclusion was reached (data not shown). These results demonstrated that the model prediction was accurate and provided additional insights into the protein expression.


View this table:
[in this window]
[in a new window]
 
Table 3 Validation of the model: correlated expression of proteins in ribosome, ATP synthase and other operons

 
To further check predictions based on the P-ZIP model, we also tested the protein expression pattern of genes belonging to a number of operons with the assumption that since these genes were co-regulated, the ‘molar abundance’ of proteins from the same operon should be roughly at the same level. From the Microbes Online website (www.microbesonline.org) (Alm et al., 2005), 18 D.vulgaris operons with at least 6 genes and with relatively high predicted protein abundances were identified. We chose only operons with at least six genes to increase the reliability of the statistical analysis (calculation of CV). In almost all these operons, less than one-third of the proteins were detected experimentally, however, through the model, we could assign protein abundance to all the proteins and use them for calculations. Analysis of the expression patterns showed that the CV values for these operons were all significantly smaller than the control set (Table 3). These results also demonstrate that protein abundance predictions based on the P-ZIP model are reliable.

The P-ZIP model was applied to a comparative analysis of genes known to be involved in the energy metabolism when D.vulgaris is grown on lactate- or formate-based media at exponential phase (Zhang et al., 2006a, b). In this analysis we focused on the proteins that were experimentally undetectable by LC-MS/MS analysis. The results of the P-ZIP analysis showed that when compared with growth on lactate at the exponential phase, proteins for the subunits of three formate dehydrogenases were predicted to be up-regulated during growth on formate (Table 4). In contrast, previous microarray analysis showed that only two of the formate dehydrogenase subunit genes were transcriptionally up-regulated under these same conditions (Zhang et al., 2006a). Two cytoplasmic hydrogenases (Ech and Coo) were previously assigned the putative function of generating hydrogen in the cytoplasm (Heidelberg et al., 2004). In agreement with this observation, the P-ZIP model also predicted that all subunits of Ech hydrogenase (EchACDE) were up-regulated, while all subunits of Coo hydrogenase (CooHLUX) were down-regulated on formate-based growth, suggesting their differential roles in H2 generation when grown with different carbon sources. These results demonstrate that the P-ZIP model predictions of protein abundance can provide better data interpretation than that given by microarray or proteomics data alone (Zhang et al., 2006a, b).


View this table:
[in this window]
[in a new window]
 
Table 4 Regulation of proteins involved in energy metabolism in formate-based growth

 
The advantage of applying the P-ZIP model was also demonstrated by its prediction for the expression of c-type cytochromes, which are involved in electron transport processes. The genome sequence shows that in D.vulgaris there are at least a dozen of c-type cytochromes (Heidelberg et al., 2004). It has been reported that c-type cytochromes are highly expressed in sulfate-reducing bacteria, and their participation in sulfate reduction has been intensively investigated (Meyer et al., 1971; Elias et al., 2004). However, only low amounts of c-type cytochrome proteins have been identified from D.vulgaris by our LC-MS/MS approach (Table 4). This discrepancy is probably due to the fact that c-type cytochromes undergo a complex post-translational maturation process involving covalent attachment of heme groups. This modification can change the charge state in the gas phase and cause atypical fragmentation of peptides, resulting in loss of detection (Yu et al., 1993; Aubert et al., 1998). By integrating microarray and proteomic data using the P-ZIP model, significant expression of various cytochrome c proteins was predicted in D.vulgaris grown under all three conditions. Among these, the cytochrome c3 encoded by DVU3171 was present in all conditions and was predicted to have significant expression at the exponential phase in both lactate- and formate-based media (predicted abundance of 20.5 and 29.3, respectively). This is consistent with the previous observation that DVU3171 was the primary electron acceptor for periplasmic hydrogen oxidation (Meyer et al., 1971; Heidelberg et al., 2004).


    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 
High-throughput experiments that measure mRNA using DNA microarray technology is currently one of the richest sources of whole genome based information available (Greenbaum et al., 2002). Although the identification range and sensitivity of LC-MS/MS-based proteomics are still not fully comparable with DNA microarrays, rapid progress in LC-MS/MS-based peptide profiling in the past several years has made it possible to measure protein abundance globally. At present, a key issue in applying these technologies to biological questions is how to integrate and interpret information from these two sources (Greenbaum et al., 2002). In this study, we proposed a data-driven ZIP regression-based model for integrative analysis of these two different types of large-scale genomic data. This approach is a significant improvement over previous methods since it allows undetected proteins (those with an assigned protein abundance value of zero) to be assigned a predicted abundance based on the mRNA levels. This allows us to include the abundance of proteins that were undetected owing to experimental or technical limitations in our investigations. Although a thorough and conclusive evaluation of the proposed P-ZIP model would require a comparison of predictions obtained by this method with more sensitive and precise measurements of protein abundance than are currently available, we evaluated the validity of this model using bioinformatics approaches. For example, in a comparison of the predicted protein abundance patterns of genes belonging to the same operons, (representing groups of proteins that are expected to have similar abundance values), the results demonstrated that the coefficients of variation of estimated protein abundance values within operons are indeed smaller than that for random groups of proteins. In addition, a comparative analysis of D.vulgaris grown on lactate- versus formate-based media was performed using the protein abundance values predicted from this model. These results demonstrated that, in comparison with using microarray or proteomic data alone data interpretation can be improved by utilizing the predicted abundance values obtained by combining both methods. However, caution should be observed when interpreting experimental data based on predicted protein expression values because the predicted abundance values are constrained by the experimental proteomic data used as input, and this data may have been significantly underestimated owing to experimental and technical limitations. In this case, the ratio of abundance values of the same protein across various conditions can be more meaningful for biological interpretations. With minor modifications, this method can also be applied to the integration of microarray data with other types of proteomic data, such as that from two-dimensonal gel analysis or labeling-based quantitative proteomic analysis. Finally, our model also establishes a basis for developing more sophisticated models that will allow the inclusion of other types of data, such as RNA decay measurements (Selinger et al., 2003), when they become available.


    Acknowledgments
 
The authors would like to thank the anonymous reviewers for their excellent comments and suggestions, which greatly improved the quality of the paper. The authors would like to thank Dr David E. Culley from Pacific Northwest National Laboratory for his critical reading of this manuscript. The research described in this paper was conducted under the Laboratory Directed Research and Development Program at the Pacific Northwest National Laboratory, a multi-program national laboratory operated by Battelle for the US Department of Energy under Contract DE-AC05-76RLO1830.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Golan Yona

Received on December 14, 2005; revised on March 31, 2006; accepted on April 1, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 3 RESULTS AND DISCUSSION
 4 CONCLUSIONS
 REFERENCES
 

    Alm, E.J., et al. (2005) The MicrobesOnline Web site for comparative genomics. Genome Res, . 15, 1015–10122[Abstract/Free Full Text].

    Alter, O. and Golub, G.H. (2004) Integrative analysis of genome-scale data by using pseudoinverse projection predicts novel correlation between DNA replication and RNA transcription. Proc. Natl Acad. Sci. USA, 101, 16577–16582[Abstract/Free Full Text].

    Aubert, C., et al. (1998) Characterization of the cytochromes C from Desulfovibrio desulfuricans G201. Biochem. Biophys. Res. Commun, . 242, 213–218[CrossRef][Medline].

    Beyer, A., et al. (2004) Post-transcriptional expression regulation in the yeast Saccharomyces cerevisiae on a genomic scale. Mol. Cell. Proteomics, . 3, 1083–1092[Abstract/Free Full Text].

    Elias, D.A., et al. (2004) Periplasmic cytochrome C3 of Desulfovibrio vulgaris is directly involved H2-mediated metal but no sulfate reduction. Appl. Environ. Microbiol, . 70, 413–420[Abstract/Free Full Text].

    Eng, J.K., et al. (1994) An approach to correlate tandem mass spectral data with amino acid sequences in a protein database. J. Am. Soc. Mass Spectrom, 5, 976–979[CrossRef][Web of Science].

    Gao, J., et al. (2003) Changes in the protein expression of yeast as a function of carbon source. J. Proteome Res, . 2, 643–649[CrossRef][Medline].

    Greenbaum, D., et al. (2002) Analysis of mRNA expression and protein abundance data: an approach for the comparison of the enrichment of the features in the cellular population of proteins and transcripts. Bioinformatics, 18, 585–596[Abstract/Free Full Text].

    Greenbaum, D., et al. (2003) Comparing protein abundance and mRNA expression levels on a genomic scale. Genome Biol, . 4, 117.1–117.8.

    Gygi, S.P., et al. (1999) Correlation between protein and mRNA abundance in yeast. Mol. Cell. Biol, . 19, 1720–1730[Abstract/Free Full Text].

    Hegde, P.S. (2003) Interplay of transcriptomics and proteomics. Curr. Opin. Biotechnol, . 14, 647–651[CrossRef][Medline].

    Heidelberg, J.F., et al. (2004) The genome sequence of the anaerobic, sulfate-reducing bacterium Desulfovibrio vulgaris Hildenborough. Nat. Biotechnol, . 22, 554–559[CrossRef][Web of Science][Medline].

    Horak, C.E. and Snyder, M. (2002) Global analysis of gene expression in yeast. Funct. Integr. Genomics, 2, 171–180[CrossRef][Medline].

    Ideker, T., et al. (2001) Integrated genomic and proteomic analyses of a systematically perturbed metabolic network. Science, 292, 929–934[Abstract/Free Full Text].

    Johnson, R.A. Miller And Freund's Probability and Statistics for Engineers, (2005) Pearson prentice hall.

    Lambert, D. (1992) Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics, 34, 1–14[Free Full Text].

    McCullagh, P. and Nelder, J.A. Generalized Linear Models, (1989) Chapman and Hall.

    Meyer, T.E., et al. (1971) Cytochrome C3, a class of electron transfer heme proteins in both photosynthetic and sulfate-reducing bacteria. Biochim. Biophys. Acta, 245, 453–464[Medline].

    Mootha, V.K., et al. (2003a) Identification of a gene causing human cytochrome c oxidase deficiency by integrative genomics. Proc. Natl Acad. Sci. USA, 100, 605–610[Abstract/Free Full Text].

    Mootha, V.K., et al. (2003b) Integrated analysis of protein composition, tissue diversity, and gene regulation in mouse mitochondria. Cell, 115, 629–640[CrossRef][Web of Science][Medline].

    Nie, L., et al. (2006) Correlation between mRNA and protein abundance in Desulfovibrio vulgaris: a multiple regression to identify sources of variations. Biochem. Biophys. Res. Commun, . 339, 603–610[CrossRef][Medline].

    Nuwaysir, E.F., et al. (2002) Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res, . 12, 1749–1755[Abstract/Free Full Text].

    Qian, W.J., et al. (2005) Probability-based evaluation of peptide and protein identifications from tandem mass spectrometry and SEQUEST analysis: the human proteome. J. Proteome Res, . 4, 53–62[CrossRef][Web of Science][Medline].

    Selinger, D.W., et al. (2003) Global RNA half-life analysis for Escherichia coli reveals positional patterns of transcriptional degradation. Genome Res, . 13, 216–223[Abstract/Free Full Text].

    Smith, R.D., et al. (2002) The use of accurate mass tags for high-throughput microbial proteomics. OMICS, 6, 61–90[CrossRef][Medline].

    Washburn, M.P., et al. (2001) Large-scale analysis of the yeast proteome by multidimensional protein identification technology. Nat. Biotechnol, . 19, 242–247[CrossRef][Web of Science][Medline].

    Washburn, M.P., et al. (2003) Protein pathway and complex clustering of correlated mRNA and protein expression analyses in Saccharomyces cerevisiae. Proc. Natl Acad. Sci. USA, 100, 3107–3112[Abstract/Free Full Text].

    Yates, J.R., III, et al. (1995) Method to correlate tandem mass spectra of modified peptides to amino acid sequences in the protein database. Anal. Chem, . 67, 1426–1436[Medline].

    Yu, X.L., et al. (1993) Assessment of metals in reconstituted metallothioneins by electrospray mass spectrometry. Anal. Chem, . 65, 1355–1359[Medline].

    Antonie van Leeuwenhoek Zhang, W., et al. (2006a) Global transcript analysis in Desulfovibrio vulgaris grown on different carbon sources. (in press).

    Zhang, W., et al. (2006b) A proteomic view of the metabolism in Desulfovibrio vulgaris determined by liquid chromatography coupled with tandem mass spectrometry. Proteomics, (in press).


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
GeneticsHome page
L. Nie, G. Wu, and W. Zhang
Correlation of mRNA Expression and Protein Abundance Affected by Multiple Sequence Features Related to Translational Efficiency in Desulfovibrio vulgaris: A Quantitative Analysis
Genetics, December 1, 2006; 174(4): 2229 - 2243.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/13/1641    most recent
btl134v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nie, L.
Right arrow Articles by Zhang, W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nie, L.
Right arrow Articles by Zhang, W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?