Bioinformatics Advance Access originally published online on January 12, 2005
Bioinformatics 2005 21(9):1935-1942; doi:10.1093/bioinformatics/bti258
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Robust estimation of protein expression ratios with lysate microarray technology
1Department of Pathology, University of Texas M.D. Anderson Cancer Center Houston, TX, USA
2Institute of Signal Processing, Tampere University of Technology Tampere, Finland
*To whom correspondence should be addressed
| Abstract |
|---|
|
|
|---|
Motivation: The protein lysate microarray is a developing proteomic technology for measuring protein expression levels in a large number of biological samples simultaneously. A challenge for accurate quantification is the relatively narrow dynamic range associated with the commonly used chromogenic signal detection system. To facilitate accurate measurement of the relative expression levels, each sample is serially diluted and each diluted version is spotted on a nitrocellulose-coated slide in triplicate. Thus, each sample yields multiple measurements in different dynamic ranges of the detection system. This study aims to develop suitable algorithms that yield accurate representations of the relative expression levels in different samples from multiple data points.
Results: We evaluated two algorithms for estimating relative protein expression in different samples on the lysate microarray by means of a cross-validation procedure. For this purpose as well as for quality control we designed a 1440-spot lysate microarray containing 80 identical samples of purified bovine serum albumin, printed in triplicate with six 2-fold dilutions. Our analysis showed that the algorithm based on a robust least squares estimator provided the most accurate quantification of the protein lysate microarray data. We also demonstrated our methods by estimating relative expression levels of p53 and p21 in either p53+/+ or p53/ HCT116 colon cancer cells after two drug treatments and their combinations on another lysate microarray.
Availability: http://www.cs.tut.fi/~mirceanc/lysate_array_bioinformatics.htm
Contact: is{at}ieee.org
| INTRODUCTION |
|---|
|
|
|---|
Despite the enormous genomic complexity of most organisms, and in particular humans, the complexity is further increased at the protein level as a result of posttranslational modifications, such as phosphorylation, acetylation and ubiquitination, which can appreciably impact the functional state of proteins. Thus, it is not only the levels of proteins but also their modification status that will have to be studied in order to gain a deeper understanding of biological systems. A number of proteomic technologies have been developed that allow researchers to study proteins in a high-throughput fashion. Among these is the protein microarray, which may appear in different formats depending on what substrates are deposited on the solid matrix (MacBeath and Schreiber, 2000). For example, various antibodies can be spotted on the slides to produce the so-called antibody array (Ivanov et al., 2004).
Another format is called the reverse-phase protein lysate microarray (or simply, lysate array) where cellular lysates or purified proteins are arrayed on nitrocellulose-coated slides and probed with antibodies that recognize various proteins and their modified products (Paweletz et al., 2001). Several recent studies report the use of lysate arrays for investigating signal pathways using cancer specimens (Grubb et al., 2003; Wulfkuhle et al., 2003; Espina et al., 2003). A lysate array spotted from a library of purified proteins can also serve as a powerful tool for detecting proteinprotein interactions. The advantages of a lysate array include the requirement of minimal amount of protein extracts for generation of tens of arrays, potential for automation of hybridization and considerable saving of labor compared with western blotting experiments. Another important advantage of lysate arrays over the western blotting assay is that multiple replicates and dilutions can be incorporated into the experimental design, thus making the protein level quantification more accurate. This is not insignificant because the dynamic range of commonly used protein detection methods is narrow, whereas the range of protein expression in different samples can be very large. Therefore, it is important to develop robust algorithms for analyzing the multiple data points produced by lysate arrays.
Nishizuka et al. (2003) profiled 60 human cancer cell lines (NCI-60) with lysate arrays using serial dilutions. After removal of outliers, dilution curves were estimated from the ten dilution points using monotonic linear spline interpolation. The patterns of protein expression were compared with those obtained for the same genes using both cDNA and oligonucleotide arrays. It was discovered that structure-related proteins exhibited a high correlation between mRNA and protein levels across the NCI-60 cell lines.
In this study, our primary objective was to develop a method for accurately estimating the relative protein expression in two different samples from a protein lysate microarray. Toward that end, we proposed and evaluated two different algorithms. For this purpose as well as for quality control we designed a 1440-spot lysate microarray containing 80 identical samples of purified bovine serum albumin (BSA), printed in triplicate with six 2-fold dilutions. We further tested the selected algorithms to quantify the expression of selected proteins in colon cancer cells after drug treatment.
| MATERIALS AND METHODS |
|---|
|
|
|---|
Lysate mixtures and signal detection
Cells and protein extraction
Proteins were harvested from isogenic HCT116 cell lines provided by Dr Bert Vogelstein (Johns Hopkins University, Baltimore, MD). The p53+/+ and p53/ cell lines were cultured in Dulbecco's minimal essential medium (DMEM) with 10% Nu-serum (Collaborative Research Products, Bedford, MA). Lysates were collected in a buffer containing 20 mM Tris pH 7.6, 150 mM NaCl, 5 mM EDTA, 0.5% NP40 freshly supplemented with 0.02 mM leupeptin and 0.01 mM PMSF after 48 h of drug treatment at IC25. Lowry assay was used to normalize the protein concentration prior to liquid handling.
Lysate array printing
BSA demonstration chips containing six 2-fold dilutions with 80 sample replicates were made by a liquid-handling robot RSP (TECAN US, Research Triangle Park, NC). These dilutions were then spotted on the nitrocellulose-coated FAST (Scheicher and Schuell, Keene, NH) or Vivid (Pall, Ann Arbor, MI) slides with 500 µm center-to-center interspot distance using a G3 spotter (Genomics Solutions, Ann Arbor, MI) to produce triplicate spots of all dilution points with the specified number of transfers. HCT116 lysate arrays were created in a similar manner with 500, 250, 125, 62.5, 31.5 and 15.625 ng/µl of protein.
Detection and imaging
Detection of target protein was done by a heavily modified DAKO CSA kit (DakoCytomation, Carpinteria, CA) catalog # K1500. Briefly, the slides were blocked with reblot + Mild (Chemicon, Temecula, CA) catalog # 2502, followed by overnight blocking with i-block (Applied Biosystems, Foster City, CA) catalog # Tropix AI300. The next day, blocking continued with fresh 3% hydrogen peroxide, Avidin/Biotin and casein block. Primary antibodies p53 Ab-6 (Oncogene Research Products, San Diego, CA) and p21/WAF1 (gift from Wade Harper, Baylor College of Medicine, Houston, TX) were diluted in antibody dilution buffer and incubated at 25°C for 1 h in a humid chamber. Secondary biotinylated antibodies BA-9200 for rabbit primaries or BA-1000 for mouse primaries (Vector Laboratories, Burlingame, CA) were diluted 1:10000 and incubated as before. Final steps included streptavidinbiotin, amplification reagent and streptavidin/peroxidase incubations from the CSA kit followed by DAB+development. TBST washes preceded all steps for the removal for the previous reagent. Slides were scanned in color at 1200 DPI; the resulting image was converted to a 16-bit grayscale and inverted (negative) to allow quantification by ArrayVisionTM (Imaging Research Inc.).
Design on slide and spotting
In order to tune the preprocessing stages of the liquid handling, spotting and image analysis, we created slides containing purified BSA spotted in three replicates of each sample with six 2-fold dilutions for each of 80 identical samples resulting in 1440 spots per slide. Three replicates were included to provide better error resilience and outlier detection (Fig. 1 gives the layout of the lysate microarray).
|
Robust estimation of expression ratios
An important goal is the estimation of the ratio between expressions of specific proteins from two samples (e.g. tumor/normal or treated/untreated). We propose and evaluate two robust approaches in comparison with standard linear regression aimed at estimating the expression ratios from lysate arrays. In an ideal case, both robust methods should yield the same results as linear regression.
In developing our approaches for estimating the protein expression ratios, our overall aim was to produce results that are biologically accurate and robust to outliers and other artifacts. We consider an outlier to be a spot that is inconsistent either with the other replicate spots in the same dilution (e.g. in a certain dilution two replicates are close to each other while the third is significantly different) or with respect to all other spots, which may be caused, e.g. by extreme saturation or a crack in the membrane affecting all spots from the same dilution.
Non-linear approach
The use of triplicates rather than duplicates confers an obvious advantage for robustly estimating the protein expression value. In this approach, we apply the median to the triplicates because of its outlier-resilient properties, thus providing a robust estimate of protein expression for each dilution. We then perform a least squares linear fit to the median estimated values (one estimate per dilution and sample), since the dilution is expected to be linear on a loglog scale as will become apparent in the experimental section (Fig. 5). Figure 2a illustrates the approach for the sample labeled n. The log-expression value is denoted by ni(k) for the replicate i at dilution k, where i
1,2,3 and k = 1,...,6; k = 1 corresponds to the highest concentration. For the sample labeled t, where we have the log-expression values ti(k), the approach is similar. One potential drawback of the non-linear approach described here is that when all replicate spots from one dilution are erroneous, say, due to a scratch, the fitted line can be dramatically affected, even though the other dilutions may be perfectly reliable. The least squares method is optimal if the errors are normal, independent and identically distributed. However, if this is not the case, it can perform very poorly, especially if data are affected by outliers. Next, we propose a more robust method alleviating to a large extent the mentioned drawbacks.
|
|
Robust least squares
This method considers all 18 spots (3 replicates, 6 dilutions) for each sample in order to fit a linear model (in loglog space). Since lysate array data may be susceptible to several artifacts such as membrane irregularities, improper spot segmentation or outliers due to saturation, that can affect all replicates for a given dilution, a robust regression scheme should be used, since one of the main disadvantages of least squares is its sensitivity to outliers. We propose the use of Huber weights with an iteratively reweighted least squares algorithm that minimizes the weighted sum of squares, where the weight given to each value depends on its distance to the fitted line (Huber, 1981; Street et al., 1988). The algorithm, using all 18 spots from each sample, will produce a robust linear fit in loglog space. This method is shown in Figure 2b, where we use the same notation as in Figure 2a.
Estimation of protein expression ratio
The distance between the two fitted lines for the two samples n and t represents the logarithm of the protein expression ratio. However, it is clear that because of experimental variability, the two lines (tumor and normal) will not be perfectly parallel. Our approach is to estimate the distances at all the six dilution points and to weight them in accordance with the estimated quality of the slide at these points. This is based on the intuition that the higher the dispersion for a particular dilution (over the entire array) the less the weight this dilution should get when calculating the distance between the two fitted lines and consequently, the less the influence this dilution should have on the final estimate of the ratio of protein expressions between the two samples. In the case of the BSA test slide, we chose the weights to be inversely proportional to the dispersion of the values for each dilution, as measured by the interquartile range. For other lysate arrays, which typically contain many different samples, we used another approach to estimate the weights by means of the coefficient of variation (CoV), which is defined as the standard deviation divided by the mean for each dilution. Specifically, for each dilution, we calculated the mean (or median) of the CoVs, where the mean is taken of the over all samples on the array, and set the weights to be inversely proportional to these values. Finally, the weights were normalized such that their sum is equal to one. In summary, we proposed the following multistage method:
- Fit a (robust) regression line for each sample,
- Calculate the dilution weights,
- Calculate the weighted distance between the two samples.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
Experiments using protein lysate arrays included multiple steps, including preparation of the biological samples, construction of lysate arrays, detection and imaging. An additional factor for lysate arrays is that when the protein concentrations from biological samples, such as microdissected cells, are very low, we want to repeat the spotting multiple times on one array in order to transfer more proteins to the lysate array. This is not normally done for cDNA microarrays and consistency of such repeated touches has to be evaluated in our experiments. Quality control is crucial for the success of the experiments and for evaluation of the data analysis methods as in cDNA microarray experiments. To avoid potentially complicating factors associated with biological materials from cells, we first used a purified protein, BSA, as printing material to evaluate our lysate spotting procedures and the performance of the estimation algorithms. For example, we compared the quality of the slides after 1, 5 and 10 touches (Fig. 3), referring to the number of times a protein is transferred to the same spot on the slide by the printing robot. As seen in Figure 3, single-touch printing results in more outliers, especially in the low concentration range, but 10-touch printing produces a dilution curve with an insufficiently steep slope. We found 5 touches to yield the best overall quality.
|
We also used the BSA array to compare the two proposed algorithms to estimate the protein expression ratios. Since all 80 samples on the BSA array are identical, under ideal circumstances, the protein expression ratios between any pair of samples should be equal to one; equivalently, the log-expression differences should be zero. Our strategy for comparing the two proposed methods with each other and with standard linear regression is based on a cross-validation approach whereby the true dilution curve is estimated, by averaging from 49 samples (training set), and an error is formed between that curve and a test dilution curve computed from one sample (test set). The remaining 30 samples are used for computing the weights corresponding to the quality of each dilution, as discussed earlier. The reason for reserving 30 samples for computing only the weights is to avoid computing the quality weights from the same samples to which these weights will be applied. Finally, the 49:1:30 random split is performed approximately 5700 times in order to construct the histograms of the errors (log-expression differences between the quality-weighted dilution curves). The results are shown in Figure 4. In addition to the error, we also computed the mean squared error (MSE), the histogram of which is also shown in Figure 4. Our results strongly indicate that the robust least squares approaches yield a dramatically lower error than the non-linear approach and the standard linear regression. In the iteratively reweighted least squares algorithm, there are a number of possibilities for the weight function, each protecting against outliers in slightly different ways. We used the Huber weight function, but also tested other variants. Figure 4 (upper inset) shows fairly comparable results for the Andrews, Cauchy, Logistic, Talwar, Turkey and Welsch weighting schemes. Furthermore, only the robust least squares method yields cross-validation errors that are normally distributed, as shown by the histogram and the normal probability plot in Figure 4 for Huber weights.
|
In order to demonstrate this method in a real experimental setting, we designed another lysate array containing samples of HCT116 colon cancer cell lines under two different drug treatments (we describe them as drug 1 and drug 2) as well as a treatment consisting of a combination of the two drugs. Additionally, the array contained p53-null or p53/ HCT116 cells with no treatment as well as p53+/+ HCT116 cells (parental cells) with no treatment as a control. Each sample was spotted with six dilutions and three replicates, as on the BSA array. Using these arrays, we measured the relative protein expressions for p53 as well as its downstream target gene product p21. All expression ratios were with respect to p53+/+ HCT116 cells with no treatment.
To estimate the protein expression ratios, we used the robust least squares method and estimated the quality-based weights for each dilution using the mean of the coefficients of variation over the samples of the slide, as described in the Methods section. Figures 5c and d show the fitted lines for all samples, for p53 and p21, respectively.
We estimated the errors of the model once again, this time directly applied to HCT116 cells, by means of a bootstrap procedure (Efron and Tibshirani, 1993). In applying bootstrap, we have at least two different possible ways of estimating the errors for the distance between the two regressions. The two fitted lines of drug 1 (B), drug 2 (C), combination of drug 1 and drug 2 (D) and p53/ no-drug (E) are compared with the expression of p53+/+ HCT116 (A). The principles of the two methods are different. One could bootstrap the pairs (ni(k),k) and then estimate the ratio between two protein expressions as the quality-weighted distance using the bootstrap-randomized selection. Another option is to estimate the errors by means of bootstrapping the residuals.
We preferred to use the first alternative, which involves recalculation of the parameters, as it makes no assumption about whether or not the regression model holds. Figure 5b illustrates the expression ratios, relative to p53+/+ HCT116 cells with no treatment, indicated as a baseline at zero, along with the bootstrap-estimated standard deviations. Because the bar graphs in Figure 5b represent distances, the standard deviations on these bars incorporate the combined variability owing to the treatment (or p53/) and the p53+/+ no-drug reference.
As expected, the p53/ cells (E) have a much lower p53 relative expression than all other conditions. The same behavior is seen for p21 expression, but not so dramatically. Further, the combination of drugs (D) does not increase the expression of p53 and p21 in HCT116 p53+/+ cells, relative to drug 1 (B) or drug 2 (C) alone. As a validation step, Figure 5a shows a western blot corresponding to all tested conditions.
The HCT116 cell data are fairly free of outliers. In an ideal case, the two robust methods should return the same results as a standard regression model. We considered it fit to report the case, where the three methods return significantly different results, a case not used in our further analyses. The slide in question is spotted with BSA, with 1-touch. We can observe several cracks of the membrane caused by erroneous handling combined with a faster drying on a Pall's Vivid slide. On the left of Figure 6, we show the entire slide. The upper detail is the processed image as described in the Methods section, before applying the segmentation step of ArrayVisionTM. For this sample, the values of the fifth dilution spots are strongly affected by the crack and, as we can see, in reality the spots themselves are not outliers. From the three models, we can observe the robust least squares to be the least influenced, followed by least squares of the medians method (non-linear approach), and finally by standard least squares. Further it may be useful, as an additional step, to conduct a test of linearity (on the log-scale) prior to applying any of the aforementioned methods (Neter et al., 2004).
|
In measuring the expression of samples, an important question is to determine if the sample is significantly different from another. The results of the three algorithms are compared on the cracked membrane sample and two normal samples. In Figure 6, we plot bootstrap histograms (10000 runs, resampling with replacement was performed separately for each dilution) of the distances between two neighboring normal samples, using each of the three estimation methods. The distance between the sample influenced by the cracked membrane and the neighboring normal sample is in the same range (P-value 0.61) using the robust least squares method (Fig. 6c), in contrast to the other two algorithms (P-value is zero) (Fig. 6d and e). The distances between the cracked and normal samples are shown by the bars in Figure 6c, d and e.
Our quality control experiments and evaluation of several proposed methods for estimating the relative protein expressions, using a specially designed BSA array, indicate that this technology, coupled with the computational methods described here, can be used to robustly determine protein differential expression in a high-throughput manner. The experiments with HCT116 cells further support this claim. We demonstrate that the robust least squares approach provides an accurate quantification for protein lysate arrays that contain multiple data points for each sample.
| Acknowledgments |
|---|
The work was partially supported by the Tobacco Settlement Fund to M.D. Anderson Cancer Center (MDACC) as appropriated by the Texas Legislature, a grant from Kadoorie Foundation to MDACC, a grant from the Goodwin Fund and the Cancer Center Supporting Grant from NIH/NCI, and a grant from the Academy of Finland.
Received on June 24, 2004; revised on December 21, 2004; accepted on December 29, 2004
| REFERENCES |
|---|
|
|
|---|
Efron, B. and Tibshirani, R.J. An Introduction to the Bootstrap, (1993) , New York Chapman and Hall.
Espina, V., Mehta, A.I., Winters, M.E., Calvert, V., Wulfkuhle, J., Petricoin, E.F., III, Liotta, L.A. (2003) Protein microarrays: molecular profiling technologies for clinical specimens. Proteomics, 11, 20912100.
Grubb, R.L., Calvert, V.S., Wulkuhle, J.D., Paweletz, C.P., Linehan, W.M., Phillips, J.L., Chuaqui, R., Valasco, A., Gillespie, J., Emmert-Buck, M., Liotta, L.A., Petricoin, E.F. (2003) Signal pathway profiling of prostate cancer using reverse phase protein arrays. Proteomics, 11, 21422146.
Huber, P.J. Robust Statistics, (1981) , New York Wiley.
Ivanov, S.S., Chung, A.S., Yuan, P.Z., Guan, A.Y., Sachs, K.V., Reichner, J.S., Chin, Y.E. (2004) Antibodies immobilized as arrays to profile protein post-translational modifications in mammalian cells. Mol. Cell Proteomics, 8, 788795.
MacBeath, G. and Schreiber, S.L. (2000) Printing proteins as microarrays for high-throughput function determination. Science, 289, 17601763
Neter, J., Kutner, M.H., Wasserman, W., Nachtsheim, C.J. Applied Linear Regression Models with Student CD-rom, (2004) 4th edn , Boston, New York WCB/McGraw-Hill.
Nishizuka, S., Charboneau, L., Young, L., Major, S., Reinhold, W.C., Waltham, M., Kouros-Mehr, H., Bussey, K.J., Lee, J.K., Espina, V., et al. (2003) Proteomic profiling of the NCI-60 cancer cell lines using new high-density reverse-phase lysate microarrays. Proc. Natl Acad. Sci. USA, 24, 1422914234.
Paweletz, C.P., Charboneau, L., Bichsel, V.E., Simone, N.L., Chen, T., Gillespie, J.W., Emmert-Buck, M.R., Roth, M.J., Petricoin, E.F., III, Liotta, L.A. (2001) Reverse phase protein microarrays which capture disease progression show activation of pro-survival pathways at the cancer invasion front. Oncogene, 20, 19811989[CrossRef][Web of Science][Medline].
Street, J.O., Carroll, R.J., Ruppert, D. (1988) A note on computing robust regression estimates via iteratively re-weighted least squares. Am. Stat., 42, 152154[CrossRef].
Wulfkuhle, J.D., Aquino, J.A., Calvert, V.S., Fishman, D.A., Coukos, G., Liotta, L.A., Petricoin, E.F., III. (2003) Signal pathway profiling of ovarian cancer from human tissue specimens using reverse-phase protein microarrays. Proteomics, 3, 20852090[CrossRef][Medline].
This article has been cited by other articles:
![]() |
E. S. Neeley, S. M. Kornblau, K. R. Coombes, and K. A. Baggerly Variable slope normalization of reverse phase protein arrays Bioinformatics, June 1, 2009; 25(11): 1384 - 1389. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Zhang, Q. Wei, L. Mao, W. Liu, G. B. Mills, and K. Coombes Serial dilution curve: a new method for analysis of reverse phase protein array data Bioinformatics, March 1, 2009; 25(5): 650 - 654. [Abstract] [Full Text] [PDF] |
||||
![]() |
O. Gul, H. Basaga, and O. Kutuk Apoptotic blocks and chemotherapy resistance: strategies to identify Bcl-2 protein signatures Brief Funct Genomic Proteomic, February 18, 2008; (2008) eln002v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Hu, X. He, K. A. Baggerly, K. R. Coombes, B. T.J. Hennessy, and G. B. Mills Non-parametric quantification of protein lysate arrays Bioinformatics, August 1, 2007; 23(15): 1986 - 1994. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







