Bioinformatics Advance Access originally published online on April 25, 2007
Bioinformatics 2007 23(12):1451-1458; doi:10.1093/bioinformatics/btm130
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Simultaneous and exact interval estimates for the contrast of two groups based on an extremely high dimensional variable: application to mass spec data


1Department of Biostatistics, Dana-Farber Cancer Institute, 2Department of Biostatistics, Harvard School of Public Health, 3Lank Center for Genitourinary Oncology, Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA and 4Department of Statistics, Seoul National University, Seoul, Korea.
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Analysis of high-throughput proteomic/genomic data, in particular, surface-enhanced laser desorption/ionization time-of-flight mass spectrometry (SELDI-TOF MS) data and microarray data, has led to a multitude of techniques aimed at identifying potential biomarkers. Most of the statistical techniques for comparing two groups are based on qualitative measures such as P-value. A quantitative way such as interval estimation for the contrasts of two groups is more appealing.
Results: We have devised a simultaneous confidence bands method capable of detecting potential biomarkers, while controlling for overall confidence coverage level, in high-dimensional datasets that discriminate two treatment groups using a permutation scheme. For example, for the SELDI-TOF MS data, we deal with the entire spectrum simultaneously and construct (1 –
) confidence bands for the mean differences between groups. Furthermore, peaks were identified based on the maximal differences between the groups as determined by the confidence bands. The analysis method herein described gives both qualitative (P-value) and quantitative data (magnitude of difference). The Clinical Proteomics Programs Databank's ovarian cancer dataset and data from in-house samples containing known spiked-in proteins were analyzed. We were able to identify potential biomarkers similar to those described in previous analysis of the ovarian cancer data, however, while these markers are highly significant between cancer and normal groups, our analysis indicated the absolute difference between the two groups was minimal. In addition, we found additional markers than those previously described with greater differences in average intensities. The proposed confidence bands method successfully detected the spiked-in peaks, as well as, secondary peaks generated by adducts and double-charged species. We also illustrate our method utilizing paired gene expression data from a prostate cancer microarray experiment by constructing confidence bands for the fold changes between cancer and normal samples.
Availability: R-package, seie.zip (license: GNU GPL), is publiclly available at http://research2.dfci.harvard.edu/dfci/MS_spike-in_data/
Contact: parkyuhyun{at}gmail.com
Supplementary information: For supplementary data, please refer to Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Microarray and mass spectrometry technologies require effective statistical/computational methods that find important biomarkers, defined as features (genes/proteins), differentially expressed between two groups of samples within thousands of features. An inherent problem in analyzing such high-dimensional data is the detection of false positives if one uses separate statistical tests for each feature using traditional P-value cut-offs of 0.01 or 0.05, due to a large number of features which are potentially correlated with each other by unknown fashion. This multiplicity problem has been widely studied in the statistical literature (Benjamini and Hochberg, 1995; Hochberg and Tahmhane, 1987; Lehmann, 1986; Westfall and Young, 1993) and recent papers analyzing microarray experiments (Dudoit et al., 2003; Efron et al., 2002; Golub et al., 1999; Pollard and van der Laan, 2003; Tusher et al., 2001).
However, most of the ideas introduced for biomarker detection in high-dimensional data, so far, were originated from hypothesis testing, rather than interval estimates for the difference between two groups. An interval estimate provides a set of possible values for a parameter of interest, which is more informative than a single P-value to quantify its importance from statistical and also biological interpretation. For example, in comparing the mean gene-expression levels between two experimental groups, suppose that for Gene X, 0.95 confidence interval for the fold change is (1.1, 1.2) whereas 0.95 interval for the fold change for Gene Y is (4, 8). And suppose the first one has the Z score of six and the second has that of five. In this case, the fold change for X may not be interesting biologically even though its P-value is more highly significant. Usefulness of such interval estimation is well acknowledged in the statistical literature, however, simultaneous interval estimation for high-dimensional data has not been widely considered due to complications arising from the unknown dependence between structures within a large number of features.
We propose a simple algorithm based on permutation method to construct
simultaneous and exact confidence bands for any contrast assessment between groups with high-dimensional datasets. The proposed method allows for visualization of the possible range of difference in protein/gene abundance between groups with statistical significance while simultaneously controlling for overall confidence coverage level. We consider comparisons not only between two independent samples, but also dependent paired samples. One of the most intuitive contrast measurements would be mean difference between two groups and it is usually tested by t-statistics. Our method can be flexibly generalized to construct confidence bands for general two- or one-sample test statistics, such as Wilcoxon test statistics.
The confidence bands method can be applied to different datasets including microarray, CGH/SNP experiments or proteomics data. In particular, we found this method to be very useful for exploring proteome-wide biomarkers using Surface-enhanced laser desorption/ionization time-to-flight mass spectrometry (SELDI-TOF) MS technology.
Conventional methods for analyzing SELDI-TOF MS data first detect the peaks in a spectrum generated for each sample after calibration and then align these peaks across the samples. Next, peaks related to a mass-to-charge ratio (m/z) that discriminate groups based on testing of peak intensities are determined. However, the peak detecting methods are controversial because they are ad-hoc and the results can vary due to user-defined parameters such as signal-to-noise ratio. To tackle this problem, researchers have used fingerprints (Fung et al., 2001 and Vlahou et al., 2001), however, this method does not allow for individual peak labeling and subsequent protein identification. Morris et al. 2005 introduced a peak detection method based on the mean spectrum and demonstrated that the usage of the spectrum average leads to greater sensitivity and specificity while eliminating the difficult and intrinsically error-laden step of matching detected peaks on individual spectra.
To bypass the difficulties associated with peak detection method, we use all the intensity data along the spectrum without a peak calling procedure. After constructing
confidence bands for the mean difference between groups along the entire spectrum, we determine the peaks (biomarkers) that have potential maximal differences between groups based on the confidence bands. In this way, we can obtain the statistically meaningful peaks that discriminate two groups with qualitative and quantitative significance.
First we used well-known SELDI-TOF MS datasets from the latest experiment for ovarian cancer in the Clinical Proteomics Programs Databank (http://clinicalproteomics.steem.com). This dataset has been discussed by several papers (Baggerl et al., 2004; Diamandis, 2004; Petricoin et al., 2002; Sorace and Zhan 2003), and has been controversial. We re-analyzed the data and compared our results with the results from these papers. We also performed a spike-in experiment with samples from 91 prostate cancer patients to assess the accuracy of our methods.
For demonstrating our method for matching paired samples and other high-dimensional data, we also constructed confidence bands for the fold change between cancer and normal samples for a prostate cancer microarray experiment with 46 matching pairs that was used in Singh et al. 2002.
| 2 STATISTICAL METHODS |
|---|
|
|
|---|
Suppose that we are interested in making inferences about the difference between two groups, A and B, based on a large set of measurements or some specific transformations thereof from each study subject, that is, the response random vector for A is
We are interested in constructing a
simultaneous confidence band
for
where
i.e.
|
| (1) |
Let
and
be the sample means of Xt and Yt, respectively. Then, for each fixed t, a pointwise, two-sided
confidence interval for
is
|
| (2) |
where
St is a sample standard error estimate for
and
the upper 100
percentage point of the standard normal. Obviously, to obtain a
confidence band
in (1), one needs to replace
in (2) with larger cut-off values, say,
and
such that the interval
|
| (3) |
Unfortunately, since we do not know the dependence structure among the components of {Xt} nor of
the cut-off points
and
are difficult, if not impossible, to obtain analytically even for the case with large sample sizes m and n.
Here, we utilize a simple permutation idea to obtain these cut-off point
and
First, for a generic random quantity Q, let the lower case q be its observed value.
Let c be a positive real value and let
If
is the true value of
then
and
were generated from the same distribution. Let
be a random sample with size n drawn from the finite population composed of
and
and let
be the random sample which is the complement of
Moreover, let
and
be the sample means for
and
respectively.
|
| (4) |
To obtain the upper bound
we let d be a positive number and
We let
be a random sample for a population consisting of
and
Moreover, let
be the complement of the sample
Then,
if
|
| (5) |
These confidence bands (3) can be graphically displayed with two curves along
indicating the possible range of
with
confidence. Unlike the P-value curve across
, this confidence band is quite useful to identify features whose, for example, mean marker values between two groups are statistically and also biologically significant.
Now consider the case that {Xt} and {Yt} are from the same study subject, but under different experimental conditions, say, A and B. Let
be independent copies of
The confidence bands for
can be obtained via a similar argument. However, the random vectors
in (4),
are generated by permuting
and
randomly within the ith subject,
For (5),
are generated by permuting
and
randomly.
Note that the confidence band for
is constructed by inverting a test statistic based on
which can be replaced by any two sample or one sample (for paired observations) test statistic
where
and
are the vectors of the random samples
and
respectively, and
is a
vector whose components are
For example, W may be the standard two or one-sample Wilcoxon test statistic. Let
be the consistent estimator by solving the equation
and St is the standard error estimate of
Then, the cut-off points
and
for the confidence band
in (3) can be obtained via the above iterative procedure. Specifically, to check whether
we replace (4) by
|
|
|
|
| 3 APPLICATION TO MASS SPECTROMETRY DATA ANALYSIS |
|---|
|
|
|---|
3.1 Finding the potential biomarkers (peaks) for SELDI-TOF MS data
The mass-axis of the SELDI-TOF MS output shifts from experiment to experiment by
0.2–0.5% of the m/z values. Let
be a mass accuracy of a spectrum. We searched for peaks that may be potential biomarkers with
window in the significant region T. Furthermore, we excluded the m/z values that were not apexes among remaining T by examining the slopes of the curves of observed difference in mean intensities between two groups. We concluded that the final T was the list of important biomarkers.
3.2 Application to Petricoin's ovarian SELDI-TOF MS dataset
We analyzed the latest SELDI-TOF MS data from the ovarian cancer study available in the Clinical Proteomics Programs Databank with our method. This set of data consists of serum profiles of 162 subjects with ovarian cancer and 91 non-cancer control subjects. For each subject, a set of data consisting of intensities at 15 154 distinct m/z values ranging from 0.0000786 to 19 995.513 was available for analysis. This dataset was constructed using Ciphergen WCX2 ProteinChip Arrays. Preparation of chips for sample analysis was performed robotically and the raw data, without baseline subtraction, was posted for download. We used the normalization method outlined in the Clinical Proteomics Databank by scaling the intensities value between 0 and 1. Additional details of experimental data collection can be found at the Clinical Proteomics Programs Databank.
We analyzed the ovarian cancer dataset with all 11 003 m/z data points within the m/z range of I = [1500 m/z, 20 000 m/z] for 91 normal and 162 tumor samples. The intensity measures within the range below 1500 m/z were discarded due to the effects of matrix. Table 1 shows the cut-off points,
and
, for the confidence bands, with levels of 0.005, 0.01 and 0.05 with the total permutation number of M = 10 000. Using an Intel (R) Pentium (R) D 3.00 GHz CPU and 2.00 GB RAM, the algorithm took 8 min and 56 s for 1000 permutations, and 90 min and 12 s for 10 000 permutations. However, it has been determined that there was no practical difference between 1000 and 10 000 permutations. The estimated
and
obtained from 1000 permutations typically differ by less than 0.1 from those obtained from 10 000 permutations.
|
The cut-off values were relatively smaller than the Bonferroni's adjusted cut-off values which are the cut-off values under the conservative assumption that intensities for individual m/z points are independent.
Figure 1 shows the 95% confidence bands for the differences in mean intensity between cancer and normal patients. We obtained the significant region T (the shaded regions in Fig. 1) by excluding t values including 0 within the 95% confidence bands for further analysis.
|
We next followed the procedure of the section 3.1 with a mass precision of 0.5% and found 48 significant peaks with 95% confidence in the region of I as potential biomarkers. Figure 2 illustrates (a) the adjusted P-value curve by maxT method (Westfall and Young, 1993); a higher P-value curve indicates greater significance of difference between the two groups at that m/z value, and (b) 95 and 99.5% confidence bands in the region of [6700 m/z, 8100 m/z]. The outer dotted lines are 99.5% and the inner dashed lines are 95% confidence bands since greater ranges for the difference are produced for a higher confidence level.
It is important to note that the confidence bands give more information than a global P-value curve. Whereas the conventional P-value curve only conveys qualitative evidence on how strongly two groups are different from each other, the confidence bands actually show the magnitude of differences with a confidence of
. Moreover, P-value curves alone do not give information about the precise m/z position for potential biomarkers. For example, the peak at 6800 m/z was found to be as significant on the P-value curve as the region 7700–8000 m/z (P-value
0.0001). However, the confidence bands indicate that the region 7700-8000 m/z contains four individual peaks, each with a greater magnitude of difference that the single peak at 6800 m/z. In this way, we were able to find potential biomarkers based on actual contrast between two groups.
|
We sorted 48 detected markers based on their minimum potential changes, (The data can be found in the Supplementary Material at http://research2.dfci.harvard.edu/dfci/MS_spike-in_data/), and we compared the top five significant biomarkers detected by our method with significant peaks reported in the Clinical Proteomics Databank and those from Baggerly et al. We report their observed mean difference (cancer-normal) and 95% confidence bands in Table 2.
|
Figure 3 shows the MPC curves (left y-axis and black solid curves) and P-value curves (right y-axis and gray curves) for m/z values which were determined to be significant in these two papers. Although, there was no direct match between peaks identified by Petricoin et al. and the CB method, Petricoin's peaks at 2761, 3498 and 6632 were within 1% m/z of CB method detected peaks that have significant P-values but relatively small MPC (Fig. 3). This may be due to incorrect peak calling. Only two m/z at 3200 and 8033 detected by Baggerly et al. 2004 were found to be significant in our methods, however, the magnitude of difference between the groups at these two m/z values was again low when compared to the top markers identified using the CB method.
|
3.3 Application to the spike-in study
Plasma samples from 91 prostate cancer patients were divided into seven age matched groups consisting of thirteen samples. The groups were labeled A–G. Groups B–F were spiked with five proteins at 1X, 2X, 5X and 10X concentrations in a Latin Square formation (Table 3).
|
The minimal concentration of each of the spiked proteins allowing for a detectable peak in plasma was previously determined (data not shown). The minimal concentration for each protein was labeled as 1X and was found to be 1 fmol/µl for cytochrome c (from bovine heart, Sigma, St. Louis, MO), 10 fmol/µl for ubiquitin (from bovine red blood cells, Sigma), lysozyme (from chicken egg white, Sigma) and myoglobin (from horse heart, Sigma) and 100 fmol/µl for trypsinogen (from bovine pancreas, Sigma). The volume of spiked-in proteins was fixed at 10% of the plasma volume. Group A was not spiked with protein, however, an equal volume of diluent was added. Group G contained all five proteins at maximal (10X) concentrations.
Following the addition of proteins, 20 µl of each plasma sample was diluted with 30 µl 9 M urea and incubated at 4
C for 30 min in order to denature proteins. The samples were further diluted with 150 µl 1 M urea and subsequently stored at –80
C until analyzed by SELDI-TOF MS.
Using a Biomek 2000 (Beckman Coulter, Fullerton, CA, USA), CM10 ProteinChip Arrays (Ciphergen Biosystems, Freemont, CA, USA) were washed two times with 150 µl CM Low Stringency buffer (Ciphergen) with shaking for 5 min at room temperature. Following the wash step, 90 µl of buffer was aliquoted onto each spot of the array and 10 µl of sample was then added. The arrays were shaken for 30 min at room temperature to allow for protein binding to the surface chemistry. Subsequently, the diluted samples were removed and the arrays were washed three times with 150 µl buffer, with shaking for 5 min per wash at room temperature and rinsed twice with 200 µl water. The arrays were air dried and 1 µl of sinapinic acid (Ciphergen) was added twice to the arrays. The samples were analyzed on a PBSIIc SELDI-TOF mass spectrometer (Ciphergen) per manufacturer's instructions at a laser setting of 190, detector setting of seven and a digitizer rate of 1000.
The data was baseline subtracted and normalized by total-ion current using Ciphergen Express (Ciphergen). From the five purity spectra that contain a single spiked-in protein (http://research2.dfci.harvard.edu/dfci/MS_spike-in_data/), we found that SELDI-TOF MS experiments often produce secondary peaks, in addition to expected peaks, generated by multiple-charged species or matrix adducts for each of the five proteins. Moreover, we observed several peaks generated from contaminants within the pure spiked-in proteins.
We also observed the intensities for group G, which contained all five proteins in maximal concentrations, were generally lower than those for other groups with maximal concentration most likely due to ion suppression. (The mean intensity curves of groups A–G in the m/z regions of each spiked-in protein with ±0.5% precisions can be found in the Supplementary Material.)
To assess the accuracy of detecting known proteins using the confidence band method, we compared all 21 possible pairwise comparisons for groups A–G without knowledge of the five spiked-in proteins. Only the researcher conducting the SELDI-TOF MS knew the protein concentrations in each group and he did not take part in the analysis. We found 133 peaks as significant with 95% confidence. Table 4 reports the top ten detected peaks sorted primarily by the number of comparisons in which the peaks were detected as significant and secondarily by the largest MPC among the results from all comparisons between groups.
|
Ciphergen's biomarker detection algorithm, Ciphergen Express, found 124 significant markers with a level of 0.05 (data are not shown). Their top 10 detected peaks also contained four of the five spiked-in proteins. Curiously, ubiquitin was not found to be one of the top ten most significant peaks. This may have been due to the decrease in intensity observed at higher concentrations for ubiquitin (Supplementary Fig.). The resultant lower MPC resulted in ubiquitin being ranked as the 12th most significant peak, as determined by our analysis.
Baggerl et al. (2004) discussed the problems behind calibration, background subtraction and normalization of data. In order to address these potential problems, we further analyzed our spiked-in dataset to examine the effects of background subtraction and normalization. The total-ion current method of normalization assumes that the total amount of proteins in each sample may vary due to sample handling or instrument sensitivity. In general, with the exception of known disease states, the protein concentration of blood samples falls within a narrow range, but since the amount of proteins in the spike-in study may vary across the samples in different groups overall intensities may have been over-normalized with the total ion current method. We analyzed the raw data without background subtraction and normalization, and obtained 137 significant markers which contained 94 out of 133 markers detected from the analysis with the background subtraction and normalization with 95% confidence bands (The details of the analysis can be found at http://research2.dfci.harvard.edu/dfci/MS_spike-in_data) Preprocessing of the spectra made no difference regarding the detection of the spiked-in proteins. The raw data from our spike-in study is posted at http://research2.dfci.harvard.edu/dfci/MS_spike-in_data/
| 4 APPLICATION TO MICROARRAY DATA (PAIRED SAMPLES) |
|---|
|
|
|---|
We applied our confidence bands method to a previously published prostate cancer microarray experiment (Affymetrix H95Av2 containing 12 600 probesets) (Singh et al., 2002). Of the 52 prostate tumor and 50 normal samples, 46 were matching pairs. The article ignored this dependency within the same patients and obtained 456 differently expressed genes at the level of 0.001 using the signal-to-noise method of Golub et al. 1999. We re-analyzed this data with our confidence bands method with the paired data and found 71 genes to be significantly different between cancerous and normal tissue at the level of 0.001. Our smaller number of significant genes indicates that previous analysis likely yielded many false-positives due to incorrect multiple comparisons and neglect of the dependency of matching samples. Figure 4 shows 99% confidence bands of 71 significant genes sorted by their minimum potential change. s
|
| 5 DISCUSSION |
|---|
|
|
|---|
In this article, the approach we took for finding biomarkers in high-dimensional genomic/proteomic data is quite different from methods previously reported. Rather than determining qualitatively significant markers (with small P-value), we measure both qualitative and quantitative importance of potential markers by exploiting simultaneous confidence bands that reflect the true random fluctuation in the difference between groups of samples. Our algorithm bypasses the difficulties associated with estimating high-dimensional parameters to obtain the interval estimates of the difference between groups by reducing the problem into estimating two parameters,
Although our method is applicable to any high-dimensional data, we extensively studied SELDI-TOF MS data as our major application. While SELDI-TOF MS technology has been acclaimed as one of the most powerful new frontiers among disease diagnosis technologies utilizing blood-borne proteins, inconsistent results from previous publications indicated the necessity of new, refined approaches to derive optimal biomarkers that can be both interpreted and accepted by the scientific community.
Petricoin et al. (2002)'s claim that they can predict the presence of ovarian cancer using SELDI-TOF MS data with 100% accuracy has brought great attention, as well as, controversy. Zhu et al. (2003) also studied the early dataset used by Petricoin et al. (2002) and published a non-overlapping set of markers. Baggerly et al. (2004) also pointed out that the markers detected by Petricoin et al. (2002) were not significant in terms of t-statistics, leading to suspicion of their importance as discriminating biomarkers between cancer and normal samples. Sorace and Zhan, (2003) found markers using the latest dataset from Clinical Proteomics (the same set used for our analysis), however, the majority of their markers were <500 m/z and were likely to be artifacts or experimental bias as these are within the noise signal of the energy absorbing matrix.
Thus, our confidence band method employed an alternative approach to detect potential biomarkers throughout the m/z region. Most of the previous papers (Morris et al., (2005); Yasui et al., (2003)) have developed various peak detection methods and classification rules. Conventional approaches for MS data analysis usually take these two steps.
However, our method processes the data without an ad-hoc-based peak detection method and directly identifies statistically meaningful markers that discriminate between two groups based on (1 –
) simultaneous confidence bands. Our method is theoretically well justified in preventing overall false positives. Our method is simple and efficient without involving a complex classification rule which often yields non-biologically relevant markers and it provides a powerful visualization tool for potential bio-markers with both qualitative and quantitative importance. Our method was successful in detecting robust, biologically relevant, as well as statistically relevant, potential biomarkers via a spike-in study, regardless of background subtraction and normalization. Moreover, our method can also be informative to eliminate experimental artifacts or biological variations from individual patient effects. According to Diamandis (2004), SELDI-TOF technology is not capable of detecting any serum component at concentrations of <1 µg/ml and statistically significant markers with such small differences are often detected due to artifacts related to the nature of the clinical samples used or the MS instruments. Our confidence bands can easily reveal the quantitatively small differences even though these artifacts may be found to be qualitatively significant markers. Therefore, researchers can remove potential artifacts among the list of significantly detected markers by sorting them by values of corresponding MPCs or upper confidence bounds and by putting high priority in markers with larger values. Moreover, usage of mean spectra in confidence bands effectively normalizes possible biological variations from individual sample effects. Benefits of using average spectra over individual spectra were also well studied in Morris et al. 2005. Based on our findings, we claim that our method is useful for detecting biologically relevant biomarkers while being less prone to false positives.
Our confidence bands were originally designed for two-class problems. In the case of multi-class (>2 classes) problems, our method can be used via various combinations of pair-wise comparisons depending on the purpose of the analyses. For example, suppose that there are three classes of samples: ovarian, breast cancer and normal samples, and suppose that the interests of a researcher are in identifying proteins that distinguish ovarian cancer samples or breast cancer samples from normal samples. One can use the normal samples as a baseline to compare with each of the others with a multiple comparison adjustment.
While the spike-in experiment did not generate a clean dataset that contained only the spiked-in proteins as significant peaks in the spectrum, it was a valuable experiment to understand the nature of SELDI-TOF MS and to evaluate the performance of our CB method. The five proteins were initially chosen because unfractionated human plasma does not have peaks at the corresponding molecular weights on the Ciphergen cationic chip surface and all have isoelectric points at least 2 units above 4, the pH at which the low stringency wash was performed. While only the five spiked-in proteins were expected, the CB method detected 133 significant peaks. Most of the peaks were artifacts typical of the SELDI-TOF MS method, such as, EAM adducts, multiple charged species and ion suppression (see Table 4 for examples of the former two). The albumin peak at
66 500 Da was found to be significantly different between the groups (data not shown). This was due to ion suppression caused by an increase in cytochrome c levels as evidenced by groups E–G having the lowest levels of albumin and the highest levels of cytochrome c. Furthermore, some of the additional, unexpected peaks were found to be contaminants in the stocks of pure proteins. The stock of trypsinogen contained several peaks that could not be attributed to any of the typical artifacts and SDS-PAGE demonstrated that these excess peaks were in fact contaminating proteins. One of these impurities, at an m/z of 15 203, was found to be significantly different, in fact, it was one of the top ten peaks detected by both the CB and Ciphergen analysis methods. A second peak at 15 380 m/z, also identified as a top ten discriminating marker, may have been the EAM adduct of the 15 203 m/z peak as it was within 0.2% of the expected mass-to-charge ratio. Curiously, these peaks did not have a maximal MPC in comparison to group A as expected, but rather with group C. Group C was the experimental group that did not contain trypsinogen, but did contain all of the other proteins. The contaminants were detected at the highest levels in groups B and F containing 10X and 5X trypsinogen, respectively. It should be noted that two of the proteins did not exhibit a linear relationship between intensity and concentration when placed into plasma (Supplementary Fig. f). Ubiquitin demonstrated a biphasic response with increasing intensity to a maximal at 2X concentration and then a reduction in intensity at higher concentrations. Trypsinogen showed no difference in intensity between 1X and 5X concentration with a marked increase in intensity at 10X. Considering the fact that different spiked-in proteins yielded different levels of increment in intensities, it would be worth-while to consider constructing confidence bands for log-fold changes by taking the log-transformation of the spectra intensities. The described experiment also demonstrates some of the potential hazards of conducting spike-in studies.
Simultaneous confidence band method is a well-established inference scheme in statistical literatures, however, it has not been exploited in bioinformatics thus far. This type of interval estimates for contrast between groups can be very attractive for genomic/proteomic datasets since this method allows investigators to visualize the potential differences of mean intensities between groups while guarding against false positives due to multiple comparisons. It also yields meaningful biologically peaks rather than arbitrary, user-defined peaks for SELDI-TOF MS data.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank reviewers for their through reviews and helpful comments. The work is supported by Claudia Adams Barr Program in Cancer Research (Y.P., C.L.) and partially supported by the US NIH grants for I2B2 and AIDS (L.J.W.).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. ![]()
Received on March 27, 2006; revised on March 12, 2007; accepted on March 28, 2007
| REFERENCES |
|---|
|
|
|---|
Baggerly KA, et al. Reproducibility of SELDI-TOF protein patterns in serum: comparing data sets from different experiments. Bioinformatics, ( (2004a) ) 20, : 777–785.
Baggerly KA, et al. High-resolution serum proteomic patterns for ovarian cancer detection. Endocri.-Relat. Cancer, ( (2004b) ) 11, : 583–584..
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B, Methodological, ( (1995) ) 57, : 289–300..
Diamandis EP. Mass spectrometry as a diagnostic and a cancer biomarker discovery tool: opportunities and potential limitations. Mol. Cell. Proteomics, ( (2004) ) 3, : 367–378.
Dudoit S, et al. Multiplie hypothesis testing in microarray experiments. Stat. Sci, ( (2003) ) 18, : 71–103.[CrossRef][ISI].
Efron B, Tibshirani R. Empirical bayes methods and false discovery rates for microarrays. Genet. Epidemiol, ( (2002) ) 70–86..
Fung ET, et al. Protein biochips for differential profiling. Curr. Opin. Biotechnol, ( (2001) ) 12, : 65–69.[CrossRef][ISI][Medline].
Golub TR, et al. Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science, ( (1999) ) 286, : 531–537.
Hochberg Y, Tamhane AC. Multiple Comparison Procedures, ( (1987) ) New York: John Wiley & Sons, Inc..
Lehmann EL. Testing statistical hypotheses., ( (1986) ) New York: John Wiley & Sons, Inc..
Morris JS, et al. Feature extraction and quantification for mass spectrometry in biomedical applications using the mean spectrum. Bioinformatics, ( (2005) ) 21, : 1764–1775.
Petricoin EF, et al. Use of proteomic patterns in serum to identify ovarian cancer. Lancet, ( (2002) ) 359, : 572–577.[CrossRef][ISI][Medline].
Pollard KS, Van der Laan MJ. Resampling-based Multiple Testing: Asymptotic Control of Type I Error and Applications to Gene Expression Data. Division of Biostatistics Working Paper 121. Berkeley, CA:University of California Berkeley, ( (2003) ) Available: http://www.bepress.com/ucbiostat/paper121..
Signh D, et al. Molecular determinants of prostate cancer behavior. Cancer Cell, ( (2002) ) 1, : 203–209.[CrossRef][ISI][Medline].
Sorace J, Zhan M. A data review and re-assessment of ovarian cancer serum proteomic profiling. BMC Bioinformatics, ( (2003) ) 4, (24). http://www.biomedcentral.com/1471-2105/4/24..
Tusher VG, et al. Significance analysis of microarrays applied to the ionizing radiation response. Proc. Natl Acad. Sci. USA, ( (2001) ) 98, : 5116–5121.
Vlahou A, et al. Development of a novel proteomic approach for the detection of transitional carcinoma of the bladder in urine. A. J. Pathol, ( (2001) ) 158, : 1491–1502..
Westfall PH, Young SS. Resampling-based multiple testing: Examples and Methods for P-value Adjustment., ( (1993) ) New York: John Wiley & Sons, Inc..
Wu B, et al. Comparison of statistical methods for classification of ovarian cancer using a proteomics dataset. Bioinformatics, ( (2003) ) 19, : 1636–1643.
Yasui Y, et al. A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics, ( (2003) ) 4, : 449–463.[Abstract].
Zhu W, et al. Detection of cancer-specific markers amid massive mass spectral data. PNAS, ( (2003) ) 100, : 14666–14671.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



