Bioinformatics Advance Access originally published online on March 27, 2006
Bioinformatics 2006 22(12):1515-1523; doi:10.1093/bioinformatics/btl106
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Finding regions of significance in SELDI measurements for identifying protein biomarkers
1 Department of Medical Epidemiology and Biostatistics, Karolinska Institutet Stockholm, Sweden
2 Cancer Centrum Karolinska, Karolinska Institutet Stockholm, Sweden
3 Center for Molecular Epidemiology, Yong Loo Lin School of Medicine, National University of Singapore and Genome Institute of Singapore Singapore
4 Clinical Proteomics, Karolinska Biomics Center, Karolinska University Hospital Stockholm, Sweden
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: There is a well-recognized potential of protein expression profiling using the surface-enhanced laser desorption and ionization technology for discovering biomarkers that can be applied in clinical diagnosis, prognosis and therapy prediction. The pre-processing of the raw data, however, is still problematic.
Methods: We focus on the peak detection step, where the standard method is marked by poor specificity. Currently, scientists need to inspect individual spectra visually and laboriously in order to verify that spectral peaks identified by the standard method are real. Motivated by this multi-spectral process, we investigate an analytical approachcalled RS for regions of significancethat reduces the data to a single spectrum of F-statistics capturing significant variability between spectra. To account for multiple testing, we use a false discovery rate criterion for identifying potentially interesting proteins.
Results: We show that RS has better operating characteristics than several existing methods and demonstrate routine applications on a number of large datasets.
Availability: RS is implemented in an R package called ProSpect which is available at http://www.meb.ki.se/~yudpaw
Contact: yudi.pawitan{at}ki.se
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
The growth of proteomics has depended on simultaneous measurements of a wide range of proteins (Tang et al., 2004). One important technology platform is the surface enhanced laser desorption and ionization (SELDI) technology, developed by Ciphergen Biosystems of Fremont, CA (Hutchens and Yip, 1993). SELDI protein profiles have been used to distinguish different pathologic states (e.g. normal versus cancer, Kozak et al., 2003, Zhang et al., 2004) and to evaluate different experimental conditions (e.g. treated versus untreated, Xiao et al., 2004; Boot et al., 2004). Although existing gene-expression microarray technologies have similar purposes, proteomics has the potential to provide additional biological understanding and new biomarkers. As proteins are closer to the biological processes under study, they are more natural targets than RNA for biochemical analysis and the development of new therapeutics.
There are, however, still problems with the early steps in the analysis of SELDI data, particularly in the detection of peaks in the spectra. The current methodology applies a peak detection algorithm to each spectrum separately, trying to identify peaks among the background noise (Fung and Enderwick, 2002). This is difficult, especially when the noise is of high frequency, and the peak detection step is known to have low specificity (Coombes et al., 2003; Jarman et al., 2003). To improve specificity, scientists visually inspect multiple spectra in parallel for differences in intensities, which is time consuming and error prone. Hence there is urgent need for an algorithm that improves automatization and specificity.
It seems advantageous to mimic the multi-spectral visual validation by analyzing all the spectra simultaneously and thereby avoiding or at least minimizing the need for visual inspection. Pooling the spectra is also likely to improve the characterization of the background noise. We have investigated such a multi-spectral approach for identifying spectral regions of interest.
Our basic idea is to use the one-way analysis of variance (ANOVA) to reduce a collection of spectra of intensities to a single spectrum of modified F-statistics and false discovery rate (FDR). Instead of trying to identify peaks, we aim to identify regions with significant variation in mean intensity between spectra. Our method, which will be called RS to capture the idea of regions of significance, is therefore a signal detection rather than a peak detection algorithm. Additional analysis is required in order to relate the variation between spectra to the clinical or experimental outcome of interest, but this issue will not be addressed here.
To summarize our contributions, using a real dataset where the peaks are manually annotated by a proteomics specialist, we show that RS has better operating characteristics (OC) than (1) the standard method used by Ciphergen and the peak-detection methods of (2) Coombes et al. (2003), (3) Yasui et al. (2003) and (4) Coombes et al. (2005). For a spike-in dataset, RS has clearly better specificity than the standard method. Finally we demonstrate the application of RS for a number of large datasets from a drug sensitivity trial and from a public source.
| 2 METHODS |
|---|
|
|
|---|
The SELDI technology selects subgroups of peptides and proteins from a sample through their affinity for chromatographic surfaces, where different classes of peptides and proteins bind to chips with different coatings, e.g. cationic, anionic or hydrophobic. A SELDI experiment on one spot of a chip results in
50 000150 000 pairs of (TOF, intensity) data. Standard preprocessing of the data is described in Fung and Enderwick (2002) and in the Supplementary Material. Assuming that a biological sample contains k proteins of distinct molecular weight, ideally its SELDI spectrum should show only k peaks. In reality, however, the spectra are very noisy. Figure 1 shows four sample spectra in the 310 kDa region, with a noisy high frequency pattern evident throughout the mass range. The numbered peaks have been identified by the Ciphergen software, using the default setting; obviously, this detects numerous peaks, but to determine which peaks are real is a challenge. Typically, a scientist will visually inspect the spectra and try to ascertain that certain peaks are replicated over several spectra. This approach requires highly trained experts while offering only limited reproducibility, and given the large number of spectra and peaks in clinical proteomics projects, the amount of work becomes fast impossible.
|
2.1 Existing methods for peak detection
There are currently two general approaches in peak detection: intensity threshold-based methods, like the one implemented in the Ciphergen software, and matching spectral intensities to a reference peak shape. In intensity threshold-based approaches, we first establish an instrument noise level, e.g. using the standard deviation of the intensity values in the absence of peaks. This noise level is then used to define a critical threshold that flags intensities exceeding the threshold as peaks and as non-peaks otherwise. This approach corresponds to a one-sided statistical hypothesis test for each measurement, with a null hypothesis of no peak.
Coombes et al. (2003) proposed to use the first differences of successive intensities to identify local maxima and local minima, and then to use the median of the absolute values of first differences as the noise level; maximum points whose distance to their nearest local minimum is greater than the noise level are marked as potential peaks. Recently, Coombes et al. (2005) estimated the noise from the residuals in a so-called wavelet denoising method that does baseline substraction on each spectrum. Yasui et al. (2003) identified points as peaks when their intensity is the maximum in a neighbourhood containing a prespecified number of points, and used a smoother to estimate the local noise. We will use the standard Ciphergen method, Coombes et al. (2003), Yasui et al. (2003) and Coombes et al. (2005) for comparisons with RS.
Spectral matching approaches compare the intensity values within a window with a reference peak for detection and subsequent characterization. In general, this requires the specification of a predefined reference peak and a suitable distance measure to determine the similarity of a window of intensity values with the reference peak (e.g. Reich, 1987; Danielsson et al., 2002). Recently, Jarman et al. (2003) developed an approach where the reference peak does not need to be specified explicitly. It uses a histogram-based model for spectral intensity and detects peaks by comparing the estimated variance of the observations with the expected variance when no peak is present in a window.
2.2 Detection of significant regions
The one-way ANOVA is used to study the variability of the spectra within a window W with a fixed number mF of observations from each spectrum. The F-statistic is used to identify regions where variability between spectra is significantly greater than the variability within spectra. In one window, the ANOVA model can be written as
![]() | (1) |
![]() | (2) |
2.3 Validating the null distribution using blanks
In order to obtain an informative F-statistic spectrum, we need to understand its behaviour under the null hypothesis. The permutation testing approach might be applicable, but because of the dependence, each spectrum has to be permuted as a whole unit. This means this approach would be useful only for relatively large datasets. For the datasets that we commonly encounter in practice, a more parametric approach is essential. To obtain an appropriate null distribution we need data that are as close as possible to real spectra, but without biological differences between samples. Given the complex and poorly known nature of the spectral data, simulating the null is bound to miss crucial properties of the real data. Therefore we decided to use spectra generated from chips that carry no biological tissue. The intensities on these chips (referred to as blanks in this paper) do not contain any biological signal, but only noise because of the chemical characteristics of the chip.
For this purpose, we obtained four blank spectra from SAX2 chips (strong anion exchange arrays with quaternary amine functionality). They were analyzed in the 310 kDa range that is of interest in low intensity runs. All four blanks exhibit a similar pattern with variability decreasing along the m/z-axis, creating a funnel shape after baseline correction; Figure 2.
|
The blank data are also useful for motivating the need for baseline correction. Here we know the ideal spectra should simply be flat lines at zero value. The first row of Figure 2 shows the raw blank data. Note the different levels (or offsets) of the different samples, showing the arbitrary level of each raw spectrum. The second row shows the baseline-corrected values using Ciphergen method and the corresponding robustly smoothed curves. If we compare the smoothed curves, there are still spurious differences between spectra, indicating inadequate baseline correction. Therefore, we perform an additional correction using a robust smoother, which produces the spectra in the third row.
Figure 3a and b show that the observed F-statistics do not follow the standard F-distribution. This provides evidence that SELDI spectra violate standard one-way ANOVA assumptions. Heteroscedasticity is already evident from the funnel shape in Figure 2 and is mainly responsible for the MSE strongly declining for increasing m/z-values. In an additional simulation study however (data not shown), we have found that the F-statistic is robust to heteroscedasticity if the number of points in the window is small (i.e. mF
20). It appears that within a sufficiently small window, the change in variability is small enough to be disregarded.
|
More importantly, we found that the spectral intensities showed significant correlation across neighbouring m/z-values at small lags. This dependency is not surprising, given that the raw spectra are in fact time series data, and account for most of the difference between observed and theoretical null distribution.
We propose two modifications of the standard F-statistic in order to deal with (1) the fluctuation in MSE and (2) the dependency between yij. As for (1), we observe a strong local fluctuation of the MSE, indicating that neighbouring windows might have hugely different MSEs. The first modification of the F-statistic aims to reduce the effect of these fluctuations by applying a simple smoothing procedure and replacing each MSE with the local mean or median of the neighbouring MSEs. This smoothed MSE is denoted by MSE', providing a better estimate of
2 than MSE, and therefore increasing the sensitivity of the F-statistic. The smoothing parameter is denoted by Mmse, and it is expressed as a percentage of the total data. Note that using the local median provides some robustness, whereas using the local mean allows us to apply the Satterthwaite's approximation (Satterthwaite, 1946) to estimate the degrees of freedom of the MSE', see Section 2.4. The F-statistic corresponding to MSE' is denoted by F'.
The second step of the modification aims to remove the effect of the dependency between the yij. Under the null hypothesis, the standard F-statistic with normal but dependent yij is distributed as cF, with the multiplicative factor c given by
![]() | (3) |
k is the autocorrelation for the k-th-lag (see the Supplement Material). As an illustration, in Figure 3a, the observed F deviates globally (i.e. across the range 310 kDa) from the line-of-identity by a factor of 5.3. A global autoregressive model of order 1 can capture
80% of this factor (see the Supplementary Material), and the rest might be due to varying dependency (i.e.
ks) over the m/z-range. We expect a local correction factor will remove the dependency effect from F'. This is achieved simply by dividing F' by the local mean or median of neighbouring F's (see Supplementary Material). The smoothing parameter is denoted by MF, and is again expressed as a percentage of the total data. The final modified F is denoted by F*. When the local mean is used, we expect F* to follow an F distribution with estimated degrees of freedom (see the first row of Table 1).
|
The same theory does not apply if a local running median is used to obtain MSE' or F*; in this case, the scale can be adjusted so that the median of the modified F matches the median of the expected null distribution. Since there is no simple way to compute the degrees of freedom of MSE' obtained from a local median, we use the fact that F with degrees of freedom (df1, df2) is approximately
for large df2; since MSE' is based on a large number of points, in practice df2 will be in the safe order of several hundreds. In the same manner, various combinations of mean or median smoothing can be used, as listed in Table 1. For example, using median smoothing for both the MSE and F', we obtain
![]() | (4) |
For the blank spectra, we have applied the two modification steps to obtain F* with local median smoothing applied to both MSE and F'. From Figure 3c and d, the observed distribution of F* now matches the expected distribution very closely, which in this case is
2 with three degrees of freedom. Similar results were also observed for the other variants of modified F* listed in Table 1, using mF = 5, 10 and the other tuning parameters Mmse and MF taking values 2.5, 5, 7.5 and 10%. Therefore the proposed corrections are adequate in removing the deviation in the observed distribution caused mainly by local dependence in the spectra.
2.4 Overall algorithm
For clarity, the RS algorithm is summarized in the following four steps.
Step 1: Data preparation
- The spectra are restricted to the m/z region of interest and aligned. The aligned data within the region of interest are given as pairs (xij, yij), where x denotes the m/z values, i = 1, ... , m and j = 1, ... , n.
- Baseline correction using robust smoothing is applied to the spectra individually.
Step 2: Data reduction
- Mutually exclusive windows W are formed, such that
where 1
(5)
w
[m/mF].
- One-way ANOVA analysis is performed on each window and the MSR and MSE are recorded.
Step 3: Smoothing of MSE
- MSE' is calculated by taking the mean or median of closest Mmse% of all MSE, where closeness between two windows is expressed as the difference between the mean m/z values within the windows. F' is calculated as MSR/MSE'. If the local running mean is selected, the degrees of freedom of MSE',
are approximated (Satterthwaite, 1946) by
where dfw is the degrees of freedom of MSEw.
(6)
- Calculate F' = MSR/MSE'.
Step 4: Smoothing of F'
- Local mean or median smoothing is applied to F', with MF as tuning parameter. Details as in Step 3.
- Obtain a local correction factor from F' according to the formula in Table 1.
The algorithm requires three tuning parameters: the number of points for each window (mF), the percentage of MSEs (Mmse) and the percentage of F's (MF) to be considered as neighbours of a point when performing a local mean or median smoothing. From our experience, mF should be small (around 5) to maintain resolution and sensitivity, but the results are very stable over a wide range of values for Mmse and MF.
2.5 False discovery rate
The identification of potentially interesting spectral regions involves the simultaneous testing of thousands of hypotheses. It is widely recognized that, at this level of multiplicity, the standard assessment of significance using p-values will lead to an unacceptable number of false positives. We will instead use direct control of the false discoveries using a FDR criterion. The FDR is defined as the expected proportion of false positives among the rejected hypotheses (Benjamini and Hochberg, 1995), i.e. FDR = E(V/R|R > 0), where V is the number of false positives and R is the number of rejected hypotheses. Therefore, an FDR cut-off has a meaningful interpretation, with better statistical properties than the (1) signal-to-noise ratio cut-off used by Ciphergen's method, Coombes et al. (2003) and Coombes et al. (2005) and (2) the intensity cut-off criterion by Yasui et al. (2003).
To compute the FDR, we first order the p-values obtained from F*, denoted by pr1
pr2
prT, where T is the total number of windows. The FDR (Benjamini and Hochberg, 1995) is given by
![]() | (7) |
2.6 Software
The RS algorithm is implemented in an R package called ProSpect which is available at http://www.meb.ki.se/~yudpaw. The package performs the proposed RS algorithm, displays the results graphically (see e.g. Figures 4 and 7), and allows the export of the detected peaks into the Ciphergen software for subsequent analysis.
|
| 3 RESULTS |
|---|
|
|
|---|
3.1 Validation datasets
We validate RS using two SELDI datasets, after applying the standard pre-processing steps in the Ciphergen ProteinChip software:
- Lung cell line data (H69). Two types of lung cell lines were studied, resistant versus sensitive to chemotherapy, with four spectra for each type of cell lines. A low intensity laser setting was used on SAX2 chips, and the analysis was restricted to the 310 kDa range, with 4295 points per spectrum. We used different settings for the tuning parameters: the window size mF was set to 5 and 10, and Mmse and MF were set to 2.5, 5, 7.5 and 10%. The scientist who was familiar with the data (JL) had manually identified 51 regions in the spectra with biologically plausible peaks. These regions were used as gold standard in determining true and false positives.
- Spike-in data. Bovine insulin at approximately 5733 Da was spiked into human blood serum, at seven levels of dilution: one control with only serum and six standard spike-ins with 10, 6, 4, 2, 1 and 0.4 µl of standard solution in serum. Each dilution was performed in independent duplicates and applied to WCX2 chips (weak cation exchange arrays with carboxylate functionality). The chips were scanned at low laser intensity, resulting in 14 spectra. The analysis was restricted to the 56.5 kDa range with 3765 points per spectrum. The tuning parameters were varied as in the lung cell line data: mF = 5, 10, and Mmse and MF taking values 2.5, 5, 7.5 and 10%.
3.2 Application datasets
The two datasets used in the validation study above are small compared with the usual SELDI experiments. Therefore we used larger datasets to illustrate the routine application of RS to more realistic clinical studies. The first collection of datasets came from a drug sensitivity study with eight patients sensitive to a drug, and seven patients resistant to it. Independent duplicates were prepared for each patient, yielding 30 samples applied to each of four different chip surfaces:
- CM10: weak cation exchange array with carboxylate functionality, with a hydrophobic barrier coating,
- H50: reversed phase or hydrophobic interaction chromatography array with a hydrophobic barrier coating,
- IMAC3-Cu: immobilized metal affinity capture array with a nitriloacetic acid (NTA) surface,
- SAX2: strong anion exchange array with quaternary amine functionality.
The second dataset came from an ovarian cancer study conducted by the Clinical Proteomics Program at the US National Cancer Institute, publicly available from http://home.ccr.cancer.gov/ncifdaproteomics/ppatterns.asp. The study population consists of 162 ovarian cancer cases and 91 controls, giving a total of 253 samples. No duplication was performed and the samples were applied onto the WCX2 chip surface (i.e. a weak cation exchange array with carboxylate functionality).
3.3 Lung cell line data
Figure 4 shows the eight spectra together with the FDR from the region 3.54 kDa. The F* was computed using mF = 5, Mmse = 2.5% and MF = 2.5% with local running median for smoothing both MSE and F'. In this spectral region, eight windows were found to be significant with FDR <5%, six of which correspond to three clear peaks.
To compare RS with the standard method we changed the following options in Ciphergen's Biomarker Wizard from their default values: (1) first pass set to 1.14 and (2) Min Peak Threshold set to 10%. These settings were chosen to match the number of clusters with peaks in the 310 kDa region found by Ciphergen to the number of significant windows found by RS at an FDR cut-off of 5% (n = 102).
Figure 4 compares the findings within the 3.54 kDa region: 12 clusters were identified by the standard method, shown as the bottom row of vertical bars in the bottom of Figure 4; in contrast, the eight windows found by RS are shown as grey bars in the main panel; five windows from RS overlapped with four of the clusters from the standard method. The scientist who generated the data (JL) noted that one of the overlapping regions (at 3.82 kDa with FDR
1%) was likely to be noise, that the three clear peaks were true peaks corresponding to proteins, and that the remaining eight clusters from the standard method are false positives, compared to only two windows from RS. It also appears from Figure 4 that RS captures the width of the peaks better than the standard method.
The same general conclusion was obtained for all variants of F* listed in Table 1 and for all settings of the tuning parameters (mF = 5, 10, and Mmse and MF taking values 2.5, 5, 7.5 and 10%).
For comparisons with RS, we also implemented and compared the methods proposed by Coombes et al. (2003), Yasui et al. (2003) and Coombes et al. (2005). For Coombes et al. (2003) and Yasui et al. (2003), peaks were identified from the robust-corrected intensity and we classify them into clusters by matching the peaks within 0.05% m/z [as suggested by Coombes et al. (2003)]; hence each cluster contains at least 1 spectrum with a peak. For Coombes et al. (2005), we used a publicly available MATLAB implementation, called Cromwell from http://bioinformatics.mdanderson.org/software.html, and analyzed the raw intensity, because in this method the baseline correction and peak detection were inter-connected. To distinguish these 2003 and 2005 papers, we refer them as the Coombes and Cromwell methods respectively.
In order to compare the performance of the methods across the full 310 kDa region, we study their operating characteristics (OC) curves. We modify the traditional OC curves slightly by replacing the false positive rate on the horizontal axis with the FDR (see also Choe et al., 2005). RS has two types of FDR: (1) obtained theoretically by converting the p-values from F* to FDR as given by (7), and (2) obtained empirically as described in the Supplement Material.
For the other methods, only the empirical FDR is available. However, sometimes there is a little complication in computing the empirical FDR, as it is not always obvious how to partition the spectra with reasonable m/z range, especially for non-peak regions. We have used the peak clusters of each method as the denominator for the FDR. For the Cromwell method, there are instances where its peak cluster, for example, has m/z ranging from 3 to 7 kDa. Although it picked up the 51 peaks that lies in the 37 kDa range, it obviously lacks specificity. To overcome this problem, we consider values of the signal-to-noise ratio for the first pass that give clusters with small m/z range that capture specific peaks.
Figure 5a d compare the OC curves of RS to the standard, Coombes, Yasui and Cromwell, respectively, using the empirical FDR for all methods. The scattered points for each method are obtained from different settings, and the OC curve is the loess smooth of the points. For RS, we used mF = 5 but varied Mmse and MF with values 2.5, 5, 7.5 and 10%; the local running median was used to smooth MSE and F'. It is clear that, at fixed FDR, different choices of the smoothing windows lead to similar results. The settings for the other methods are given in the Supplementary Material.
|
The plots show that RS (solid curve, which is a loess smooth of the scattered solid circles) has better OC than the other four methods (dashed curves in Fig. 5ad). At 80% sensitivity, the FDR of the standard, Coombes, Yasui and Cromwell methods are around 25 to 50%, compared with around 8% for RS. In fact these four other methods have similar OC curves, which could be attributed to their similar single-spectrum-based strategy in detecting the peaks.
To assess our distribution theory, Figure 6 shows that the theoretical FDR for RS (solid and dotted curves) matches the empirical FDR closely, providing support for the use of the theoretical FDR. The scattered points are the same as those in Figure 5a.
|
By comparing the OC curves, we also found that the local running median for smoothing MSE and F' had better OC than the local running mean (plots not shown), suggesting that robust smoothing is required. Using a larger window size mF = 10, RS with local running median applied to F' had a better OC curve than the standard method when FDR < 0.1. Generally it makes sense to use a window size that is smaller than an average peak width.
In summary, these results suggest that (1) median smoothing is better than mean smoothing, so this should be the default option, (2) the procedure is not very sensitive to the choice of the smoothing parameters Mmse and MF and (3) using median smoothing, RS performs better than the existing methods.
3.4 Spike-in data
Figure 7 shows the result of the spike-in data analysis. The local running median was used to smooth MSE and F' with mF = 5, Mmse = 2.5% and MF = 2.5%. An FDR cut-off of 5% identified nine significant windows in the 5.56 kDa range, eight of which corresponded to the insulin peak. The other window was near the end of the range at 5.5 kDa, with an FDR value of
3%.
|
The bottom row of Figure 7 indicates the peak locations detected by the standard method. We also restricted the analysis from 5 to 6.5 kDa and used the default setting from Ciphergen's Biomarker Wizard, except that first pass was set to 5 and Min Peak Threshold set to 0%; although Min Peak Threshold was 0%, the smallest percentage of the total spectra that contains a detectable peak among the clusters is 36% (i.e. five spectra). That is about two times bigger than the default value of 15%. The standard method also detected the insulin peak, but there were 30 other clusters detected across the whole mass range from 5.5 to 6 kDa. This shows that RS has a much better specificity than the standard method.
In the 55.5 kDa and 66.5 kDa regions, the standard method detected 48 and 19 clusters, respectively. In contrast, RS only detected 2 and 6 (contiguous) windows, respectively. Generally, the same conclusion was reached when the tuning parameters of RS were varied (mF = 5, 10 and Mmse, MF taking values from 2.5, 5, 7.5 and 10%). The local running mean was found to be less specific than the local running median.
3.5 Drug sensitivity and ovarian cancer data
Table 2 summarizes the raw data from the drug sensitivity and ovarian cancer data. After restricting and aligning the data in the 310 kDa region, 17 17517 185 points are left in the analysis for the drug sensitivity data and 4845 points left for the ovarian cancer data. The local running median was used to smooth MSE and F' with mF = 5, Mmse = 2.5% and MF = 2.5%. The FDR cut-off was set to 5%. Table 2 summarizes the total number of windows analysed and the number of windows flagged as significant. The proportion of significant windows ranges from 0.27 to 0.37, where many flagged windows are contiguous, corresponding to wide or overlapping peaks in the spectra. The number of contiguous regions of flagged windows are shown as clusters in Table 2. For these datasets, our software produces summary figures as shown in Figure 4.
|
| 4 DISCUSSION |
|---|
|
|
|---|
The current methods for detecting the relevant regions in SELDI protein expression spectra suffer from poor specificity. A typical solution is to visually inspect and compare spectra at the location of putative peaks. The RS algorithm was motivated by this idea of looking across multiple spectra. To our knowledge the application of the ANOVA-based F test here is novel. One key advantage of RS is that we are able to use an objective selection criterion based on the FDR. This allows for an easy interpretation of cut-off values and accounts for multiple hypothesis testing, which is in contrast to the arbitrary criterion used by other methods.
From our validation studies we have found that the local running median for smoothing the MSE and F' performed better than the other options in Table 1. This indicates that robust smoothing is necessary. The procedure was not sensitive to the choice of smoothing parameter, but for optimal power, the window size mF should be small compared with the average peak width. We have found that RS is a promising alternative to other competing methods. From the analysis of the lung cell line data in Section 3.3, at 80% sensitivity, RS has
8% FDR, compared with 2550% FDR for the other methods. Results from ongoing analyses of several clinical datasets also indicate that RS works more reliably than the standard algorithm.
We note that RS is a signal detection method and not a peak detection method: if a peak has the same intensity across all spectra, then it will not be identified as significant. This however should be more of an advantage than a drawback: if a protein does not vary between individuals in a population, then it has little or no potential to serve as a biomarker. Effectively, RS can function as a filter for common but uninformative proteins.
From protein chemistry and proteomics point of view, the question of peak width and morphology is highly relevant, since the isotopic mass distribution is not the only small deviations seen in proteomics experiment. When testing RS using real sample (lung cancer cell lysate), we could detect a great number of potentially interesting peak shoulders that could very well derive from post-translational modifications or from mutations. This information is biologically very relevant and is not detected very satisfactorily by current peak detection methods that rely only on single-spectrum peak finding.
Finally, we expect that RS approach will also work on other protein mass-spectra platform such as the matrix-assisted laser desorption and ionization (MALDI) data, although some validation studies, particularly on the theoretical behaviour of the F statistics, will be required.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on January 31, 2005; revised on March 7, 2006; accepted on March 18, 2006
| REFERENCES |
|---|
|
|
|---|
Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Statist. Soc. B, 57, 289300.
Boot, R., et al. (2004) Marked elevation of the chemokine CCL18/PARC in Gaucher disease: a novel surrogate marker for assessing therapeutic intervention. Blood, 103, 3339
Choe, S., et al. (2005) Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Genome Biol, . 6, R16[CrossRef][Medline].
Coombes, K.R., et al. (2003) Quality control and peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorption and ionization. Clin. Chem, . 4, 16151623.
Coombes, K.R., et al. (2005) Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics, 5, 41074117[CrossRef][Web of Science][Medline].
Danielsson, R., et al. (2002) Matched filtering with background suppression for improved quality of base peak chromatograms and mass spectra in liquid chromatography-mass spectrometry. Anal. Chim. Acta, 452, 167184[CrossRef].
Fung, E.T. and Enderwick, C. (2002) ProteinChip clinical proteomics: computational challenges and solutions. Biotechniques, 32, S34S41.
Hutchens, T.W. and Yip, T.T. (1993) New desorption strategies for the mass spectrometric analysis of macromolecules. Rapid Commun. Mass Spectrom, . 7, 576580[CrossRef].
Jarman, K.H., et al. (2003) A new approach to automated peak detection. Chemometr. Intell. Lab. Syst, . 69, 6176[CrossRef].
Kozak, K.R., et al. (2003) Identification of biomarkers for ovarian cancer using strong anion-exchange ProteinChips: potential use in diagnosis and prognosis. Proc. Natl Acad. Sci. U.S.A, . 100, 1234312348
Reich, G. (1987) Recognizing chromatographic peaks with pattern recognition methods part 2: evaluation of different distance measures. Anal. Chim. Acta, 201, 171183[CrossRef].
Satterthwaite, F.E. (1946) An approximate distribution of estimates of variance components. Biometrics, 2, 110114[CrossRef].
Tang, N., et al. (2004) Current developments in SELDI affinity technology. Mass Spectrom. Rev, . 23, 3444[CrossRef][Web of Science][Medline].
Xiao, Z., et al. (2004) Serum proteomic profiles suggest celecoxib-modulated targets and response predictors. Cancer Res, . 64, 29042909
Yasui, Y., et al. (2003) A data-analytic strategy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics, 4, 449463[Abstract].
Zhang, Z., et al. (2004) Three biomarkers identified from serum proteomic analysis for the detection of early stage ovarian cancer. Cancer Res, . 16, 58825890.
This article has been cited by other articles:
![]() |
Y. Wang, X. Zhou, H. Wang, K. Li, L. Yao, and S. T.C. Wong Reversible jump MCMC approach for peak identification for stroke SELDI mass spectrometry using mixture model Bioinformatics, July 1, 2008; 24(13): i407 - i413. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Skold, T. Ryden, V. Samuelsson, C. Bratt, L. Ekblad, H. Olsson, and B. Baldetorp Regression analysis and modelling of data acquisition for SELDI-TOF mass spectrometry Bioinformatics, June 1, 2007; 23(11): 1401 - 1409. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












