Skip Navigation


Bioinformatics Advance Access originally published online on March 25, 2007
Bioinformatics 2007 23(11):1401-1409; doi:10.1093/bioinformatics/btm104
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/11/1401    most recent
btm104v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sköld, M.
Right arrow Articles by Baldetorp, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sköld, M.
Right arrow Articles by Baldetorp, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Regression analysis and modelling of data acquisition for SELDI-TOF mass spectrometry

Martin Sköld 1,*, Tobias Rydén 1, Viktoria Samuelsson 2, Charlotte Bratt 3, Lars Ekblad 3, Håkan Olsson 3 and Bo Baldetorp 3

1Centre for Mathematical Sciences, Lund University, Box 118, SE-221 00 Lund, Sweden, 2Oncological Centre, University Hospital, Klinikgatan 22, SE-221 85 Lund, Sweden and 3Department of Oncology, Lund University, Barngatan 2:1, SE-221 85 Lund, Sweden

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSIONS
 AUTHORS' CONTRIBUTIONS
 APPENDIX
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Pre-processing of SELDI-TOF mass spectrometry data is currently performed on a largel y ad hoc basis. This makes comparison of results from independent analyses troublesome and does not provide a framework for distinguishing different sources of variation in data.

Results: In this article, we consider the task of pooling a large number of single-shot spectra, a task commonly performed automatically by the instrument software. By viewing the underlying statistical problem as one of heteroscedastic linear regression, we provide a framework for introducing robust methods and for dealing with missing data resulting from a limited span of recordable intensity values provided by the instrument. Our framework provides an interpretation of currently used methods as a maximum-likelihood estimator and allows theoretical derivation of its variance. We observe that this variance depends crucially on the total number of ionic species, which can vary considerably between different pooled spectra. This variation in variance can potentially invalidate the results from naive methods of discrimination/classification and we outline appropriate data transformations. Introducing methods from robust statistics did not improve the standard errors of the pooled samples. Imputing missing values however—using the EM algorithm—had a notable effect on the result; for our data, the pooled height of peaks which were frequently truncated increased by up to 30%.

Contact: martins{at}maths.lth.se

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSIONS
 AUTHORS' CONTRIBUTIONS
 APPENDIX
 ACKNOWLEDGEMENTS
 REFERENCES
 
The use of mass spectrometry for profiling serum proteome has shown great potential, but there are still many unresolved problems related to the processing of data and quantification of results. In particular, results have shown poor reproducibility over time and across laboratories. For a review of the technology and previous work, see the recent commentary in Coombes et al., (2005a) and references therein.

Much research has been devoted to post-processing of data, such as clustering and classification of samples, while issues related to the pre-processing steps of normalization and base-line subtraction have received less attention and are commonly done on an ad hoc basis. It is often argued that, in order to increase reproducibility and decrease the amount of false discoveries, it is necessary to form a complete understanding of the statistical properties of a SELDI-TOF experiment and the pre-processing steps. This article aims to take a first step in this direction. We have analysed the building blocks of each spectra, the single-shot spectra obtained from each individual shot from the laser. In doing so, a number of more or less known anomalities in the spectra, including machine artefacts and those resulting from the irregular spatial structure of coated chips are revealed. Analysing single-shots also allows us to empirically verify statistical properties of their pooled versions, in particular we find a close to linear relation between mean-level and variance of the response.

Based on our empirical findings, we construct a full statistical model within the framework of heteroscedastic linear regression. This framework allows explicit expressions for a maximum-likelihood estimator and its variance. Our results indicate that the necessary assumptions behind naive methods for classification/discrimination may fail and we outline appropriate transformations of data. The framework also allows straightforward extension to handle censored data (due to limited recording span of the instrument) and introduce robust methods, both are discussed in detail in this article.

This study is important in the search of reliable methods for measuring biomarkers for early detection of e.g. cancer disease.


    2 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSIONS
 AUTHORS' CONTRIBUTIONS
 APPENDIX
 ACKNOWLEDGEMENTS
 REFERENCES
 
A ProteinChip® array consists of 8 or 16 spots, to which sample and matrix are applied. In a typical configuration, each spot is irradiated Formula times at each of Formula positions, resulting in Formula ‘single-shot spectra’ which define our raw data. (Here, ‘s/p’ in Formula stands for ‘shots per position’.) Before peak detection, raw data is typically processed in the following steps.

  1. Pooling: The n single shots are combined, by averaging, into a ‘raw spectra’ (the default Ciphergen setting also performs a local smoothing). This step is performed by the instrument software and the individual single shots are by default not available for inspection.
  2. Base-line subtraction: The ‘base-line’—consisting of chemical noise (e.g. contribution from matrix molecules) and machine artefacts—is removed, usually through a local constrained smoothing algorithm. Sauve and Speed (2004) use a morphological filter, Coombes et al. (2005b) use wavelets and a monotone local minimum, Tan et al. (2006) use a non-Gaussian mixed-model-based robust smoothing, Malyarenko et al. (2005) use time-series methods.
  3. Normalization: The base-line subtracted spectra are normalized, usually through division with total ion current (area under curve), in order to allow pairwise comparisons.

Our interest in this article lies in the first step, where we aim to introduce a statistical framework that allows straightforward extensions in a number of important directions. It should be noted that we are not considering the problem of base-line subtraction, since we do not attempt to distinguish between chemical noise and proteins. Hence, the result of our analysis may still be in need of further pre-processing through steps 2 and 3 above.

In order to provide a complete understanding of the statistical properties of a SELDI-TOF experiment we have—through non-standard machine settings—extracted all 192 single-shot spectra in replicates and under a number of different conditions. We have in particular studied experiments on blank chips without coating, chips coated with matrix only and chips coated with serum proteins and matrix. For a detailed description of our experiment, see the Appendix. Two of these single-shot spectra from the same spot containing serum proteins and matrix are shown in Figure 1.


Figure 1
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Two single shots from Figure 3.

 
As in Malyarenko et al. (2005) we prefer to work on the time-scale rather than the mass/charge (m / z) scale. In a single-shot spectrum, intensity measurements are recorded at a range of time points (Fig. 1). We refer to the time parameter as ‘time-of-flight’ (TOF), and in this article we disregard the actual timescale and measure TOF as an integer corresponding to the sequential ordering of observations within the spectra. Hence, the unit of measurement of TOF is 4 ns, corresponding to the time between recordings of the particular instrument used.

The instrument provides integer intensity measurement in the range 0–255 (due to 8-bit analogue-to-digital conversion). In our experiments, the values varied between a constant offset around 80 and the maximum recordable value of 255, providing roughly 200 unique recordable intensity values (Fig. 2). Some of these values were ‘less favoured’ by the instrument though (Fig. 2).


Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Stemplot of the number of recordings of each intensity value out of the total 192x13 470 recorded values from one spot (top frame). The bottom frame shows part of the top frame. The values 87 and 89 are recorded roughly 4 x 105 and 2 x 105 times, respectively, while there are only 76 recordings of 88.

 
A potential problem comes from misalignments of peaks on the timescale between shots. In a SELDI-TOF experiment, the effect of triggering errors is negligible (2 ns according to ProteinChip System Users Guide, 2000, p. D-2). Instead, the major source of misalignment is the varying thickness of material on the spot, see e.g. Malyarenko et al., (2005), Önnerfjord et al. (1999). In our study, peaks were well aligned for single shots within each position (as also observed in Malyarenko et al., 2005), though a few positions were slightly misaligned with the majority. To make a more quantitative statement, we studied a peak around TOF 7655 on a spot coated with matrix and serum. For each position in which this peak was pronounced (10 positions in total; other positions were almost empty in this region) we searched for the TOF in the range 7645–7665 at which the recorded value was maximal. We then performed a two-way ANOVA of these 10 x 12 values, with ‘position’ and ‘single-shot index’ as covariates. In this ANOVA, 73% of the total sum of squares in peak location was explained by differences between positions and only 7% by differences within position. Although this result should not be taken too literally (errors are certainly not normal, for instance), it clearly indicates that peak misalignment is primarily caused by differences between positions. When comparing the alignment across 128 positions (aligning positions using maximal cross-correlation), we found a maximum shift of five TOF between any two positions. Since this value is small in comparison to the observed width of peaks, we choose not to take the influence from misalignment into account in the present study. Note also that since variation in alignment mainly occurs between positions, using data at position-level seems crucial in order to be able to increase resolution of spectra by correcting for misalignment.

Blank chips The blank chips should ideally give a constant spectrum, and deviations from a constant can be attributed to electronic noise from the equipment and variations in lab environment. This noise showed to be relatively low, with a standard deviation around 0.60 and increasing to 1 with a period of roughly 800 TOF (3.2) ms; Fig. 1 in Supplementary Material). This could be an artefact related to the clock similar to that observed by Baggerly et al. (2003), however they reported a period of 4096 with a different instrument. A corresponding change in mean level was not observed. The noise was slightly coloured, with a first-lag auto-correlation of around 0.35 quickly vanishing for higher lags. It should be noted that Malyarenko et al. (2005), who also analysed blank single-shots from a PBS II instrument, seem to record a higher standard deviation than we did (cf. Fig. 2A in their article) possibly due to different instrument settings.

Coated chips The top two frames of Figure 3 show heat maps of the log single shots recorded at two spots coated with serum proteins and matrix. It is clear from the banded structure that we are not recording Formula independent and identically distributed single shots. There is a strong dependence both within each position (group of 12 shots) and within each spot, confirming a large spatial variation in the number of successfully bound/emitted ionic species on the surface of the chip. This variation is a well-known problem with techniques that use the so-called dried-droplet method of preparing chips, see e.g. Önnerfjord et al. (1999). At some positions (e.g. the third position in the left frame, corresponding to shots 25–36), the shots are practically empty and at others (e.g. the fourth position in the left frame, corresponding to shots 37–48) the intensities have been censored/truncated at the maximum recording value of 255 (cf. Fig. 1). The censoring is especially troublesome since it will bias the high peaks downwards and low-abundance peaks upwards in the pooled and normalized spectra. Moreover, based on the default instrument output we cannot tell whether censoring has taken place or not (unless all 192 single shots are censored at some specific TOF). We will later discuss how this could be corrected using the EM algorithm.


Figure 3
View larger version (46K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Heat-maps of log single-shots from two spots coated with serum proteins and matrix (top) and corresponding pooled spectra (bottom).

 
We did not observe the shelf-like increase in baseline after a detector overload event reported in Malyarenko et al. (2005).

In Figure 4, we have identified the area of the heat maps from Figure 3 where the recorded values are larger than 2 SD of a shot on an uncoated spot and hence constitute a ‘significant deviation’ from an empty shot. It is obvious from this plot that parts of the spots contain little information. It is then sensible to ask if the pooling of single shots can be improved by somehow disregarding or down-weighting these shots; this question is the starting point for our regression analysis below.


Figure 4
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Area (black) of the heat-maps in Figure 3 where the recorded value (minus constant baseline) is larger than 2 SD of an uncoated spot.

 
We also studied chips with spots coated with matrix only. Single shots from these spots showed similar characteristics, including the strong within-position dependence, though obviously without the sharp peaks corresponding high-abundance proteins. A closer look at spot-average single shots reveals a sinusoidal pattern around the main matrix-hump, with smoothly varying amplitude and a period of 25 TOF (100 ns) (Fig. 2 in Supplementary Material). A similar pattern is also faintly visible in spectra with serum proteins and matrix. Since shots from matrix-only spots can be assumed to consist of baseline plus noise, this suggests that the common assumption of a smooth (slowly varying) baseline is questionable and hence baseline subtraction an even more difficult problem than previously thought.

2.1 Statistical framework
We denote by Formula the measured intensity in single shot i at TOF t after subtraction of the constant baseline (see Appendix). This is an observation of a random variable Yit.

A natural assumption is that, given a total of Ni charged particles recorded by the detector, the expected value of Yit is proportional to Ni, i.e.


Formula 1

(1)
where st is the unnormalized proportion of particles that, on the average, hit the detector at TOF t (unnormalized since we do not know the unit of measurement of Formula ). Further, based on this assumption, the area under the curve (AUC) of the i th single-shot equals


Formula 2

(2)
where the final step follows by the central limit theorem; the second sum is of order Formula which is small in comparison with the first sum which is of order T. The above two equations now define a linear regression problem with zero intercept through


Formula 3

(3)
with Formula the unknown proportion of interest and Formula representing a zero-mean random variation. The relation (3) might seem naive in the sense that it assumes the same shape of baseline for all positions. However, since we are only concerned with the very first step of the pre-processing, we will allow it to form the basis for our further investigations.

There are good reasons to assume heteroscedasticity, i.e. that Formula depends on the mean level Formula . A plausible assumption is that Yit is proportional to a Poisson random variable, since we would expect the number of charged particles recorded by the detector in a small time interval to follow a Poisson distribution (which in turn could be viewed as an approximation of a multinomial likelihood). This implies that variance is proportional to the mean, i.e. Formula , where c can be interpreted as the absolute contribution of a single charge to the recording (cf. below). Note that this variance model ignores any additive instrument noise unrelated to charges hitting the detector, i.e. it assumes Formula when pt = 0. While this is not entirely correct, our analysis of blank shots suggests that additive noise is small and hence we choose to ignore it. Nevertheless, care should be taken when interpreting results for very small mean levels.

By assuming xi to be roughly constant over each position, an assumption supported by Figure 3, we can verify the relation between mean and variance by plotting log-empirical position means and variances. That is, plotting pairs Formula where


Formula 4

(4)
Ij is the set of indices that constitute the j th position (Formula in our experiments) and Formula the number of single shots at each position (12 in our experiments).

This is done for a few selected TOFs, corresponding to visible ‘peaks’, in Figure 5. Here, the observed linear relation suggests a model of the form Formula . To verify that {alpha} = 1, we have performed an individual linear regression analysis of log-empirical variance against log-empirical mean for each TOF. The resulting estimated slopes (i.e. {alpha}s) and intercepts (i.e. log(c)s) are shown in Figure 6. It seems as if {alpha} is fairly stable across the time axis, and only slightly larger than 1. However, the intercept log(c) clearly depends on TOF, suggesting that the contribution to the recording made by a single particle varies with time. This may be due to multiple charged matrix molecules hitting the detector, an explanation supported by the shape of log(c) as a function of TOF which resembles the spectrum baseline. Another plausible explanation is charge accumulation in the detector as discussed in Malyarenko et al. (2005). In order to check whether the varying intercept is due to the removal of a constant baseline only, we repeated the analysis after first (for each position) removing a time-varying baseline computed using a local median smoother. For this case, neither slopes nor intercepts were found to be constant as functions of TOF.


Figure 5
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Plots of log-empirical within-position means against log-empirical within-position variances for four different TOF’s (TOF 7335Figure 5 5903, TOF 7655Figure 5 6432, TOF 8406Figure 5 7762, TOF 9256Figure 5 9418). Note the effects of the discrete nature of measurements in the lower left ends of the plots. Groups that contain censored values are represented by crosses. Straight lines represent the line x = y. The points in the figure correspond to the measurements at each position of eight spots with the same serum.

 

Figure 6
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Estimates of {alpha} (top) and log(c) (bottom) based on ordinary linear regression on log-mean/log-variance plots. Measurements from eight spots with the same serum were used in the construction of the figure. Segments of high variability, in particular below 2000 and above 9000 TOF, correspond to areas where most spectra are practically empty and hence provide little information on {alpha} and log(c).

 
It is interesting to note that the commonly used approach to pooling of taking the average at each TOF divided by the total mass average is equivalent to maximizing a Poisson likelihood. That is, assuming Formula and solving


Formula 5

(5)
where n denotes the number of single shots to be pooled (192 in our experiment). In the bottom two frames of Figure 3, we have displayed the pooled versions of the heat maps in the same figure. We remark that the use of a Poisson likelihood only makes full sense for non-negative integer-valued observations. However, writing Formula and Formula , where Formula and Ni represent the total number of proteins recorded at TOF t and over the entire single-shot, respectively [(cf. Equation (1))] and c is a constant given by the instrument, and assuming that Formula is an observation from a Poisson distribution with mean Formula , leads to a similar likelihood with the same maximizer Formula (irrespective of c).

In the model formulation Formula etc., it holds that Formula and Formula . We also note that a normal approximation, Formula for some (unknown) constant Formula unrelated to pt leads to the MLE Formula as well. This approximation will later be used to take censoring of values that are truncated at the maximum recordable 255 into account.

2.1.1 Variance of regression estimators
Using the variance assumption Formula and independence of the Yit across single-shot indices i, we can derive the variances of the estimator Formula as


Formula 6

(6)
It is important to note that this variance depends on Formula , the sum of AUCs. Moreover, our initial data analysis suggested that this value can vary greatly from spot to spot. For example, spots A and B displayed in Figure 3 differ by a factor 7 in their sum of AUCs.

In order to empirically verify the relation between pooled spectra variance and Formula , we have also estimated the variance using a non-parametric bootstrap resampling approach. This approach is based on the assumption that single-shot spectra are statistically exchangeable within each position and that positions (groups of single-shots within a position) are exchangeable within a spot (see Appendix for a description of the algorithm). From the resampled spots, we can then empirically estimate the standard deviation of Formula and compare with the theoretical value suggested by (6). We do this by plotting pairs Formula , where Formula denotes the bootstrap standard deviation for a particular configuration of positions and Formula is computed from the same configuration. In Figure 7, we show the result for the previously analysed four TOFs. The observed linear relations confirm the relation between sum of AUC and pooled spectrum variance. However, the variation in slope of the lines suggests that c is not constant over the time scale (c.f. the plot of log(c) in Fig. 6). Note in particular, as predicted by theory, the large difference in variance between spots.


Figure 7
View larger version (27K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Comparison of theoretical and bootstrap standard deviation of estimator Figure 7 for four different TOFs and 6 spots. Each point correspond to one resampled configuration of positions for a particular spot. Different symbols correspond to different spots.

 
2.1.2 Variance and baseline subtraction
The above section provides explicit formulae for the variance of Formula , the pooled spectra with only a constant baseline subtracted. Since the shape of the baseline tend to vary between spectra, it is usually necessary to remove also a time-variable baseline in order to facilitate pair wise comparison of protein content. Furthermore, baseline-subtracted spectra need to be renormalized. Denoting estimated baseline by Formula , it is thus of interest to also derive the variance of Formula . Here we argue that since most methods used for estimating the baseline Formula are based on averaging over a large range of TOF, the variance of Formula will be negligible in comparison to that of Formula . In conclusion, only the renormalization step will affect variance and


Formula 7

(7)
where pt includes time-varying baseline as in previous sections.

2.2 Pooling and comparing spots
The difference in variance between pooled spectra has immediate consequences when comparing and combining pooled spectra from different spots. In this section, we briefly outline some pitfalls and potential solutions.

Assume that we want to compare spectra from two groups of patients, e.g. healthy and diseased, at some fixed TOF t. We have available estimates Formula for Nh and Nd spots from healthy and diseased individuals, respectively. Let Ih and Id be the corresponding sets of indices, i.e. indices Formula correspond to healthy individuals. Under the null hypothesis that t is not the location of a biomarker, all the estimates share a common mean. In this situation, it is tempting to apply a standard two-sample t-test based on the difference in group means. However, such a test is based on the assumption that the estimates Formula share not only a common mean but also a common variance, which is not true. The difference in variance also suggests that simple averaging, i.e. computing Formula , is not the most efficient way of pooling estimates. Instead, if we aim to minimize squared loss, the weighted average Formula should be used in order to reduce the influence of near-empty spots. Here, Formula is the AUC for spot k. If the estimates are baseline-subtracted and renormalized, Formula for spot-based base-line estimates Formula .

Returning to the problem of testing for an equal mean, let Formula and Formula be the weighted averages of estimates from the healthy and diseased, respectively, Formula and Formula . According to the null model Formula for all k, and hence Formula . Furthermore, Formula has the distribution of Formula times a Formula -distribution with Nh – 1 degrees of freedom (d.f.) and is independent of Formula ; corresponding statements hold true for the diseased group. As a result, the pooled variance Formula , where Formula , is an unbiased estimator of Formula , and under the null hypothesis Formula has a t-distribution with f d.f. In conclusion, the null hypothesis is rejected if Formula , where Formula is the upper Formula quantile of the t distribution with f d.f. and {alpha} is the level of the test.

2.3 Taking censoring into account
Censored values are generally observed around the matrix-hump and at peaks of the most abundant proteins. While the problem could be alleviated by reducing the laser intensity, this seems difficult due to the extreme variations in intensity observed at different TOFs of the spectra. In our experiment, the total number of censored values in each spot (a spot consisting of 192 x 13 470 values) varied from 229 to 25 776 corresponding to Chip II Spots B and A displayed in Figure 3. The top frame of Figure 9 shows the proportion of censored values at each TOF for the latter spot.

By assuming the normal model Formula , the missing/censored values can be imputed and the maximum-likelihood estimate computed by the EM algorithm; for a full description, see the Appendix. In Figure 8, we have plotted pooled spectra from eight different spots over a range where, for some spots, censoring was frequent. For the spots with several censored values, the peak appears both higher and ‘sharper’ when censoring is properly taken into account. The bottom frame of Figure 9 examines the observed relative increase in the pooled spectra when the EM algorithm is applied. We see that the pooled spectrum can be as much as 30% larger when using the EM algorithm to take censoring into account.


Figure 8
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8. A peak after pooling with/without (left)/(right) the EM algorithm compared for 8 different spots.

 

Figure 9
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 9. Top frame: Proportion of censored values at each TOF for Chip II Spot A. Bottom frame: Relative increase in pooled spectrum (Chip II Spot A) when using the EM algorithm.

 
Note that censoring also affects the sum of AUCs. While ignoring censoring will underestimate the height of censored peaks in the spectra it will overestimate uncensored peaks after normalization. In our experiment, the largest observed increase in sum of AUCs taking censoring into account was 2%. Hence, ignoring censoring would overestimate the height of uncensored peaks by roughly the same amount for this spectra.

2.4 Robust regression
Based on the variance model Formula or Formula , one may compute normalized residuals Formula where Formula is an estimate of pt. Given either variance model, these residuals should roughly have the same variance. In a plot of such residuals, it is however typical that a few have magnitudes notably larger than the other ones and may therefore be thought of as outliers. Thus, it is natural to ask if a robust regression procedure can improve the performance of the estimates.

Departing from the normal model Formula , we attempted a ‘robustification’ of its MLE. For {alpha} = 1 the MLE is, as noted above, given by Formula , and in general it is Formula .

Our robust method, estimators denoted Formula , is based on reweighted least squares (see Appendix) and gives low weights to large residuals under the variance model Formula . We tried two fixed values for {alpha}; {alpha} = 1 and {alpha} = 1.2, the latter based on visual inspection of the top frame in Figure 6.

2.4.1 Comparison of regression methods
Here we compare a number of different regression methods based on the proposed model:

M1 Poisson likelihood regression, corresponding to Formula .
M2 Robust regression, {alpha} = 1, corresponding to Formula .
M3 Robust regression, {alpha} = 1.2, corresponding to Formula .
M4 Ordinary least squares linear regression, corresponding to Formula .

As a measure of performance, we estimated the variance Formula of the estimators using a bootstrap algorithm described in the Appendix. The results showed that all methods perform comparably and that there was no improvement in introducing robust methods. For a graphical display, see Figure 3 in the Supplementary Material.


    3 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSIONS
 AUTHORS' CONTRIBUTIONS
 APPENDIX
 ACKNOWLEDGEMENTS
 REFERENCES
 
By studying single-shot spectra, we have observed a large variation in the amount of successfully emitted ionic species in a SELDI-TOF experiment. This variation was found both between positions within the same spot and between spots. In order to accurately take this variation into account, we have constructed a statistical framework for pooling single-shot spectra over a spot. This framework involves heteroscedastic linear regression, and gives a simple expression for the variance of a pooled spectra as a function of the sum of AUCs. We also observe that this value can show considerable variation between spots. Hence, the sum of AUCs is an important covariate to take into account when comparing pooled spectra. Apart from providing statistically motivated estimators, using a full statistical model allow straightforward extensions in a number of directions.

For example, it allows a simple modification of the standard two-sample t-test that would otherwise be incorrect due to the differing variances of pooled spectra. It also allows taking instrument censoring of high peaks into account in a straightforward manner using the EM algorithm. When applying this to peaks corresponding to high-abundance proteins and peaks in the area of high baseline, we obtained estimates of peak heights that were up to 30% larger than what was obtained when censoring was ignored. We attempted at introducing robust methods in order to reduce variance of the estimators. However, in an empirical comparison of the variances of robust and non-robust estimators we found them to be similar. This gives some indication that the observed high variance is not due to a few outlying observations, since such values would be ignored by the robust methods.

Using a linear regression framework also has the advantage of being straightforwardly extended to a multivariate context. Here, extra covariates may involve e.g. instrument settings in order to combine/compare experiments under possibly different settings or done by different labs.

Finally, we would like to remark that this study is only a first step towards a full statistical understanding of a SELDI-TOF experiment. Further work is needed in order to accommodate the fact that relative height of baselines may vary between shots and take into account bias introduced in the baseline-subtraction step. We believe that this analysis need to be done at the resolution of a position on the spot and that much important information is lost in the current default instrument output of spot averages. We also believe that extensions of our framework to other platforms like MALDI-TOF would be straightforward, provided the spot-reading protocol is defined in a similar manner.


    AUTHORS’ CONTRIBUTIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSIONS
 AUTHORS' CONTRIBUTIONS
 APPENDIX
 ACKNOWLEDGEMENTS
 REFERENCES
 
T.R., V.S. and M.S. developed the statistical methodology; T.R. and M.S. wrote the main part of the article; L.E. and C.B. performed the SELDI-TOF experiments and took part in the data collection; L.E., C.B. and B.B. contributed to the experimental design; B.B. and H.O. contributed reagents/materials/analysis tools and critically revised the manuscript. All authors read and approved the final manuscript.


    APPENDIX
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSIONS
 AUTHORS' CONTRIBUTIONS
 APPENDIX
 ACKNOWLEDGEMENTS
 REFERENCES
 
Description of experiment
The spectra were recorded in a PBS-II from Ciphergen (Freemont, CA) with the following settings: high mass 20 000, optimized from 2000 to 20 000, laser intensity 180, detector sensitivity 7, focus mass 8000 and mass deflector 1500. In all experiments, the protein chip CM10 with a weak cationic coating was used. Every position was warmed by two shots with laser intensity 210 after which twelve single shots were recorded. In this way, 16 positions were analysed, producing 192 spectra from each spot. The blank chips were analysed without pre-treatment. To the matrix chips, 1 µl 100% sinapinic acid in 50% acetonitrile and 0.5% trifluoroacteic acid was added per spot and left to air-dry for 15 min before analysis. The serum proteins from healthy volunteers were solubilized in 9 parts of 9 M urea, 2% CHAPS and 50 mM Tris-HCl, pH 9, and then adsorbed to the protein chip spots in 0.1 M sodium acetate, pH 4.0, (CM low stringency buffer) using a Biomek2000 robot (Beckman Coulter. Inc., Fullerton, CA) equipped with a bioprocessor from Ciphergen according to the instructions from the manufacturer. In this way, proteins from 1 µl serum were applied to each spot. The protein chips were air dried for 20 min and then matrix was added as above.

Removal of constant baseline
Before analysis, the constant baseline was estimated as the most abundant value in the higher region of each spectrum and subtracted from data. This value turned out to be identical for all single shots in our experiment (i.e. 86).

Robust regression algorithm
The robust procedure is based on reweighted least squares, and proceeds as follows for any fixed t:

  1. Choose a starting value Formula , e.g. Formula .
  2. Repeat for Formula until convergence.
(k.1) Compute residuals Formula , compute an estimate Formula of their scale ct as their median absolute deviation (MAD) and compute weights Formula , where W is a weight function and {tau} is a tuning parameter. We used Tukey’s bisquare weight function Formula for Formula and {tau} = 4.685.
(k.2) Compute the reweighted estimate


Formula


We notice that the regression step combines the variance model Formula with the weights. The weight functions assign low weights to observations with large residuals, which hence may suspected to be outliers.

A hierarchical bootstrap analysis
In order to evaluate the performance of the regression methods, we followed a bootstrap resampling approach. It is important to note that direct sampling with replacement of the single-shots within each spot is not appropriate due to the strong dependence between AUCs within each position (cf. Fig. 3). Thus, we adopted a hierarchical approach based on the assumption that shots are statistically exchangeable within each position (group of single-shots) and that positions are exchangeable within each spot. Hence, resampling from a spot containing n single-shots grouped in Formula positions of Formula shots each proceeds as follows (as before Ik denotes the set of indices of spectra at the k th position):

  1. Draw bootstrapped position indices Formula independently from the uniform distribution on the set Formula .
  2. For each Formula and each Formula , draw the bootstrapped single-shot Formula uniformly among the original shots at position Formula , that is, uniformly from the set Formula .
This procedure gives a set Formula of n resampled single-shots, and a corresponding set Formula of resampled totals.

In Fig. 7, each point (there are 100 points for each TOF and spot) in the figure corresponds to one iteration of step (i). The theoretical variance Formula was computed from the drawn position indices, keeping the original single shots within each position. For each set of position indices (points in the figure), we repeated step (ii) 100 times, and these 100 resampled spots were used to empirically estimate Formula .

We repeated the procedure 500 times [i.e. one iteration of each of (i) and (ii)] for each of the two spots analysed and for each bootstrap sample computed the estimates corresponding to M1–M4 over the observed range of TOFs. These values were then used to empirically compute bootstrap variances.

EM algorithm
By assuming a normal likelihood, Formula , the model with censored data can be viewed as a missing data problem. As before, we denote by Formula the observed measurements after constant baseline subtraction. We further let mt be the number of censored values and re-order in such a way that the first mt values, Formula , are censored. Denoting by Formula the corresponding unobserved ‘true intensities’, we have that Zit is distributed as Formula , where Yit is distributed as Formula . The EM algorithm for computing the MLE of pt now proceeds as follows:

  1. Choose starting values Formula and Formula , e.g. Formula .
  2. Repeat for Formula until convergence. Compute


    Formula 8

    (8)
    and


    Formula

    where


    Formula

    and


    Formula


Here, Formula and {varphi}, {Phi} are the standard normal density and distribution functions, respectively. Note that we have ignored the effect of censoring on the xi since this seems negligible.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSIONS
 AUTHORS' CONTRIBUTIONS
 APPENDIX
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work has been supported by grants from the Swedish Research Council, The Royal Physiographic Society in Lund, IngaBritt and Arne Lundberg Foundation, Gunnar Nilsson Cancer Foundation, Hospital of Lund Foundations, Mrs. Berta Kamprad Foundation, King Gustav V Jubilee Foundation and Franke and Margaretha Bergqvist Foundation.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on August 31, 2006; revised on February 22, 2007; accepted on March 10, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 RESULTS AND DISCUSSION
 3 CONCLUSIONS
 AUTHORS' CONTRIBUTIONS
 APPENDIX
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Baggerly KA, et al. A comprehensive approach to the analysis of matrix-assisted laser desorption/ionization-time of flight proteomics spectra from serum samples. Proteomics (2003) 3:1667–1672.[CrossRef][Web of Science][Medline]

    Coombes KR, et al. Serum proteomics profiling—a young technology begins to mature. Nat. Biotech (2005a) 23:291–292.[CrossRef][Web of Science][Medline]

    Coombes KR. Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics (2005b) 5:4107–4117.[CrossRef][Web of Science][Medline]

    Malyarenko DI, et al. Enhancement of sensitivity and resolution of surface-enhanced laser desorption/ionization time-of-flight mass spectrometric records for serum peptides using time-series analysis techniques. Clin. Chem (2005) 51:65–74.[Abstract/Free Full Text]

    Önnerfjord P, et al. Homogeneous sample preparation for automated high throughput analysis with matrix-assisted laser desorption/ionisation time-of-flight mass spectrometryRapid Commun. Mass Spectrom (1999) 13:315–322.

    ProteinChip System Users Guide. Ciphergen BioSystems (2007) CA: Fremont.

    Sauve AC, Speed TP. Normalization, baseline correction and alignment of high-throughput mass spectrometry data. (2004) Paper presented at Workshop on Genomic Signal Processing and Statistics (GENSIPS), Baltimore, MD, May 26–27, 2004.

    Tan C, et al. Finding regions of significance in SELDI measurements for identifying protein biomarkers. Bioinformatics (2006) 22:1515–1523.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/11/1401    most recent
btm104v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Sköld, M.
Right arrow Articles by Baldetorp, B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sköld, M.
Right arrow Articles by Baldetorp, B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?