Skip Navigation


Bioinformatics Advance Access originally published online on May 6, 2005
Bioinformatics 2005 21(14):3066-3073; doi:10.1093/bioinformatics/bti482
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/14/3066    most recent
bti482v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jeffries, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jeffries, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Published by Oxford University Press 2005

Algorithms for alignment of mass spectrometry proteomic data

Neal Jeffries

National Institute of Neurological Disorders and Stroke, National Institutes of Health Bethesda, MD 20892, USA


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHMS
 3 SOFTWARE
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 

Motivation: The analysis of biological samples with high-throughput mass spectrometers has increased greatly in recent years. As larger datasets are processed, it is important that the spectra are aligned to ensure that the same protein intensities are correctly identified in each sample. Without such an alignment procedure it is possible to make errors in identifying the signals from peptides with similar molecular weight. Two algorithms are provided that can improve the alignment among samples. One algorithm is designed to work with SELDI data produced from a Ciphergen instrument, and the other can be used with data in a more general format.

Results: The two algorithms were applied to samples drawn from a common pool of reference serum. The results indicate substantial improvement in consistently identifying peptide signals in different samples.

Availability: The two algorithms are programmed using the R language and are available from http://krisa.ninds.nih.gov/alignment/

Contact: neal.jeffries{at}nih.gov


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHMS
 3 SOFTWARE
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 
Recent advances in mass spectrometry proteomic analysis have generated considerable excitement and raised the prospect of using such techniques for widespread and high-throughput diagnostic purposes (Petricoin and Liotta, 2003). SELDI-TOF (surface enhanced laser desorption/ionization time-of-flight) and MALDI-TOF (matrix assisted laser desorption and ionization time-of-flight) mass spectrometers essentially analyze a biological sample to simultaneously ascertain the relative abundance of many protein/peptide sequences. The results are often displayed as a graph showing the relative abundance associated with protein mass/charge ratios over a particular Dalton range (Fig. 1).



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1 Example of a SELDI-TOF proteomic spectrum from serum.

 
As the technology has become more amenable to high-throughput analysis, investigators have examined spectra with the purpose of identifying differentially expressed proteins in samples of diseased and healthy individuals. Such biomarkers might then be used as a basis for diagnostic evaluation. Impressive results in categorizing unlabeled spectra as either healthy or diseased (Petricoin et al., 2002; Adam et al., 2002; Li et al., 2002) have spurred expansion in SELDI-TOF and MALDI-TOF mass spectrometry analysis. With increased use of these technologies has come increased scrutiny of associated data. A number of preprocessing steps have been recognized as necessary to control factors that can obscure true differences between disease classes. Normalization, baseline adjustment and peak selection are among the most common adjustments employed. In this work the focus is on spectral alignment, which is not performed as often because (1) this problem is not always present, (2) few algorithms are available to make the adjustment and (3) investigators may not recognize the need for adjustment. However, as mass spectrometry proteomics methods become more widespread and larger datasets become more common, such adjustments will probably be necessary as alignment problems can arise when different machines are used to generate spectra or spectra are generated over a long time period, e.g. for a large study. Recent work (Semmes et al., 2005) utilizing spectrometers located in different medical centers cites problems with alignment/calibration as important barriers that must be overcome to ensure that data from different centers are compatible. Baggerly et al. (2004) also identify alignment problems as a significant hindrance in achieving reproducibility from samples collected within the same lab.

In this paper two straightforward algorithms for alignment adjustments to SELDI-TOF data are presented. These algorithms may be useful in standardizing measurements taken from a single instrument over a long period of time, for measurements taken from many separate instruments and in any situation when the locations of corresponding peaks across spectra are not consistent. Sauve and Speed (2004) discuss alignment methods that address proteomic data yet provide few details and no software for theirimplementation.

1.1 Example of the problem
As part of a larger study examining proteomic spectra from healthy individuals and those with multiple sclerosis, reference samples from a large pool of serum were included as part of a quality control procedure. As patient and control samples were processed, a few spectra were consistently drawn from this common, fixed reference pool and analyzed to alert investigators to deviations related to sample processing. Ideally, all the spectra from the reference samples should look very similar. Samples were processed on six separate days using identical calibration procedures, personnel, equipment and sample handling techniques. Samples from the first four days were processed within a single week, whereas samples for the last two days were processed ~2 and 3 months later. The data were obtained from a PBS IIc instrument (Ciphergen Biosystems, Inc.) using version 3.2.1 of the Ciphergen ProteinChip software. Further details are available from http://krisa.ninds.nih.gov/alignment/.

The 44 reference spectra (eight for the first four days and six for the other two) are shown between 5280 and 5380 Da in Figure 2. The graphs indicate that the data produced on the third, fifth and sixth days are not well aligned with the other spectra. Though the data are not shown, this heterogeneity is largely prevalent throughout the m/z range of 2000–20 000 Da. These impressions extended to the patient and control samples as well.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 2 Ciphergen data from reference serum requiring alignment.

 
The percentage deviation in corresponding mass values near 5300 Da between the third day (peaks located near 5318 Da) and an average based upon the first, second and fourth days (peaks located near 5302 Da) is (5318 – 5302)/5302 or ~0.3%. For peaks of ~9000 Da this discrepancy declines to ~0.2%. For the fifth day the degree of deviation from the average of the first, second and fourth days data is ~0.15% near 5300 Da and 0.2% near 9000 Da. For the sixth day the degree of deviation is much greater, ~1.1% near 5300 Da and 0.5% near 9000 Da. In this case the spectrometer's laser failed and required replacement between the fifth and sixth days. This day's data are retained (1) for illustrative purposes and (2) because these type of complications do arise in practice and become increasingly unavoidable for large projects that may be conducted over a long time period or involve multiple spectrometers. The data also illustrate the difficulty in processing samples over long time periods. In this study more patient and control samples became available at a later date, and hence more quality control samples were analyzed. In general it is understood that investigators should try to minimize the calendar time involved in processing samples to avoid problems such as machine drift, but problems like this can still occur.

Most (Baggerly et al., 2003; Adam et al., 2002; Li et al., 2002), but not all, investigators (Petricoin et al., 2002) use some type of peak-picking algorithm to choose m/z values that correspond to local peaks in the spectra. Such an approach focuses on a smaller number of m/z values for analysis of differential expression or diagnostic modeling.

In these cases an alignment window approach to promote compatibility across spectra is common. The idea is that given a window centered at 5300 Da with a width of 0.3%, the peak intensity at 5300 Da for a given spectrum will be the largest intensity observed in the range [0.997 · 5300, 1.003 · 5300] = [5284.1,5315.9]. The choice of where to center a window is often determined by the location of peaks in an averaged spectrum (average over all spectra in the sample). Common window widths for SELDI data are 0.2% (Yasui et al., 2003; Adam et al., 2002) and 0.3% (Li et al., 2002).

In considering Figure 2 an alignment window >0.3% would be required to reliably conclude that these peaks reflect the same proteins/peptides. However, the cost of a large window is the possibility of spuriously associating peaks and confounding nearby markers. From the figure it should be clear that alignment algorithms would be valuable if they increased confidence that the same peak is identified in each spectra and that the confounding of nearby markers was reduced.


    2 ALGORITHMS
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHMS
 3 SOFTWARE
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 
Two alignment algorithms were developed. The first is specifically for investigators examining data using Ciphergen software and the second provides alignment for more general data forms.

2.1 Algorithm for Ciphergen data
The first algorithm might be preferable for those who process the data within the Ciphergen system and would like to use the Ciphergen software's baseline subtraction, normalization and peak-picking methods on data that have been realigned. The algorithm is based on Ciphergen's conversion of time-of-flight (t) data to mass (m/z) values via a quadratic equation,

(1)
where U is the known ion voltage (20 000 for these data) and sign(t t0) = 1 for t > t0 and –1 otherwise. Values of t begin with 0 and increase in increments of 2 x 10–9 seconds. There is additionally a spot correction factor, but its effect is negligible and the term is omitted here. The a, b and t0 values are determined by a calibration procedure that is performed on spiked samples (i.e. calibrants) before the samples of interest are processed. A mixture consisting of known peptides/proteins is processed and a, b and t0 values are chosen so that mass values using Equation (1) are close to the known masses in the mixture. Typically, all the samples processed after a calibration procedure receive the same a, b and t0 values until the machine is again recalibrated. One hopes that the a, b and t0 parameters should not change too much among different calibration sessions. Guidelines for performing calibrations indicate that they should be performed ‘on a regular basis, preferably once per session’ (ProteinChip Software manual, manual, 2002).

The algorithm for aligning the spectra is simple—for each spectrum consider the mass locations for a set of peaks (at least three) and associate them with target m/z values. Then choose a, b and t0 to minimize deviation from the targeted values. The target values may correspond to average values for these peaks across the samples, to the values for a single spectrum chosen as a reference for all spectra or perhaps to an investigator's prior notions about what true values should underlie these particular peaks. The set of peaks may vary from spectrum to spectrum. Given a set of N peaks, we find values of a, b and t0 that minimize

(2)
where mi are the target m/z values and ti are the times-of-flight associated with the peaks in a given spectrum that requires alignment. The minimizing values of a, b and t0 are determined by a Nelder–Mead multivariate optimization procedure that is implemented in the software. In Equation (2) each summand is divided by the associated mass to scale the errors proportionally—a 10 Da error at 2000 Da is more severe than a 10 Da error at 50 000 Da. The algorithm indicates that the time-of-flight values need to be known. However, one can use the original a, b and t0 values for the given spectrum and invert Equation (1) to infer these time-of-flight measures from the original m/z values associated with the peaks. Consequently, to execute this algorithm one needs the target m/z values, mi, the actual m/z values in the single spectrum that need to be recalibrated, denoted pi, and the original a, b and t0 values for the single spectrum. The algorithm's implementation that we provide uses the pi to compute the ti and then finds the minimizing values of a, b and t0. A similar approach was applied as an ad hoc correction to Ciphergen data (Baggerly et al., 2004).

It should be noted that the quality of the alignment depends in part upon what target values are chosen for the calibrating set of peaks. If the range of targets is too narrow the quadratic curve parameters may not provide a good realignment for m/z values outside the range of targets. This is an important point that will be addressed further below. Also, at least three points are required to fit the quadratic curve, but ideally more should be used.

In some cases alignment problems can be alleviated by assigning the same calibration equation to all spectra instead of using that corresponding to the most recent calibration or providing individually tailored parameters. This approach is provided as a competing method when evaluating the Ciphergen-based algorithm.

2.2 Algorithm for general data
It is assumed the data are represented in a two-column format: one column for m/z value and the other showing intensity (the algorithm could also be used if time-of-flight instead of m/z values were provided). This algorithm is nearly as simple as the previous one, though it is based on fitting cubic splines to the data rather than quadratic equations. We begin with a single spectrum, a set of N peaks with associated m/z values (denoted pi) and a set of target m/z values for these peaks (denoted mi).

Let yi denote the ratio of the original mass value to its target value, i.e. yi = pi/mi. Then, given {p1,m1,...,pN,mN} and any positive value {lambda}, there exists a unique function f{lambda}(m) that minimizes theerror term

(3)
among all functions f(m) with two continuous derivatives, where m denotes any mass value in the range of interest. Here {alpha} and ß indicate the limits of the m/z range under consideration, e.g. 2000–20 000 Da and f''(m) denotes the second derivative. The unique function, f{lambda}(m), is a natural cubic spline—see gs,ht for some properties. For a given spectrum and {p1,m1,...,pN,mN} values, a particular {lambda} is chosen by cross-validation—-successively leaving out one of the {pi,mi} pairs and determining what value of {lambda} generally yields low estimated error, {yif{lambda}(pi)}2, for the omitted data. Intuitively, the function f{lambda}(m) minimizes the deviations between yi and f(pi) subject to a penalty term, {lambda}{int}{f''(p)}2dp, that modulates the extent that f(m) deviates from a straight line. For large values of {lambda}, the minimizing function f{lambda}(m) looks more like a straight line fit by a regression equation to the points p1,y1,...,pN,yN; as {lambda} becomes smaller, the f{lambda}(m) curve will tend to become less straight and to yield smaller values of {yif{lambda}(pi)}2.

The {p1,m1,...,pN,mN} may vary by spectrum, as will the associated {lambda} values. Once a particular {lambda} is chosen for a given spectra, the data are recalibrated as follows. We let Morig denote one of the original m/z values associated with the two-column dataset and Iorig denote its associated intensity. The mass need not be one of those peaks used to derive the f{lambda}(m) spline; it may be any mass in the original dataset. Then a recalibrated mass associated with the original intensity, Iorig, is calculated as

(4)
where f{lambda}(Morig) denotes the value of the spline function at the mass value of Morig. These recalibrated masses are computed for every one of the original masses and are associated with the original intensities. Linear interpolation of the recalibrated masses that are closest to the original mass is used to obtain a new intensity for the original mass.

In Table 1 we present an example to illustrate the process. The spline transformation moves the peak seen at 5316.165 closer to 5306 Da—i.e. the f{lambda} associated with 5316.165 realigns the peak value of 35.1734 to a recalibrated value 5306.465. The recalibrated values could be used, but the investigator may want to retain the original m/z values as these may be identical for all the samples whereas the recalibrated m/z will vary from spectrum to spectrum. Having identical m/z values can be helpful in subsequent processing of the data, e.g. choosing peaks. To obtain new intensities for the original m/z we interpolate nearby intensity values. For instance, to obtain the new intensity for the original m/z of 5306.182, one sees that the closest recalibrated m/z values are 5305.697 and 5306.465, with associated intensities of 34.7263 and 35.1734. Weighting these intensity values by their distance from the mass value of 5306.182 yields an interpolated intensity of 35.0090. If the original m/z value is at either extreme end of the range (e.g. near 0) then it may happen that the recalibrated m/z are all to one side of the original m/z, in which case it is not possible to interpolate. In these instances the new intensity is taken to be the intensity associated with the closest recalibrated m/z value.


View this table:
[in this window]
[in a new window]
 
Table 1 Recalibration example

 
The algorithm is quite similar to the algorithm for Ciphergen data, which is based upon the quadratic curve. Similar concerns regarding the range of the target values apply here; too narrow a range may result in curves that fit poorly outside the range. Also, it is useful to provide enough target points so there are no long ranges of the m/z range without reference points that tie down the f{lambda}(m) function. Long regions without target points may lead to f{lambda}(m) values that vary widely between targets. Graphical inspection of the f{lambda}(m) curve should reveal when such inadequacies exist. At least four target points are required to fit cubic splines to the data.

2.3 Selection of peak masses (pi) and corresponding targets (mi)
For each spectrum requiring realignment, both algorithms need a relatively small set of peaks with m/z values p1,...,pN and associated target m/z values m1,...,mN. Here is one approach for generating these corresponding lists in an automated fashion. Software for implementing this approach is also available at http://krisa.ninds.nih.gov/alignment/.

This approach begins with a reference spectrum—a fixed spectrum with which all other spectra will be compared and aligned. This may be obtained by choosing one of the spectra or creating a spectrum that is the average of all the spectra. A spectrum that requires alignment to this reference will be referred to as a test spectrum. To begin, first define a subrange of the mass values—in this case perhaps 2000–3000 Da. For this range locate the largest peak in the test spectrum and note its location, denoted p1. Then consider all the peaks in the reference spectrum that are located within a fixed window around p1, say [0.98 · p1,1.02 · p1]. If there are k peaks within this 2%, window they are denoted . For each of these peaks consider a window (of 5% width on each side) centered at such as for j = 1 ... k. For each one of these windows , compute the correlation coefficient of the intensities in the reference spectrum over this mass range with the intensities over a 5% window centered about p1, [0.95 · p1,1.05 · p1] in the test spectrum. From these k correlation coefficients choose the corresponding target peak as that with the highest correlation. This procedure may then be repeated using different m/z ranges. For each subrange one obtains a mass corresponding to the highest peak in the test spectrum and a mass corresponding to the peak in the reference spectrum that shows best correlation. If a set of peaks are thought to be represented in nearly all spectra the ranges may be set to focus upon them. Alternatively, a set of subranges that partitions the range of interest may be chosen, e.g. 2000–4000, 4000–7000, 7000–10 000, 10 000–15 000 and 15 000–20 000 Da. Further details regarding implementation are available at http://krisa.ninds.nih.gov/alignment/.


    3 SOFTWARE
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHMS
 3 SOFTWARE
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 
Software programs are provided that implement both types of alignment. The R language was chosen as a basis for the code as it is free, easily available (see http://www.r-project.org) and widely supported, and it has optimization and spline fitting functions that ease the development of the algorithms.

The required information and formats are somewhat different depending upon whether the algorithm for Ciphergen or general data format is used. The advantage of using the algorithm for Ciphergen data is that after its application one may import the transformed spectra into the ProteinChip software and use that software to perform subsequent processing, e.g. baseline subtraction, normalization, peak selection. Investigators may find this preferable to using other software for performing these steps. To implement this approach the transformed data need to be in a form that Ciphergen software can import; the Ciphergen XML format for spectral files suits this purpose (at the time of writing, Ciphergen software cannot import text or ASCII files of the type described for the general format). Given an XML file representing a spectrum, a set of target m/z values and an associated set of actual m/z values for the spectrum, the algorithm reads in information from the XML file to convert the actual m/z values to time-of-flight values [the ti in expression (2)] and computes minimizing values of a, b and t0 in expression (2) to overwrite the original a, b and t0 values in the XML file. The altered XML file can then be imported back into the ProteinChip software for subsequent processing.

The implementation for data in a more general format is more straightforward. Users provide a comma-delimited (CSV), two-column file with m/z in the first column and associated intensity in the second column. Again a set of target and actual m/z values needs to be provided. From this input the algorithm produces a second CSV file with the original m/z column and altered intensities in the second column.


    4 RESULTS AND DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHMS
 3 SOFTWARE
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 
The dataset reflected in Figure 2 is used here to evaluate the algorithms. Of interest is how the number of peaks and the intensity change when the alignment algorithms are implemented. For a given set of spectra (which may or may not be realigned) baseline subtraction and normalization procedures are performed. Then a peak-picking tool is used to select a number of peak m/z values and to obtain the intensity values associated with these peaks. Good performance should be associated with low coefficients of variation for the peak intensities as the data are all derived from the same biological source of reference serum.

Two sets of analyses are presented, one with the entire set of 44 spectra and another with attention restricted to just the first 4 days' worth of data (32 spectra). Two studies are provided because some may believe the length of time between the first 4 and last 2 days' worth of samples invalidates the inclusion of the latter.

4.1 Ciphergen data and quadratic algorithm
When realignment was performed it was done before any other data processing step. Data processing was restricted to m/z values between 2000 and 20 000 Da. Normalization (after baseline subtraction) was performed over the same Dalton range with an external constant of 0.2. Peak-picking parameters included the following: auto-detecting peaks to cluster, first pass S/N = 5, minimum peak threshold of 50% of all spectra, cluster mass window of 0.3%, second pass S/N = 2 and estimated peaks added to complete clusters.

Three competing approaches are evaluated. First, the Biomarker Wizard component of the ProteinChip software was used. As this method has no alignment procedure beyond the use of a 0.3% alignment window, the results using this approach are labeled as ‘None’ under the column designated ‘Alignment Method’. The second method, designated ‘Individual Equation’, is based on providing an individual mass to time-of-flight quadratic equation for each spectrum as described above. Finally, we include results obtained when a single calibration equation with fixed values of a, b and t0 was used. As data were gathered from 6 days for the 44 spectra (4 days for the 32) following 6 (4 for 32 spectra) different calibration procedures, we show the best results obtained after successively trying the 6 (4) different calibration equations. Most calibration equations had very similar coefficient of variation (CV) distributions for a given set of spectra. These results are designated by the ‘One Equation for All’ label. It should be pointed out that both the individual equation and the one equation approach also employed the 0.3% alignment window. The initial realignment step is the only difference in how the three sets of analyses were performed.

Table 2 shows that when data are restricted to the first 4 days (32 spectra) both the one equation and individual equation approaches perform identically. In both cases 185 peaks were identified and the median and mean CVs were 0.27. The 25th percentile of 0.19 indicates that 25% of the 185 CVs were ≤0.19. Without any alignment adjustment the mean and median CVs are approximately doubled and unacceptably high. These results suggest that in some circumstances it may be reasonable to assign a single calibration equation to all the spectra and avoid the time and effort associated with identifying target and actual m/z values (denoted as mi and pi above) for each spectrum.


View this table:
[in this window]
[in a new window]
 
Table 2 Coefficients of variation under different alignment schemes

 
However, results from the full set of 44 spectra indicate that using the same equation for all spectra may not always work well. Here we see that no adjustment beyond the 0.3% alignment window leads to CV values near 1 (average of 0.94, median of 0.82), clearly indicating that the data are not comparable. Using a single equation leads to some reduction but the CVs are still unacceptably high for data arising from a common biological source (average of 0.52, median of 0.43). The quadratic adjustment yielding individually tailored time-of-flight equations shows good results in that they are quite similar to those found when restricting analysis to the first 32 spectra (median CV based on 44 spectra of 0.27 is the same as that based on 32 spectra). The similarity in results gives some indication that one might be able to work with the entire dataset despite the obvious spectral changes brought about by the hardware alteration, as evidenced in Figure 2.

4.2 General data format and cubic spline algorithm
Here the same data are used, but in a different format. Each spectrum was expressed as a two-column text file with the first column designating m/z and the second column showing the associated intensity. The m/z values correspond to the original calibration equations, i.e. no adjustments to the data have been made. To determine peaks and their associated intensities some software tools are required. In the absence of Ciphergen software there is little consensus on what software may best be used to perform the preprocessing steps for mass spectrometry data. Although many investigators (Yasui et al., 2003; Adam et al., 2002; Coombes et al., 2003) have described their methods for performing various preprocessing steps (e.g. baseline subtraction, alignment, peak selection, normalization and smoothing), few have made supporting software widely available. Two exceptions are a set of tools written using the MATLAB programming environment (Coombes et al., 2004) and the PROcess package (version 1.3.4) in the Bioconductor software collection, which is written in the R programming language and available from http://bioconductor.org. Here we use the PROcess package because it is written in R and hence free, widely accessible and easy to implement.

The software performs baseline subtraction (based on loess fitting), normalization (equalizing area under the curve), peak-picking and alignment based on a 0.3% window (see the technical guide that accompanies the package's distribution for more information). When the cubic spline alignment procedure was employed, it was applied before any other data processing step. Here we present two alternatives—the alignment procedure that is part of the PROcess package and the cubic spline realigned data then subjected to the PROcess package's procedure. In this instance the data do not include time-of-flight as well as m/z information so it is not possible to mimic the earlier approach of applying the same calibration equation to all the samples.

Again the data are analyzed in two sets—one set based on 32 spectra and the other based on all 44. Initially the default parameters for the PROcess processing steps were employed, but these were modified to generate more peaks. In particular the lower bound for signal/noise ratio was made 1 (the default was 2), the minimum intensity was decreased to 0.05 units (the default required a peak height of at least 2 units) and the requirement for total area under the curve was reduced (i.e. ratio measure to 0.01 instead of a default of 0.2). As the intensity units are arbitrary and a broad range in the magnitude of peaks is expected, these changes seem reasonable. The results are shown in Table 3. The results are qualitatively similar to those obtained using the Ciphergen algorithm in that the effect of using the spline correction is most apparent when all spectra are used. The results for 32 spectra are relatively close; the median CV for the PROcess procedure alone is 0.26 and the median for the PROcess procedure with cubic spline adjustment is 0.23. When all 44 spectra are retained, the median CVs change to 0.45 and 0.25, respectively. As was the case with the Ciphergen quadratic-based correction there is no appreciable increase in the CVs when the additional 12 spectra are used with the cubic spline alignment. Thus it seems reasonable to combine the first 32 spectra with the 12 spectra from a later acquisition period as long as some additional alignment procedure is employed.


View this table:
[in this window]
[in a new window]
 
Table 3 General data format—coefficients of variation under different alignment schemes with revised PROcess parameters

 
It should be noted that the quadratic and cubic-spline based approaches yield very similar corrections for these data. The differences between the results for ‘Individual Equation’ in Table 2 and ‘Cubic spline and PROcess’ in Table 3 arise because of the different peak-picking algorithms employed (Ciphergen based in Table 1 and PROcess based in Table 3). When the quadratic-aligned data were exported from the Ciphergen software and then processed using the PROcess peak determination parameters, the distribution of the CVs was the same as that obtained using the cubic spline approach (data not shown). This should not be surprising as both the quadratic and cubic approach are based on the same small set of mi and pi values for each spectrum.

4.3 Importance of choosing calibrants, pi andmi appropriately
These data may overstate the alignment problem in that the machine calibration process could probably have been improved. When the instrument was calibrated it was set up to be accurate over a range from ~10 000 to 100 000 Da based on the spiked set of calibrants. This wide range of calibrants was selected because a priori there was neither expectation nor experience to suggest where the signals would emerge. In retrospect, masses above 15 000 Da show less information and it would have been better to use peptide calibrants in a lower Dalton range. Looking for signals outside the calibrants' masses involves extrapolation (aligning outside the range of calibrants) rather than interpolation (aligning within the range of calibrants) and typically increases the degree of error. It may be that the higher CVs for unadjusted data in Tables 2 and 3 would have been moderated if peptides with smaller masses had been used for machine calibration.

A related issue concerns the choice of the peak mass locations pi and their target mass locations mi. It is important that the set of pi and mi cover the mass range of interest. Supplementary material on the website shows that CVs for peaks <10 000 Daltons increase (sometimes dramatically) when the pi and mi are chosen in the 10 000–100 000 Da range. Extrapolation errors may be particularly severe for the quadratic method, and one may be better served by fitting a model with a b = 0 constraint in Equation (1) in such circumstances—see the website for details. As an aside we note that although the cubic splines use cubic polynomials to interpolate data within the ‘knot’ points, p1, p2,...,pN, the procedure uses linear extrapolation outside of the knots on the boundary, p1 and pN, so extrapolation errors are not likely to be too severe (Hastie and Tibshirani, 1990). These results underscore the importance of choosing pi and mi appropriately. Should investigators wish to analyze different regions (e.g. <20 000 Da using a low laser setting and >20 000 Da using a higher setting) it would be prudent to use separate alignment procedures (i.e. different pi and mi) for the differentranges.


    5 CONCLUSION
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHMS
 3 SOFTWARE
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 
In some circumstances such alignments may not be necessary, particularly if the data are obtained within a short period of time and from one machine operating in a stable environment. However, the validation of SELDI-based exploratory searches for biomarkers does require that reproducibility of the process be demonstrated—part of this entails work in which data must be consistent across separate instruments/centers. Semmes et al. (2005) addressed the problems with alignment primarily through strict oversight of the calibration process and constant comparisons of data within and across centers. However, if alignment problems become apparent after the data are collected then recalibration of the instrument cannot help. As an example, data from two investigators may be entirely consistent within each lab because operators closely monitored calibration, but inconsistency arises when the data are combined because the investigators made no effort to be consistent across labs as their collaboration arose after data collection. The algorithms described here may allow the data to be compared in suchinstances.

With Ciphergen data it may be relatively easy to get an idea whether additional effort work beyond the use of an alignment window is necessary. The Biomarker Wizard software provides graphs that indicate the mass and intensity values associated with an identified peak. By examining these values over a narrow m/z range across all the spectra one can get a sense of whether the alignment window is correctly associating peaks or whether more alignment work needs to be done. For data in a general format it may be possible to obtain similar information, though this would entail the cost of writing code to perform the computations.

The analysis of the reference serum indicates that, in some instances, a separate preprocessing step devoted to aligning the spectra can greatly improve the ability to detect signal by removing some potentially obscuring noise arising from misaligned peaks. As in the case of the Day 6 data obtained on June 25, 2003 in Figure 2, such a step may be necessary to allow two datasets to be combined. Should such radical transformations as that used for the June 25, 2003 data be necessary, investigators must still exercise diligence to ensure that other differences are not present when examining data obtained on different machines or under different circumstances. Simply because the data now appear to be realigned does not exclude the possibility that, for instance, the June 25, 2003 results may preferentially measure greater intensity in lower m/z regions. A simple way to look for such differences may be to examine the principle components associated with the samples and, for those components explaining a great deal of variation, to determine whether there are differences related to when the sample was obtained. Should such differences be apparent the investigators should strongly consider excluding the dissimilar data.


    Acknowledgments
 
The author wishes to thank Dr Xiang Wang (obtaining the data and reviewing the manuscript), Dr Catherine Campbell (interpreting the data and reviewing the manuscript) and Mr Jack Panossian (algorithm evaluation) for their assistance.

Received on March 13, 2005; revised on April 15, 2005; accepted on May 1, 2005

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 ALGORITHMS
 3 SOFTWARE
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 

    Adam, B.-L., et al. (2002) Serum protein fingerprinting coupled with a pattern-matching algorithm distinguishes prostate cancer from benign prostate hyperplasia and healthy men. Cancer Res., 62, 3609–3614[Abstract/Free Full Text].

    Baggerly, K., et al. (2003) A comprehensive approach to the analysis of matrix-assisted laser desorption/ionization-time of flight proteomics spectra from serum samples. Proteomics, 3, 1667–1672[CrossRef][Web of Science][Medline].

    Baggerly, K., et al. (2004) Reproducibility of seldi-tof protein patterns in serum comparing data sets from different experiments. Bioinformatics, 20, 777–785[Abstract/Free Full Text].

    Coombes, K., et al. (2003) Quality control and peak finding for proteomics data collected from nipple aspirate fluid by surface-enhanced laser desorptoin and ionization. Clin. Chem., 49, 1615–1623[Abstract/Free Full Text].

    Technical Report Coombes, K., Tsavachidis, S., Morris, J., Baggerly, K., Hung, M.-C., Kuerer, H. (2004) Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. M.D. Anderson Biostatistics and Applied Mathematics Department.

    Green, P. and Silverman, B. Nonparametric Regression and Generalized Linear Models. A Roughness Penalty Approach, (1994) , London Chapman and Hall.

    Hastie, T. and Tibshirani, R. Generalized Additive Models, (1990) , London Chapman and Hall.

    Li, J., et al. (2002) Proteomics and bioinformatics approaches for identification of serum biomarkers to detect breast cancer. Clin. Chem., 48, 1296–1304[Abstract/Free Full Text].

    Petricoin, E. and Liotta, L. (2003) Mass spectrometry-based diagnostics: the upcoming revolution in disease detection. Clin. Chem., 49, 533–534[Free Full Text].

    Petricoin, E., et al. (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet, 359, 572–577[CrossRef][Web of Science][Medline].

    ProteinChip Software 3.1 Operation Manual. (2002) , Fremont, CA Ciphergen Biosystems, Inc.

    Sauve, A. and Speed, T. (2004) Normalization, baseline correction and alignment of high-throughput mass spectrometry data. Proceedings of the Genomic Signal Processing and Statistics, 2004, .

    Semmes, O., et al. (2005) Evaluation of serum protein profiling by surface-enhanced laser desorption/ionization time-of-flight mass spectrometry for the detection of prostate cancer: I. assessment of platform reproducibility. Clin. Chem., 51, 102–112[Abstract/Free Full Text].

    Yasui, Y., et al. (2003) A data-analytic stratedgy for protein biomarker discovery: profiling of high-dimensional proteomic data for cancer detection. Biostatistics, 4, 449–463[Abstract].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
P. Du, W. A. Kibbe, and S. M. Lin
Improved peak detection in mass spectrum by incorporating continuous wavelet transform-based pattern matching
Bioinformatics, September 1, 2006; 22(17): 2059 - 2065.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/14/3066    most recent
bti482v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (18)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jeffries, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jeffries, N.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?