Skip Navigation


Bioinformatics Advance Access originally published online on November 14, 2007
Bioinformatics 2008 24(1):63-70; doi:10.1093/bioinformatics/btm533
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/1/63    most recent
btm533v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Mantini, D.
Right arrow Articles by Urbani, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mantini, D.
Right arrow Articles by Urbani, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Independent component analysis for the extraction of reliable protein signal profiles from MALDI-TOF mass spectra

Dante Mantini 1,2, Francesca Petrucci 3,4, Piero Del Boccio 3,4,5, Damiana Pieragostino 3,4,5, Marta Di Nicola 3,4, Alessandra Lugaresi 3,6, Giorgio Federici 7,8, Paolo Sacchetta 3,4, Carmine Di Ilio 3,4 and Andrea Urbani 3,4,5,*

1Istituto Tecnologie Avanzate Biomediche (ITAB), Fondazione ‘G. d’Annunzio’, 2Dipartimento di Scienze Cliniche e Bioimmagini, Università ‘G. d’Annunzio’, 3Centro Studi sull’Invecchiamento (Ce.S.I.), Fondazione ‘G. d’Annunzio’, 4Dipartimento di Scienze Biomediche, Università ‘G. d’Annunzio’, Chieti-Pescara, 5Centro Europeo Ricerca sul Cervello (CERC), IRCCS-Fondazione S. Lucia, Roma, 6Dipartimento di Oncologia e Neuroscienze, Università ‘G. d’Annunzio’, Chieti-Pescara, 7Dipartimento di Medicina Interna, Università di Roma ‘Tor Vergata’ and 8Ospedale Pediatrico Bambino Gesù – IRCCS, Roma, Italy

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Independent component analysis (ICA) is a signal processing technique that can be utilized to recover independent signals from a set of their linear mixtures. We propose ICA for the analysis of signals obtained from large proteomics investigations such as clinical multi-subject studies based on MALDI-TOF MS profiling. The method is validated on simulated and experimental data for demonstrating its capability of correctly extracting protein profiles from MALDI-TOF mass spectra.

Results: The comparison on peak detection with an open-source and two commercial methods shows its superior reliability in reducing the false discovery rate of protein peak masses. Moreover, the integration of ICA and statistical tests for detecting the differences in peak intensities between experimental groups allows to identify protein peaks that could be indicators of a diseased state. This data-driven approach demonstrates to be a promising tool for biomarker-discovery studies based on MALDI-TOF MS technology.

Availability: The MATLAB implementation of the method described in the article and both simulated and experimental data are freely available at http://www.unich.it/proteomica/bioinf/.

Contact: a.urbani{at}unich.it


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Independent component analysis (ICA) is a signal processing technique that can be used to separate distinct underlying signals from mixed recorded signals, on the basis of their statistical properties (Hyvärinen et al., 2001; Stone, 2004). The most common example for illustrating ICA refers to the so-called cocktail party problem. Consider a cocktail party where there are a number of people speaking in the same room. Assume further that there are several microphones in different positions, so that each records a mixture of the speech signals with slightly different weights. In this situation, a blind source separation is required, whether we aim at retrieving the different sound sources from the recordings without any a priori information. ICA is the technique able to do it, if there are at least as many microphones in the room as there are different simultaneous sound sources. In the last years, several efficient algorithms have been developed to solve the ICA problem (Bell and Sejnowski, 1995; Cardoso and Souloumiac, 1996; Hyvärinen, 1999; Ziehe et al., 2000). Furthermore, ICA has recently received attention because of its potential applications in several fields, as demonstrated in studies on specific problems dealing with speech recognition systems, financial time series analysis and biomedical signal processing (Back and Weigend, 1997; James and Hesse, 2005; Jung et al., 2001; Mantini et al., 2005; Smith et al., 2006). ICA techniques have been applied to a number of non-conventional situations, because the basic assumption of the ICA problem, i.e. the independence of the variables, is realistic in many circumstances, hence permitting a completely blind source separation, or feature retrieval. The successful applications of ICA have encouraged its use also for biological systems, which are in general more complex to analyze than non-biological ones (Frigyesi et al., 2006; Liebermeister, 2002; Scholz et al., 2004).

1.1 ICA theory
The ICA model can be mathematically described as:


Formula 1

(1)
where Formula is the matrix of n observed signals, Formula is the matrix of m underlying signals and A denotes the [n x m] mixing matrix (Hyvärinen et al., 2001; Stone, 2004). It is a generative model, which means that it describes how the observed data are generated by a process of mixing the underlying signals si, whose estimates are named independent components (ICs). The minimal required a priori information in the ICA model is the independence of the ICs.

A solution for the ICA problem is possible if two additional conditions are met: the number of underlying signals is at most equal to the number of observed signals (m ≤ n), and the mixing matrix is full column-rank (r(A) = m). In this case, the ICs can be retrieved by determining an [m x n] matrix W, named unmixing matrix, such as


Formula 2

(2)

1.2 Pre-processing for ICA
In order to obtain W by estimating a minimum number of parameters, it is necessary to center and whiten the acquired data (Hyvärinen and Oja, 1997). Centering is performed by subtracting the j-th average value from each mixed signal xj, so that the underlying signals si become zero-mean; whitening is a linear transformation of matrix X into another matrix Formula whose observations are uncorrelated and with variances equal to unity. The most common method for whitening is the eigenvalue decomposition of the covariance matrix EXXT = EDET, where E is the orthogonal matrix of eigenvectors of EXXT and D = diag(d1,d2,...,dn) is the diagonal matrix of its eigenvalues (Yang and Wang, 1999).

The whitened matrix Formula can be calculated as


Formula 3

(3)
with Formula .

Whitening transforms the unmixing matrix W into a new matrix Formula , for which:


Formula 4

(4)
Formula is orthogonal and allows minimizing the number of parameters to be estimated: instead of having to estimate all the coefficients of the original matrix W, we only need to estimate the orthogonal matrix Formula , containing a lower number of degrees of freedom.

After solving the ICA problem for (4), the unmixing matrix W can be computed as:


Formula 5

(5)

1.3 Post-processing of ICA output
At the end of the ICA decomposition, the matrix W is calculated according to (5) and the matrix S according to (2). Subsequently, the matrix A, being the pseudoinverse of any (not square) W, is obtained with the formula


Formula 6

(6)

Typical post-processing of the ICA output consists of sorting the ICs with decreasing power level (James and Hesse, 2005). To this purpose, the power pi of the i-th component across the matrix X is directly calculated from the i-th column of the matrix A as


Formula 7

(7)

After sorting the components, the IC waveforms si and the related IC amplitudes Ai, corresponding to the i-th column of the matrix A, can be stored for further analysis.

1.4 Application to MALDI-TOF MS data
An intriguing application of ICA is the processing of proteomic signals, and in particular MALDI-TOF mass spectra. MALDI-TOF mass spectrometers are devices able to produce signals that correspond to the different time of flight of the analyzed proteins, ionized by means of a high-energy laser beam and accelerated with an electric field (Karas, 1996). The acquired spectra always present complex features, because they are composed by a number of overlapping peaks with different amplitudes, related to the abundance of the proteins, and are contaminated by artifacts of biological/physical origin (Gras et al., 1999), i.e. the baseline trend and the background noise. Due to the presence of these disturbances, very sensitive and accurate peak-detection methods, able to correctly separate protein signals from noise, are required. Several processing strategies have been proposed in the literature for analyzing MALDI-TOF data (Coombes et al., 2005; Gras et al., 1999; Mantini et al., 2007; Satten et al., 2004; Yasui et al., 2003); however, the problem of the potential detection of noise peaks as signals still has not been completely solved. This problem seriously limits the development of reliable proteomics tools for biomarker discovery and early disease diagnosis (Diamandis, 2004).

In this perspective, an ICA approach for extracting protein profiles from multi-subject MALDI-TOF MS data is proposed. According to the ICA theory, the observed signals will be the mass spectra and the protein profiles, assumed to be independent to each other, will correspond to the ICs. Each IC is expected to contain single peaks, or multiple peaks that are up- and down-regulated in the same manner across mass spectra. With regard to the basic assumptions for the solution of the ICA problem, it is worth noting that the mixing-matrix is always be full-rank, since each mass spectrum cannot be obtained as a linear combination of the other spectra. In turn, a number of mass spectra at least equal to that of expected protein profiles are required for the ICA decomposition.

To the best of our knowledge, in this work ICA is used for the first time for high-dimensional proteomic data analysis, for the separation of the artifacts and for the direct resolution of protein signals. The ICA method has been validated on simulated data for assessing its capability of separating protein peaks, without noticeable signal distortion. Our findings show that the reliability of the proposed method is superior in reducing the false discovery rate of protein peak masses than those of classical methods in terms of peak detection. Moreover, it is demonstrated, using real serum and plasma samples obtained from a group of 30 patients, that the information extracted using ICA from MALDI-TOF data is valuable in the perspective of biomarker-discovery studies.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Simulated data
Synthetic data were prepared with the aim of reproducing signals with the same characteristics of real spectra obtained from MALDI-TOF MS devices. A set of 40 protein profiles was created; for each of them, 1–3 peaks with fixed relative abundance were combined in the same signal. The following equation (Foley, 1987) was adopted to simulate the MS peaks:


Formula 8

(8)
where z is the mass/charge (m/z) value, A0 is the area of the peak, {tau} is the time constant of the exponential decay, {sigma}p controls the tailing of the peak, zp determines the position of the peak on the m/z axis, the ratio {tau}/{sigma}p is a measure of its asymmetry, and h = [(zzp)/{sigma}p]–({sigma}p/{tau}). The parameters used in the simulation ranged between 180 and 700 for A0, between 6000 and 18 000 for zp, whereas it was set to 0.0172 for {tau}, and 0.0189 for {sigma}p. Among the 40 protein profiles, 30 were generated using a single peak, whereas 6 and 4 were respectively obtained summing two and three peaks.

A synthetic MALDI-TOF MS dataset Xsyn was prepared by means of a linear mixing of the 40 protein profiles, using realistic weights. A total number of 60 mass spectra with at most 40 peaks were produced; a specific signal (mass spectrum #33) contained an outlier peak, which was absent in the other mass spectra. In order to simulate realistic conditions, a decreasing baseline and a non-uniform background noise were added to each spectrum. Simulated MALDI-TOF mass spectra are shown in Figure 1.


Figure 1
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Plots of five simulated MALDI-TOF mass spectra out of the 60 ones generated by combining all 40 synthetic ICs plus background noise and baseline trend. The mass spectrum #33 shown in the bottom plot contains a single outlier peak. The m/z values are expressed in kDa and the amplitudes are in a normalized average intensity (a.i.) scale.

 
2.2 Experimental data
2.2.1 Sample preparation
The experimental data, referring to both serum and plasma samples, were collected from 30 patients (age 28–40 years) affected by inflammatory auto-immune disease, who signed a written informed consent.

Equine myoglobin, dissolved in 0.1% trifluoroacetic acid (TFA) in deionized water, was used as calibrator. Sample preparation was performed by ZipTip (Millipore) C4 tips with sinapinic acid. The samples (20 µl) were first acidified by addition of 5 µl 1% TFA before loading and preparation with a sandwich layer method on MTP ground steel 384 (Bruker Daltonics). First, a sinapinic acid matrix seed layer was created by depositing a droplet (0.5 µl) of a saturated solution of sinapinic acid in 100% ethanol on the target. The C4 resin was first activated by multiple washing with 10 µl of ACN/water (1:1) and then equilibrated by 0.1% TFA. Thereafter, the sample was trapped on the ZipTip resin and washed with TFA 0.1%; finally, the sample was eluted from the resin using 2 µl of a saturated solution of sinapinic acid in 30/70 ACN/0.1% TFA and spotted directly on MTP ground steel 384 (Bruker Daltonics) and subjected to MALDI-TOF MS acquisition (Biroccio et al., 2006).

2.2.2 Data acquisition
Each matrix droplet was individually analyzed using a MALDI time-of-flight Bruker Reflex IV mass spectrometer, equipped with a nitrogen laser (337 nm), used in the linear mode under delayed extraction conditions (400 ns). The ion source and flight tube were evacuated by turbo pumps to a pressure lower then 6 x 10–7 mbar. The laser spot was 57 µm high and 32 µm large. Data were collected using an Accelerating Voltage of 25 kV, Ion Source1 20 kV, Ion Source2 17 kV, Lens 9.60 kV. Spectra were collected for a mass range of 5–20 kDa at 1 GHz. The laser power was modulated between 20% and 40% in order to obtain less than 1000 ion counts for single acquisition run. Every single acquisition run was composed by 100 laser pulses at 5 Hz; multiple additions of single position acquisition run were employed to obtain a minimal spectrum intensity scale of 5 x 103 ion counts. The resulting mass spectra showed a mass accuracy of 40 p.p.m. on average with peaks at 800/1200 FWHM. MALDI-TOF mass spectra from serum and plasma samples are illustrated in Figure 2.


Figure 2
View larger version (8K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Mass spectra of serum and plasma samples, respectively, acquired from the same three patients. The m/z values are expressed in kDa and the amplitudes are in a normalized average intensity (a.i.) scale.

 
2.2.3 Data pre-processing
In order to be suitable for the data processing phase, each linear-mode MALDI-MS spectrum xi was resampled at 0.25 Da resolution, and converted to a text file listing of k intensity versus m/z data points (k = 60 000). To circumvent the presence of slight m/z variations in different spectra, signals were calibrated (Jeffries, 2005) using as internal calibrants the peaks derived from mono- and bi-charged myoglobin ions and mono- and bi-charged hemoglobin ions; the spectra were also normalized using the peak intensity of the mono-charged myoglobin ion. The processed spectra xi were arranged in an [n x k]-dimensional matrix Formula , where n is the number of mass spectra. Three matrices were generated and used in the study on experimental data: the first one Xser containing the spectra from serum, the second one Xplas containing the spectra from plasma and the third one Xtot obtained from the vertical concatenation of Xser and Xplas.

2.3 ICA decomposition
The generic data matrix X, arranged with the spectrum IDs in rows and the intensities corresponding to the m/z values in columns, is the only necessary input for ICA. In turn, the results of the ICA decomposition are two matrices: the first is the matrix S of IC waveforms with the component IDs in rows and the intensities corresponding to the m/z values in columns; the second is the matrix A of IC amplitudes with the component IDs in columns and the spectrum weights in rows.

We used the FastICA algorithm (Hyvärinen, 1999) for the data decomposition into n ICs. We downloaded a free copy of the FastICA package, written in MATLAB code, from the page http://www.cis.hut.fi/projects/ica/fastica/, that also contained the implementations in R and C++ programming languages. We ran FastICA on a PC with Pentium IV processor at 2.5 GHz and 1.5 GB RAM. Due to memory limitations, the maximum number of MALDI-TOF mass spectra that could be jointly processed was about 400.

With FastICA, the estimation of ICs is based on the assumption that components that mutually statistically independent are characterized by probability distributions that are not Gaussian. The algorithm uses the method of the kurtosis (fourth-order cumulant) of the signals, which is defined for a zero-mean random variable v as


Formula 9

(9)

Kurtosis is null for a Gaussian density distributions of v, it is positive for densities peaked at zero and negative for flat densities. This means that kurtosis is suitable to assess the statistical independence of variables.

In order to maximize and/or minimize the kurtosis, a number of neural algorithms can be chosen (Hyvärinen, 1999). With the natural gradient method, the fixed-point learning rule is used (Hyvärinen and Oja, 1997). During the estimate of the n-dimensional vector w, the learning rule will stop at a fixed point, for which Formula is sufficiently close to unity, and the linear combination Formula will be one of the required ICs.

After that one ICA basis vector is obtained, other ICA basis vectors are estimated by sequentially finding new basis vectors onto the subspace orthogonal to the one covered by the previous ones (Hyvärinen and Oja, 1997). The vectors obtained in this way are used to create the matrix Formula . This iterative procedure is stopped until convergence, or when the total number of the required ICs is achieved.

Since no exact assumption could be done regarding either the expected power or the morphology of signals and noise, the larger number suitable for the ICA model was chosen, i.e. the same number n of the mass spectra.

2.4 Peak detection
The presence or the absence of peaks allowed to differentiate the signal components and the artifactual ones. As a consequence, a peak-picking algorithm was necessary for the identification of the protein signal profiles: to this purpose, the LIMPIC algorithm (Mantini et al., 2007) was run for each IC using a variable noise threshold set to 10{sigma}. The LIMPIC method consists of signal smoothing, baseline subtraction, and peak-picking.

2.4.1 Smoothing, baseline subtraction, measurement of noise
The smoothing was performed using a Kaiser filter, with a smoothing factor p properly set in order to cover a range of 5 Da. The baseline drift c was locally estimated from signal blocks having width of 150 Da; for each of them, the average intensity was calculated, so that a vector w = {w1,w2,...,wN} of amplitude values was generated. Then, w was associated to the vector b = {b1,b2,...,bN} of the m/z values corresponding to the central point of each interval. The values wk, with k = 1,...,N, characterized by rapid intensity variations were considered to be out of the baseline, hence they were disregarded. The baseline drift c was estimated from the remaining points (bk,wk), with k = 1,...,L, by means of a linear interpolation, and then it was removed from the spectrum. The processed spectrum was used for the estimate of the residual noise level {sigma}. This was calculated using the SD gk of the values included in same blocks considered for the baseline reconstruction; a polynomial interpolation of the points (bk,gk), with k = 1,...,L were used to obtain {sigma}.

2.4.2 Peak-picking
A peak list was created after the peak-picking phase by finding the local maxima: if the point intensity was the highest among its nearest ± f points, a peak was detected in that position (Yasui et al., 2003); for our data, the parameter f was chosen equal to 2, in order to cover a range of 0.5 Da. The peaks with intensity lower than 10{sigma} were then eliminated from the peak list.

2.5 Biomarker identification
In order to obtain protein profiles from the experimental MALDI-TOF MS data acquired from serum and plasma, FastICA algorithm was run using Xser and Xplas, respectively. Conversely, the use of Xtot allowed the simultaneous estimate of protein signal profiles in the two groups with similar or different abundance. After the separation of the ICs, the latter was statistically analyzed, in order to find those components that showed significant difference of protein concentration among serum and plasma. To this purpose, the Mann–Whitney U-test (Mann and Whitney, 1947) was performed for each IC, using the values contained in the columns of the matrix A. The Mann–Whitney U-test was chosen because it is not based on the assumption of that the two distributions are Gaussian, and it is concurrently able to assess the alternative hypothesis that they are statistically different. The Benjamini–Hochberg method was used for multiple testing corrections (Benjamini and Hochberg, 1995). The test was considered significant at a chosen significance level ({alpha} = 0.05). This approach was intended for biomarker discovery: when analyzing clinical samples from healthy and diseased populations, the protein signal profiles with imbalanced intensity across the two groups of mass spectra could be considered to be protein biomarkers.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Four IC waveforms separated with FastICA from the synthetic dataset are shown in Figure 3, along with a fraction of the mass spectrum that is included for comparison. Therefore, the correspondence of the peaks in the IC waveforms with the ones in the mass spectrum can be assessed; it is also evident that even the outlier peak at 18 kDa is separated as a component. The component amplitudes across the mass spectra can be observed from Figure 4: this information is provided in form of bar plot for five ICs: ICs #4 and #25 are single-peak components, and IC #13 is a double-peak component; they have variable and randomly distributed amplitudes across the mass spectra. Conversely, IC #28 can be directly identified from the associated bar plot the as the signal corresponding to the outlier peak, also shown in Figure 3. IC #43 is an example of noise component, which is separated by FastICA because it is statistically independent from the protein signal profiles. The noise components can be discriminated from the signal components by the LIMPIC algorithm for peak detection, since the noise components do not contain any peak. Therefore, the peak-picking method applied to the separated ICs allowed to perfectly detect the peaks that were present in the simulated dataset.


Figure 3
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The peak positions in the ICs correspond to those in the m/z window 14.6–18.4 kDa of the mass spectrum #33. The complete spectrum #33 is previously shown in Figure 1. The IC amplitudes are in arbitrary units (a.u.) because the signals are normalized.

 

Figure 4
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Samples of 5 out of the 60 ICs separated from the simulated MALDI-TOF mass spectra. For each IC, the plot of the waveform is shown, as well as the bar plot of the corresponding amplitudes in the mass spectra.

 
The ICA method was also tested on experimental data: FastICA was run with the serum mass spectra, with the plasma mass spectra, and then with all mass spectra. The outcomes in terms of detected peaks and hit-rate, defined as the ratio between the number of peaks using multi-subject data and the average number of peaks detected in the single spectra (Mantini et al., 2007), were compared with the ones of the in-house LIMPIC algorithm, and of the commercial algorithms APEX and CENTROID (Table 1). The hit-rate of the ICA method is always equal to unity, because no false positives are detected with this approach; on the other hand, the peak-picking performance is quite poor when using the separate serum and plasma groups, and is similar to those of LIMPIC only for the joint dataset. APEX and CENTROID present a larger number of peaks, but their hit-rate was lower than that of LIMPIC.


View this table:
[in this window]
[in a new window]

 
Table 1. Performance comparison of peak identification algorithms

 
The ICs separated from the mixed dataset were qualitatively and quantitatively analyzed: six of them are illustrated in Figure 5.


Figure 5
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Samples of 6 out of the 60 ICs separated from the serum and plasma MALDI-TOF mass spectra. For each IC, the plot of the waveform is shown; in addition, two bar plots containing the corresponding amplitudes for the serum and plasma mass spectra, respectively, are provided.

 
ICs #1 is the component with the largest power, and has similar intensities in the serum and plasma groups; IC #4 has the characteristics of a signal with an outlier peak; IC #19 has all amplitudes close to unity, because it corresponds to the myoglobin protein (H+ = 16 952.25 Da) that is used for the signal calibration. ICs #35 and #45 are double-peak components, for which the Mann–Whitney U-test is significant at the chosen significance level (P < 0.05, corrected). IC #53 seems to be a biological artifact: it has larger amplitudes for the serum group than for the plasma group (P < 0.001), but it is classified as a disturbance, because no peak above the noise threshold is detected in the associated waveform. The P-values about the protein ICs that were significantly different between the two sample groups are provided in Table 2. Specifically, the peaks that are associated in the literature with apolipoproteins CI, CII, CIII (Bondarenko et al., 1999) were observed to be differentially expressed in plasma and sera, in accordance with previous findings on human body fluids in presence of inflammatory auto-immune diseases (Hortin, 2006).


View this table:
[in this window]
[in a new window]

 
Table 2. List of the protein ICs which are characterized by significant statistical difference among serum and plasma groups

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
4.1 ICA decomposition
ICA has been extensively used for signal extraction tasks in the field of biomedical signal processing (James and Hesse, 2005). The separation of underlying signals that can be performed with this technique provides a tool for high-dimensional spectrometric data processing, where the signals of interest are generally contaminated by artifacts of both biological and physical origin: ICA has proved to extract reliable protein profiles using multi-subject MALDI-TOF mass spectra acquired in the same m/z range, which have been previously calibrated and normalized. It successfully separated underlying signals contained in the mass spectra; in particular, background noise and outlier peaks could be identified, and the real protein signals globally showed the same peaks contained into the mass spectra, with an increased signal-to-noise ratio (SNR). For this reason, it can be integrated with existing computational methods for peak detection, and can enhance their effectiveness.

ICA has also an intrinsic limitation: the optimal number of independent signals mixed in the mass spectra is unknown. Unless a dimension reduction procedure is previously performed (Yang and Wang, 1999), the typical ICA model assumes that the number of underlying signals is at most equal to the number of mass spectra (Hyvärinen et al., 2001; Stone, 2004): in this case, the ICA decomposition generally could produce a residual number of non-relevant ICs, depending on the correct number of protein profiles. As a consequence, the identification of the ICs becomes a difficult task.

4.2 Peak detection
The presence of peaks in the ICs can be considered as the main indicator for the identification of protein signal profiles. As a result, the integration of a peak-picking technique is required, not only for the characterization of the peaks contained in the signals produced by ICA, but also for the classification of the ICs into two different groups: the protein signals and the artifacts. By setting a conservative noise threshold equal to 10{sigma}, we are able to automatically discriminate the disturbances. Conversely, 1–3 peaks are present in the ICs corresponding to real protein signals. With regard to the IC waveform, it is worth noting that the proteins associated to the peaks in a single component can be assumed to be dependent, because the increase or the decrease of their intensity is proportional across the analyzed mass spectra.

ICA provides signals that are present with different abundance in all spectra, and the peaks are detected in protein signals, which are largely uncontaminated by noise. As a result, we can assume that there are no false positives in the peak-picking procedure, and the hit-rate is always equal to unity. By contrast, the number of detected peaks is generally lower than that of other methods when using a limited number of mass spectra from serum or plasma (Table 1); the number of peaks is similar to that of LIMPIC only for the joint dataset corresponding to serum and plasma samples. We can infer that the main limitation of the proposed system is the requirement of a large number of mass spectra for achieving a sufficient number of detected peaks. This finding is consistent with the results obtained from the application of the ICA algorithms in other research fields (Hyvärinen et al., 2001; James and Hesse, 2005).

4.3 Biomarker discovery
The opportunity of isolating signal components, combined with the availability of amplitude values associated with single mass spectra, permits to use ICA for biomarker discovery. The non-parametric Mann–Whitney U-test has been used for assessing if two distributions come from the same population. It does not require that the distributions are Gaussian. When the test performed using the IC amplitudes for the two groups is significant, it is possible to affirm that the relative abundance of the proteins associated with the specific IC is different. This approach has been validated with the groups related to serum and plasma samples. We have found a consistent number of protein peaks, whose relative abundances are statistically different among the two groups. The specific information on the m/z associated with the detected peaks is particularly valuable for the identification and characterization of the corresponding proteins. When the analyzed MALDI-TOF MS dataset is from clinical samples, these proteins can be considered potential biomarkers. Since ICA does not need any parameter tuning for separating reliable protein peaks from noise, this approach can be considered more robust with respect to other methods for biomarker discovery (Coombes et al., 2005; Mantini et al., 2007).


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
It is the first time that ICA is used for the decomposition of protein signal profiles from MALDI-TOF mass spectra. In this case, data analysis is particularly difficult, because weak protein signals, represented by the true peaks in the acquired spectra, are generally contaminated by noise of biological and physical origin. A system for the reliable separation of protein signals has been developed, and used for improving classical peak detection methods. Although further verification on a larger clinical population is required, our findings suggest that the proposed system could be more effective than other open-source and commercial algorithms, when a sufficiently large number of mass spectra are available for analysis. In addition, the quantitative information on the peak intensity extracted with ICA could be used for the recognition of significant protein profiles by means of advanced statistical tests. From the perspective of a routine clinical employment of MALDI-TOF mass spectrometry, the proposed system might represent a further step toward the optimization of a standardized procedure for the automatic recognition of disease-state biomarkers.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We are particularly grateful to Enzo Ballone and Gian Luca Romani for continuous support and scientific discussion.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Anna Tramontano

Received on July 20, 2007; revised on October 16, 2007; accepted on October 16, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Back AD, Weigend AS. A first application of independent component analysis to extracting structure from stock returns. Int. J. Neural Syst (1997) 8:473–484.[CrossRef][Web of Science][Medline]

    Bell AJ, Sejnowski TJ. An information-maximization approach to blind separation and blind deconvolution. Neural Comput (1995) 7:1129–1159.[Web of Science][Medline]

    Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B (1995) 57:289–300.

    Biroccio A, et al. Differential post translational modifications of transthyretin in Alzheimer's disease: a study of the cerebral spinal fluid. Proteomics (2006) 6:2305–2313.[CrossRef][Web of Science][Medline]

    Bondarenko PV, et al. Mass spectral study of polymorphism of the apolipoproteins of very low density lipoprotein. J. Lipid Res (1999) 40:543–555.[Abstract/Free Full Text]

    Cardoso JF, Souloumiac A. Jacobi angles for simultaneous diagonalization. J. Math. Anal. Appl (1996) 17:161–164.[CrossRef]

    Coombes KR, et al. Improved peak detection and quantification of mass spectrometry data acquired from surface-enhanced laser desorption and ionization by denoising spectra with the undecimated discrete wavelet transform. Proteomics (2005) 5:4107–4117.[CrossRef][Web of Science][Medline]

    Diamandis EP. Mass spectrometry as a diagnostic and a cancer biomarker discovery tool: opportunities and potential limitations. Mol. Cell Proteomics (2004) 3:367–378.[Abstract/Free Full Text]

    Foley JP. Equations for chromatographic peak modeling and calculation of peak area. Anal. Chem (1987) 59:1984–1987.

    Frigyesi A, et al. Independent component analysis reveals new and biologically significant structures in micro array data. BMC Bioinformatics (2006) 7:290.[CrossRef][Medline]

    Gras R, et al. Improving protein identification from peptide mass fingerprinting through a parameterized multi-level scoring algorithm and an optimized peak detection. Electrophoresis (1999) 20:3535–3550.[CrossRef][Web of Science][Medline]

    Hortin GL. The MALDI-TOF mass spectrometric view of the plasma proteome and peptidome. Clin. Chem (2006) 52:1223–1237.[Abstract/Free Full Text]

    Hyvärinen A. Fast and robust fixed-point algorithms for independent component analysis. IEEE Trans. Neural Netw (1999) 10:626–634.[CrossRef][Web of Science][Medline]

    Hyvärinen A, Oja E. A fast fixed point algorithm for independent component analysis. Neural Comput (1997) 9:283–292.

    Hyvärinen A, et al. Independent Component Analysis. (2001) New York, USA: John Wiley & Sons.

    James CJ, Hesse CW. Independent component analysis for biomedical signals. Physiol. Meas (2005) 26:R15–R39.[CrossRef][Web of Science][Medline]

    Jeffries N. Algorithms for alignment of mass spectrometry proteomic data. Bioinformatics (2005) 1:3066–3073.

    Jung TP, et al. Analysis and visualization of single-trial event-related potentials. Hum. Brain Mapp (2001) 14:166–85.[CrossRef][Web of Science][Medline]

    Karas M. Matrix-assisted laser desorption ionization MS: a progress report. Biochem. Soc. Trans (1996) 24:897–900.[Web of Science][Medline]

    Liebermeister W. Linear modes of gene expression determined by independent component analysis. Bioinformatics (2002) 18:51–60.[Abstract/Free Full Text]

    Mann HB, Whitney DR. On a test of whether one of 2 random variables is stochastically larger than the other. Ann. Math. Stat (1947) 18:50–60.[CrossRef]

    Mantini D, et al. A method for the automatic reconstruction of fetal cardiac signals from magnetocardiographic recordings. Phys. Med. Biol (2005) 50:4763–4781.[CrossRef][Web of Science][Medline]

    Mantini D, et al. LIMPIC: a computational method for the separation of protein signals from noise. BMC Bionformatics (2007) 8:101.[CrossRef]

    Satten GA, et al. Standardization and denoising algorithms for mass spectra to classify whole-organism bacterial specimens. Bioinformatics (2004) 20:3128–3136.[Abstract/Free Full Text]

    Scholz M, et al. Metabolite fingerprinting: detecting biological features by independent component analysis. Bioinformatics (2004) 20:2447–2454.[Abstract/Free Full Text]

    Smith D, et al. An analysis of the limitations of blind signal separation application with speech. Signal Process (2006) 86:353–359.[CrossRef]

    Stone JV. Independent Component Analysis: A Tutorial Introduction. Bradford Books Series (2004) London, England: MIT Press.

    Yang TN, Wang SD. Robust algorithms for principal component analysis. Pattern Recognit. Lett (1999) 20:927–933.[CrossRef]

    Yasui Y, et al. An automated peak identification/calibration procedure for high-dimensional protein measures from mass spectrometers. J. Biomed. Biotechnol (2003) 4:242–248.

    Ziehe A, et al. Artifact reduction in magnetoneurography based on time-delayed second order correlations. IEEE Trans. Biomed. Eng (2000) 41:75–87.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/1/63    most recent
btm533v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Mantini, D.
Right arrow Articles by Urbani, A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Mantini, D.
Right arrow Articles by Urbani, A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?