Bioinformatics Advance Access originally published online on December 7, 2004
Bioinformatics 2005 21(8):1325-1331; doi:10.1093/bioinformatics/bti160
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Autoregressive modeling of analytical sensor data can yield classifiers in the predictor coefficient parameter space


1Mechanical and Instruments Division, Bioengineering Group Cambridge, MA 02139, USA
2Electronics Division, RF and Communication Systems Group Cambridge, MA 02139, USA
3Electronics Division, Analog Systems Group, The Charles Stark Draper Laboratory Cambridge, MA 02139, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: The analysis of chromatographic data resulting from complex chemical mixtures is challenging. Components may co-elute, causing their signals to overlap. An algorithm that will increase the signal-to-noise ratio so compounds present in low abundance can be better distinguished from noise is useful in this type of analysis. The autoregressive (AR) filter offers the advantage of smoothing chromatograms to increase this ratio, while also offering data compression and increased resolution. Furthermore, this filter can be useful for classification, as the roots of the predictor coefficient vectors represent features present in the data and can therefore be used for pattern recognition. In this paper, we present a novel method for applying AR filtering to chromatogram data. We show that the AR filter outperforms the SavitzkyGolay filter for smoothing noise while retaining important information within chromatograms, and also that AR correlation coefficients have the potential to be used to classify chromatogram data into groups.
Contact: cdavis{at}draper.com
| 1 INTRODUCTION |
|---|
|
|
|---|
The gas chromatographic (GC) data of complex chemical mixtures can be difficult to analyze. With many chemical components present, some will likely elute off of the GC column at the same time, leading to overlapping signals in the data. Furthermore, although some components may be present in high abundance, others may be present only in trace amounts, and the signal may be difficult to distinguish from the instrument background and electronic noise.
To classify data sets from complex chemical mixtures, it is often necessary to use a pattern recognition algorithm to extract important features from the data that correspond to a given state. This algorithm will look at the data and assign it to a class by deciding which class it resembles most closely. Ideally, the probability of the data being assigned to the correct class should approach 1, and the probability of assignment to the incorrect class should approach 0.
If any of the features that help define a given state are components that co-elute or that are present in low abundance in a chromatogram, classification may prove difficult, if the signals are not readily distinguished either from each other or from the surrounding noise. Smoothing of the noise can assist with component detection and pattern recognition by allowing signals to become clearer as compared to the smoothed background. However, it is extremely important that the method used does not diminish information present in the data as this may confound pattern recognition.
Several methods for smoothing have been proposed in the literature, and many books have been written that discuss this with respect to the signal processing field (Bellanger, 1989; Brereton, 2003; Coates, 1975; Willsky, 1979). One of the simplest smoothing methods is the moving average filter. In this filter, each point is replaced by an average of itself with a set number of neighboring points. The noise reduction is greater with more points used for averaging, but the chance that low-energy signals may be obscured is also higher. Another difficulty is that the approximation is linear, and peaks are better fit with polynomial functions. It is possible to use polynomial approximations instead; however, the calculations can be computationally intense, and the end result may not be worth the time spent achieving it (Brereton, 2003). Another group has modified the simple moving average filter by taking into account the mean and the median (Bhargava and Levin, 2002); this method provides a more accurate estimate of the signal compared to a simple moving average, but there is still the possibility of smoothing out low-energy signals that would be important for pattern recognition.
Another popular method for smoothing data is the SavitzkyGolay filter (Savitzky and Golay, 1964). This filter uses least squares prediction to minimize the error between the fitted curve and the actual data. Each data point is re-calculated and expressed as the sum of coefficients. The calculations are simpler, there is no distortion of peak amplitudes and the noise is reduced. However, there is still a concern of decreased resolution, which can be problematic when analyzing a sample that has hundreds of features that may overlap; in this case, the potential loss of resolution is likely not worth the reduction in noise.
One type of filter that has been used to increase resolution is the derivative filter (Brereton, 2003); this filter finds inflection points of the signal and considers them to be the overlapping point of two closely spaced peaks. It has the effect of narrowing peaks, but also tends to amplify noise, and it is a computationally intensive technique.
The National Institute of Standards and Technology created an algorithm for detection of components present in low concentrations in a complex mixture (Stein, 1999). This deconvolution algorithm is part of their freely available software package called AMDIS (Automatic Mass spectral Deconvolution and Identification Software) and is based on a method described previously (Dromey et al., 1976). This method can be successful but relies on a steady level of background ions for correct peak identification.
Another deconvolution algorithm available in commercial software packages is called CODA (component detection algorithm) (Windig et al., 1996). This algorithm can be set to retain only those ions that are not characterized by consistent low-level abundances (considered to be noise). The disadvantage is that the noise is determined by a threshold chosen by the user, and is therefore potentially subjective. Also, since background chromatograms are considered to be smooth (i.e. the abundance level does not vary dramatically over time), any ion chromatograms that may have one true peak but with its remaining signal at a constant, low level will be scored lower and may be discarded. A similar method to this is the calculation of a morphological score (Shen et al., 2001).
Another filter that can be used for smoothing is the autoregressive (AR) filter (Bellanger, 1989; Coates, 1975; Makhoul, 1975; Pardey et al., 1996; Willsky, 1979). This relates the noise to the observed response at a given time and develops a model for the data based on this comparison (Fig. 1). We will demonstrate AR filtering provides enhanced data resolution along with noise reduction and data compression, as applied to GC-MS data. This method is discussed extensively in signal processing literature, but has not been applied to analytical chromatographic data sets. The AR filter provides several benefits, including data compression (Pianykh et al., 1999) and enhanced resolution of the data (Ubeyli and Guler, 2004); this is extremely useful for the deconvolution of overlapping peaks that have co-eluted. Furthermore, AR provides a reduction in signal noise (Ubeyli and Guler, 2004), allowing the peaks to appear more clearly. In addition, AR modeling has the further advantage of providing the opportunity for pattern recognition in parameter space, as has been shown previously in electrocardiograms (Ge et al., 2002). In this paper we apply a novel AR filter approach to analyze GC-MS spectra.
|
| 2 MATERIAL AND METHODS |
|---|
|
|
|---|
The data analysis was performed by codes written in MATLAB (The Mathworks, Inc., Natick, MA) software version 6.5.1.199709 [EC] Release 13. Fourier transforms were accomplished using the fast Fourier transform (FFT) algorithms included with the software. The SavitzkyGolay filter (sgolayfilt) included with the software was also used.
2.1 Sample preparation
2.1.1 Human plasma samples
Whole blood was collected from a 42-year-old male subject with informed consent, and the sample was processed to obtain plasma (CBR Institute for Biomedical Research; Boston, MA). Aliquots of 1 ml were placed into 10 ml borosilicate vials, capped with PTFE/silicone septa and aluminum crimp caps (Agilent, Palo Alto, CA), creating an airtight seal to trap the volatiles produced from the plasma. The vials were frozen at 20°C until chemical analysis of the headspace was performed.
2.1.2 Bacteria headspace samples
GC-MS traces were recorded from concentrated headspace above vegetative bacteria cultures: Escherichia coli, Mycobacterium smegmatis and Bacillus subtilis. The cultures were grown at 37°C in 10 ml of liquid LB media contained in 20 ml borosilicate as described above. The cultures were analyzed after growing in septum-capped vials overnight.
2.2 Chemical analysis of samples
2.2.1 Human plasma sample analysis
One milliliter of headspace above the plasma was analyzed without concentration, using an automated headspace sampler (7694, Agilent) connected to the GC-MS. The vials were removed from 20°C and placed into the headspace sampler: oven temperature of 45°C, loop temperature of 55°C and transfer line temperature of 70°C; event times were set as follows: vial equilibration, 25 min; pressurization, 0.1 min; loop fill, 0.5 min; loop equilibration, 0.1 min; injection, 0.2 min. Vial agitation was low. The GC column used was an HP5-MS (0.25 mm i.d. x 30 m x 0.25 µm film thickness, Agilent). The GC oven profile was 40°C held for 4 min, ramped to 145°C at 15.0°C/min, no final hold. The GC inlet was run in 1:1 split mode at 200°C. Helium was used as the carrier gas with a flow rate of 1.5 ml/min. The MS scanned from 50 to 550 m/z with a threshold of 0. The MS quad temperature was set to 150°C and the MS detector temperature 230°C.
2.2.2 Bacteria headspace sample analysis
Solid phase micro extraction (SPME) concentration of the culture headspaces was performed with a polydimethylsiloxane/ divinylbenzene (PDMS/DVB) 65 µm Bonded Blue fiber (Supelco, Bellefonte, PA). The cultures were removed from the incubator and placed in a 60°C water bath. The fiber was extended into the headspace of the culture and exposed for 1 h. The SPME was retracted, removed from the vial and placed in the inlet of the GC. The GC column was as above. The GC oven profile was 50°C held for 5 min, ramped to 100°C at 25°C/min and held for 4 min, ramped to 150°C at 10°C/min and held for 6 min, then ramped to the final temperature of 205°C at 5°C/min and held for 10 min. The inlet was operated in splitless mode at 250°C and the SPME fiber remained in the inlet for 5 min. The MS scanned from 50 to 550 m/z with a threshold of 30. The carrier gas and MS detector temperatures were as above.
| 3 RESULTS |
|---|
|
|
|---|
An AR filter was applied to each of the data sets, according to the general process illustrated in Figure 2A. The chromatogram is considered to be in the frequency domain, as discussed in more depth in the discussion. The full Hermitian symmetric signal is calculated from this frequency response. This is then used to calculate the power spectrum, which is the Fourier transform of the correlation function. An inverse Fourier transform is taken of the power spectrum to calculate the correlation function (Fig. 2B), which represents the time waveform. The functions X = FFT(x) and x = FFT1(X) implement the discrete Fourier transform and inverse transform pair for vectors of length N by
![]() | (1) |
![]() | (2) |
![]() | (3) |
|
Predictor coefficients are calculated and used as input for the YuleWalker equations to determine the impulse response. The predictor coefficients (an) are calculated from the product of the inverse of the covariance matrix (Rn) with the covariance vector (rn). The covariance matrix is a Toeplitz matrix derived from the correlation function, and the covariance vector is taken from the correlation function based on the first model order:
![]() | (4) |
|
|
The similarity of the model to the recorded data can be calculated by the sum of squared differences at each scan between the chromatograms. The error is calculated using this method for varying model orders to determine the lowest model order that should be used, using the equation: error =
[model(tn) recorded(tn)]2 from n = 1 to the final time of the scan. In a plot of error versus model order (Fig. 5), the error will decrease as the model order increases. This plot will have the same trend for any data set, and can be used to determine an optimal model order: as the model order increases, the error approaches a minimum asymptote. The lowest model order at which this error falls within an acceptable error range can be used, as this will provide the greatest level of data compression with minimal error. Generally, the model order at which this curve approaches the minimum asymptote is ideal to use for smoothing.
|
Similarly, the signal-to-noise ratio (SNR) can be calculated using the following equation, where V is the recorded signal, V-bar is the modeled signal and k is a scaling constant, calculated such that it minimizes the error between the modeled and recorded signals (Ge et al., 2002):
![]() | (5) |
We compared our AR filter to the SavitzkyGolay filter. Figure 6 shows the correlation coefficients from the comparison of filtered data with original data. The model order was allowed to vary for the AR filter; in the SavitzkyGolay filter, the polynomial order was set to 3 and the window size was allowed to vary. As seen in the graph, the trends are different; this is due to the effect the model parameters have on the filters. For the AR filter, a higher model order indicates less data compression due to more points being modeled, and thus a tighter fit of the model to the data. As seen in the figure, as the model order increases, the correlation of the model with the experimental data also increases. Note that the correlation coefficient exceeds 0.9 for model orders that offer data compression in the range of 10:1 to 1:1. On the other hand, the SavitzkyGolay filter uses a window size that represents how many points will be averaged together to create the new smoothed points; thus, the larger the window size, the smoother the data and the farther from the experimental data. For the smallest window size, the correlation of the model to the original chromatogram is almost 1.0. However, with a small window size, little smoothing of the data is occurring and thus the output is almost identical to the input, yielding little advantage of using the filter. As the window size increases, the correlation rapidly drops below 0.9 and remains very low through the remaining windows. The AR filter used thus provides higher correlation of the model to the data with minimal loss of information.
|
The roots of the predictor coefficients that are calculated and used to build the model of the data contain information about features of the raw data. For this reason, they offer a possible basis for pattern recognition and classification. Using a vector of the roots of the predictor coefficients for classification rather than using the raw data offers the benefit of decreased computation time and also the possibility for increased performance of a classification algorithm, since each point contains information about a feature of the raw data, whereas the raw data itself contains both features and noise. When comparing two different chromatograms, the closer the roots of the predictor parameters are to each other the more likely that they are from the same sample. Thus, if there are roots that do not overlap from file to file, these could be used for classification.
The roots of the predictor coefficients for the bacteria headspace files were calculated for seven experiments for each of the three species. The roots from each file were concatenated to create one vector for each species; thus, each vector is comprised of seven replicates from one species and all points are plotted (Fig. 7). Each panel shows a binary comparison between two species. As seen in the expanded version of the plot of E.coli and M.smegmatis in Figure 8, some of the roots do overlap each other, but there are others that can be found clustering by species.
|
|
| 4 DISCUSSION |
|---|
|
|
|---|
The AR filter has been extensively discussed in literature for general signal analysis (Bellanger, 1989; Coates, 1975; Makhoul, 1975; Pardey et al., 1996; Willsky, 1979). The AR filter has the effect of smoothing the noise, and it offers the benefit of providing greater peak resolution, rather than diminishing it. Linear prediction is used in time series analysis; the signal is predicted from linear combinations of previous inputs and outputs. The model built by these predictions is called the AR model, and is also known as the all-pole model. For this model, the predictor coefficients and gain must be determined, and the input is known. If we assume that the input is totally unknown, then we use the method of least squares, which predicts the signal based on a weighted linear summation of past samples. The error between the actual and predicted values is calculated, and prediction coefficients are obtained by minimizing the error.
Linear prediction can be approached from either the time or the frequency domain. Additionally, it is possible to model discrete spectra, those that are recorded at a finite number of frequencies. The discrete-time data can be used to construct continuous-time models with the application of the least squares method (Larsson and Soderstrom, 2002).
The chromatogram is considered at the onset to be in the frequency domain, as the frequency response of an AR filter can accurately represent narrow band peaks using relatively low-order models. In the usual procedure for estimating the predictor coefficients, one solves the linear system in the time domain using the correlation samples. To get the correlation samples, the inverse Fourier transform of the magnitude-squared chromatogram is calculated.
The input signal is proportional to the error, so output signal energy equals that of the original signal, and total energy in the input signal can be specified. White noise is one type of input, assumed to be a sequence of uncorrelated samples with a mean of 0 and a variance of 1. The output forms a stationary random process for a fixed all-pole filter. YuleWalker equations completely specify an all-pole random process. There are several ways to calculate the predictor parameters (Besson, 1995; Choi, 1997; Salami and Sidek, 2003). After the predictor parameters have been calculated, it is important to examine the stability of the filter. If the predictor coefficients are positive definite, the filter will be stable. Also, it is important to check rounding, as any errors will be compounded and greatly affect the correlation matrix integrity. Additionally, as the number of samples increases, the filter will become more stable.
The optimal model order can be determined by an examination of the error at various model orders (Ing and Wei, 2003). The error is a measure of fit but not an absolute measure (Swanepoel and Doku, 2003). A higher model order offers a closer fit but also increases computation time and lowers data compression. When looking at error versus model order (Fig. 5), the optimal model order can be determined where this curve reaches an asymptote of the lowest achievable error (Chen and Mechefske, 2001). If the model order is equal to the total scans acquired, the error will be at a minimum because every scan has been modeled individually and will closely fit, but there is no data compression advantage in this scenario. However, upon examining the graph of error at varying model orders, it is clear that a minimum error can be achieved long before the model order is equal to the number of scans. Choosing an order that approaches this minimum asymptote will offer the best data compression with minimal loss of information.
The predictor parameters are used to build the AR model of the data. The roots of these parameter vectors contain information about the features of the chromatogram. For this reason, they can be used for pattern recognition. Much research has been done to determine if the headspace above bacteria contains volatile organic compounds that can be used to identify the species (Carlier and Sellier, 1989; Labows et al., 1980; Zechman et al., 1986). As the ability to detect microorganisms based on the volatiles they produce is well-documented, we used chromatograms of bacteria headspace for AR modeling. We looked at the roots of the predictor coefficients and compared them among species. Figure 7A shows the roots of the predictor coefficients of the data sets from E.coli and M.smegmatis. The roots of the two species clearly do not have the same pattern. The same can be said for the comparison of E.coli with B.subtilis and M.smegmatis with B.subtilis (Figs 7B and C). This is shown more clearly in Figure 8, where portions of the graph comparing E.coli and M.smegmatis have been expanded for easier viewing. In the upper expanded portion, the roots that closely overlap have been circled. These points would not be useful for pattern recognition because the two species cannot be distinguished at these points. However, in the lower expanded portion, areas where the roots of only one of the species are clustering together have been circled. These areas may prove useful for pattern recognition and classification.
The AR filter offers the advantage of noise reduction and data compression without significant loss of information. Additionally, overlapping peaks can be resolved. Furthermore, it provides the benefit of allowing pattern recognition in the predictor parameter space.
| 5 CONCLUSIONS |
|---|
|
|
|---|
We have presented the use of an autoregressive filter as a method for smoothing chromatographic data with the additional advantage of increasing the resolution of signal-to-noise. The signal-to-noise ratio is increased without broadening of the peaks. This filter is useful for distinguishing features in a chromatogram. Predictor parameters have proved useful in other signal processing areas for pattern recognition and classification, and we show the potential for this method to be used in chromatographic classification as well. AR filtering of GC-MS and other chemical or biological analytical sensor data offer a more accurate mechanism for smoothing data and increasing signal resolution.
| Acknowledgments |
|---|
We would like to thank Roger Wilmarth and Priya Agrawal (Draper) for fruitful discussions on the implications of the signal processing. We would like to thank Dr Charles Larsen, Dr Devendra Dubey, Dr Chester Alper, Alissa Baker and Michelle Ploutz from The CBR Institute for Biomedical Research (Boston, MA) for the human plasma samples. We would also like to acknowledge input from our infectious disease clinician collaborators on bacteria culture conditions and sensing applications in the field of breath analysis.
This work is sponsored in part by DARPA under ARO Contract DAAD1903-C-0069, and also in part by the Department of the Army, Cooperative Agreement DAAD17-02-2-0006. Opinions, interpretations, conclusions and recommendations are those of the authors and are not necessarily endorsed by the United States Government.
| Footnotes |
|---|
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Received on August 30, 2004; revised on November 3, 2004; accepted on November 16, 2004
| REFERENCES |
|---|
|
|
|---|
Bellanger, M. Digital Processing of Signals: Theory and Practice, (1989) 2 edn. , Chichester John Wiley & Sons.
Besson, O. (1995) Improved detection of a random amplitude sinusoid by constrained least-squares technique. Signal Process., 45, 347356[CrossRef].
Bhargava, R. and Levin, I.W. (2002) Effective time averaging of multiplexed measurements: a critical analysis. Anal. Chem., 74, 14291435[Medline].
Brereton, R.G. Chemometrics: Data Analysis for the Laboratory and Chemical Plant, (2003) , Chichester John Wiley & Sons Ltd.
Carlier, J.P. and Sellier, N. (1989) Gas chromatographic-mass spectral studies after methylation of metabolites produced by some anaerobic bacteria in spent media. J. Chromatogr., 493, 257273[ISI][Medline].
Chen, Z. and Mechefske, C.K. (2001) Model order selection: a practical approach. Mech. Syst. Signal Process., 15, 265273.
Choi, B.S. (1997) A recursive algorithm for solving the spatial Yule-Walker equations of causal spatial AR models. Statist. Probab. Lett., 33, 241251.
Coates, R. (1975) Fourier transform methods. In Bogner, R.E. and Constantinides, A.G. (Eds.). Introduction to Digital Filtering, , London John Wiley & Sons, pp. 91137.
Dromey, R.G., Stefik, M.J., Rindfleisch, T.C., Duffield, A.M. (1976) Extraction of mass spectra free of background and neighboring component contributions from gas chromatography/mass spectrometry data. Anal. Chem., 48, 13681375[CrossRef].
Ge, D., Srinivasan, N., Krishnan, S.M. (2002) Cardiac arrhythmia classification using autoregressive modeling. Biomed. Eng. Online, 1, 5[CrossRef][Medline].
Ing, C.K. and Wei, C.Z. (2003) On same-realization prediction in an infinite-order autoregressive process. J. Multivari. Anal., 85, 130155[CrossRef].
Labows, J.N., McGinley, K.J., Webster, G.F., Leyden, J.J. (1980) Headspace analysis of volatile metabolites of Pseudomonas aeruginosa and related species by gas chromatography-mass spectrometry. J. Clin. Microbiol., 12, 521526
Larsson, E.K. and Soderstrom, T. (2002) Identification of continuous-time AR processes from unevenly sampled data. Automatica, 38, 709718[CrossRef].
Makhoul, J. (1975) Linear prediction: a tutorial review. Proc. IEEE, 63, 561580.
McGillem, C.D. and Cooper, G.R. Continuous and Discrete Signal and System Analysis, (1991) 3 edn , Philadelphia Saunders College Publishing.
Pardey, J., Roberts, S., Tarassenko, L. (1996) A review of parametric modelling techniques for EEG analysis. Med. Eng. Phys., 18, 211[CrossRef][ISI][Medline].
Pianykh, O.S., Tyler, J.M., Sharman, R. (1999) Nearly-lossless autoregressive image compression. Pattern Recog. Lett., 20, 221228[CrossRef].
Salami, M.J.E. and Sidek, S.N. (2003) Parameter estimation of multicomponent transient signals using deconvolution and ARMA modelling techniques. Mech. Syst. Signal Process., 17, 12011218[CrossRef].
Savitzky, A. and Golay, M.J.E. (1964) Smoothing and differentiation of data by simplified least squares procedures. Anal. Chem., 36, 16271639[CrossRef].
Shen, H., Grung, B., Kvalheim, O.M., Eide, I. (2001) Automated curve resolution applied to data from multi-detection instruments. Anal. Chim. Acta, 446, 313328[CrossRef].
Stein, S.E. (1999) An integrated method for spectrum extraction and compound identification from gas chromatography/mass spectrometry data. J. Am. Soc. Mass. Spectrom., 10, 770781[CrossRef].
Swanepoel, C.J. and Doku, W.O. (2003) New goodness-of-fit tests for the error distribution of autoregressive time-series models. Comput. Statist. Data Anal., 43, 333340[CrossRef].
Ubeyli, E.D. and Guler, I. (2004) Spectral analysis of internal carotid arterial Doppler signals using FFT, AR, MA, and ARMA methods. Comput. Biol. Med., 34, 293306[CrossRef][ISI][Medline].
Willsky, A.S. Digital Signal Processing and Control and Estimation Theory: Points of Tangency, Areas of Intersection, and Parallel Directions, (1979) , Cambridge The MIT Press.
Windig, W., Phalp, J.M., Payne, A.W. (1996) A noise and background reduction method for component detection in liquid chromatography/mass spectrometry. Anal. Chem., 68, 36023606[CrossRef].
Zechman, J.M., Aldinger, S., Labows, J.N. (1986) Characterization of pathogenic bacteria by automated headspace concentrationgas chromatography. J. Chrom., 377, 4957.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












= E.coli (n = 7), stars, M.smegmatis (n = 7),
= B.subtilis (n = 7). (A) Comparison of E.coli and M.smegmatis. (B) Comparison of E.coli and B.subtilis. (C) Comparison of M.smegmatis and B.subtilis.