Exploring the lipoprotein composition using Bayesian regression on serum lipidomic profiles
i
11VTT Technical Research Centre of Finland, Espoo, FI-02044 VTT2Helsinki University of Technology, Espoo, FI-02015 TKK3Helsinki University Hospital, Biomedicum, P.O. Box 700, 000290 Helsinki4Department of Medicine, University of Helsinki, Biomedicum, 00290 Helsinki
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Serum lipids have been traditionally studied in the context of lipoprotein particles. Today's emerging lipidomics technologies afford sensitive detection of individual lipid molecular species, i.e. to a much greater detail than the scale of lipoproteins. However, such global serum lipidomic profiles do not inherently contain any information on where the detected lipid species are coming from. Since it is too laborious and time consuming to routinely perform serum fractionation and lipidomics analysis on each lipoprotein fraction separately, this presents a challenge for the interpretation of lipidomic profile data. An exciting and medically important new bioinformatics challenge today is therefore how to build on extensive knowledge of lipid metabolism at lipoprotein levels in order to develop better models and bioinformatics tools based on high-dimensional lipidomic data becoming available today.
Results: We developed a hierarchical Bayesian regression model to study lipidomic profiles in serum and in different lipoprotein classes. As a background data for the model building, we utilized lipidomic data for each of the lipoprotein fractions from 5 subjects with metabolic syndrome and 12 healthy controls. We clustered the lipid profiles and applied a regression model within each cluster separately. We found that the amount of a lipid in serum can be adequately described by the amounts of lipids in the lipoprotein classes. In addition to improved ability to interpret lipidomic data, we expect that our approach will also facilitate dynamic modelling of lipid metabolism at the individual molecular species level.
Contact: marko.sysi-aho{at}vtt.fi
| 1 INTRODUCTION |
|---|
|
|
|---|
Systematic analysis of gene and molecular function in the context of biological processes such as complex diseases or maintenance of system homeostasis remains a challenge due to complexity of multi-cellular organisms. Genetic, molecular or environmental interventions aiming to elucidate functional networks generally lead to multifactorial responses and complex phenotypes with no direct correlation with a specific genotype. While ideally biomedical studies would focus on human as a system, it is generally difficult to obtain sufficient data from human to be able to unravel pathophysiological phenomena such as human disease.
Linus Pauling and colleagues recognized already in early 1970s the potential of profiling of small molecules in human biofluids for characterization of disease phenomena (Pauling et al., 1971). Several decades later, metabolomics as a global study of small molecules in biological fluids, tissues and cells, is again gaining recognition for being a very sensitive and amplified readout of physiology and therefore a platform of choice for systems biology studies in complex organisms (Kell, 2006; van der Greef et al., 2003). Serum patterns of metabolites reflect to some extent the homeostasis of the organism, so that abnormalities in specific metabolite groups may mirror anomalous responses to environmental alterations or interventions (Oresic et al., 2006). The metabolic phenotype is affected by factors, such as lifestyle, nutrition and gut microbiota (Lenz et al., 2004; Nicholson et al., 2005); which is of particular relevance to complex diseases, believed to be due to interactions between genetic factors and the environment.
Although the serum metabolic profiles may lead to sensitive biomarkers for different disorders, tracing the metabolic patterns to specific pathophysiological mechanisms remains a challenge. Specifically, we have recently demonstrated how serum lipidomics (i.e. global profiling of lipid molecular species) can be utilized to predict expression of inflammatory genes (Laaksonen et al., 2006); as well as that obesity already in its early stages and independent of genetic influences is associated with deleterious alterations in the lipid metabolism known to facilitate atherogenesis, inflammation and insulin resistance (Pietiläinen et al., 2007). Also, bioinformatics methods have been developed by us and others to study the cellular lipid pathways based on lipidomics data (Serhan et al., 2006; Yetukuri et al., 2007). However, in order to establish a functional link between the serum lipid patterns and the tissue- or organ-specific pathophysiological phenomena, one would need to have a thorough understanding of the systemic metabolism of lipid molecular species.
Serum lipids have been traditionally studied in the context of lipoprotein particles, i.e. carriers of lipid molecular species, such as sterols, phospholipids and triglycerides (Vance and Vance, 2004). Complex mathematical models have been developed over the past decades to study their metabolism in the context of human physiology (Adiels et al., 2005; Zech et al., 1979). However, today's emerging lipidomics technologies afford sensitive detection of individual lipid molecular species (Watson, 2006; Wenk, 2005), i.e. to a much greater detail than the scale of lipoproteins. Structures for some of the representative individual molecular lipid species which are usually measured by lipidomic profiling methods are shown in Figure 1. However, global serum lipidomic profiles such as those based on Liquid chromatography mass spectrometry (LC/MS) do not inherently contain any information on where the detected lipid species are coming from. Since it is too laborious and time consuming to routinely perform serum fractionation and lipidomics analysis on each lipoprotein fraction separately, this presents a challenge for interpretation of profile data in the context of lipid metabolism and human physiology.
|
Deeper understanding of the lipid profiles at the level of individual lipid molecular species, combined with the knowledge of systemic metabolism of these species in context of their carrier lipoproteins, is of immense medical importance. Western countries face high and increasing rates of cardiovascular disease. It is the number one cause of death and disability in the US and most European countries. Some of the major risk factors, such as diabetes mellitus, are on the rise globally. Recent failure of Torcetrapib (CP-529414, Pfizer), a new and highly anticipated drug developed to treat hypercholesterolemia, suggests that our understanding of the lipid metabolism is far from complete; as well as that conventional clinical lipid measures such as good cholesterol (i.e. HDL-cholesterol) and bad cholesterol (i.e. LDL-cholesterol) are unlikely to capture all subtle and potentially pathogenic changes in serum lipid profiles.
An exciting and medically important new bioinformatics challenge today is thus how to build on extensive knowledge of lipid metabolism at lipoprotein levels in order to develop better models and bioinformatics tools based on high-dimensional lipidomic data becoming available today. Such models would greatly facilitate interpretation of growing number of lipidomics datasets and significantly contribute to domains, such as metabolic disorders and cardiovascular diseases, where lipid metabolism plays a central role.
In short, models are needed that can predict lipoprotein lipid profiles from the total serum profiles (Fig. 2). As a step towards this goal, in this article we introduce a hierarchical Bayesian regression model to study the lipid abundances in serum and different lipoprotein fractions. We also discuss the prediction problem of estimating the amounts of lipids in each of the lipoprotein classes given the amounts of lipids in serum. As a background data used for the model construction, we utilize data from a comprehensive lipidomics analysis for each of the lipoprotein fractions from subjects with metabolic syndrome and healthy controls. To our knowledge this is the first bioinformatics effort aiming to bridge lipidomics profiling with the levels of lipoprotein metabolism.
|
| 2 METHODS |
|---|
|
|
|---|
2.1 Overview of data and its processing
The data consists of lipidomic measurements from 17 subjects, 5 of which progressed to metabolic syndrome (Moller and Kaufman, 2005) and 12 that remained healthy. Abundances of lipids were separately measured from the blood serum of each subject and from each of the isolated lipoprotein fractions. The lipoproteins were separated into high density (HDL), intermediate density (IDL), low density (LDL) and very low density (VLDL) fractions. Lipidomic profiling was performed on each of the fractions separately. In addition, the lipid content of the residual mixture (Resid) that remained after lipoprotein extraction was profiled. Clinical aspects of the study and details of the analytical methodology will be reported elsewhere.
The UPLC/MS platform was applied to profile the lipids. In such setting, the molecules are first separated based on their hydrophobicity using liquid chromatography. The molecules from biological extract first pass through the analytical column, with time of passage measured as retention time, then enter the ion source of the mass spectrometer where they are ionized and analysed. Each ion is characterized by a number of counts as detected by the instrument and the mass to charge ratio (m/z). After the experiment, the raw spectra are proceeded through multiple data processing stages, such as peak detection, alignment and normalization (Katajamaa and Oresic, 2005; Katajamaa et al., 2006). Lipids are then identified by their retention time and m/z using a database that has been built in-house based on large number of UPLC/MS/MS spectra as described previously (Yetukuri et al., 2007). Concentration of each lipid is estimated using internal standard lipid species whose abundances and exact values of m/z and retention time are known. The number of identified lipids varies across the lipoprotein fractions and serum. Since lipoprotein fractions are obtained from serum, a serum sample contains all lipids from different lipoprotein fractions. However, this fact is not always reflected in the actual measurements. If the amount of a specific lipid molecular species in serum is too minor, it may not be detected from the serum sample due to the limit of detection of the analytical platform. The same lipid may be within the limits of detection in one or more of the samples prepared from the extracted lipoprotein fractions, because these fractions are generally less complex than serum and thus the concentration of the particular lipid relative to the other compounds in the sample is higher than in serum. In such cases, the results of data processing would show zero abundances of some lipids in serum but non-zero abundances of the same lipids in one or more of the lipoprotein fractions. If the pattern of missing values of a specific lipid molecular species in serum and its presence in the lipoprotein fractions is consistent, it may be possible to use such zero abundances in serum for inference about the abundances of lipids in the lipoprotein fractions. In our case the patterns were not sufficiently regular, therefore we decided to exclude lipids with zero abundances in serum from further analysis.
In order to avoid ill-posed inference, we required that a lipid is present in at least one lipoprotein fraction. Total 210 lipid species that satisfied this criterion were included in the data analysis.
2.2 Bayesian framework
We apply a hierarchical Bayesian regression model, which we call the forward regression model, to explain abundances lipid molecular species in serum by lipid abundances in different lipoprotein fractions. A model for the reverse task, i.e. a model that could predict the amounts of a lipid in each of the lipoprotein classes, given the amount of the lipid in serum, would also be of high practical importance. Although in this article we introduce and discuss such a model as well, we could only study the forward regression in quantitative manner due to scarcity of our data.
We chose to apply a Bayesian regression model for this setting (see, e.g. Chapter 14 of Gelman et al., 2004, or for early works Tiao and Zellner, 1964; Zellner and Chetty, 1965). In general terms, everything in Bayesian inference is expressed by probability distributions. The principle of inference is conceptually simple: in the beginning the prior distribution p(
) of model parameters
is set, then the likelihood of observed data, that are assumed to arise from a sampling distribution
, are used to update the prior beliefs to form the posterior distribution of the parameters
. The posterior distribution thus expresses beliefs about the values of the model parameters after the data have been observed. All the future analysis and predictions are then based on the posterior distribution. For example, if a value of a new datum y* is of interest, one forms the joint distribution
and integrates over the parameters to obtain the predictive distribution of the datum,
. Thus, in forming predictions, uncertainties of the model parameters are explicitly and properly taken into account.
Furthermore, it is conceptually straightforward to invert relationships between different variables of a Bayesian model. For example, if we had a joint posterior distribution for y1 and
given y2,
, and if we knew the prior distribution p(y2) of y2, we could perform the inversion simply by forming
. In the context of our current application, y1 corresponds to lipid measurements from serum and y2 to measurements from different lipoprotein fractions. On physical basis, we know how y1 should depend on y2, and we are interested in the future predictions of y2 given y1.
2.3 Regression model
We express the abundances of lipids in concentration units. Concentration is a relative measure of abundance being dependent on the abundances of other compounds that are in the same medium with the lipid. In the present study, each of the lipoprotein classes and serum form different media and thus the concentration estimates of lipids are not on comparable scales across the fractions. In order to set the concentration numbers across serum and the different lipoprotein fractions on comparable scales, the lipoprotein extracts are diluted. In principle, if dilution rates are properly taken into account, the estimated concentration of a lipid in serum would equal to the sum of estimated concentrations of the lipid from the lipoprotein fractions. In practice, various factors such as sample preparation introduce errors that prevent perfect recovery.
Assuming that all lipids in serum are bound to lipoproteins or residual mixture and that these lipoproteins are exclusively captured into the extracted lipoprotein fractions, one can express the amount of a lipid in serum as a linear combination of the amounts of the lipid in the lipoprotein classes and residual mixture
|
| (1) |
with
|
|
Due to common structural features of different lipid classes, often regulated by the same enzymes in class-specific manner, there is a large degree of coregulation in biofluid lipid profiles. As an initial step in model building, one must therefore first cluster lipids based on their correlations as determined from the available data. We then assume that data from different lipid clusters arise from different distributions. Furthermore, previous studies have shown that the variance of lipid measurement errors are proportional to the abundance of the lipid, i.e. the errors tend to be multiplicative (Sysi-Aho et al., 2007; van den Berg et al., 2006).
We thus assume the following form for the regression model
|
| (2) |
where
are the measured amounts of lipid i (
) in sample j = 1 (
) in the serum, and in the different lipoprotein classes or the residual mixture,
are the regression coefficients for all lipids i that belong to a lipid cluster g (
). We do not pose any restrictions to the possible values of a, but use exponents to incorporate the assumption that the amount of a lipid in serum can only increase if the amount of the lipid increases in the lipoproteins. Graphical presentation of the model is shown in Figure 3.
|
For the errors we assign a t-distribution that is more robust than a normal distribution for outlying values.
|
| (3) |
We assign a half-Cauchy distribution for the scale parameter sg and a power-law distribution
with
> 2 for the degrees of freedom parameter.
We assume that all lipids i that belong to a cluster g have common cluster-wise terms
, relating their abundances in lipoproteins and serum. Individual subject j, in turn, may deviate from the group by
. We also investigated whether lipids i within cluster g deviated from the group terms
by
, but the amounts of data was insufficient to judge the appropriateness of this addition. Since we neither found marked differences between healthy and non-healthy individuals with the current amount of data, we decided to keep the model as simple as possible and exclude terms aiming at explaining these differences.
The prior distributions for the group-wise and individual-wise parameters were set to
|
| (4) |
|
| (5) |
Again, the scale parameters sa and sh were given half-Cauchy priors, as recommended for scale parameters of the population prior for parameters in a hierarchical model by Gelman (2006). The degrees of freedom parameters were given power-law priors,
, with
. We tested our model with various prior distributions that differed in their scale and found that results are not sensitive to the width of the prior.
For the cluster membership of each lipid we used a uniform prior
|
| (6) |
The number of clusters, G, could be considered unknown and sampled using reversible jump MCMC (Markov chain Monte Carlo) (Green, 1995). However, with the current amount of data (only 17 individuals) systematic evaluation of the performance of models with different numbers of clusters was not meaningful. Thus, based on exploratory analyses, we fixed G = 5, which provided a sensible balance between the number of parameters and the predictive power of the model. In addition, the five clusters seemed to be meaningful in the sense that lipids belonging to a particular functional group were often assigned to a common cluster.
2.4 Parameter sampling
Sampling from the joint posterior was made using a MCMC approach and specifically the slice sampling algorithm for each parameter conditionally on the other parameters (see. e.g. Neal, 2003) and Metropolis algorithm (Robert and Casella, 1999) for updating group indices. For MCMC convergence diagnostics, we used visual inspection of trends, the potential scale reduction method (Gelman, 1996) and the Kolmogorov–Smirnov test (Robert and Casella, 1999). Drawing thousand samples took
3 h with a 2.4 GHz (and 2 GB RAM) personal computer.
2.5 Lipidomic analysis
The lipidomic analysis was performed as previously described (Laaksonen et al., 2006). In brief, 10 µl of internal standard mixture and 10 µl of NaCl (0.9%) were added to 10 µl of the sample. Lipids were extracted from the samples with 100 µl of chloroform : Methanol (2 : 1) solvent and the sample was homogenized with a glass rod. After vortexing for 2 min and incubating for 1 h at room temperature, the lower layer (
60 µl) was separated by centrifugation at 10 000 rpm for 3 min at room temperature. Ten micro liters of labelled standard mixture was added to the lipid extract.
Lipid extracts were analysed on a QToF Premier mass spectrometer (Waters, Inc.) combined with an Acquity Ultra Performance Liquid chromatography (UPLC/MS). The column was an Acquity UPLC BEH C18 10 x 50 mm with 1.7 µm particles and was kept at 50°C. The binary solvent system A included water (1% 1M NH4Ac, 0.1% HCOOH) and solvent system B included LC/MS grade (Rathburn) acetonitrile/isopropanal (5 : 2, 1% 1M NH4Ac, 0.1% HCOOH). The gradient started from 65% A/ 35% B, reached 100% B in 6 min and remained there for the next 7 min. The total run time including a 5 min re-equilibration step was 18 min. The flow rate was 0.2 ml/min and injection volume 1 µl. The temperature of the sample organizer was set at 10° C. The lipid profiling was carried out using positive ion mode. The data was collected at mass range of m/z 300-2000 with scan duration of 0.2 sec. The source temperature was set at 120° C and nitrogen was used as desolvation gas (800 l/h) at 250° C. The voltages of the sampling cone and capillary were 39 V and 3.2 KV, respectively. Reserpine (50 µg/l) was used as the lock spray reference compound (5 µl/min; 10 sec scan frequency).
The raw data was converted into netCDF file format using Dbridge software from MassLynx (Waters, Inc.). The converted data was processed using MZmine software version 0.60 (Katajamaa et al., 2006).
2.6 Lipid nomenclature
Lipids were named according to the LIPID MAPS nomenclature (Fahy et al., 2005). For example, lysophosphatidylcholine with 16:0 fatty acid chain was named as monoacyl-glycerophosphocholine GPCho(16:0/0:0). In case the fatty acid composition was not determined, total number of carbons and double bonds was marked. For example, a phosphatidylcholine species GPCho(16:0/20:4) is represented as GPCho(36:4). However, GPCho(36:4) could also represent other molecular species, e.g. GPCho(20:4/16:0) or GPCho(18:2/18:2). The ether-bonded fatty acid is marked by the O-, e.g. phosphatidylethanolamine GPEtn(O-40:7) contains ether bond on one of the fatty acid chains. Examples of few lipids using this notation are shown in Figure 1.
Common short hand notations: GPCho, glycerophosphocholine; GPEtn, glycerophosphoethanolamine; GPGro, glycerophosphoglycerol; GPA, glycerophosphatidic acid; TG, triacylglycerol; ChoE, cholesteryl ester; SM, sphingomyelin; Cer, ceramide.
| 3 RESULTS |
|---|
|
|
|---|
We studied lipidomic profiles from 17 subjects, 5 of which progressed to metabolic syndrome and 12 that remained healthy. The results of this section were obtained from a simulation run of 9000 MCMC iterations, from which the first 1000 burn-in iterations were removed, and from the remaining iterations a sample from every 10th iteration was saved for further analysis.
3.1 Grouping of serum lipid profiles
It is necessary to group the lipids because it is known that there is a large degree of coregulation in biofluid lipid profiles. If these correlations are strong enough, one can assume that the data of each lipid species that belongs to a particular group are drawn from a common, group-dependent distribution. This assumption reduces the number of parameters in the regression model and thus improves the predictive power of the model. We sampled the group membership of each lipid within the MCMC iterations together with the other parameters of the model. Convergence of the grouping was visually checked from the post-burn-in samples. For most lipids the sampler converged to a uniquely dominant group, but there were some lipids that frequently changed their group. The group memberships, corresponding to the most common group assignment of each lipid in the post-burn-in MCMC iterations, are shown in Table 1.
|
It is evident from grouping that the assignment is based on lipid functional class for most lipids; but there are some lipids that do not seem to form coherent functional classes. Incidentally, those lipids that do not belong to a clear functional class within a group were the ones that most frequently changed their group within the post-burn-in samples. Increasing the number of groups improves the purity of the groups but with the cost of additional parameters. On the other side, decreasing the number of groups saves parameters but leads to more mixed cluster identities. Our choice of five groups is a compromise between interpretability and the number of parameters.
3.2 Predictions of serum lipid profiles
In order to assess the performance of our model, we drew 800 samples from the posterior predictive distribution of a new datum, obtained by integrating out the model parameters,
, and plot their median against the actually observed values (Fig. 4a–e). In each panel, the abundances of lipid species belonging to the particular group are predicted from each person, with the person-wise predictions marked with different colors. We marked the persons instead of different lipids, because it was one of our basic assumptions that lipids within a group are highly correlated and that their values arise from a common distribution. If we excluded the individual level of hierarchy by deleting the
terms, the point clouds in Figure 4 would look spiky, the spikes corresponding to predictions of different individuals.
|
Overall, the level of a lipid in serum is well explained by our model using the levels of the lipid in different lipoprotein classes. As a reference model, we used a simple predictor that sums the abundances of a lipid from different lipoprotein fractions to predict the amount of the lipid in serum. The use of this predictor is justifiable if one believes that dilution rate corrected data reliably represents the true concentrations of lipids in serum and the different lipoprotein fractions. Figure 4f shows the percentage reduction in the median of the absolute value of the errors when our linear model is used instead of the reference model.
3.3 Predictions of individual lipid species within the lipid clusters
Predicted amounts for a randomly selected subset of lipids are shown in Figure 5. The red stars indicate points from persons who developed the metabolic syndrome and the black stars indicate healthy controls. There seems to be no detectable systematic difference between these two groups of subjects. It is possible that with larger amount of data such differences would emerge. One can also try to find lipid species-specific patterns from the shapes of the observed-predicted point pairs. It is obvious that these shapes differ, but due to small number of subjects from which to infer, it is less obvious whether these differences arise as a consequence of random sampling or if they are systematic. Because of this indeterminacy, we did not include lipid-wise hierarchical structure to our regression model.
|
3.4 Lipid-wise residuals for each person
In order to check whether all lipid species within the clusters are adequately explained by our regression model, we studied lipid-wise residuals for different persons. Figure 6 shows the prediction errors for 16 subjects. Results for the remaining one subject are similar as the presented results. In the figure, different lipid species correspond to different indices on the x-axis and the wildly fluctuating continuous line represents the errors. The dots indicate the cluster memberships of the lipids. It is evident from the figure that the lipid profiles are better explained for some persons than for the others. For example, the residuals of subject 4 are low for each lipid, whereas the residuals of subject 10 are high around lipid index 100 and from 150 to 210. However, there does not seem to be any lipid species that would be consistently poorly explained by the model.
|
3.5 Importance of different lipoprotein fractions in explaining the lipid content of serum
In order to trace the source of a serum lipid, i.e. to find out what are the major lipoprotein fractions that best explain the abundance of the lipid in serum, we explored the lipid-group-wise regression coefficients of Equation (2) together with the mean abundances of lipids belonging to each lipid group. Figure 7a–e shows the shares by which each of the lipoprotein fractions are predicted to explain the serum lipid content (Pred). This prediction is the average of the product of the measured lipid abundance in fraction k,
|
Figure 7f shows how the lipids distribute among different lipid groups and lipoprotein fractions. Each bar corresponds to the average abundance of lipids that are bound to a particular lipoprotein fraction and that belong to the indicated lipid group. The bar heights are normalized such that the sum over all lipid groups equals to one. For example, it can be seen that the group one contains lipids that are abundant in all lipoprotein fractions.
4 DISCUSSION
We studied the dependencies between lipid abundances in serum and the different lipoprotein fractions using a Bayesian hierarchical regression model. Bayesian approach is flexible since it is easy to incorporate prior knowledge to the model and it is also conceptually simple to invert a regression model.
The drawback of a Bayesian approach, however, is that the implementation of a model tends to be more complex than in alternative approaches, where point estimates of parameters and their variances are used for inference. As with any other approach, inferences are only as good as the model and the experiment that produced the data, and the risk of mis-specifying or over-fitting a model is the higher the fewer data points are available, the more variables the data includes and the higher the unexplained variation in the data is. In the current study, these risks were high as the number of samples was only 17 (one per subject), whereas the number of variables (lipid species) was 210 with high levels of variation related to each of them. For this reason, we tried to utilize available information efficiently and keep the model as simple as possible but still such that it could provide good results for larger datasets available in the future. We utilized the strong correlations between lipid species belonging to a particular functional class by dividing the lipids into five groups, and assumed that within a group all lipids are drawn from a common distribution. We used a generalized linear regression framework within the lipid groups and set up a hierarchical structure for subject-specific regression coefficients in order to decrease model complexity. We also carefully chose the prior distributions and explored the data to guarantee our assumptions about the forms of the priors were reasonable. Our results indicate that these choices were adequate.
We also built and analysed a reverse regression model, i.e. a model that predicts the abundance of a lipid in the lipoproteins given the amount of the lipid in serum. We concluded that with our current data it is not possible to assess whether our model performs better than a naive reference model that predicts the outcome of a new observation to equal the average of the old measurements. Thus, we decided not to show results from our reverse regression model. However, below we discuss the prediction task and problems related to it, reflecting on our experience from the model we built.
In order to reverse the regression and predict the lipid content in the lipoproteins given the lipid content in serum, one needs to explore the data
and assume a sensible distribution from which these data can arise. For example, assuming normality one could set
|
|
g are the lipid-group-wise mean vector and the covariance matrix. Our exploratory studies indicated that the normal assumption for the logarithmic data was reasonable for the non-zero values, although high variation in the data made it difficult to asses the validity of this assumption. In addition, the fact that
70% of the measured values were zeros made the assumption of log-normality weaker. If one wants to adhere to the assumption of log-normality, one would need to understand whether the zero values arise due to erroneous peak detections or due to sensitivity of the platform. Alternatively, one could abandon the log-normal assumption and propose alternative ways to represent the distribution of the data. Regardless of the assumed form for the sampling distribution of the data, it is not possible to validate the performance ability of a predictor if variation in the data is high and there are only few samples. In our study, the between-individual variations in the lipid profiles were higher than systematic variations due to changes in the amount of the lipid in serum. Thus, the systematic trends necessary for prediction were masked by error-like variation. We estimated that for obtaining reasonable results from the reverse regression, the unexplained variation in the data should be decreased to 1/100 of its current magnitude and also the number of samples should be increased to hundreds. This is still a technically feasible task for the metabolomics platforms.
In conclusion, even without predictive model our proposed Bayesian modelling framework is, to our knowledge, the first effort to bridge individual molecular species level information, as obtained from serum lipidomic profiles, to the level of lipoprotein metabolism. Already at this stage, the model is a valuable tool to assist interpretation of complex serum lipidomics data in a clinical setting.
It is obvious that more clinical lipid profile data, including lipoprotein fractions, will need to be available in the future to build reliable and validated models. However, one can already speculate about the usefulness of such models. For example, large body of works exists on dynamical models of lipid metabolism at lipoprotein level (Adiels et al., 2005; Zech et al., 1979) One could thus anticipate application of our approach to model lipid metabolism at the individual molecular species level dynamically. Such exciting possibilities would certainly make a significant contribution to our understanding of human physiology, cardiovascular diseases and effects of therapeutic interventions aimed to alter lipid metabolism.
| AUTHOR'S CONTRIBUTIONS |
|---|
|
|
|---|
M.S.A. participated in drafting the manuscript and developed the statistical model for the analysis in collaboration with A.V., who also implemented the model in Matlab and provided the results. V.R.V. performed the lipid analysis and participated in drafting the manuscript. J.W., R.B., M.R.T., and H.Y.J. initiated the clinical study, provided the samples, and performed lipoprotein fractionations. M.O. initiated the research and participated in drafting the manuscript.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Janne Nikkilä and Samuel Kaski for discussions and ideas about potential statistical models applicable for the setting of this study. This work is in part supported by the Tekes MASI Programme and by the project Hepatic and adipose tissue and functions in the metabolic syndrome (HEPADIP, see http://www.hepadip.org/), which is supported by the European Commission as an Integrated Project under the 6th Framework Programme (Contract LSHM-CT-2005-018734). The funding sources have not been involved in the design, analysis or interpretation of the results.
Conflict of Interest: none declared.
| REFERENCES |
|---|
|
|
|---|
Adiels M, et al. A new combined multicompartmental model for apolipoprotein B-100 and triglyceride metabolism in VLDL subfractions. J. Lipid Res, ( (2005) ) 46, : 58–67.
Betteridge D, et al. Lipoproteins in Health and Disease, ( (2000) ) 1st. Arnold..
Fahy E, et al. A comprehensive classification system for lipids. J. Lipid Res, ( (2005) ) 46, : 839–862.
Gelman A. Inference and monitoring convergence. In: Markov Chain Monte Carlo in Practice, —Gilks WR, et al, eds. ( (1996) ) Chapman & Hall. 131–144..
Gelman A. Prior distributions for variance parameters in hierarchical models (comment on article by browne and draper). Bayesian Anal, ( (2006) ) 1, : 515–534..
Gelman A, et al. Bayesian Data Analysis, ( (2004) ) 2nd. Chapman & Hall..
Green P. Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika, ( (1995) ) 82, : 711–732.
Katajamaa M, Oresic M. Processing methods for differential analysis of lc/ms profile data. BMC Bioinformatics, ( (2005) ) 6, : e179.[CrossRef].
Katajamaa M, et al. Mzmine: toolbox for processing and visualization of mass spectrometry based molecular profile data. Bioinformatics, ( (2006) ) 22, : 634–636.
Kell DB. Metabolomics, modelling and machine learning in systems biology – towards an understanding of the languages of cells: delivered on 3 July 2005 at the 30th FEBS Congress and 9th IUBMB conference in Budapest. FEBS J, ( (2006) ) 273, : 873–894.[CrossRef][Medline].
Laaksonen R, et al. A systems biology strategy reveals biological pathways and plasma biomarker candidates for potentially toxic statin induced changes in muscle. PLoS ONE, ( (2006) ) 1, : e97.[CrossRef].
Lenz E, et al. Metabonomics, dietary influences and cultural differences: a 1H NMR-based study of urine samples obtained from healthy British and Swedish subjects. J. Pharm. Biomed. Anal, ( (2004) ) 36, : 841–849.[CrossRef][ISI][Medline].
Moller D, Kaufman K. Metabolic syndrome: a clinical and molecular perspective. Annu. Rev. Med, ( (2005) ) 56, : 45–62.[CrossRef][ISI][Medline].
Neal RM. Slice sampling. Ann. Stat, ( (2003) ) 31, : 705–767.[CrossRef].
Nicholson J, et al. Gut microorganisms, mammalian metabolism and personalized health care. Nat. Rev. Microbiol, ( (2005) ) 3, : 431–438.[CrossRef][ISI][Medline].
Oresic M, et al. Metabolomic approaches to phenotype characterization and applications to complex diseases. Expert Rev. Mol. Diagn, ( (2006) ) 6, : 575–585.[CrossRef][ISI][Medline].
Pauling L, et al. Quantitative analysis of urine vapor and breath by gas-liquid partition chromatography. Proc. Natl Acad. Sci. USA, ( (1971) ) 68, : 2374–2376.
Pietiläinen K, et al. Acquired obesity is associated with changes in the serum lipidomic profile independent of genetic effects – a monozygotic twin study. PLoS ONE, ( (2007) ) 1, : e218..
Robert CP, Casella G. Monte Carlo Statistical Methods, ( (1999) ) Springer-Verlag..
Serhan C, et al. Lipid mediator informatics-lipidomics: novel pathways in mapping resolution. AAPS J, ( (2006) ) 8, : E284–E297.[CrossRef][ISI][Medline].
Sysi-Aho M, et al. Normalization method for metabolomics data using optimal selection of multiple internal standards. BMC Bioinformatics, ( (2007) ) 8, : e93.[CrossRef].
Tiao GC, Zellner A. On the bayesian estimation of multivariate regression. J. R. Stat. Soc. B, ( (1964) ) 26, : 277–285..
van den Berg RA, et al. Centering, scaling, and transformations: improving the biological information content of metabolomics data. BMC Genomics, ( (2006) ) 7, : 142.[CrossRef][Medline].
van der Greef J, et al. The role of metabolomics in systems biology: a new vision for drug discovery and development. In: Metabolic profiling: Its Role in Biomarker Discovery and Gene Function Analysis, —Harrigan GG, Goodacre R, eds. ( (2003) ) Boston, MA: Kluwer Academic Publishers. 171–198..
Vance D, Vance JE. Biochemistry of Lipids, Lipoproteins and Membranes, ( (2004) ) Amsterdam: Elsevier B. V..
Watson AD. Thematic review series: systems biology approaches to metabolic and cardiovascular disorders. lipidomics: a global approach to lipid analysis in biological systems. J. Lipid Res, ( (2006) ) 47, : 2101–2111.
Wenk MR. The emerging field of lipidomics. Nat. Rev. Drug Discov, ( (2005) ) 4, : 594–610.[CrossRef][ISI][Medline].
Yetukuri L, et al. Bioinformatics strategies for lipidomics analysis: characterization of obesity related hepatic steatosis. BMC Syst. Biol, ( (2007) ) 3, : e12..
Zech L, et al. Kinetic model for production and metabolism of very low density lipoprotein triglycerides. Evidence for a slow production pathway and results for normolipidemic subjects. J. Clin. Invest, ( (1979) ) 63, : 1262–1273.[ISI][Medline].
Zellner A, Chetty VK. Prediction and decision problems in regression models from the bayesian point of view. J. Am. Stat. Assoc, ( (1965) ) 60, : 608–616.[CrossRef][ISI].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







