Skip Navigation


Bioinformatics Advance Access originally published online on January 18, 2008
Bioinformatics 2008 24(4):561-568; doi:10.1093/bioinformatics/btm640
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/4/561    most recent
btm640v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pihur, V.
Right arrow Articles by Datta, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pihur, V.
Right arrow Articles by Datta, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Reconstruction of genetic association networks from microarray data: a partial least squares approach

Vasyl Pihur , Somnath Datta and Susmita Datta *

Department of Bioinformatics and Biostatistics, University of Louisville, Louisville, KY 40292, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM AND METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Gene association/interaction networks provide vast amounts of information about essential processes inside the cell. A complete picture of gene–gene associations/interactions would open new horizons for biologists, ranging from pure appreciation to successful manipulation of biological pathways for therapeutic purposes. Therefore, identification of important biological complexes whose members (genes and their products proteins) interact with each other is of prime importance. Numerous experimental methods exist but, for the most part, they are costly and labor intensive. Computational techniques, such as the one proposed in this work, provide a quick ‘budget’ solution that can be used as a screening tool before more expensive techniques are attempted. Here, we introduce a novel computational method based on the partial least squares (PLS) regression technique for reconstruction of genetic networks from microarray data.

Results: The proposed PLS method is shown to be an effective screening procedure for the detection of gene–gene interactions from microarray data. Both simulated and real microarray experiments show that the PLS-based approach is superior to its competitors both in terms of performance and applicability.

Availability: R code is available from the supplementary web-site whose URL is given below.

Contact: susmita.datta{at}louisville.edu

Supplementary information: Supplementary information are available at http://www.susmitadatta.org/Supp/GeneNet/supp.htm.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM AND METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
1.1 Motivation
Experimental detection of biological association/interaction networks is an expensive and labor-intensive process. The two-hybrid systems is the most common method used for detecting physical interactions. Unfortunately, it has received a lot of criticism in the literature for the reputation of having high false-positive discovery rates (Futschik et al., 2007). Nevertheless, using the two-hybrid systems, researchers have reconstructed many important biological association/interaction networks in evolutionary diverse organisms, ranging from as simple as Saccharomyces cerevisiae (Uetz et al., 2000) to as complex as humans (Rual et al., 2005). Mass spectrometry is an alternative experimental approach that has been adapted to the identification of gene and protein complexes on a large scale (Ho et al., 2002). Berggard et al. (2007) discuss these and some other techniques for detecting protein interactions in more details, contrasting their relative strengths and weaknesses. Despite the fact that existing knowledge of the interactions between genes and their products proteins is limited and incomplete, biologists received a rare opportunity to take a closer look at the inner mechanisms of multiple processes inside a cell.

Computational approaches to reverse engineering of gene and protein association networks are of great interest to biologists as they allow for an indirect elucidation of relationships between genes and proteins through existing post-genomic high-throughput data. In particular, reconstruction of gene and protein association networks from microarray data has been a dynamic area of research in the last few years. Understanding the properties of naturally occurring biological pathways is the key to a successful mathematical modeling of such systems and developing efficient inference techniques. Numerous approaches originating from different disciplines have been undertaken for this purpose in the literature, such as Bayesian networks, auto-regressive models, correlation-based models and clustering techniques among the others. A previous study by Datta (2001) suggests that the partial least squares (PLS) regression may in fact be a powerful tool for exploring relationships between genes’ expression profiles, which may translate into biologically meaningful interactions/associations. In this work, we propose a more systematic approach to network construction to uncover the relationships between genes based on the PLS regression modeling of their simultaneous expression profiles obtained from microarray studies.

Low correspondence among different high-throughput interaction studies is an issue that speaks for further development of both computational and experimental methods for the reconstruction of biological networks. It has been shown that multiple interaction experiments in budding yeast, for the most part, reveal different interactions. Bader (2003) compared two-hybrid and mass spectrometry experiments for yeast and discovered a relatively small overlap of 387 interactions between them (6% of the two-hybrid data and 1% of the mass spectrometry data). From the practical point of view, this emphasizes some of the shortcomings of these high-throughput techniques and, to some degree, questions the validity of their results. Similar comparisons of the eight differently collected (three based on literature search, three based on known interactions in other organisms, and two based on two-hybrid systems) protein association networks in humans have been compared by Futschik et al. (2007). They also found relatively little correspondence between the studies, especially the ones coming from different methods, despite the large number of shared proteins. Only 3474 out of about 50 000 interactions were common to any two experiments, 374 interactions were found in three studies, 60 in four, and only 8 in five. These results once again point out that the quality of today's interaction studies requires critical assessment. As more work needs to be done in this area, computational methods may provide the benefit of quick and inexpensive initial screening for gene or protein interactions that then could be verified by more reliable and elaborate techniques.

1.2 Related work
It has been previously demonstrated by Datta (2001) that the PLS regression scores serve as good indicators of biological relationships. In this work, we are building upon this approach by formulating a more systematic framework for inferring biological networks. Numerous network reconstruction techniques from microarray data have been proposed in recent literature, some of which are considered in this article (Basso et al., 2005; Schäfer and Strimmer, 2005a; Yu et al., 2004) for comparison purposes. Among them, the work of Schäfer and Strimmer (2005a) which introduces a novel genetic reconstruction technique based on partial correlation (PC) coefficients is most similar to the proposed PLS-based method. In particular, both methods make use of Efron's (2004) empirical Bayes approach and local false discovery rate (fdr) calculation. Direct comparisons between the PC method and the PLS method are made in Section 3.

1.3 Outline and summary
In Section 2, a brief introduction to the PLS regression is given with the details on computing the scores representing the strength of the association/interaction between any two genes. A short description of the multiple testing procedure based on empirical Bayes and fdr completes this section. Section 3 presents a summary on the performance of the algorithm for both simulated and real datasets and compares it with a number of existing network inference methods. We end the manuscript with Section 4 where some of the general issues facing computational approaches to genetic network reconstruction are considered.


    2 SYSTEM AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM AND METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Partial least squares regression
Microarray data are often characterized by a large number of genes p for which a relatively small number of measurements N are recorded. Whereas common statistical techniques cannot be directly used on such data, dimension reduction or latent variable methods, such as the PCR (Principal Component Regression) and the PLS, are applicable when N < p.

PLS regression (Brown, 1993; Stone and Brooks, 1990; Wold et al., 1983) can be used to measure the relationship between a given gene (or ORF) i and all other gene(s) using their expression profiles (Datta, 2001). Let xi be a vector of centered and scaled expression values for gene i (sometimes referred to as the expression profile for gene i), 1 ≤ i ≤ p. For each gene i, a separate regression model is considered


Formula 1

(1)
This is accomplished in PLS by first computing a number v < N of (orthogonal) latent factors Formula 's, called components, and then by fitting a linear model


Formula 2

(2)
by the method of ordinary least squares leading to


Formula

The components Formula 's are linear combinations of x1,..., xi 1, xi + 1,..., xp and are sequentially constructed as follows: for {ell} ≥ 1,


Formula

where


Formula

and


Formula

In the above notation, X({ell}) is a deflated design matrix with columns Formula and the starting design matrix X(1) has columns xk, 1 ≤ k != i ≤ p.

Note that the coefficient Formula is the contribution of gene k to the first factor Formula (that has maximum covariance with the expression of gene i amongst all normalized linear combinations of expressions of other genes) and hence a measure of association (or a first order interaction) between genes i and k in the presence of other genes. Similarly, the coefficient Formula is the contribution of gene k to the second factor Formula after accounting for the relationship explained by the first factor and, therefore, it measures a second order interaction between genes i and k in the presence of other genes. The coefficients from the subsequent components can also be interpreted as higher order interactions in the same way. Owing to Equation (2), the overall association/interaction score between genes i and k, in the presence of other genes, is then given by Formula . We use its symmetrized form (by reversing the roles of genes i and k) to measure association/interaction between genes i and k in the presence of other genes:


Formula

We compute these measures of association/interaction for each gene pair.

The number of latent components v used in computing the above score is a user selectable tuning parameter for our procedure. In Section 3.1, we provide empirical results and a brief discussion about the effects of different number of components on the resulting genetic network.

2.2 Local false discovery rate
After obtaining numerical scores Sik measuring the strength of the relationship between any two genes i and k, we need to determine which of these are statistically significant. More formally, we are faced with testing multiple hypotheses given by


Formula

where sik = E(Sik) are the population scores. As a result, a statistical analysis should use a multiplicity adjustment while determining significance. Here, we employ an empirical Bayes technique due to Efron (2004) which uses a local version of the fdr to assess significance.

In this formulation, we assume that a typical Sik comes from a mixture distribution


Formula

where p0 and p1 are the mixing proportions, and f0(s) and f1(s) are the component densities corresponding to the null (no interaction present) and the alternative (interaction present) densities, respectively. The ratio f0(Sik) / f(Sik) then is an upper bound on the posterior probability of Sik coming from the null density f0(s) (i.e. the interaction between genes i and k is absent) given the value of Sik.

In his paper, Efron further addresses the problem of choosing the null density f0(s) and points out that the very presence of multiple hypotheses presents an opportunity to empirically estimate the correct null hypothesis. This means that both f(s) and f0(s) are empirically estimated from the data and the ratio Formula quantifies the ‘likelihood’ of Sik coming from the null distribution, provided the mixing proportion p0 is close to one. If the value is small, we can conclude that the interaction between genes i and k, is present. In other words, we borrow evidence (or strength) of an interaction being present or absent from data on all remaining interactions as well.

To this end, we perform the following routine suggested by Efron:

  1. Nonparametrically estimate the mixture density f(s) from the estimated values Sik obtained using PLS calculations as described earlier. It is achieved by a smooth curve fit to the normalized histogram of the Sik.
  2. Parametrically estimate the null density f0(s) by assuming a normal distribution. Estimate the mean and the variance parameters of the normal distribution by the method of maximum likelihood (ML).
  3. Compute Formula .
  4. Identify significant interactions, which are the cases whose fdr(Sik) is smaller than a chosen threshold value q, for example fdr(Sik) < q = 0.1.

The assumption made in this procedure is that the proportion of Sik's coming from the null distribution p0 is relatively large (very close to one). This assumption is valid in our context where we assume that most genes do not interact with each other and only a small fraction of all possible interactions is present. Another assumption is that the null distribution is approximately normal, which holds true for the PLS-based scores by the Central Limit Theorem as long as N is moderate to large. We have verified this from the histograms for the simulated as well as real data (see the Supplementary Data). The R package locfdr available at CRAN (http://cran.r-project.org/) provides a convenient interface to the above procedure. Throughout, we used the default ML estimation for the null distribution f0(s) (nulltype = 1) varying only the number of degrees of freedom for fitting the mixture density f(s).

Note that it is possible to obtain estimates of the linear coefficients bij in Equation (1) from the estimated coefficients Formula and by expressing the factors Formula in terms of the x's; see, e.g., Rosipal and Krämer (2006). It is also natural to use the symmetrized estimated coefficient Formula as a measure of association between genes i and k Datta and Datta (2003). However, we noted in a simulation study that a parallel network construction procedure based on these scores has worse performance than our procedure, especially for larger number of components. One possible reason could be that these scores do not capture the higher order interactions. On the other hand, for smaller number of components these scores appear to be very non-normally distributed resulting in a breakdown of the local fdr procedure. For the real dataset also the network considered based on these scores discovered fewer true interactions. Interested readers may see the Supplementary data for additional details.

We end this section with a brief rational for using the PLS regression in measuring strength of association between two genes. In the case of a single dependent variable (say, gene i) and a single regressor (say, gene j), a widely accepted association measure, the Pearson correlation coefficient between the expressions profiles of genes i and j, can be interpreted as the coefficient of a simple linear regression model of one on the other, provided both the expression profiles are standardized (i.e. centered and scaled). The PLS-based scores proposed here are also weighted sums of the coefficients of regression models of one normalized gene expression profile on the other genes’ normalized profiles, after taking out the effect of the previously constructed latent variables, when multiple genes are present in the study. Perhaps, one could use other latent variable regression techniques in this regard as well, notably the more widely used PCR. The two methods differ in the way the latent components are constructed. The PCR picks the directions of its principal components along the axes of the largest variability among the predictors with no consideration as to how those components are correlated with the dependent variable. The PLS, on the other hand, maximizes the covariance between the dependent and independent variables when choosing its components trying to explain as much variability as possible in both dependent and independent variables.

Thus, the PLS appears to be more appealing for the purpose of assessing the relationship between genes in the present context.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM AND METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Simulated data
Since no complete reference set of physical interactions on a genomic or proteomic level exists at this time, we had to resort to a simulation study to obtain estimates of sensitivity and specificity for our procedure and compare its performance with some commonly used competing techniques mentioned earlier (see Section 1.2). Also in order to remain ‘unbiased’ in our comparison to existing computational methods of network inference (reconstruction), we chose a model that is different from the statistical models underlying any of these network inference methods including that of ours. For this reason, we simulate a biological network with a known topological structure along with the corresponding microarray data using the SynTReN software developed by Van den Bulcke et al. (2006). The SynTReN software was designed to simulate benchmark microarray datasets for which the underlying biological networks are known for the purpose of developing and testing new network inference algorithms. Association networks are generated based on the existing biological subnetworks and are modeled with Michaelis–Menten and Hill kinetics equations. Numerous tuning parameters are available in the software that allow generating datasets of different sizes and complexity. For our purposes, we kept the default tuning parameters controlling the complexity and noise aspects and only changed the ones pertaining to the size of the dataset being generated. The first microarray dataset we simulated using the SynTReN consists of 50 genes with 64 interactions between them and 50 sample points. Results from applying the PLS method to this dataset with q = 0.1 are graphically summarized in Figure 1. The figure has been generated using the Cytoscape software available at http://www.cytoscape.org/ (Shannon et al., 2003). At q = 0.1, the PLS method identified 227 interactions as significant, 50 of which being the true interactions. The full discovered (i.e., reconstructed) network displaying all 227 interactions can be found on the Supplementary Data web-site.


Figure 1
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Genetic association network simulated by the SynTReN consisting of 50 genes. Edges between the nodes represent interactions among the genes with solid edges indicating the interactions discovered by the PLS method. The network exhibits the hub topology characteristic to many biological complexes. This plot was generated with the Cytoscape application.

 
A number of different techniques aiming at inferring genetic association networks from microarray data were proposed in past few years. We have looked at three different algorithms which are described in Table 1 and compared them with our method. This list is certainly not exhaustive, but represents different conceptual approaches to the problem of reconstructing genetic networks from microarray data.


View this table:
[in this window]
[in a new window]

 
Table 1. Alternative genetic network inference methods whose software implementations are freely available in a public domain

 
The PC method has been originally proposed by Schäfer and Strimmer (2005a). It is based on graphical Gaussian models (GGMs) characterized by good small-sample inference properties and an exact test of edge inclusion. The strength of genetic interactions is captured by the PC matrix {Pi} = ({pi}ik) whose entries are the correlations between any two genes i and k after accounting for the effects of all other genes. Formally, the PCs are related to the inverse of the standard correlation matrix P and can be computed using the following relationships


Formula

and


Formula

In the same article, the authors address the issue of the covariance matrix being not positive definite (and thus not invertible) which is the case when N < p by using the Moore-Penrose pseudoinverse to compute the PC matrix {Pi} and then implementing bootstrap aggregation (Breiman, 1996), commonly known as bagging, to stabilize it. However, further studies have proven the proposed estimator to be inferior to the covariance shrinkage estimator proposed in Schäfer and Strimmer (2005b). The general form of the estimator in this context is given by


Formula

where Formula denotes the estimate of the covariance matrix P, T denotes the constrained shrinking target covariance matrix of a lower complexity (usually assuming some form of structural regularity; for example, the identity matrix, equal variances or constant correlations), and {lambda} is the shrinkage coefficient which balances the bias-variance tradeoff of the two estimates Formula (characterized by a relatively large variance) and T (biased due to imposed constraints). After estimating the matrix {Pi}, an empirical Bayes multiple testing procedure and local fdr calculations are used to determine which PC coefficients are significantly different from zero.

BANJO is one of popular Bayesian networks methods for inferring both static and dynamic networks (Yu et al., 2004). It searches the ‘network space’ for the best network, which is identified as the one having the best overall network score computed using either the BDE (Bayesian Dirichlet equivalence) or the BIC (Bayesian information criterion) metrics. The search methods commonly used to search the space for optimal solutions are the greedy search with multiple random restarts, simulated annealing and genetic algorithm. All of them have their strong points, be it theoretical (which guarantees the convergence), practical or both. The original article provides additional results regarding the performances of these different search schemes.

ARACNe (Algorithm for the Reconstruction of Accurate Cellular Networks) can be regarded as an information-theoretic approach (Basso et al., 2005). It computes mutual information (MI) for all pairs of gene profiles which is an information-theoretic measure of how related the genes are. MI's are estimated using the Gaussian Kernel estimator and are then filtered out to reduce the number of false-positive interactions.

We refer the readers to the original papers for more detailed descriptions of these algorithms.

To compare the performance of our PLS-based method with the other three methods, we generate 100 datasets consisting of 150 genes and 100 sample points. Again, the corresponding 100 networks are automatically generated by the SynTReN software.

A direct comparison of the PLS and PC methods is possible as both use the empirical Bayes approach to calculate fdr. Table 2 shows the sensitivity and specificity for the two approaches at different levels of nominal fdr q. For this reporting we only present the result of a 3-component PLS (i.e. v = 3). At each nominal fdr level, the PLS method appears to be more sensitive albeit a little less specific than the PC method. Results for the other two methods, ARACNe and BANJO, are shown in Table 3.


View this table:
[in this window]
[in a new window]

 
Table 2. 100 simulated datasets: 150 genes and 100 samples

 

View this table:
[in this window]
[in a new window]

 
Table 3. 100 simulated datasets: 150 genes and 100 samples

 
From these tables, one can see that the sensitivity and/or the specificity of these procedures do not cover the entire theoretical range of the unit interval for various choices of the operational parameters. As a result the ROC curves were drawn covering the practical range of (specificity, 1-sensitivity) values for a procedure. These plots in Figure 2 provide a convenient graphical summary of algorithms’ performances.


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. 100 simulated datasets: 150 genes and 100 samples. ROC Curves for the four algorithms are compared. The PLS method using 2, 3, 4 and 8-component models, outperforms the other three methods as judged by the RAUC scores.

 
In Figure 2, where 2, 3, 4 and 8-component PLS models are compared, differences between the performances of the first three are not substantial. Increasing the number of components, however, results in inferior performance. We have also calculated the restricted area under the curve (RAUC) for all the curves between the vertical lines at 0 and 0.13. The RAUC scores reported in the legend are the areas under the curves divided by the range over which they were computed, i.e., 0.13. Clearly, the PLS performs better than both PC and ARACNe methods.

The PLS method requires setting a user-selectable parameter v that controls the number of components (latent variables). Depending on the sample size, it is generally advisable not to take it too large since attempting to estimate a large number of components may result in overfitting and degradation of predictive performance. The Supplementary web-site provides a performance summary for the PLS method with components ranging from 1 to 90 for this simulation example. The general trend observed indicates that the best performance is expected for a relatively small number of components (2 to 5). Increasing the number of components further, however, decreases the number of discovered interactions and, thus, the sensitivity of the procedure, till its performance stabilizes (around v = 30 in this example). Note that unlike regression and classification problems, our use of PLS is unsupervised; it is not possible to have a data based ‘optimal’ selector of v for this problem in the absence of some ‘ground truth’. In the rest of this article we have used v = 3 for our calculations.

In Tables 2 and 3, we also include a column called the ‘empirical fdr’ which reports the proportion of discovered interaction that were not true interactions according to SynTReN which was used to simulate the network. The discrepancy between the nominal fdr and the empirical fdr presumably arises (amongst other things) due to the difference in underlying modeling approaches. In the case of both PLS and PC, the correctness of a discovery refers to a parameter in the respective model which has no direct mathematical relationship with the ‘true’ interactions in the kinetics equations used by SynTReN.

3.2 Extensions of the algorithm
Applications of the PLS algorithm to the simulated data clearly demonstrates its ability to identify potential gene–gene interactions. The basic method as described in Section 2, however, can easily be extended to accommodate more complex microarray experiments. This includes data collected at selected time points capturing the trend of each gene's expression over time and data collected on several groups representing different experimental conditions.

3.2.1 Dynamic microarray data
Very often researchers are interested in changes of gene expression levels over time as a result of external stimuli or natural biological processes inside the cell. Interacting genes then will likely exhibit similar (or opposite) trends over time while not necessarily having comparable initial expression levels. Intuitively, changes in expression levels between adjacent time points capture these trend patterns


Formula

Therefore, when applying the PLS-based procedure to temporal microarray data, it is reasonable to use expression level differentials yij instead of the original values xij as the input.

3.2.2 Use of the group indicator
The second type of microarray experiments records data collected on two or more groups Formula = {G1, ..., GM} representing different experimental conditions (tumor versus normal, radiation versus no radiation and so on). Many microarray studies are designed around multiple conditions attempting to pinpoint essential underlying differences between them. It is potentially beneficial to incorporate this additional grouping variable in the reconstruction of the genetic association network from these data. To the best of our knowledge, none of the other three network inference methods discussed earlier makes use of this additional information.

The potential benefit of introducing the group variable is 2-fold. First, it brings additional available information into the inference process making it more powerful (enhanced ability to detect interactions). Second, it helps to reduce the number of false positive interactions. We achieve this by first computing the PLS symmetrized coefficients Formula 's for each group G separately. These are then standardized within each group g


Formula

where Formula is a robust estimate of the standard deviation, IQR being the interquartile range and 1.34 being the interquartile range for the standard normal distribution. We then compute the overall score for each gene pair by


Formula

where s* is the sign of Formula given Formula .

The intuitive idea behind this modification of the algorithms is that having M sets of scores from M groups, we are choosing the scores which are the closest to 0 (the null). If the interaction is to be declared present, the association has to be observed in all M groups. In that sense, this imposes a stricter restriction on categorizing the interactions as significant. Some variations of this modification are also possible; as for example, instead of considering the overall absolute minima, we may consider some other order statistic and thereby making the requirement less stringent.

The overall scores sik will not be approximately normally distributed and need to be transformed to satisfy the normality assumption of the fdr procedure. For two groups, the transformation has the following form with more details on the Supplementary Data


Formula

where {Phi}(x) is the cumulative distribution function of standard normal N(0,1) and Fs(x) is the cumulative distribution function of sik scores defined as


Formula

Often both temporal data and multiple groups are present in a single dataset such as the one considered in the next section. In such cases, both of the above extensions will be useful to consider.

3.3 Real data
We apply the proposed extensions of the algorithm to a subset of microarray data on genomic expression responses to DNA-damaging agents collected and analyzed by Gasch et al. (2001). Both wild-type and mec1 (mutants defective in Mec1 which plays the central role in conducting the damage signal) cells were subjected to two distinct DNA-damaging agents: MMS (methylating-agent methylmethane sulfonate) and ionizing radiation. In the original work, a comparative analysis of the responses to the agents in terms of gene expression was carried out to identify the dependencies within the Mec1 pathway. Out of the original 6167 yeast genes, a subset consisting of 768 genes was selected for our analysis. These were the genes whose mean squared errors (Formula , where Formula is the mean expression value for gene i) across profiles were greater than six and contained no missing values. The total number of samples in the data was 30 with equal numbers for wild-type and mec1 cells.

The genetic association network resulting from applying the extended PLS-based approach to the DNA-damage data is graphically visualized in Figure 3. The basic PLS method was first used on the time point differentials once for each group (wild-type and mec1 mutant cells) producing two sets of scores. Then we computed the overall scores and applied the local fdr testing scheme on these transformed scores. The discovered network consists of 111 nodes with 118 interactions between them (identified as significant at q = 0.1).


Figure 3
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Reconstructed association network using the extended PLS-based approach for DNA-damage data. The network is visualized with the Cytoscape software. It consists of 111 nodes with 118 interactions between them. Solid edges represent ‘true’ interactions matched against the BioGrid online repository.

 
Seventeen interactions matched the existing interactions in the BioGrid online repository which is about 14.4% true discovery rate (Fig. 4). Osprey (http://biodata.mshri.on.ca/osprey/servlet/Index), the front-end application for accessing the BioGrid database, provides a convenient environment for analyzing and visualizing biological networks (Breitkreutz et al., 2003). The BioGrid repository is available at www.thebiogrid.com/ and is a general repository for interaction datasets. At the present time (accessed on August 29, 2007) it contains 1810 interactions among the 768 genes in our microarray data (Stark et al., 2006). Both physical and genetic interactions are continuously added to the repository and currently 13 different organisms are represented.


Figure 4
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Seventeen discovered interactions with the matching existing interactions in the BioGrid database. The figure is produced with the Cytoscape application. Different node shapes indicate different GO categories.

 
For the PC method applied to the same DNA-damage data, a total of 288 most significant (discovered) interactions had to be considered for identifying two true positive interactions in the BioGrid.

To combat a high rate of false-positive discoveries, one can adapt a number of different techniques that associate confidence scores with protein interactions and use them within the context of genetic interactions. Suthram et al. (2006) provide an excellent comparative summary of such probability assignment schemes and come to a conclusion that imposing a probabilistic framework on the interactions being ‘true’ is superior to the assumption of all interactions identified as present being equally likely the ‘true’ interactions.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM AND METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The PLS-based method is shown to be a powerful screening tool for discovering interactions among genes. Identified potential candidates can be further analyzed more critically either through high-throughput experimental methods such as two-hybrid systems and mass spectrometry or more reliable small-scale targeted experiments. It clearly outperforms the other three algorithms that we consider in this work. In particular, the PLS method seems to be more sensitive than the other three methods for the simulated data.

When using the PLS method, one has to make a decision regarding the number of components to be used. SAS (2004) has a data-based selector for the number of PLS components that minimizes the prediction error sum of squares while fitting a PLS model although there is no reason to believe that it will translate to optimal performance of the corresponding network inference procedure. Furthermore, it could be computationally demanding (and prohibitive) to do so in our context since a large number of PLS models are to be fit. In general, due to the unsupervised nature of the problem, it is not possible to have a data-based optimal selector of the number of components for our network inference procedure. Throughout, we have used three components in this work, except in the simulation studies where a variety of number of components were examined.

Another issue that deserves special attention is the fdr level which determines the cutoff point at which significant interactions are separated from the insignificant ones. We believe that this decision should be based upon user's judgement, taking into consideration the nature of the experiment, the quality of the available microarray data and the magnitude of the discovered interactions. We have used a nominal value of q = 0.1 in our procedure throughout this work.

The PLS method performs the best as the number of available samples increases. Larger sample size provides more power for detecting significant interactions. Therefore, for better quality results, a relatively large sample size is advised. Nevertheless, the proposed algorithm can be applied to relatively small datasets if necessary. From a computational point of view, the PLS-based algorithm has the ability to handle large datasets as the computing times are not prohibitive.

The main drawback of the proposed PLS-based approach is its high false-positive rates. A substantial number of interactions identified as ‘important’ (or significant) by the algorithm are not ‘true’ interactions. The tendency of the algorithm to form ‘stars’ (as seen in the simulated data example) or completely (or almost completely) interconnected subnetworks prevents it from capturing the ‘true’ topology of biological networks which exhibits highly connected hubs (central controlling nodes) with multiple nodes which usually contain just a few interactions amongst them connected to the hubs. It is thought that many biological networks possess this hierarchical, scale-free topology (Han et al., 2004; Jeong et al., 2000). The proposed extensions to the algorithm seem to alleviate the problem to some extent as seen in the real data illustration. It should be noted, however, that these drawbacks plagues most, if not all, network inference techniques including the PC method.

It is important to remember that, as of present time, no complete genetic interaction database exists to objectively evaluate the results obtained for the real data. As was mentioned in the Introduction, high-throughput techniques which are extensively used to populate such databases have their own shortcomings. Performance results reported here are based on the current knowledge of yeast biological networks as collected in the BioGrid online repository.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM AND METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank two reviewers for many useful suggestions. This research was partially supported by grants from the United States National Science Foundation to S.D. and S.D. We thank Sarat Daas for bringing the Schäfer and Strimmer (2005a) article to our attention.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

Received on September 11, 2007; revised on December 28, 2007; accepted on December 28, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 SYSTEM AND METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bader JS. Greedily building protein networks with confidence. Bioinformatics (2003) 19:1869–1874.[Abstract/Free Full Text]

    Basso K, et al. Reverse engineering of regulatory networks in human B cells. Nat Genet (2005) 37:382–390.[CrossRef][Web of Science][Medline]

    Berggard T, et al. Methods for the detection and analysis of protein-protein interactions. Proteomics (2007) 7:2833–2842.[CrossRef][Web of Science][Medline]

    Breiman L. Bagging predictors. Mach. Learn (1996) 24:123–140.

    Breitkreutz BJ, et al. Osprey: a network visualization system. Genome Biol (2003) 4:R22.[CrossRef][Medline]

    Brown P. Measurements, Regression, and Callibration. (1993) New York: Oxford University Press.

    Datta S. Exploring relationships in gene expressions: a partial least squares approach. Gene Expr (2001) 9:249–255.[Web of Science][Medline]

    Datta S, Datta S. Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics (2003) 19:459–466.[Abstract/Free Full Text]

    Efron B. Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J. Am. Stat. Assoc (2004) 99:96–104.[CrossRef][Web of Science]

    Futschik ME, et al. Comparison of human protein-protein interaction maps. Bioinformatics (2007) 23:605–611.[Abstract/Free Full Text]

    Gasch AP, et al. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast atr homolog mec1p. Mol. Biol. Cell (2001) 12:2987–3003.[Abstract/Free Full Text]

    Han J, et al. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature (2004) 430:88–93.[CrossRef][Medline]

    Ho Y, et al. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature (2002) 415:180–183.[CrossRef][Medline]

    Jeong H, et al. The large-scale organization of metabolic networks. Nature (2000) 407:651–654.[CrossRef][Medline]

    Rosipal R, Krämer N. Overview and recent advances in partial least squares. In: Subspace, Latent Structure and Feature Selection.—Saunders C, et al, eds. (2006) Heidelberg: Springer-Verlag. 34–51.

    Rual JF, et al. Towards a proteome-scale map of the human protein-protein interaction network. Nature (2005) 437:1173–1178.[CrossRef][Medline]

    Schäfer J, Strimmer K. An empirical bayes approach to inferring large-scale gene association networks. Bioinformatics (2005a) 21:754–764.[Abstract/Free Full Text]

    Schäfer J, Strimmer K. A shrinkage approach to large-scale covariance matrix estimation and implications for functional genomics. Stat. Appl. Genet. Mol. Biol (2005b) 4:32.

    Shannon P, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res (2003) 13:2498–2504.[Abstract/Free Full Text]

    Stark C, et al. Biogrid: a general repository for interaction datasets. Nucleic Acids Res (2006) 34:D535–D539.[Abstract/Free Full Text]

    Stone B, Brooks RJ. Continuum regression: Cross-validated sequentially constructed prediction embracing ordinary least squares, partial least squares and principal component regression. J. R. Stat. Soc. B (1990) 52:237–269.

    Suthram S, et al. A direct comparison of protein interaction confidence assignment schemes. BMC Bioinformatics (2006) 7:360.[CrossRef][Medline]

    Uetz P, et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature (2000) 403:623–627.[CrossRef][Medline]

    Van den Bulcke T, et al. Syntren: a generator of synthetic gene expression data for design and analysis of structure learning algorithms. BMC Bioinformatics (2006) 7:43.[CrossRef][Medline]

    Wold S, et al. The multivariate calibration problem in chemistry solved by the PLS method. In: Lecture Notes in Mathematics: Matrix Pencils.—Ruhe A, Kägström B, eds. (1983) Heidelberg: Springer-Verlag. 286–293.

    Yu J, et al. Advances to bayesian network inference for generating causal networks from observational biological data. Bioinformatics (2004) 20:3594–3603.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/4/561    most recent
btm640v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pihur, V.
Right arrow Articles by Datta, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pihur, V.
Right arrow Articles by Datta, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?