Skip Navigation


Bioinformatics Advance Access originally published online on August 20, 2008
Bioinformatics 2008 24(20):2376-2383; doi:10.1093/bioinformatics/btn440
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/20/2376    most recent
btn440v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pu, S.
Right arrow Articles by Wodak, S. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pu, S.
Right arrow Articles by Wodak, S. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Local coherence in genetic interaction patterns reveals prevalent functional versatility

Shuye Pu 1,*, Karen Ronen 1, James Vlasblom 1, Jack Greenblatt 2,3 and Shoshana J. Wodak 1,3,4

1Molecular Structure and Function Program, Hospital for Sick Children, 555 University Avenue, Toronto, ON, Canada M5G 1X8, 2Terrence Donnelly Centre for Cellular & Biomolecular Research, 160 College Street, Toronto, ON, Canada M5S 3E1, 3Department of Molecular Genetics and 4Department of Biochemistry, University of Toronto, 1 King's College Circle, Toronto, ON, Canada M5S 1A8

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Epistatic or genetic interactions, representing the effects of mutating one gene on the phenotypes caused by mutations in one or moredistinct genes, can be very helpful for uncovering functional relationships between genes. Recently, the epistatic miniarray profiles (E-MAP) method has emerged as a powerful approach for identifying such interactions systematically. For E-MAP data analysis, hierarchical clustering is used to partition genes into groups on the basis of the similarity between their global interaction profiles, and the resulting descriptions assign each gene to only one group, thereby ignoring the multifunctional roles played by most genes.

Results: Here, we present the original local coherence detection (LCD) algorithm for identifying groups of functionally related genes from E-MAP data in a manner that allows individual genes to be assigned to more than one functional group. This enables investigation of the pleiotropic nature of gene function. The performance of our algorithm is illustrated by applying it to two E-MAP datasets and an E-MAP-like in silico dataset for the yeast Saccharomyces cerevisiae. In addition to recapitulating the majority of the functional modules and many protein complexes reported previously, our algorithm uncovers many recently documented and novel multifunctional relationships between genes and gene groups. Our algorithm hence represents a valuable tool for uncovering new roles for genes with annotated functions and for mapping groups of genes and proteins into pathways.

Availability: A Java implementation of the LCD algorithm is available at URL http://genepro.ccb.sickkids.ca/biclustering.html

Contact: shuyepu{at}sickkids.ca

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Genetic interactions are non-multiplicative epistatic effects of double mutations on phenotypes. Synthetic enhancement and synthetic suppression have been used widely by geneticists for functional analysis in the yeast, worm and fly (Avery and Wasserman, 1992; Baker and Ridge, 1980; Guarente, 1993). Development of high-throughput techniques, such as synthetic genetic array (SGA) (Tong et al., 2004) and heterozygote diploid-based synthetic lethality analysis with microarrays (dSLAM) (Pan et al., 2007), has enabled genome-wide assessment of epistatic relationships between genes in yeast (Boone et al., 2007; Davierwala et al., 2005; Ye et al., 2005). A powerful variant of the SGA method, the epistatic miniarray profiles (E-MAP) (Collins et al., 2007; Schuldiner et al., 2005) has emerged recently. Unlike conventional SGA and dSLAM, which have mainly been used to detect aggravating (synthetic sick or lethal) interactions, the E-MAP procedure measures both negative (aggravating) and positive (alleviating) interactions quantitatively on a continuous scale and comprehensively among both essential and non-essential genes.

The fact that many genes show either alleviating or aggravating interactions with genes involved in diverse cellular processes strongly suggests that multifunctional genes are prevalent in the cell, and the information-rich E-MAP data are ideally suited for uncovering the multifaceted roles of genes. Currently, E-MAP data are analyzed mainly using hierarchical clustering and related methods, to partition genes into pathways (Schuldiner et al., 2005), divide large protein complexes into functional sub-modules and identify epistasis groups that are otherwise not involved in physical interactions (Collins et al., 2007). Elaborating on an earlier study (Kelley and Ideker, 2005), Ideker and colleagues (Bandyopadhyay et al., 2008), have recently combined E-MAP data with protein–protein interaction data to improve identification of protein complexes and to delineate functional relationships among them. However, none of the above methods is designed to extract information on multiple functions of genes, because they partition genes into disjoint groups.

Biclustering, a relatively new unsupervised learning technique, allows the assignment of individual genes to multiple clusters and, therefore, does not suffer from these limitations. By clustering query genes and library genes simultaneously, this technique defines query gene clusters with respect to subsets of library genes, and vice versa. Biclustering has been used recently for the analysis of gene expression data from DNA microarray experiments (Madeira and Oliveira, 2004; Prelic et al., 2006) and phenotypic data of yeast mutants (Dudley et al., 2005), aiming among other things at finding genes involved in multiple cellular processes. Depending on the bicluster definitions and the applications, different biclustering algorithms produce biclusters with strikingly divergent properties. A survey of available biclustering algorithms indicates that none is appropriate for the analysis of E-MAP data (see Supplementary Material for a brief discussion of limitations of existing biclustering algorithms regarding E-MAP analysis). Therefore, we have developed a novel biclustering algorithm, termed local coherence detection (LCD), which exploits the fact that a group of multifunctional genes may display a tight coherence over a fraction of their interaction profiles if they cooperate in a common process under specific conditions, but behave distinctly otherwise. Unlike hierarchical clustering methods, which rely on correlating global interaction profiles, the LCD algorithm is capable of utilizing both global and local profile similarities and uses a fitness score to gauge the extent of profile coherence among genes in a functional module. Stochastic optimization techniques are employed to efficiently identify biologically meaningful gene clusters with the best fitness score. The statistical significance of the biclusters is established by computing P-values based on comparisons with appropriate random controls (see Section 2 for details).


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 The LCD algorithm
We consider that a group of query genes interact with a library gene in a highly coherent fashion if all these query genes not only display the same type of interaction (either alleviating or aggravating) with the given library gene, but also have similar interaction scores. Let the query and library genes represent respectively, the row and column entries in the E-MAP data matrix, the LCD algorithm aims at finding submatrices such that all values in a given column not only display the same sign (all positive or all negative) but also feature a small variation that is well within the range of stochastic noise. To that end each entry aij of the sub matrix is subjected to the constraint: Formula with Formula representing the average value of the current entries in column j and {varepsilon}j is the stochastic noise associated with that column. Another key requirement is that the considered submatrix must deviate significantly from background noise. Lastly, the size of the submatrix should be non-trivial. A submatrix satisfying these requirements defines a bicluster. Computed biclusters are evaluated using a fitness score. For a bicluster B with p rows and q columns, the fitness score F(B) is defined as follows:


Formula 1

(1)
where j| and {sigma}j are the absolute value of the mean and the SD of the values in column j of bicluster B, respectively. The parameter r > 0 models the background noise, and is taken as that estimated in the experimental analyses. The stochastic noise, albeit unknown, contributes to {sigma}j. We reasoned that the smaller the {sigma}j, the more likely it reflects the stochastic noise; conversely, functionally divergent genes will produce discordant interaction scores, resulting in large {sigma}j; collectively, genes in columns with large noises ({sigma}j+r) are deemed functionally incoherent and such columns should be penalized. The argument of the logarithmic function is essentially a signal/noise ratio; a column is considered coherent if this ratio is greater than 1. The fitness score is proportional to the number of rows p, which is raised to the power w. The parameter w can be adjusted to facilitate detection of non-trivial biclusters. As indicated by Equation (1) and illustrated in Figure 1, the contribution of each column to F(B) is evaluated individually, allowing preferential selection of columns with high signal/noise ratios in the greedy searching process, which selects highly coherent columns as the foundation for growing biclusters. This individualized treatment of columns sets the LCD algorithm apart from any mean squared residue (MSR)-based scoring scheme, as detailed in the Supplementary Material. LCD treats missing values as background, which has deleterious effects on the fitness of biclusters, and consequently columns with many missing values are eliminated during the search process.


Figure 1
View larger version (31K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Definition of locally coherent biclusters. (a) Token genetic interaction scores between seven query genes and 22 library genes are plotted to illustrate the difference between the LCD algorithm and existing biclustering algorithms. Most mean squared residue-based algorithms (Cheng and Church, 2000) will cluster all seven query genes (G1–G7) and 22 library genes into one bicluster given the high level of coherence among them (mean squared residue=1.9), while our algorithm only identifies query genes G1, G2 and G3 and library genes 1–9 and 15–22 as a bicluster, which lead to the highest fitness score (see b and c). (b) Contribution to the fitness score of each library gene of the bicluster (G1–G3). The contribution from genes 10–14 is negative (center) as the genetic interaction scores deviates little from the background level (around 0). These genes are consequently rejected by the algorithm. (c) The bicluster containing query genes G1, G2 and G3 has the highest fitness score.

 
The search for non-trivial biclusters is performed by combining simulated annealing (Kirkpatrick et al., 1983) with a Taboo search procedure (Cvijovic and Klinowski, 1995).

Specifically, k initial biclusters are generated by randomly picking p rows and q columns from the full input n x m data matrix. The size of the initial biclusters is not crucial as long as it is at least 2 x 2 (Supplementary Fig. S1). The value of k can significantly impact the calculation time, but provided k is large enough, its actual value marginally affects the final number of statistically significant biclusters. At each iteration, the best move for each bicluster is determined on the basis of its current fitness score. There are n+m possible moves associated with each bicluster B. If a row or column x is already contained in B, the move will be the removal of x from B; otherwise it will be addition of x to B. Each move M is associated with a gain, defined as follows:


Formula 2

(2)
where Fi and Fi+1 are the fitness of B before and after each move, respectively. The move with the maximum gain is selected as the best move for B. Best moves are applied stochastically with a probability of performing a given move M, determined using a simulated annealing protocol:


Formula 3

(3)
here, the current temperature ti is a function of the previous temperature ti–1 and the cooling rate {alpha}. At each iteration, ti is updated using a linear cooling model with t0=1:


Formula 4

(4)

Simulated annealing improves the performance of the LCD algorithm by accepting a temporary decline in fitness, which in turn enables a more extensive exploration of the search space and facilitates convergence.

The Taboo search technique is applied to further improve the search efficiency. It does so by excluding parts of the search space unlikely to contain solutions that are better than the current best solution (Cvijovic and Klinowski, 1995). The details are as follows: for each bicluster B the procedure starts (1st iteration) by scanning all n+m row and column moves and ranking them according to the gain they produced. The top 10% moves are flagged as belonging to the so-called ‘Elite list’. The move at the top of that list is then applied to yield a new state of B, and removed from the Elite list. In the next iteration, remaining moves in the Elite list are re-ranked based on the new state of B, and the top move from the Elite list is applied and then removed from the list, and so on until half the Elite list has been used up. At that point, also denoted as the ‘aspiration condition’ the ranking of all possible n+m moves is updated by evaluating their gain values on the current state of B, and the cycle is repeated. The total time complexity of the LCD algorithm is O(lk(n+m)max (m, n), where l and k are the number of iterations and biclusters, respectively.

2.2 Data sources
For the analysis on metabolic pathways (Segre et al., 2005), single and double deletion fitness data under nominal condition were downloaded from ftp://ftp.ebi.ac.uk/pub/databases/FBA2KO. Scaled interaction score were calculated using the authors' formula. E-MAP datasets for chromosome biology (Collins et al., 2007) and early secretory pathways (Schuldiner et al., 2005) were downloaded from http://interactome-cmp.ucsf.edu/.

2.3 Biclustering analysis
The LCD algorithm was run 1000 times (multiple random starts) on each of the three datasets examined in this study, with each run beginning with 100 randomly selected 3 x 3 biclusters. Values for the parameter r were taken as those reported in the original E-MAP studies. The parameter w was adjusted to optimize the fit of identified biclusters to a compendium of yeast protein complexes curated in house (Pu et al., unpublished data), although the actual sensitivity of the results to this parameters is found to be marginal (Supplementary Fig. S2). The parameters used for each dataset are summarized in the Supplementary Table S4.

2.4 Defining modules and functional links
Random controls were used to assess the statistical significance of the biclusters identified by the LCD algorithm. To generate these controls, entries of the data matrix with numerical values were randomly shuffled such that the entries with missing values are unchanged and the matrix remains symmetrical with respect to the main diagonal. This is equivalent to randomly shuffling edge weights of the genetic interaction network without changing the node labels and network topology, thereby preserving the node degrees and edge weights distribution. For each of the matrices from the three real datasets, a total of 150 such randomization runs were performed, with this number being limited by computational cost considerations. The LCD algorithm was applied to each randomized matrix in exactly the same way as to the matrices of the real data.

In addition to the random control obtained by shuffling entries of the data matrices, we also derived biclusters with 2–20 rows by randomly drawing rows from the data matrices and retaining only columns that contribute positively to the fitness score of each cluster. For each row size, 1000 biclusters were created, and the average fitness score and average number of coherent columns were computed. The results comparing biclusters from real data and random models are summarized in the Supplementary Figure S3.

The fitness score cutoff for each row size was determined on the basis of the maximum fitness scores obtained from the 150 randomized matrices, which displayed an approximately normal distribution. Let Mrandom be the average maximum fitness score obtained from randomized matrices, then the fitness score cutoff, Fcutoff, is determined as follows:


Formula 5

(5)
This conservative level was chosen to minimize the probability that biclusters identified from the real data have arisen by chance, while minimizing the number of discarded biclusters. The cutoff values used for each dataset are summarized in the Supplementary Table S5.

The query genes in an identified bicluster are considered here as representing the predicted functional module. Identical modules identified in different runs are considered as redundant and represented only once. But the frequency with which they have been identified in all runs is recorded and used as a measure of the module robustness. Modules differing by one or more genes are considered as distinct, as they are smaller modules that are entirely contained in bigger ones. Modules corresponding to individual known functional groups (physical complex, annotated pathway or epistasis group) are identified by intersecting the lists of predicted modules with the list of known functional groups. The statistical significance of the intersection is evaluated using the hypergeometric distribution (see Supplementary Tables S1, S2 and S3). A predicted module is considered as corresponding to a known functional group when it is the largest module, such that all the components of that module belong to that functional group. Modules that contain more than 50% components of each of two or more different functional groups are considered as indicating the existence of functional links between them. Functional links are established pairwise and quantified by a weight proportional to the frequency that a pair of known functional groups appears together in different modules, with a higher weight indicating that the linked complexes (pathways, epistasis groups) are more likely to be functionally related.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 LCD and hierarchical clustering identify different functional modules
With very useful interpretations of the E-MAP data afforded to date by hierarchical clustering techniques, it is important to evaluate to what extent the modules identified in this study differ from those detected by classical clustering techniques. To that end, a systematic mapping was performed between the modules identified by LCD and the agglomerative clustering tree built by grouping query genes on the basis of the pairwise correlation coefficient between their interaction profiles with all other genes (Collins et al., 2007), as illustrated in Figure 2a. This mapping was performed independently for each of the two E-MAP and the E-MAP-like datasets considered here, with the results summarized in Figure 2b. Figure 2a clearly illustrates that genes belonging to an LCD module and hence displaying highly coherent interaction profiles with a subset of the library genes, may map into a cluster whose members display poor correlation between their genetic interaction profiles computed over all the genes (module M1, mapping into cluster C6). Such cluster will not be considered as statistically significant and discarded from further analysis, limiting the latter to clusters positioned lower in the dendrogram, whose members display higher global coherence in their interaction profiles. Figure 2b shows that in addition to these globally coherent modules, the LCD algorithm detects a large number of new locally coherent modules. These new modules display poor global coherence and therefore cannot be identified by the hierarchical clustering methods. In the following, we present a more detailed analysis of the modules identified by LCD in each of the three datasets considered here.


Figure 2
View larger version (37K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Comparing modules (biclusters) with clusters derived by hierarchical clustering. (a) Each module is mapped onto the smallest cluster that subsume the module. For example, module M1 is mapped to cluster C6, because C6 is the smallest cluster that contains genes A, B and E. (b) Fraction of modules identified in each dataset that overlap with clusters characterized by the average pairwise uncentered pearson correlation coefficient (UPC) of their genetic interaction profiles. The figures in parentheses denote the Jaccard index measuring the extent of overlap between the module and cluster components. those in the pies represent the number of modules that map into clusters having the indicated range of UPC values.

 
3.2 Functional relationships among yeast metabolic pathways
The first dataset represents genetic interactions among 890 yeast metabolic genes determined computationally using flux balance analysis (Segre et al., 2005). The main interest of this dataset, which resembles an E-MAP albeit not a true one, lies in the fact that the metabolic pathways and the roles of the genes in these pathways are well defined and the mechanism producing the phenotypic consequence of gene deletions is known. Application of LCD to this dataset recovered 17 ‘monochromatically interacting’ modules derived previously using the Prism algorithm (Segre et al., 2005), and the genes in these modules display pathway-specific functional annotations (Forster et al., 2003). Moreover, our procedure clustered these pathway-specific modules into 21 higher level modules containing genes with related functions. The remaining 23 modules are either sub-complexes/sub-pathways or combinations thereof (See Supplementary Table S1 and Supplementary Fig. S5 for an overview of the modules).

The combinatorial relationships among pathways uncovered by our procedure are summarized in Supplementary Figure S6, where two pathway-specific modules are linked by an edge if they were found together in a larger module. We see, for instance, that the ‘ATP synthase’ module (14 genes) and the ‘Electron transport system, complex IV’ module (13 genes) are regrouped into a 27-gene module (Supplementary Fig. S7). This is clearly consistent with the fact that electron transport is tightly coupled with ATP synthesis in the mitochondrial inner membrane. Such groupings occur frequently, with the resulting network highlighting some fundamental aspects of Saccharomyces cerevisiae metabolism. It is well known that the glycolysis pathway serves as a hub in the metabolic network, reflecting its central role in yeast metabolism. Our analysis captures this central role by showing that genes in the glycolysis pathway co-cluster with genes involved in the TCA cycle, in fermentation (acetaldehyde and acetate metabolism) and pantothenate and coenzyme A biosynthesis, in addition to genes in the cellular respiration pathways and ATP synthase. Lastly, glycolysis genes are also associated with genes in amino acid (proline, glycine, serine and threonine) biosynthesis and sterol biosynthesis pathways.

3.3 Function pleiotropy in yeast chromosome biology
In the second analysis, the LCD algorithm was applied to the experimentally derived E-MAP data for 754 alleles of 743 S.cerevisiae genes involved in various aspects of chromosome biology (Collins et al., 2007). This dataset represents a much denser interaction network than the network of metabolic genes considered above, as all the genes are associated with chromosome functions. Although many of these genes belong to known protein complexes, the physiological pathways in which these complexes operate are not always well known.

The LCD algorithm identifies 298 modules in this dataset (Supplementary Table S2). Among them, there are 28 known protein complexes, 2 known epistatic groups and 4 sub-complexes. In addition, four complexes and one sub-complex are also found with a slightly lower fitness score cutoff. The fact that many known protein complexes are recapitulated as genetic modules is a good indication that these modules are biologically meaningful. Several of these complexes or components thereof are further grouped into 151 higher level epistatic modules. The remaining 108 modules encompass subcomplexes and their combinations (see Supplementary Fig. S5 for an overview). The novel view afforded by the LCD algorithm reveals that this grouping is again combinatorial (Fig. 3), and indicates that complexes may participate in multiple cellular processes, as illustrated by an example given below.


Figure 3
View larger version (59K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Organization of functional modules involved in chromosome biology. Depicted is the network of functional modules identified by the LCD procedure from the E-MAP data of Collins et al. (2007). Nodes represent complexes or known epistasis groups identified here as genetic modules (See Supplementary Table S2 for their detailed composition). The node size is proportional to the number of genes and edge thickness is proportional to the number of modules that contain the linked nodes (see Section 2 for details). The dotted oval highlights portions of the network that are discussed in detail in the text.

 
The example pertains to links of the Lge1/Bre1 complex to chromatin modification, DNA damage repair and chromatid cohesion. These functional links are discussed in detail below and illustrated in Figure 4. The Lge1/Bre1 complex plays a role in monoubiquitination of histone H2B on lysine 123. In addition to strong links between this complex with the Compass, Rpd3L, Mediator and Cdc73 complexes (Figure 4a) reflecting its functional interplay with these complexes in RNA polymerase II transcription, we observed that it co-clusters, though less frequently, with the DNA damage epistasis group Asf1/Rtt101/Rtt109/Mms1/Mms22 (Collins et al., 2007) and the sister chromatid cohesion-related genes Ctf4/Ctf8/Ctf18 (Fig. 4b). Moreover, the Compass complex, but not the other complexes in Figure 4a, also occurs in these clusters, suggesting that both Lge1/Bre1 complex and the Compass complex may be involved in sister chromatid cohesion and DNA damage response as well as transcription. These findings are consistent with the reports that H2B ubiquitination and H3 methylation are implicated in DNA damage checkpoint response (Giannattasio et al., 2005) and resistance to ionizing radiation (Game et al., 2006). Interestingly, more than 80% of the library genes that define the clusters shown in Figure 4a and b are different, suggesting that the Lge/Bre1 complex participates in distinct processes in each case (so does the Compass complex) (Fig. 4c).


Figure 4
View larger version (61K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Distinct clusters reveal multiple roles of the Lge1/Bre1 complex. (a) and (b) are heat maps of biclusters, red squares represent positive values (alleviating interactions), green negative values (aggravating interactions) and white, missing values; the shade of color is proportional to the magnitude of the effect. (a) The Lge1/Bre1 complex co-clusters with genes from transcription-related complexes in a module defined by 48 library genes (displayed horizontally). (b) In another module, it is joined by genes involved in DNA damage and chromatid cohesion. Only 7 out of 35 library genes are shared with (a). (c) Results of our biclustering analysis confirm existing experimental evidence implicating the Lge1/Bre1 complex in both transcription and the DNA damage response. The role of the Compass complex in the latter, suggested by our analysis, has not been reported so far.

 
To our knowledge, experimental evidence is currently not available for many prominent links identified in this analysis (Fig. 3). In addition to the links discussed in the above example, many others—such as those between the Swr1 complex and the HIR complex, between Swr1 complex and Rpd3L complexes (Supplementary Fig. S8)—may warrant further investigation as well.

3.4 Higher level functional links among the early secretory and maturation pathways
The third analyzed E-MAP comprises the genetic interactions among 424 S.cerevisiae genes involved in the early secretory and protein maturation pathways (Schuldiner et al., 2005). The resulting network is particularly dense as all the genes belong to intimately linked pathways. The 511 epistatic modules detected in this dataset by our algorithm are enriched in genes sharing identical functional categories and frequently recover known complexes or linear pathways (Supplementary Table S3). The derived links between complexes (as shown in Supplementary Fig. S9) indicate that complexes functioning in the Golgi apparatus are densely interconnected in one part, while those residing in the endoplasmic reticulum (ER) are grouped in a separate part. Intriguingly, some genes, such as Spf1 and Pmr1, are found to co-cluster with genes that reside in both ER and Golgi apparatus and display diverse functional annotations.

Supplementary Figure S10 illustrates known and novel functional associations of Spf1 identified in our analysis. Spf1, a P-type ATPase located in the ER membrane, plays important roles in Ca2+ homeostasis (Cronin et al., 2002). We found that it co-clusters not only with Sec66, a subunit of the Sec62/Sec63 complex involved in protein import into ER, as shown by Collins et al. (2007), but also with 19 other genes. Among the genes that co-cluster with Spf1, six show functional relatedness with Spf1 in numerous experimental studies. Pmr1, another P-type ATPase located in the Golgi complex, was shown to collaborate with Spf1 to maintain ion homeostasis in the ER (Vashist et al., 2002). Get1, a subunit of the GET complex involved in retrieval of HDEL signal-containing proteins, is believed to functionally overlap with Spf1, probably by facilitating ion balance between the Golgi and ER (Ando and Suzuki, 2005). Mutational studies identified Spf1 and Ste24, a metalloprotease anchored in the ER membrane, as having protein insertion orientation function (Tipper and Harley, 2002). Experimental evidence also exists for the Spf1-Ire1 association. Ire1, which is synthetically lethal with Spf1, serves as an unfolded protein response (UPR) sensor in the ER. In spf1{Delta} mutants, UPR is constitutively activated (Cronin et al., 2002); meanwhile, Spf1 gene expression is augmented after treatment with the glycosylation inhibitor tunicamycin, a potent inducer of the UPR signaling pathway (Vashist et al., 2002). The physiological role of Spf1 in the UPR may involve its function in N-linked glycosylation (Vashist et al., 2002), an ER process that involves Ost3, the gamma subunit of the oligosaccharyltransferase complex.

Intriguingly, our analysis of the chromosome biology E-MAP has revealed Spf1 to frequently co-cluster with members of the Elongator complex (Elp2, Elp3, Elp4 and Elp6). This complex is required for normal transcription of Hac1 (Krogan and Greenblatt, 2001), itself a downstream effector of Ire1 and a transcription factor for UPR genes. This finding may well explain the enigmatic connection between Spf1 and the Elongator complex.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The interpretation of genetic interactions is admittedly difficult. A number of recent studies provided compelling evidence, however, that epistatic interaction profiles—the interaction profiles of individual genes with many other genes—are much more informative than individual direct epistatic interactions in revealing functional associations (Collins et al., 2007; Qi et al., 2005; Tong et al., 2004). We group genes based on their interaction profiles using a novel biclustering algorithm. Our LCD algorithm differs from the biclustering algorithm of Dudley et al. (2005) and the Genetic Interaction Motif Finding algorithm by Qi et al. (2005). The latter are only applicable to binary-valued data (values of 0 or 1), whereas LCD is capable of handling all numeric data types, and thus is able to take full advantage of the rich information afforded by the continuous scale, real-valued, high density E-MAP data. Since it uses a dynamic noise model rather than fixed spread, it is tolerant to noise in the data. Its limitation of being non-exhaustive can be effectively circumvented using a multiple random starts strategy, as numerous (in hundreds) biclusters can be detected at once with each start.

Exploiting the similarities between portions of the epistatic profiles requires using stringent fitness scores derived on the basis of solid statistical criteria. These were established in our study using minimally randomized background models, where only edge weights were swapped while edge weight distribution, node degree distribution, node labels and network topology were preserved. Supplementary Figure S3 plots the fitness scores as a function of the number of query genes in the bicluster, for biclusters detected with the LCD algorithm from the data matrix of Collins et al. (2007), and those obtained from both, random draws and data matrix shuffling (results from the other two datasets are not shown). Inspection of this figure shows that the LCD algorithm consistently identifies biclusters with larger fitness scores and more coherent columns than biclusters derived using either of the random models. Figure S3 also shows that the random model derived through reshuffling imposes more stringent conditions on the identification of statistically significant biclusters, than the random draws model, as it systematically yields biclusters with higher fitness scores. The probability for randomly selected modules (in 150 randomized datasets) to reach the fitness score cutoff used here, is less than 0.0002 for all the three datasets; hence the probability for modules detected from the real data to be false positives is negligible.

To our knowledge, this is the first attempt in applying biclustering approaches to the analysis of quantitative genetic interaction data. As a result of requiring tight coherence among query genes, the modules detected here are functionally homogenous (Supplementary Fig. S3). Although some parameter adjustment is performed to optimize the overlap between biclusters and known protein complexes, the sensitivity of the results to the parameter values is low. The good correspondence of the identified modules with known complexes is significant. Compared to direct genetic interactions used in other studies (Dudley et al., 2005; Qi et al., 2005), co-complex (and co-pathway) membership represents much stronger functional associations between genes. The fact that a sizeable portion of the modules identified using LCD, are combinations of protein complexes (Supplementary Fig. S4a), suggests that these modules represent groupings of functionally related genes. However, we do not expect that the functional associations among genes in the multi-complex modules are as tight as those in single-complex modules, as genes in the former type of modules most likely only share local profile similarity, while genes in the later type modules tend to display more global similarity. Nevertheless, semantic similarity analysis of Gene Ontology annotations indicates that genes within same modules consistently share higher functional similarity than genes belonging to different modules, when evaluated in the Biological process and Cellular component categories (Supplementary Fig. 4b and c) of Gene Ontology (Ashburner et al., 2000).

One of the major advantages of our LCD method is that it allows for the grouping of one gene into multiple clusters, making it particularly well suited for uncovering the pleiotropic functions of genes. For example, hierarchical clustering uniquely groups Spf1 with Sec66 in a cluster containing translocation-related genes, suggesting its involvement in protein import into the ER (Tipper and Harley, 2002). In agreement with experimental evidence, however, our analysis associates Spf1 with more than a dozen functionally diverse genes, including those involved in protein translocation, calcium homeostasis, glycosylation, ER quality control and lipid biosynthesis. Another example is the previously reported multiple links found by our analysis between the Lge1/Bre1 ubiquitination complex and complexes involved in various cellular processes, including transcription, chromatid cohesion and DNA damage repair. The second advantage of the LCD approach is its ability to reveal functional links between complexes or pathways. By exploiting partial profile similarity, multiple complexes or genes can be grouped into a single cluster, suggesting that these complexes or genes might operate in compensatory processes activated in response to deletion of a common subset of genes. The common function of these complexes may be known in some cases. For example, the cluster in Figure 4a contains proteins from five complexes that are involved in transcription and are found in a relatively small subtree with 153 genes in the hierarchical clustering dendrogram of Collins et al. (2007). However, in many cases, complexes that perform diverse roles are found in a cluster, suggesting new roles for known complexes. The cluster in Figure 4b provides a good example. In the hierarchical clustering analysis (Collins et al., 2007), these complexes belong to different subtrees scattered over the entire dendrogram, reflecting their diverse roles (transcription for Compass, chromatid cohesion for Ctf4/Ctf8/Ctf18 and DNA repair for the DNA damage epistasis group). Intriguingly, our analysis is able to group them into a single cluster that is congruent with experimental evidence. Co-clustering of Spf1 with Elongator complex provides another interesting example. On the other hand, many such clusters lack experimental support at the present time, and our findings hence predict many new functional links that would be worthwhile to investigate further.

Living organisms invariably face a myriad of challenges posed by the ever-changing environment. In addition to genome expansion and diversification, increasing the functional versatility of existing genes provides an effective and economical strategy for meeting these challenges. Local coherence-based techniques such as the LCD represent an important new direction for analyzing the high-throughput epistatic interaction data, and may potentially lead to the identification of new roles for many functionally characterized genes.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
S.P. acknowledges the McLaughlin Centre for Molecular Medicine (MCMM) for support. S.J.W. is Tier 1 Canada Research Chair in Computational Biology and Bioinformatics and acknowledges support from the Canada Institute for Health Research, the Hospital for Sick Children and the Sickkids Foundation, Toronto, Canada. J.G. acknowledges support from Ontario Genomics Institute and Genome Canada.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Trey Ideker

Received on May 13, 2008; revised on August 1, 2008; accepted on August 15, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Ando A, Suzuki C. Cooperative function of the CHD5-like protein Mdm39p with a P-type ATPase Spf1p in the maintenance of ER homeostasis in Saccharomyces cerevisiae. Mol. Genet. Genomics (2005) 273:497–506.[CrossRef][Web of Science][Medline]

    Ashburner M, et al. Gene ontology: tool for the unification of biology. The gene ontology consortium. Nat. Genet. (2000) 25:25–29.[CrossRef][Web of Science][Medline]

    Avery L, Wasserman S. Ordering gene function: the interpretation of epistasis in regulatory hierarchies. Trends Genet. (1992) 8:312–316.[Web of Science][Medline]

    Baker BS, Ridge KA. Sex and the single cell. I. On the action of major loci affecting sex determination in Drosophila melanogaster. Genetics (1980) 94:383–423.[Abstract/Free Full Text]

    Bandyopadhyay S, et al. Functional maps of protein complexes from quantitative genetic interaction data. PLoS Comput. Biol (2008) 4:e1000065.[CrossRef][Medline]

    Boone C, et al. Exploring genetic interactions and networks with yeast. Nat. Rev. Genet. (2007) 8:437–449.[CrossRef][Medline]

    Cheng Y, Church GM. Biclustering of expression data. Proc. Int. Conf. Intell. Syst. Mol. Biol. (2000) 8:93–103.[Medline]

    Collins SR, et al. Functional dissection of protein complexes involved in yeast chromosome biology using a genetic interaction map. Nature (2007) 446:806–810.[CrossRef][Web of Science][Medline]

    Cronin SR, et al. Cod1p/Spf1p is a P-type ATPase involved in ER function and Ca2+ homeostasis. J. Cell Biol. (2002) 157:1017–1028.[Abstract/Free Full Text]

    Cvijovic D, Klinowski J. Taboo search: an approach to the multiple minima problem. Science (1995) 267:664–666.[Abstract/Free Full Text]

    Davierwala AP, et al. The synthetic genetic interaction spectrum of essential genes. Nat. Genet. (2005) 37:1147–1152.[CrossRef][Web of Science][Medline]

    Dudley AM, et al. A global view of pleiotropy and phenotypically derived gene function in yeast. Mol. Syst. Biol (2005) 1:2005.0001.

    Forster J, et al. Genome-scale reconstruction of the Saccharomyces cerevisiae metabolic network. Genome Res. (2003) 13:244–253.[Abstract/Free Full Text]

    Game JC, et al. The RAD6/BRE1 histone modification pathway in Saccharomyces confers radiation resistance through a RAD51-dependent process that is independent of RAD18. Genetics (2006) 173:1951–1968.[Abstract/Free Full Text]

    Giannattasio M, et al. The DNA damage checkpoint response requires histone H2B ubiquitination by Rad6-Bre1 and H3 methylation by Dot1. J. Biol. Chem. (2005) 280:9879–9886.[Abstract/Free Full Text]

    Guarente L. Synthetic enhancement in gene interaction: a genetic tool come of age. Trends Genet. (1993) 9:362–366.[CrossRef][Web of Science][Medline]

    Kelley R, Ideker T. Systematic interpretation of genetic interactions using protein networks. Nat. Biotechnol. (2005) 23:561–566.[CrossRef][Web of Science][Medline]

    Kirkpatrick S, et al. Optimization by simulated annealing. Science (1983) 220:671–680.[Abstract/Free Full Text]

    Krogan NJ, Greenblatt JF. Characterization of a six-subunit holo-elongator complex required for the regulated expression of a group of genes in Saccharomyces cerevisiae. Mol. Cell Biol. (2001) 21:8203–8212.[Abstract/Free Full Text]

    Madeira SC, Oliveira AL. Biclustering algorithms for biological data analysis: a survey. IEEE/ACM Trans. Comput. Biol. Bioinform. (2004) 1:24–45.[CrossRef]

    Pan X, et al. dSLAM analysis of genome-wide genetic interactions in Saccharomyces cerevisiae. Methods (2007) 41:206–221.[CrossRef][Web of Science][Medline]

    Prelic A, et al. A systematic comparison and evaluation of biclustering methods for gene expression data. Bioinformatics (2006) 22:1122–1129.[Abstract/Free Full Text]

    Qi Y, et al. Genetic interaction motif finding by expectation maximization – a novel statistical model for inferring gene modules from synthetic lethality. BMC Bioinformatics (2005) 6:288.[CrossRef][Medline]

    Schuldiner M, et al. Exploration of the function and organization of the yeast early secretory pathway through an epistatic miniarray profile. Cell (2005) 123:507–519.[CrossRef][Web of Science][Medline]

    Segre D, et al. Modular epistasis in yeast metabolism. Nat. Genet (2005) 37:77–83.[CrossRef][Web of Science][Medline]

    Tipper DJ, Harley CA. Yeast genes controlling responses to topogenic signals in a model transmembrane protein. Mol. Biol. Cell (2002) 13:1158–1174.[Abstract/Free Full Text]

    Tong AH, et al. Global mapping of the yeast genetic interaction network. Science (2004) 303:808–813.[Abstract/Free Full Text]

    Vashist S, et al. Two distinctly localized p-type ATPases collaborate to maintain organelle homeostasis required for glycoprotein processing and quality control. Mol. Biol. Cell (2002) 13:3955–3966.[Abstract/Free Full Text]

    Ye P, et al. Gene function prediction from congruent synthetic lethal interactions in yeast. Mol. Syst. Biol (2005) 1:2005.0026.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/20/2376    most recent
btn440v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Pu, S.
Right arrow Articles by Wodak, S. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Pu, S.
Right arrow Articles by Wodak, S. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?