Skip Navigation


Bioinformatics Advance Access originally published online on October 10, 2006
Bioinformatics 2007 23(1):77-83; doi:10.1093/bioinformatics/btl511
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/1/77    most recent
btl511v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hollunder, J.
Right arrow Articles by Wilhelm, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hollunder, J.
Right arrow Articles by Wilhelm, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

DASS: efficient discovery and p-value calculation of substructures in unordered data

Jens Hollunder 1, Maik Friedel 1, Andreas Beyer 1,2, Christopher T. Workman 2 and Thomas Wilhelm 1,*

1 Department of Theoretical Systems Biology, Leibniz Institute for Age Research—Fritz-Lipmann-Institute e. V. (former IMB Jena) Beutenbergstrasse 11, D-07745 Jena, Germany
2 Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATIONS
 4 CONCLUSION
 REFERENCES
 

Motivation: Pattern identification in biological sequence data is one of the main objectives of bioinformatics research. However, few methods are available for detecting patterns (substructures) in unordered datasets. Data mining algorithms mainly developed outside the realm of bioinformatics have been adapted for that purpose, but typically do not determine the statistical significance of the identified patterns. Moreover, these algorithms do not exploit the often modular structure of biological data.

Results: We present the algorithm DASS (Discovery of All Significant Substructures) that first identifies all substructures in unordered data (DASSSub) in a manner that is especially efficient for modular data. In addition, DASS calculates the statistical significance of the identified substructures, for sets with at most one element of each type (DASSPset), or for sets with multiple occurrence of elements (DASSPmset). The power and versatility of DASS is demonstrated by four examples: combinations of protein domains in multi-domain proteins, combinations of proteins in protein complexes (protein subcomplexes), combinations of transcription factor target sites in promoter regions and evolutionarily conserved protein interaction subnetworks.

Availability: The program code and additional data are available at http://www.fli-leibniz.de/tsb/DASS

Contact: wilhelm{at}fli-leibniz.de

Supplementary information: Supplementary information is available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATIONS
 4 CONCLUSION
 REFERENCES
 
Pattern recognition is one of the most fundamental themes in bioinformatics (Ouzounis and Valencia, 2003). It comprises different problem classes, such as binding site discovery (Stormo, 2000; Tompa et al., 2005), multiple sequence alignment [e.g. hidden Markov models (Eddy, 1998)], classification of proteins [e.g. SCOP (Andreeva and Srikant, 1994), SUPERFAMILY (Madera et al., 2004)], characterization of protein complexes (Aloy et al., 2004; Hollunder et al., 2005a) and functional annotation of proteins [e.g. FunSpec (Robinson et al., 2002)]. Numerous algorithms exist to analyze biological sequences [e.g. FASTA (Pearson and Lipman, 1988), BLAST (Altschul et al., 1990) and PSI-BLAST (Altschul et al., 1997)]. However, only few methods deal with non-sequential (unordered) data.

Although lacking sequential order, such data nevertheless may contain hidden structure and hierarchically organized features. Hierarchical organization and modularity are common in biology and can be observed at many levels of cellular organization (Gavin and Superti-Furga, 2003). For instance, protein complexes are frequently composed of different protein subcomplexes, also called protein modules or molecular machines (Wilhelm et al., 2003; Hollunder et al., 2005a). By combining proteins and protein modules into specific complexes the cell achieves a higher degree of versatility, flexibility and robustness. In addition, the proteins used in these complexes are themselves composed of different protein domains (Apic et al., 2001; Vogel et al., 2004) that are specifically combined to define a protein's structure and function. Other examples are the combinatorial transcriptional gene regulation (Ihmels et al., 2004; Beyer et al., 2006) and modules of interacting proteins conserved across species that can be used as building blocks of biological networks and pathways (Li et al., 2004; Sharan et al., 2005). Therefore, it is important to develop efficient methods for the investigation of such modular and hierarchical structures.

Here we present algorithms for the detection and scoring of structures in data that can be represented as sets, e.g. sets of proteins or sets of protein domains. Most existing pattern recognition algorithms for unordered data are variants of the APRIORI algorithm (Agrawal and Srikant, 1994). Here a breadth-first search is used to iterate through the items (elements) of each set, considering a threshold for the item frequencies in the result sets. In the first iteration, all items below this frequency threshold are removed and all combinations of the remaining items are tested (two-itemset candidates). Frequent itemsets at the given level are extended by more items to generate candidate sets for the next iteration and infrequent subsets are discarded. Recent approaches use similar search strategies with item-based data structures to reduce the search space. Three classes of algorithms can be distinguished: those finding all frequent patterns (sets), e.g. APRIORI (Agrawal and Srikant, 1994), those determining only maximal frequent sets (a set with no frequent superset containing this set), e.g. MAFIA (Burdick et al., 2001), and algorithms finding frequent closed sets [a set is ‘closed’ if there exists no superset with the same frequency (Pei et al., 2000)], e.g. CHARM (Zaki and Hsiao, 2002), Close (Pasquier et al., 1999), and CLOSET (Pei et al., 2000). All of these approaches require a user-specified threshold for the set frequency. We are not aware of any algorithm quantifying the statistical significance of the observed patterns. Depending on the frequency of the constitutive elements, abundant patterns might not be statistically significant, whereas patterns with a comparably low abundance may be significant.

In Section 2, we present an approach that detects patterns in unordered hierarchical data and estimates the statistical significance of these patterns without requiring a pattern frequency threshold. DASS (Discovery of All Significant Substructures) comprises two independent algorithms: the first (DASSSub, Section 2.2.1) performs a fast search to find all closed sets. Closed sets provide a concise representation of the data, because the number of closed sets is typically much smaller than the number of all contained patterns. In contrast to existing algorithms, which iterate through the set elements, DASSSub iterates through the sets (structures) themselves. This approach allows for efficient pruning when analyzing hierarchically structured data and provides a superior performance compared with other algorithms. In the second algorithm (Section 2.2.2) the statistical significance of all closed sets is calculated. Two cases are distinguished: (1) sets with unique elements and (2) sets with multiple occurrence of the same elements (multi-sets). In each case DASS calculates the probability (p-value) that the considered closed set is found in a randomized dataset with the same or higher frequency.

The DASS algorithm has numerous potential applications. Section 3 outlines applications to protein domain combinations, protein complexes, and transcriptional modules. As an additional application, a new method to find conserved subnetworks is presented in the Supplementary information 2.4.


    2 ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATIONS
 4 CONCLUSION
 REFERENCES
 
2.1 Problem formulation and preliminaries
Let {Sigma} = {e1, e2, ... , e|{Sigma}|} be a finite alphabet of elements ek. We consider sets Si = {(E, m), E {subseteq} {Sigma}} (e.g. measured protein complexes), where E is the (unordered) list of elements ek contained in Si and m : E -> N+ is a function mapping the ek onto their frequencies in Si. Thus, m(ek) is the frequency of an element ek in Si and Formula is the size of Si. If ek may occur more than once, Si is a multi-set with m(ek) ≥ 1. The dataset Formula = {S1, S2, ... , S|Formula|} comprises a finite set of structures Si (e.g. Fig. 1a). For instance, Formula could be a collection of measured protein complexes, where each structure Si comprises a set of proteins and contains all different proteins. The frequency of an element ek in Formula is defined as Formula. Thus, we can compute the relative frequency of ek as fk = Fk /N, where N is the total number of all elements in Formula. See Figure 1b for an example.


Figure 1
View larger version (21K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Example. (a) the dataset Formula = {S1, S2, S3}; (b) the used alphabet {Sigma}, the size of each structure and the relative frequencies of each element; (c) host structures for the closed set {a, a, b, d}; (d) all closed sets from Formula; and (e) the p-values p(Mj) of all closed sets Mj which occur more than once in Formula.

 
Let the substructure Mj {subseteq} Si be a non-empty set or multi-set of Formula elements and {Lambda}(Mj) {subseteq} Formula be the subset of all structures from Formula containing the specific substructure Mj (‘host structures’). The frequency Fj = |{Lambda}(Mj)| of a substructure Mj is the number of structures in the dataset Formula containing Mj (e.g. Fig. 1c). If each Si corresponds to an independent measurement, a high frequency Fj indicates a large number of measurements confirming the existence of the substructure Mj.

DEFINITION 1
(closed set). A substructure MX is called a closed set if there exists no substructure MY such that MX sub MY and FX = FY. In other words, MX is called closed if there exists no superset MY sup MX with the same frequency.

Obviously, each structure Si itself is a closed set.

Formula = {M1, M2, ... , M|Formula|} is the set of all closed sets of the dataset Formula (e.g. Fig. 1d). The first part of the DASS algorithm finds Formula for a given dataset Formula and the second part computes the statistical significance of each Mj based on its frequency (e.g. Fig. 1e).

2.2 The DASS Algorithm
2.2.1 Finding all substructures
The algorithm DASSSub finds Formula in a given dataset Formula by systematically intersecting every structure with all other structures. Figure 2 shows the pseudocode for the algorithm (without frequency calculations for clarity). The algorithm starts with an empty set Formula (Fig. 2, line 1) and works by using two loops (i-loop and j-loop), and the helper-set H. The i-loop iterates through the structures Si (lines 2–12), the j-loop iterates through the current Formula (lines 5–9). H is reinitialized in every i-loop with the current structure Si (line 4) and then collects all intersections of Si with all substructures Mj in the current Formula (line 7). After finishing the j-loop, Formula is updated with H (line 10). Finally, Formula contains all closed sets from Formula (line 13). This can be proven by taking into account that all structures Si are closed sets, all intersections of two closed sets are closed again, and the complete intersection tree (all combinations of intersections of structures) is traversed by the algorithm. Figure 3 outlines the algorithm for the example of Figure 1.


Figure 2
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 The DASSSub algorithm—a set-wise enumeration algorithm for finding all closed sets Mj in a given dataset Formula.

 


Figure 3
View larger version (10K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 Illustration of the DASSSub algorithm considering the example of Figure 1.

 
As mentioned above, this algorithm iterates through the structures Si and not through the elements of {Sigma}. Compared with other data mining algorithms (Zaki and Hsiao, 2002; Pasquier et al., 1999; Pei et al., 2000), DASSSub has the advantage of finding all closed sets without needing to test that the sets are closed. Generally, given Formula = {S1, S2, ... , S|Formula|} with |Formula| structures, Formula contains 2|Formula| – 1 sets in the worst case.

However, DASSSub is particularly fast if newly added substructures Mj in Formula are often identical to already existing sets, because this reduces the number of intersections in subsequent j-loops. Fortunately, hierarchically organized biological data have exactly this property: the frequency of substructures (i.e. modules) is quite large compared with the number of different substructures. This vastly reduces the computation time of DASSSub. For instance, Formula in Figure 1d contains only five different sets, because Formula. Algorithms iterating through elements from {Sigma} do not exploit the hierarchical organization of biological data.

Finally, we compute the frequency Fj of each substructure Mj isin Formula.

2.2.2 Statistical significance of all substructures
The calculation of the statistical significance of substructures can be divided into two different problems: (1) structures with unique elements (simple sets) Formula and (2) structures with possible multiple occurrence of the same elements Formula. The appropriateness of the respective approach depends on the dataset and on the biological question.

In each case (simple set or multi-set) our underlying null model assumes that all elements ek from {Sigma} are randomly assigned to the structures Si, taking into account their relative frequencies fk. The exact way of determining the statistical significance is to generate all possible permutations of a given dataset Formula [Supplementary information 1.1.1 (a)]. However, this is only practicable for very small datasets. A good approximation for the exact model is provided by shuffling the data. This procedure maintains all structure sizes and re-assigns all elements from the ‘element pool’ (containing all ek with their frequencies fk) [Supplementary information 1.1.1 (b)]. After each randomization the frequencies of all substructures are counted. The p-value of a given structure is the number of randomizations containing the substructure at least as often as its frequency in original data Formula divided by the total number of randomizations [Supplementary information 1.1.1 (b)—Model IIA]. For very improbable substructures this procedure is very costly, because many randomizations are needed. The corresponding p-value calculation can be improved by fitting for a given substructure its frequency distribution [Supplementary information 1.1.1 (b)—Model IIB]. However, because this procedure is still computationally demanding, we developed another shuffling model suited for fast analytical calculations, which is described in the following. It is based on the average probability Formula to find a substructure in the randomized data. The p-value is calculated with a binomial distribution using this average probability and the observed frequency of the substructure. The binomial distribution (Formula) is an exact test to calculate the probability of at least k successes in n Bernoulli experiments with the probability Formula of success. A detailed discussion of all these models and a validation of our final analytical model is given in the Supplementary information 1.1.

Both Formula and Formula estimate the probability, pj (ni), of finding Mj in a random set of size ni for all structure sizes ni ≥ nj contained in the dataset Formula. The average probability Formula of finding Mj in such a random set is

Formula 1(1)
where ki is the number of structures of size ni and Formula 1 is the total number of structures with size ≥nj. The product k i· pj (ni) is the expectation value to find Mj in the structures of size ni. The average probability Formula 1 and the observed frequency Fj of Mj are used to compute the statistical significance based on a binomial distribution. The probability of observing Mj at least Fj times in a randomized dataset Drandom thus reads:

Formula 2(2)
Next, we show how the probability pj(ni) is calculated.

(i) Formula 2. If each element ek can occur at most once in a set [Formula 2 is m(ek) = 1], the number of different permutations in a random set Sr of nj elements is nj! and the probability of finding a substructure Mj in a random set Sr of nj elements is:

Formula 3(3)
Note that this equation corresponds to a sampling with replacement. Of course, this assumption is violated for a small number of elements. In such a case one should use a more exact model as indicated above. However, biological datasets typically have a large number of elements, justifying the use of Equation (3) [see Supplementary information 1.1.1 (d)].

Next, the probability pj of finding a substructure Mj in a random set Sr with nr ≥ nj can be calculated as 1 minus the probability of not finding Mj in Formula 3 random experiments:

Formula 4(4)
The pj (nr) are calculated for all structure sizes ≥nj contained in the given dataset.

(ii) Formula 4. We now consider the case where the Mj are multi-sets. The probability pj (nr) of finding Mj in a random set Sr of size nr ≥ nj can be calculated by a summation of the probabilities of all possible outcomes containing Mj at least once. A brute force algorithm would generate all |{Sigma}|nr possible outcomes to calculate this probability. Of course, this is applicable only for a small number of structures and a small alphabet of elements. We present an efficient, iterative algorithm to estimate the probability of finding Mj in Sr for successively increasing set sizes. Intermediate results of previous iteration steps are efficiently re-used to avoid unnecessary recalculations.

The number of different permutations of a given substructure Mj is:

Formula 5(5)
The probability that nj positions are randomly filled with elements from Mj is

Formula 6(6)
The probability pj(nj) of finding a given substructure Mj in a random set of nj elements is:

Formula 7(7)
Note that (Equation 7) is a generalization of (Equation 3).

If nr > nj Equation (7) has to be expanded by an additional factor Formula 7 > 1, accounting for the higher probabilities for the occurrence of Mj in larger sets. Formula 7 monotonically increases with the number of free positions. For the general case nr ≥ nj the probability pj (nr) is calculated by

Formula 8(8)
with Formula 8.

All random sets Sr containing Mj have nj fixed elements. The remaining nd = nrnj positions can be filled with any element from {Sigma}. The purpose of Formula 8 in Equation (8) is to account for the different ways in which these free positions can be filled. For instance, if nd = 1 then pj (nr) = pj (nj + 1) can be calculated by summing up all possible outcomes with one additional element, which may be an element of Mj or not. A brute force algorithm for the generation of all outcomes is very costly. To make the calculation more efficient, we recursively calculate Formula 8 as a function of its predecessors, i.e. Formula 8 (Mj, nr) = f [Formula 8(Mj, nr – 1)]. Additionally, we use an efficient ‘factor out technique’, where the subtotals are re-used extensively throughout the procedure (intermediate values are stored in a matrix SubT). Figure 4 shows the pseudo-code for the complete algorithm DASSPmset. The code also demonstrates the efficient calculation of the factor Formula 8.


Figure 4
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 The algorithm DASSPmset determines the probability pj (nr) to find Mj in a random structure Sr with possible multiple occurrence of the same elements in the general case nr ≥ nj. SubT[x][0] = 0 for 1 ≤ x ≤ nd = nrnj.

 
At first DASSPmset computes the probability pj(nj) (Fig. 4, lines 1–5). Next, DASSPmset defines a composed set Formula 8 containing all elements of a given substructure Mj and the pseudo-element Formula 8 where Formula 8 is a placeholder for all elements ek isin {Sigma}\Mj. Thus, the relative frequency of Formula 8 in Formula 8 is Formula (line 6).

In an outer loop (lines 7–15) the algorithm iterates through the nd free positions, which can be occupied by any element from {Sigma}. In the inner k-loop (lines 8–14) the algorithm iterates through the different elements of Formula 8 always in a predefined fixed order and stores all subtotals in a matrix SubT (line 13). The matrix SubT contains the corresponding subtotal for each possible free position and element ek isin Formula 8 and is used in later steps for determining the factor Formula 8. The most inner loop (lines 10–12) re-uses the subtotals in SubT to compute the subtotal for the given element ek. This loop is only used if nd > 1. Finally, Formula 8 (line 16) and the corresponding probability pj (nr) (line 17) are computed.

Due to the pre-calculation of partial results, the computation time is polynomial Formula 8, instead of the exponential time required for the brute force algorithm Formula 8: O(nd) (loop of line 6), O(|{Sigma}|) (loop of line 7) and Formula 8 (loop of line 9)].

As in the simple set case above, also in the multi-set case the pj (nr) calculation is only an approximation (sampling with replacement). However, in most cases it leads to a conservative estimate of the exact p-value [see Supplementary information 1.1.1 (d)], the accuracy increases with an increasing number of elements in {Sigma} and would be exact for an infinite number of elements. In most biological applications the number of elements in {Sigma} is much larger than the largest structure Si. Thus, Formula 8 is generally a very good approximation even if {Sigma} is finite.

We illustrate the algorithm (Fig. 4) with help of the following example: we calculate the probability of finding the substructure M4 = {a, a, b, d} (cf. Fig. 1) in random structures of size four, five and six (Fig. 5). The values of SubT denote the cumulative subtotals. The ‘free positions’ are filled with the indicated elements (Fig. 5b and c). For instance, using the fixed order a, b, d, Formula 8: SubT[1][1] refers to the case of an additional ‘a’ at the first free position, SubT[1][2] to the case of finding either ‘a’ or ‘b’ at the first free position, and so on. SubT[2][1] in Figure 5c refers to the case that both free positions are filled with ‘a’, and SubT[2][2] that both free positions are either filled with only ‘a’ or ‘b’ or are filled with ‘a’ and ‘b’ and so on. In the Supplementary information 1.2 it is shown that Formula 8 generates the same results than a brute force algorithm.


Figure 5
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 Example for the algorithm DASSPmset. Calculation of the probability to find the substructure M4 (Fig. 1) in a random structure. (a) of size four, (b) five and (c) six. Note, that re-using SubT[1][1], SubT[1][2], SubT[1][3], and SubT[1][4] from (b) in (c) reduces the time complexity.

 

    3 APPLICATIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATIONS
 4 CONCLUSION
 REFERENCES
 
The following examples demonstrate the power and versatility of DASS for a broad range of applications. Of course, many more applications are conceivable, e.g. a new method for the identification of conserved protein interaction networks is outlined in the Supplementary information 2.4.

3.1. Protein domain combinations
Multi-domain proteins are composed of different protein domains that determine the function of the whole protein. Vogel et al. (2004) discussed combinations of two and three domains that recur in different proteins together with different partner domains in a particular functional and spatial relationship. Using the Formula 8 algorithm we analyze the corresponding general problem of protein domain combinations of any size. Domain assignments of the proteins were taken from the SUPERFAMILY database (Madera et al., 2004). In a future work, we will elaborate on the functions of significant domain combinations together with their geometrical arrangement and their distribution in different species. Here we identify different protein domain combinations containing the SH2 domain (Src-homology-2) and/or the PDZ domain (PSD-95, discs large and ZO-1, see Supplementary information 2.1). SH2 domains have previously been identified in a wide range of signaling proteins, often together with other signaling domains (e.g. SH3 domain, Pleckstrin homology domain, protein tyrosin phosphatase domain) or together with scaffold domains (Yaffe, 2002). PDZ domains play a key role in organizing diverse cell signaling assemblies and occur often as multiple copies and together with other signaling domains (Fan and Zhang, 2002). We identified more than 100 significant PDZ modules (P < 10–5) containing PDZ and other domains, which can be classified into different subclasses (Fan and Zhang, 2002). Proteins directly associated with the plasma membrane (ion channels, receptors and cytoskeleton proteins) are among the major cellular targets of PDZ domains. In contrast to normal tissue, in some human tumors (e.g. colon and breast cancer), proteins containing PDZ and LIM domains are overexpressed (Kang et al., 2000). PDZ domains are abundant in animal proteins, yet scarce in yeast, bacteria and plants (Ponting, 1997). We identified similar significant PDZ modules in different multicellular species, suggesting conserved signaling mechanisms in these organisms (see Supplementary information 2.1).

3.2. Protein subcomplexes in S.cerevisiae and E.coli
One level above combinations of protein domains in single proteins are combinations of whole proteins in protein complexes which act as highly specialized cellular molecular machines.

We separately analyzed datasets of the two model organisms S.cerevisiae (Hollunder et al., 2005a) and E.coli (Hollunder et al., 2005b) and identified the underlying modular composition of protein complexes. We argue that subcomplexes represent more reliable protein assemblies than the originally measured complexes. Using the DASS algorithm we identified well-characterized protein assemblies with known functions, for instance, Casein kinase II, 19/22S (regulator of the proteasome), eIF2B and Arp2/3 in S.cerevisiae and the DNA-dependent RNA polymerase complexes and the 2-oxoglutarate dehydrogenase complex in E.coli.

On the other hand, we also identified previously unknown protein assemblies that may fulfill special cellular functions. We characterized the subcomplexes and subcomplex proteins in detail and found that subcomplexes have specific properties that underline their distinct role. For instance, subcomplexes are enriched for essential proteins and the expression of subcomplex proteins is more correlated than the expression of complex proteins in general. Subcomplexes of S.cerevisiae and E.coli are also characterized by a more homogeneous spatial and functional composition, compared with that of the measured host complexes. This property is exploited to propose functions and sub-cellular localizations of unannotated proteins. Proteins of unknown function belonging to known subcomplexes or to complexes with homogeneous functional or spatial composition get assigned the function or localization of this subcomplex.

The result of the protein subcomplex identification analyzing the recently published MALDI-TOF MS data (Krogan et al., 2006) is shown in the Supplementary information 2.2.

3.3. Transcription factor modules in S.cerevisiae
In an accompanying study we analyzed the interaction of transcription factors (TF) for the combinatorial regulation of target genes (Beyer et al., 2006). In that study we developed a scoring system for robustly assigning TFs to their target genes and we subsequently used Formula 8 to determine which TFs cooperatively regulate a significant number of target genes. The analysis of these TF modules provides novel insights into combinatorial transcriptional regulation. However, many TFs bind more than once in the upstream regions of their regulated target genes and it is assumed that repeated occurrence of the same TF increases the repressing or activating signal (Harbison et al., 2004). Thus, in addition to simply identifying sets of TFs acting together, here we also analyze the number of repeated occurrences of TFs in these modules using Formula 8. For this analysis we selected all highly significant TF—target interactions in the yeast S.cerevisiae from Beyer et al. (2006). Interactions with a log-likelihood score larger than six were selected for this study. These 3820 interaction pairs were used to assign TF-binding sites to their target genes. The number of repeats was determined based on the number of binding sites within the first 1000 bp upstream of the start codon. Binding sites were determined using ANN-Spec (Workman and Stormo, 2000) and assuming a significance threshold of P < 10– 4.

Formula 8 found 702 TF-modules, 438 of which were significant with P < 10–4 (see Supplementary information 2.3). Formula 8 finds only 200 significant TF-modules using the same interactions and requiring the same level of significance. This result shows that the number of binding sites carries significant additional information.

Figure 6 shows frequency distributions for selected TFs. It can be seen that some TFs have a strong preference for low-redundancy (e.g. Gat1p), whereas others have a broad distribution of binding site frequencies (e.g. Skn7p). Some TFs always have repeated binding elements. Examples are the drug resistance regulators Pdr1p and Pdr3p, which almost always have at least two binding sites in the upstream regions of their targets. Although the two TFs are assumed to be homologous they do not always regulate the same genes (DeRisi et al., 2000; Zhu and Xiao, 2004). Hence, the distributions of their binding sites in Figure 6 are not identical. However, both TFs primarily require either two or four binding sites, while triple binding sites appear less frequently. This does not come as a surprise, because Pdr1p and Pdr3p bind the palindromic cis-element 5'-TCCGCGGA-3' as dimers (Mamnun et al., 2002). Accordingly, the TF module Pdr1p/Pdr1p/Pdr3p/Pdr3p is highly significant (20 targets, P = 2 x 10–55), whereas a TF module with single copies of the two TFs (i.e. Pdr1p/Pdr3p) does not exist at all. Thus, the TF-modules correctly reveal the stoichiometry of regulatory complexes.


Figure 6
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6 Histogram of binding site frequencies for selected yeast transcription factors.

 
Another interesting TF module is Cbf1p/Cbf1p/Tye7p/Tye7p. To our knowledge, it has not been reported that Cbf1p and Tye7p form heterodimers and no interaction between these TFs was found [iHOP (Hoffman and Valencia, 2004); SGD (Christie et al., 2004); MIPS (Mewes et al., 2004)]; however, the two TFs bind to identical chromosomal locations. Cbf1p's binding motif (8 bases long) is completely contained in Tye7's binding motif (10 bases long). Also, Cbf1p and Tye7p have a significant number of common targets (31 targets, P = 10–43). Hence, in this case it is more likely that Cbf1p and Tye7p regulate their targets in a mutually exclusive way. Further studies would be required to validate the presence of a competitive interaction. It is known that Cbf1p physically interacts with Met4p (Kuras et al., 1996) and that Met31p and Met32p may replace the role of Cbf1p in the regulatory complex (Blaiseau and Thomas, 1998). In support of this, we find a number of significant TF-modules (P < 10–5) involving Cbf1p together with Met4p, Met31p and Met32p. These examples demonstrate that DASS is able to reveal known combinatorial interactions and to discover unknown regulatory modules.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATIONS
 4 CONCLUSION
 REFERENCES
 
The DASS algorithm represents an efficient method to calculate the statistical significance of substructures in unordered data. It is composed of two independent parts. The first sub-algorithm DASSSub finds all patterns in unordered data. It is especially efficient for hierarchically organized data, a typical property of large datasets in molecular biology. In the second part, the statistical significance of all identified patterns is calculated (DASSP). The algorithms Formula 8 and Formula 8 are developed to handle the two different cases: simple sets or multi-sets. These algorithms perform approximated estimations of statistical significances, valid for large datasets. The Supplementary information 1.1 contains additional detailed discussions of four other models better suited for smaller datasets.

The DASS algorithm has a very broad range of applications. Here we demonstrated the application of DASS to four different types of biological datasets. Modules with low p-values are overrepresented, whereas high p-values indicate underrepresentation. Interestingly, all closed sets in protein complex data (protein subcomplexes) have low p-values (see Supplementary information 2.2). In contrast, in the analyzed TF binding site data are many closed sets (TF modules), which are significantly underrepresented (see Supplementary information 2.3). All resulting modules with significant p-values represent new hypotheses requiring further experimental verification. These analyses will help to better understand the hierarchical and combinatorial nature of cellular processes.


    Acknowledgments
 
The authors thank S. Nikolajewa and M. Hoffmann for discussing statistical aspects and the anonymous referees for important comments. This work has been funded by the Federal Ministry of Education and Research, Germany (grant 0312704E).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Martin Bishop

Received on May 31, 2006; revised on September 28, 2006; accepted on October 1, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 ALGORITHM
 3 APPLICATIONS
 4 CONCLUSION
 REFERENCES
 

    Altschul, S.F. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403–410[CrossRef][Web of Science][Medline].

    Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 3389–3402[Abstract/Free Full Text].

    Agrawal, R. and Srikant, R. (1994) Fast algorithms for mining association rules. Proceedings of 20th International Conference on Very Large Data Bases, pp. 487–499.

    Aloy, P., et al. (2004) Structure-based assembly of protein complexes in yeast. Science, 303, 2026–2029[Abstract/Free Full Text].

    Andreeva, A., et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res, . 32, D226–D229[Abstract/Free Full Text].

    Apic, G., et al. (2001) Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol, . 310, 311–325[CrossRef][Web of Science][Medline].

    Beyer, A., et al. (2006) Integrated assessment and prediction of transcription factor binding. PLOS Comput. Biol, . 2, e70[CrossRef][Medline].

    Blaiseau, P.L. and Thomas, D. (1998) Multiple transcriptional activation complexes tether the yeast activator Met4 to DNA. EMBO J, . 17, 6327–6336[CrossRef][Web of Science][Medline].

    Burdick, D., et al. (2001) MAFIA: a maximal frequent itemset algorithm for transactional databases. Proceedings of the 17th International Conference on Data Engineering (ICDE 2001), pp. 443–452.

    Christie, K.R., et al. (2004) Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res, . 32, D311–D314[Abstract/Free Full Text].

    DeRisi, J., et al. (2000) Genome microarray analysis of transcriptional activation in multidrug resistance yeast mutants. FEBS Lett, . 470, 156–160[CrossRef][Web of Science][Medline].

    Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763[Abstract/Free Full Text].

    Fan, J.S. and Zhang, M. (2002) Signaling complex organization by PDZ domain proteins. Neurosignals, 11, 315–321[CrossRef][Web of Science][Medline].

    Gavin, A.-C. and Superti-Furga, G. (2003) Protein complexes and proteome organization from yeast to man. Curr. Opin. Chem. Biol, . 7, 21–27[CrossRef][Web of Science][Medline].

    Harbison, C.T., et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99–104[CrossRef][Medline].

    Hoffmann, R. and Valencia, A. (2004) A gene network for navigating the literature. Nat. Genet, . 36, 664[CrossRef][Web of Science][Medline].

    Hollunder, J., et al. (2005a) Identification and characterization of protein subcomplexes in yeast. Proteomics, 5, 2082–2089[CrossRef][Web of Science][Medline].

    Hollunder, J., et al. (2005b) Exploiting combinatorial complexity—searching for new functional entities in the cell. Proceedings of Foundation of System Biology in Engineering (FOSBE 2005), pp. 363–366.

    Ihmels, J., et al. (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics, 20, 1993–2003[Abstract/Free Full Text].

    Kang, S., et al. (2000) PCD1, a novel gene containing PDZ and LIM domains, is overexpressed in several human cancers. Cancer Res, . 60, 5296–5302[Abstract/Free Full Text].

    Krogan, N.J., et al. (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature, 440, 637–643[CrossRef][Medline].

    Kuras, L., et al. (1996) A heteromeric complex containing the centromere binding factor 1 and two basic leucine zipper factors, Met4 and Met28, mediates the transcription activation of yeast sulfur metabolism. EMBO J, . 15, 2519–2529[Web of Science][Medline].

    Li, S., et al. (2004) A map of the interactome network of the metazoan C. elegans. Science, 303, 540–543[Abstract/Free Full Text].

    Madera, M., et al. (2004) The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res, . 32, D235–D239[Abstract/Free Full Text].

    Mamnun, Y.M., et al. (2002) The yeast zinc finger regulators Pdr1p and Pdr3p control pleiotropic drug resistance (PDR) as homo- and heterodimers in vivo. Mol. Microbiol, . 46, 1429–1440[CrossRef][Web of Science][Medline].

    Mewes, H.W., et al. (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res, . 32, D41–D44[Abstract/Free Full Text].

    Ouzounis, C.A. and Valencia, A. (2003) Early bioinformatics: the birth of a discipline—a personal view. Bioinformatics, 19, 2176–2190[Abstract/Free Full Text].

    Pasquier, N., et al. (1999) Discovering frequent closed itemsets for association rules. Proceedings of the 7th International Conference on Database Theory (ICDT 1999), Lecture Notes in Computer Science Springer Vol. 1540, , pp. 398–416.

    Pei, J., et al. (2000) CLOSET: an efficient algorithm for mining frequent closed itemsets. Proceedings of ACM SIGMOD International Workshop on Data Mining and Knowledge Discovery, (DMKD 2000), pp. 21–30.

    Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448[Abstract/Free Full Text].

    Ponting, C.P. (1997) Evidence for PDZ domains in bacteria, yeast, and plants. Protein Sci, . 6, 464–468[Web of Science][Medline].

    Robinson, M.D., et al. (2002) FunSpec: a web-based cluster interpreter for yeast. BMC Bioinformatics, 3, 35[CrossRef][Medline].

    Sharan, R., et al. (2005) Conserved patterns of protein interaction in multiple species. Proc. Natl Acad. Sci. USA, 102, 1974–1979[Abstract/Free Full Text].

    Stormo, G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16, 16–23[Abstract/Free Full Text].

    Tompa, M. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol, . 23, 137–144[CrossRef][Web of Science][Medline].

    Vogel, C., et al. (2004) Supra-domains. Evolutionary units larger than protein domains. J. Mol. Biol, . 336, 809–823[CrossRef][Web of Science][Medline].

    Wilhelm, T., et al. (2003) Physical and functional modularity of the protein network in yeast. Mol. Cell. Prot, . 2.5, 292–298.

    Workman, C.T. and Stormo, G.D. (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac. Symp. Biocomput, . 5, 467–478.

    Yaffe, M.B. (2002) Phosphotyrosine-binding domains in signal transduction. Nat. Rev. Mol. Cell Biol, . 3, 177–186[CrossRef][Web of Science][Medline].

    Zaki, M.J. and Hsiao, C.-J. (2002) CHARM: an efficient algorithm for closed itemset mining. Proceedings of the 2nd SIAM International Conference on Data Mining (SDM 2002), pp. 457–473.

    Zhu, Y. and Xiao, W. (2004) Pdr3 is required for DNA damage induction of MAG1 and DDI1 via a bi-directional promoter element. Nucleic Acids Res, . 32, 5066–5075[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
C. C. Friedel and R. Zimmer
Identifying the topology of protein complexes from affinity purification assays
Bioinformatics, August 15, 2009; 25(16): 2140 - 2146.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/1/77    most recent
btl511v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Hollunder, J.
Right arrow Articles by Wilhelm, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hollunder, J.
Right arrow Articles by Wilhelm, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?