Bioinformatics Advance Access originally published online on October 10, 2006
Bioinformatics 2007 23(1):77-83; doi:10.1093/bioinformatics/btl511
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DASS: efficient discovery and p-value calculation of substructures in unordered data
1 Department of Theoretical Systems Biology, Leibniz Institute for Age ResearchFritz-Lipmann-Institute e. V. (former IMB Jena) Beutenbergstrasse 11, D-07745 Jena, Germany
2 Department of Bioengineering, University of California San Diego, La Jolla, CA 92093, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Pattern identification in biological sequence data is one of the main objectives of bioinformatics research. However, few methods are available for detecting patterns (substructures) in unordered datasets. Data mining algorithms mainly developed outside the realm of bioinformatics have been adapted for that purpose, but typically do not determine the statistical significance of the identified patterns. Moreover, these algorithms do not exploit the often modular structure of biological data.
Results: We present the algorithm DASS (Discovery of All Significant Substructures) that first identifies all substructures in unordered data (DASSSub) in a manner that is especially efficient for modular data. In addition, DASS calculates the statistical significance of the identified substructures, for sets with at most one element of each type (DASSPset), or for sets with multiple occurrence of elements (DASSPmset). The power and versatility of DASS is demonstrated by four examples: combinations of protein domains in multi-domain proteins, combinations of proteins in protein complexes (protein subcomplexes), combinations of transcription factor target sites in promoter regions and evolutionarily conserved protein interaction subnetworks.
Availability: The program code and additional data are available at http://www.fli-leibniz.de/tsb/DASS
Contact: wilhelm{at}fli-leibniz.de
Supplementary information: Supplementary information is available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Pattern recognition is one of the most fundamental themes in bioinformatics (Ouzounis and Valencia, 2003). It comprises different problem classes, such as binding site discovery (Stormo, 2000; Tompa et al., 2005), multiple sequence alignment [e.g. hidden Markov models (Eddy, 1998)], classification of proteins [e.g. SCOP (Andreeva and Srikant, 1994), SUPERFAMILY (Madera et al., 2004)], characterization of protein complexes (Aloy et al., 2004; Hollunder et al., 2005a) and functional annotation of proteins [e.g. FunSpec (Robinson et al., 2002)]. Numerous algorithms exist to analyze biological sequences [e.g. FASTA (Pearson and Lipman, 1988), BLAST (Altschul et al., 1990) and PSI-BLAST (Altschul et al., 1997)]. However, only few methods deal with non-sequential (unordered) data.
Although lacking sequential order, such data nevertheless may contain hidden structure and hierarchically organized features. Hierarchical organization and modularity are common in biology and can be observed at many levels of cellular organization (Gavin and Superti-Furga, 2003). For instance, protein complexes are frequently composed of different protein subcomplexes, also called protein modules or molecular machines (Wilhelm et al., 2003; Hollunder et al., 2005a). By combining proteins and protein modules into specific complexes the cell achieves a higher degree of versatility, flexibility and robustness. In addition, the proteins used in these complexes are themselves composed of different protein domains (Apic et al., 2001; Vogel et al., 2004) that are specifically combined to define a protein's structure and function. Other examples are the combinatorial transcriptional gene regulation (Ihmels et al., 2004; Beyer et al., 2006) and modules of interacting proteins conserved across species that can be used as building blocks of biological networks and pathways (Li et al., 2004; Sharan et al., 2005). Therefore, it is important to develop efficient methods for the investigation of such modular and hierarchical structures.
Here we present algorithms for the detection and scoring of structures in data that can be represented as sets, e.g. sets of proteins or sets of protein domains. Most existing pattern recognition algorithms for unordered data are variants of the APRIORI algorithm (Agrawal and Srikant, 1994). Here a breadth-first search is used to iterate through the items (elements) of each set, considering a threshold for the item frequencies in the result sets. In the first iteration, all items below this frequency threshold are removed and all combinations of the remaining items are tested (two-itemset candidates). Frequent itemsets at the given level are extended by more items to generate candidate sets for the next iteration and infrequent subsets are discarded. Recent approaches use similar search strategies with item-based data structures to reduce the search space. Three classes of algorithms can be distinguished: those finding all frequent patterns (sets), e.g. APRIORI (Agrawal and Srikant, 1994), those determining only maximal frequent sets (a set with no frequent superset containing this set), e.g. MAFIA (Burdick et al., 2001), and algorithms finding frequent closed sets [a set is closed if there exists no superset with the same frequency (Pei et al., 2000)], e.g. CHARM (Zaki and Hsiao, 2002), Close (Pasquier et al., 1999), and CLOSET (Pei et al., 2000). All of these approaches require a user-specified threshold for the set frequency. We are not aware of any algorithm quantifying the statistical significance of the observed patterns. Depending on the frequency of the constitutive elements, abundant patterns might not be statistically significant, whereas patterns with a comparably low abundance may be significant.
In Section 2, we present an approach that detects patterns in unordered hierarchical data and estimates the statistical significance of these patterns without requiring a pattern frequency threshold. DASS (Discovery of All Significant Substructures) comprises two independent algorithms: the first (DASSSub, Section 2.2.1) performs a fast search to find all closed sets. Closed sets provide a concise representation of the data, because the number of closed sets is typically much smaller than the number of all contained patterns. In contrast to existing algorithms, which iterate through the set elements, DASSSub iterates through the sets (structures) themselves. This approach allows for efficient pruning when analyzing hierarchically structured data and provides a superior performance compared with other algorithms. In the second algorithm (Section 2.2.2) the statistical significance of all closed sets is calculated. Two cases are distinguished: (1) sets with unique elements and (2) sets with multiple occurrence of the same elements (multi-sets). In each case DASS calculates the probability (p-value) that the considered closed set is found in a randomized dataset with the same or higher frequency.
The DASS algorithm has numerous potential applications. Section 3 outlines applications to protein domain combinations, protein complexes, and transcriptional modules. As an additional application, a new method to find conserved subnetworks is presented in the Supplementary information 2.4.
| 2 ALGORITHM |
|---|
|
|
|---|
2.1 Problem formulation and preliminaries
Let
= {e1, e2, ... , e|
|} be a finite alphabet of elements ek. We consider sets Si = {(E, m), E
} (e.g. measured protein complexes), where E is the (unordered) list of elements ek contained in Si and m : E
+ is a function mapping the ek onto their frequencies in Si. Thus, m(ek) is the frequency of an element ek in Si and
is the size of Si. If ek may occur more than once, Si is a multi-set with m(ek)
1. The dataset
= {S1, S2, ... , S|
|} comprises a finite set of structures Si (e.g. Fig. 1a). For instance,
could be a collection of measured protein complexes, where each structure Si comprises a set of proteins and contains all different proteins. The frequency of an element ek in
is defined as
. Thus, we can compute the relative frequency of ek as fk = Fk /N, where N is the total number of all elements in
. See Figure 1b for an example.
|
Let the substructure Mj
Si be a non-empty set or multi-set of
elements and
(Mj)
be the subset of all structures from
containing the specific substructure Mj (host structures). The frequency Fj = |
(Mj)| of a substructure Mj is the number of structures in the dataset
containing Mj (e.g. Fig. 1c). If each Si corresponds to an independent measurement, a high frequency Fj indicates a large number of measurements confirming the existence of the substructure Mj.
DEFINITION 1
(closed set). A substructure MX is called a closed set if there exists no substructure MY such that MXMY and FX = FY. In other words, MX is called closed if there exists no superset MY
MX with the same frequency.
Obviously, each structure Si itself is a closed set.
= {M1, M2, ... , M|
|} is the set of all closed sets of the dataset
(e.g. Fig. 1d). The first part of the DASS algorithm finds
for a given dataset
and the second part computes the statistical significance of each Mj based on its frequency (e.g. Fig. 1e).
2.2 The DASS Algorithm
2.2.1 Finding all substructures
The algorithm DASSSub finds
in a given dataset
by systematically intersecting every structure with all other structures. Figure 2 shows the pseudocode for the algorithm (without frequency calculations for clarity). The algorithm starts with an empty set
(Fig. 2, line 1) and works by using two loops (i-loop and j-loop), and the helper-set H. The i-loop iterates through the structures Si (lines 212), the j-loop iterates through the current
(lines 59). H is reinitialized in every i-loop with the current structure Si (line 4) and then collects all intersections of Si with all substructures Mj in the current
(line 7). After finishing the j-loop,
is updated with H (line 10). Finally,
contains all closed sets from
(line 13). This can be proven by taking into account that all structures Si are closed sets, all intersections of two closed sets are closed again, and the complete intersection tree (all combinations of intersections of structures) is traversed by the algorithm. Figure 3 outlines the algorithm for the example of Figure 1.
|
|
As mentioned above, this algorithm iterates through the structures Si and not through the elements of
. Compared with other data mining algorithms (Zaki and Hsiao, 2002; Pasquier et al., 1999; Pei et al., 2000), DASSSub has the advantage of finding all closed sets without needing to test that the sets are closed. Generally, given
= {S1, S2, ... , S|
|} with |
| structures,
contains 2|
| 1 sets in the worst case.
However, DASSSub is particularly fast if newly added substructures Mj in
are often identical to already existing sets, because this reduces the number of intersections in subsequent j-loops. Fortunately, hierarchically organized biological data have exactly this property: the frequency of substructures (i.e. modules) is quite large compared with the number of different substructures. This vastly reduces the computation time of DASSSub. For instance,
in Figure 1d contains only five different sets, because
. Algorithms iterating through elements from
do not exploit the hierarchical organization of biological data.
Finally, we compute the frequency Fj of each substructure Mj
.
2.2.2 Statistical significance of all substructures
The calculation of the statistical significance of substructures can be divided into two different problems: (1) structures with unique elements (simple sets)
and (2) structures with possible multiple occurrence of the same elements
. The appropriateness of the respective approach depends on the dataset and on the biological question.
In each case (simple set or multi-set) our underlying null model assumes that all elements ek from
are randomly assigned to the structures Si, taking into account their relative frequencies fk. The exact way of determining the statistical significance is to generate all possible permutations of a given dataset
[Supplementary information 1.1.1 (a)]. However, this is only practicable for very small datasets. A good approximation for the exact model is provided by shuffling the data. This procedure maintains all structure sizes and re-assigns all elements from the element pool (containing all ek with their frequencies fk) [Supplementary information 1.1.1 (b)]. After each randomization the frequencies of all substructures are counted. The p-value of a given structure is the number of randomizations containing the substructure at least as often as its frequency in original data
divided by the total number of randomizations [Supplementary information 1.1.1 (b)Model IIA]. For very improbable substructures this procedure is very costly, because many randomizations are needed. The corresponding p-value calculation can be improved by fitting for a given substructure its frequency distribution [Supplementary information 1.1.1 (b)Model IIB]. However, because this procedure is still computationally demanding, we developed another shuffling model suited for fast analytical calculations, which is described in the following. It is based on the average probability
to find a substructure in the randomized data. The p-value is calculated with a binomial distribution using this average probability and the observed frequency of the substructure. The binomial distribution (
) is an exact test to calculate the probability of at least k successes in n Bernoulli experiments with the probability
of success. A detailed discussion of all these models and a validation of our final analytical model is given in the Supplementary information 1.1.
Both
and
estimate the probability, pj (ni), of finding Mj in a random set of size ni for all structure sizes ni
nj contained in the dataset
. The average probability
of finding Mj in such a random set is
![]() | (1) |
is the total number of structures with size
nj. The product k i· pj (ni) is the expectation value to find Mj in the structures of size ni. The average probability
and the observed frequency Fj of Mj are used to compute the statistical significance based on a binomial distribution. The probability of observing Mj at least Fj times in a randomized dataset Drandom thus reads:
![]() | (2) |
(i)
. If each element ek can occur at most once in a set [
is m(ek) = 1], the number of different permutations in a random set Sr of nj elements is nj! and the probability of finding a substructure Mj in a random set Sr of nj elements is:
![]() | (3) |
Next, the probability pj of finding a substructure Mj in a random set Sr with nr
nj can be calculated as 1 minus the probability of not finding Mj in
random experiments:
![]() | (4) |
nj contained in the given dataset.
(ii)
. We now consider the case where the Mj are multi-sets. The probability pj (nr) of finding Mj in a random set Sr of size nr
nj can be calculated by a summation of the probabilities of all possible outcomes containing Mj at least once. A brute force algorithm would generate all |
|nr possible outcomes to calculate this probability. Of course, this is applicable only for a small number of structures and a small alphabet of elements. We present an efficient, iterative algorithm to estimate the probability of finding Mj in Sr for successively increasing set sizes. Intermediate results of previous iteration steps are efficiently re-used to avoid unnecessary recalculations.
The number of different permutations of a given substructure Mj is:
![]() | (5) |
![]() | (6) |
![]() | (7) |
If nr > nj Equation (7) has to be expanded by an additional factor
> 1, accounting for the higher probabilities for the occurrence of Mj in larger sets.
monotonically increases with the number of free positions. For the general case nr
nj the probability pj (nr) is calculated by
![]() | (8) |
.
All random sets Sr containing Mj have nj fixed elements. The remaining nd = nr nj positions can be filled with any element from
. The purpose of
in Equation (8) is to account for the different ways in which these free positions can be filled. For instance, if nd = 1 then pj (nr) = pj (nj + 1) can be calculated by summing up all possible outcomes with one additional element, which may be an element of Mj or not. A brute force algorithm for the generation of all outcomes is very costly. To make the calculation more efficient, we recursively calculate
as a function of its predecessors, i.e.
(Mj, nr) = f [
(Mj, nr 1)]. Additionally, we use an efficient factor out technique, where the subtotals are re-used extensively throughout the procedure (intermediate values are stored in a matrix SubT). Figure 4 shows the pseudo-code for the complete algorithm DASSPmset. The code also demonstrates the efficient calculation of the factor
.
|
At first DASSPmset computes the probability pj(nj) (Fig. 4, lines 15). Next, DASSPmset defines a composed set
containing all elements of a given substructure Mj and the pseudo-element
where
is a placeholder for all elements ek
\Mj. Thus, the relative frequency of
in
is
In an outer loop (lines 715) the algorithm iterates through the nd free positions, which can be occupied by any element from
. In the inner k-loop (lines 814) the algorithm iterates through the different elements of
always in a predefined fixed order and stores all subtotals in a matrix SubT (line 13). The matrix SubT contains the corresponding subtotal for each possible free position and element ek
and is used in later steps for determining the factor
. The most inner loop (lines 1012) re-uses the subtotals in SubT to compute the subtotal for the given element ek. This loop is only used if nd > 1. Finally,
(line 16) and the corresponding probability pj (nr) (line 17) are computed.
Due to the pre-calculation of partial results, the computation time is polynomial
, instead of the exponential time required for the brute force algorithm
: O(nd) (loop of line 6), O(|
|) (loop of line 7) and
(loop of line 9)].
As in the simple set case above, also in the multi-set case the pj (nr) calculation is only an approximation (sampling with replacement). However, in most cases it leads to a conservative estimate of the exact p-value [see Supplementary information 1.1.1 (d)], the accuracy increases with an increasing number of elements in
and would be exact for an infinite number of elements. In most biological applications the number of elements in
is much larger than the largest structure Si. Thus,
is generally a very good approximation even if
is finite.
We illustrate the algorithm (Fig. 4) with help of the following example: we calculate the probability of finding the substructure M4 = {a, a, b, d} (cf. Fig. 1) in random structures of size four, five and six (Fig. 5). The values of SubT denote the cumulative subtotals. The free positions are filled with the indicated elements (Fig. 5b and c). For instance, using the fixed order a, b, d,
: SubT[1][1] refers to the case of an additional a at the first free position, SubT[1][2] to the case of finding either a or b at the first free position, and so on. SubT[2][1] in Figure 5c refers to the case that both free positions are filled with a, and SubT[2][2] that both free positions are either filled with only a or b or are filled with a and b and so on. In the Supplementary information 1.2 it is shown that
generates the same results than a brute force algorithm.
|
| 3 APPLICATIONS |
|---|
|
|
|---|
The following examples demonstrate the power and versatility of DASS for a broad range of applications. Of course, many more applications are conceivable, e.g. a new method for the identification of conserved protein interaction networks is outlined in the Supplementary information 2.4.
3.1. Protein domain combinations
Multi-domain proteins are composed of different protein domains that determine the function of the whole protein. Vogel et al. (2004) discussed combinations of two and three domains that recur in different proteins together with different partner domains in a particular functional and spatial relationship. Using the
algorithm we analyze the corresponding general problem of protein domain combinations of any size. Domain assignments of the proteins were taken from the SUPERFAMILY database (Madera et al., 2004). In a future work, we will elaborate on the functions of significant domain combinations together with their geometrical arrangement and their distribution in different species. Here we identify different protein domain combinations containing the SH2 domain (Src-homology-2) and/or the PDZ domain (PSD-95, discs large and ZO-1, see Supplementary information 2.1). SH2 domains have previously been identified in a wide range of signaling proteins, often together with other signaling domains (e.g. SH3 domain, Pleckstrin homology domain, protein tyrosin phosphatase domain) or together with scaffold domains (Yaffe, 2002). PDZ domains play a key role in organizing diverse cell signaling assemblies and occur often as multiple copies and together with other signaling domains (Fan and Zhang, 2002). We identified more than 100 significant PDZ modules (P < 105) containing PDZ and other domains, which can be classified into different subclasses (Fan and Zhang, 2002). Proteins directly associated with the plasma membrane (ion channels, receptors and cytoskeleton proteins) are among the major cellular targets of PDZ domains. In contrast to normal tissue, in some human tumors (e.g. colon and breast cancer), proteins containing PDZ and LIM domains are overexpressed (Kang et al., 2000). PDZ domains are abundant in animal proteins, yet scarce in yeast, bacteria and plants (Ponting, 1997). We identified similar significant PDZ modules in different multicellular species, suggesting conserved signaling mechanisms in these organisms (see Supplementary information 2.1).
3.2. Protein subcomplexes in S.cerevisiae and E.coli
One level above combinations of protein domains in single proteins are combinations of whole proteins in protein complexes which act as highly specialized cellular molecular machines.
We separately analyzed datasets of the two model organisms S.cerevisiae (Hollunder et al., 2005a) and E.coli (Hollunder et al., 2005b) and identified the underlying modular composition of protein complexes. We argue that subcomplexes represent more reliable protein assemblies than the originally measured complexes. Using the DASS algorithm we identified well-characterized protein assemblies with known functions, for instance, Casein kinase II, 19/22S (regulator of the proteasome), eIF2B and Arp2/3 in S.cerevisiae and the DNA-dependent RNA polymerase complexes and the 2-oxoglutarate dehydrogenase complex in E.coli.
On the other hand, we also identified previously unknown protein assemblies that may fulfill special cellular functions. We characterized the subcomplexes and subcomplex proteins in detail and found that subcomplexes have specific properties that underline their distinct role. For instance, subcomplexes are enriched for essential proteins and the expression of subcomplex proteins is more correlated than the expression of complex proteins in general. Subcomplexes of S.cerevisiae and E.coli are also characterized by a more homogeneous spatial and functional composition, compared with that of the measured host complexes. This property is exploited to propose functions and sub-cellular localizations of unannotated proteins. Proteins of unknown function belonging to known subcomplexes or to complexes with homogeneous functional or spatial composition get assigned the function or localization of this subcomplex.
The result of the protein subcomplex identification analyzing the recently published MALDI-TOF MS data (Krogan et al., 2006) is shown in the Supplementary information 2.2.
3.3. Transcription factor modules in S.cerevisiae
In an accompanying study we analyzed the interaction of transcription factors (TF) for the combinatorial regulation of target genes (Beyer et al., 2006). In that study we developed a scoring system for robustly assigning TFs to their target genes and we subsequently used
to determine which TFs cooperatively regulate a significant number of target genes. The analysis of these TF modules provides novel insights into combinatorial transcriptional regulation. However, many TFs bind more than once in the upstream regions of their regulated target genes and it is assumed that repeated occurrence of the same TF increases the repressing or activating signal (Harbison et al., 2004). Thus, in addition to simply identifying sets of TFs acting together, here we also analyze the number of repeated occurrences of TFs in these modules using
. For this analysis we selected all highly significant TFtarget interactions in the yeast S.cerevisiae from Beyer et al. (2006). Interactions with a log-likelihood score larger than six were selected for this study. These 3820 interaction pairs were used to assign TF-binding sites to their target genes. The number of repeats was determined based on the number of binding sites within the first 1000 bp upstream of the start codon. Binding sites were determined using ANN-Spec (Workman and Stormo, 2000) and assuming a significance threshold of P < 10 4.
found 702 TF-modules, 438 of which were significant with P < 104 (see Supplementary information 2.3).
finds only 200 significant TF-modules using the same interactions and requiring the same level of significance. This result shows that the number of binding sites carries significant additional information.
Figure 6 shows frequency distributions for selected TFs. It can be seen that some TFs have a strong preference for low-redundancy (e.g. Gat1p), whereas others have a broad distribution of binding site frequencies (e.g. Skn7p). Some TFs always have repeated binding elements. Examples are the drug resistance regulators Pdr1p and Pdr3p, which almost always have at least two binding sites in the upstream regions of their targets. Although the two TFs are assumed to be homologous they do not always regulate the same genes (DeRisi et al., 2000; Zhu and Xiao, 2004). Hence, the distributions of their binding sites in Figure 6 are not identical. However, both TFs primarily require either two or four binding sites, while triple binding sites appear less frequently. This does not come as a surprise, because Pdr1p and Pdr3p bind the palindromic cis-element 5'-TCCGCGGA-3' as dimers (Mamnun et al., 2002). Accordingly, the TF module Pdr1p/Pdr1p/Pdr3p/Pdr3p is highly significant (20 targets, P = 2 x 1055), whereas a TF module with single copies of the two TFs (i.e. Pdr1p/Pdr3p) does not exist at all. Thus, the TF-modules correctly reveal the stoichiometry of regulatory complexes.
|
Another interesting TF module is Cbf1p/Cbf1p/Tye7p/Tye7p. To our knowledge, it has not been reported that Cbf1p and Tye7p form heterodimers and no interaction between these TFs was found [iHOP (Hoffman and Valencia, 2004); SGD (Christie et al., 2004); MIPS (Mewes et al., 2004)]; however, the two TFs bind to identical chromosomal locations. Cbf1p's binding motif (8 bases long) is completely contained in Tye7's binding motif (10 bases long). Also, Cbf1p and Tye7p have a significant number of common targets (31 targets, P = 1043). Hence, in this case it is more likely that Cbf1p and Tye7p regulate their targets in a mutually exclusive way. Further studies would be required to validate the presence of a competitive interaction. It is known that Cbf1p physically interacts with Met4p (Kuras et al., 1996) and that Met31p and Met32p may replace the role of Cbf1p in the regulatory complex (Blaiseau and Thomas, 1998). In support of this, we find a number of significant TF-modules (P < 105) involving Cbf1p together with Met4p, Met31p and Met32p. These examples demonstrate that DASS is able to reveal known combinatorial interactions and to discover unknown regulatory modules.
| 4 CONCLUSION |
|---|
|
|
|---|
The DASS algorithm represents an efficient method to calculate the statistical significance of substructures in unordered data. It is composed of two independent parts. The first sub-algorithm DASSSub finds all patterns in unordered data. It is especially efficient for hierarchically organized data, a typical property of large datasets in molecular biology. In the second part, the statistical significance of all identified patterns is calculated (DASSP). The algorithms
and
are developed to handle the two different cases: simple sets or multi-sets. These algorithms perform approximated estimations of statistical significances, valid for large datasets. The Supplementary information 1.1 contains additional detailed discussions of four other models better suited for smaller datasets. The DASS algorithm has a very broad range of applications. Here we demonstrated the application of DASS to four different types of biological datasets. Modules with low p-values are overrepresented, whereas high p-values indicate underrepresentation. Interestingly, all closed sets in protein complex data (protein subcomplexes) have low p-values (see Supplementary information 2.2). In contrast, in the analyzed TF binding site data are many closed sets (TF modules), which are significantly underrepresented (see Supplementary information 2.3). All resulting modules with significant p-values represent new hypotheses requiring further experimental verification. These analyses will help to better understand the hierarchical and combinatorial nature of cellular processes.
| Acknowledgments |
|---|
The authors thank S. Nikolajewa and M. Hoffmann for discussing statistical aspects and the anonymous referees for important comments. This work has been funded by the Federal Ministry of Education and Research, Germany (grant 0312704E).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on May 31, 2006; revised on September 28, 2006; accepted on October 1, 2006
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403410[CrossRef][ISI][Medline].
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 33893402
Agrawal, R. and Srikant, R. (1994) Fast algorithms for mining association rules. Proceedings of 20th International Conference on Very Large Data Bases, pp. 487499.
Aloy, P., et al. (2004) Structure-based assembly of protein complexes in yeast. Science, 303, 20262029
Andreeva, A., et al. (2004) SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res, . 32, D226D229
Apic, G., et al. (2001) Domain combinations in archaeal, eubacterial and eukaryotic proteomes. J. Mol. Biol, . 310, 311325[CrossRef][ISI][Medline].
Beyer, A., et al. (2006) Integrated assessment and prediction of transcription factor binding. PLOS Comput. Biol, . 2, e70[CrossRef][Medline].
Blaiseau, P.L. and Thomas, D. (1998) Multiple transcriptional activation complexes tether the yeast activator Met4 to DNA. EMBO J, . 17, 63276336[CrossRef][ISI][Medline].
Burdick, D., et al. (2001) MAFIA: a maximal frequent itemset algorithm for transactional databases. Proceedings of the 17th International Conference on Data Engineering (ICDE 2001), pp. 443452.
Christie, K.R., et al. (2004) Saccharomyces Genome Database (SGD) provides tools to identify and analyze sequences from Saccharomyces cerevisiae and related sequences from other organisms. Nucleic Acids Res, . 32, D311D314
DeRisi, J., et al. (2000) Genome microarray analysis of transcriptional activation in multidrug resistance yeast mutants. FEBS Lett, . 470, 156160[CrossRef][ISI][Medline].
Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755763
Fan, J.S. and Zhang, M. (2002) Signaling complex organization by PDZ domain proteins. Neurosignals, 11, 315321[CrossRef][ISI][Medline].
Gavin, A.-C. and Superti-Furga, G. (2003) Protein complexes and proteome organization from yeast to man. Curr. Opin. Chem. Biol, . 7, 2127[CrossRef][ISI][Medline].
Harbison, C.T., et al. (2004) Transcriptional regulatory code of a eukaryotic genome. Nature, 431, 99104[CrossRef][Medline].
Hoffmann, R. and Valencia, A. (2004) A gene network for navigating the literature. Nat. Genet, . 36, 664[CrossRef][ISI][Medline].
Hollunder, J., et al. (2005a) Identification and characterization of protein subcomplexes in yeast. Proteomics, 5, 20822089[CrossRef][ISI][Medline].
Hollunder, J., et al. (2005b) Exploiting combinatorial complexitysearching for new functional entities in the cell. Proceedings of Foundation of System Biology in Engineering (FOSBE 2005), pp. 363366.
Ihmels, J., et al. (2004) Defining transcription modules using large-scale gene expression data. Bioinformatics, 20, 19932003
Kang, S., et al. (2000) PCD1, a novel gene containing PDZ and LIM domains, is overexpressed in several human cancers. Cancer Res, . 60, 52965302
Krogan, N.J., et al. (2006) Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature, 440, 637643[CrossRef][Medline].
Kuras, L., et al. (1996) A heteromeric complex containing the centromere binding factor 1 and two basic leucine zipper factors, Met4 and Met28, mediates the transcription activation of yeast sulfur metabolism. EMBO J, . 15, 25192529[ISI][Medline].
Li, S., et al. (2004) A map of the interactome network of the metazoan C. elegans. Science, 303, 540543
Madera, M., et al. (2004) The SUPERFAMILY database in 2004: additions and improvements. Nucleic Acids Res, . 32, D235D239
Mamnun, Y.M., et al. (2002) The yeast zinc finger regulators Pdr1p and Pdr3p control pleiotropic drug resistance (PDR) as homo- and heterodimers in vivo. Mol. Microbiol, . 46, 14291440[CrossRef][ISI][Medline].
Mewes, H.W., et al. (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res, . 32, D41D44
Ouzounis, C.A. and Valencia, A. (2003) Early bioinformatics: the birth of a disciplinea personal view. Bioinformatics, 19, 21762190
Pasquier, N., et al. (1999) Discovering frequent closed itemsets for association rules. Proceedings of the 7th International Conference on Database Theory (ICDT 1999), Lecture Notes in Computer Science Springer Vol. 1540, , pp. 398416.
Pei, J., et al. (2000) CLOSET: an efficient algorithm for mining frequent closed itemsets. Proceedings of ACM SIGMOD International Workshop on Data Mining and Knowledge Discovery, (DMKD 2000), pp. 2130.
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 24442448
Ponting, C.P. (1997) Evidence for PDZ domains in bacteria, yeast, and plants. Protein Sci, . 6, 464468[Abstract].
Robinson, M.D., et al. (2002) FunSpec: a web-based cluster interpreter for yeast. BMC Bioinformatics, 3, 35[CrossRef][Medline].
Sharan, R., et al. (2005) Conserved patterns of protein interaction in multiple species. Proc. Natl Acad. Sci. USA, 102, 19741979
Stormo, G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16, 1623
Tompa, M. (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat. Biotechnol, . 23, 137144[CrossRef][ISI][Medline].
Vogel, C., et al. (2004) Supra-domains. Evolutionary units larger than protein domains. J. Mol. Biol, . 336, 809823[CrossRef][ISI][Medline].
Wilhelm, T., et al. (2003) Physical and functional modularity of the protein network in yeast. Mol. Cell. Prot, . 2.5, 292298.
Workman, C.T. and Stormo, G.D. (2000) ANN-Spec: a method for discovering transcription factor binding sites with improved specificity. Pac. Symp. Biocomput, . 5, 467478.
Yaffe, M.B. (2002) Phosphotyrosine-binding domains in signal transduction. Nat. Rev. Mol. Cell Biol, . 3, 177186[CrossRef][ISI][Medline].
Zaki, M.J. and Hsiao, C.-J. (2002) CHARM: an efficient algorithm for closed itemset mining. Proceedings of the 2nd SIAM International Conference on Data Mining (SDM 2002), pp. 457473.
Zhu, Y. and Xiao, W. (2004) Pdr3 is required for DNA damage induction of MAG1 and DDI1 via a bi-directional promoter element. Nucleic Acids Res, . 32, 50665075
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

MY and FX = FY. In other words, MX is called closed if there exists no superset MY
MX with the same frequency.










x 
