Computational Genomics
Family relationships: should consensus reign?consensus clustering for protein families
CNRS/LaBRI, Université Bordeaux 1 351 cours de la Libération, 33405 Talence Cedex, France
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Reliable identification of protein families is key to phylogenetic analysis, functional annotation and the exploration of protein function diversity in a given phylogenetic branch. As more and more complete genomes are sequenced, there is a need for powerful and reliable algorithms facilitating protein families construction.
Results: We have formulated the problem of protein families construction as an instance of consensus clustering, for which we designed a novel algorithm that is computationally efficient in practice and produces high quality results. Our algorithm uses an election method to construct consensus families from competing clustering computations.
Our consensus clustering algorithm is tailored to serve the specific needs of comparative genomics projects. First, it provides a robust means to incorporate results from different and complementary clustering methods, thus avoiding the need for an a priori choice that may introduce computational bias in the results. Second, it is suited to large-scale projects due to the practical efficiency. And third, it produces high quality results where families tend to represent groupings by biological function.
Availability: This method has been used for Génolevures project to compute protein families of Hemiascomycetous yeasts. The data are available online at http://cbi.labri.fr/Genolevures/fam/
Supplementary information: Supplementary data are available at http://cbi.labri.fr/Genolevures/fam/
Contact: macha{at}labri.fr
| 1 BACKGROUND |
|---|
|
|
|---|
1.1 Overview
When confronted with computation of protein families, one must decide which algorithm to use, evaluate the quality of the result and decide which computed familiesif anycorrespond to real protein families. We argue that it is the comparison of different results that produces the set of the most plausible families.
Algorithmic means for establishing protein families on a large scale are based on a notion of similarity. Often the only available similarity is sequence similarity; some authors go so far as to define the notion of a protein family itself via the similarity of sequences (Doolittle, 1981). Use of structural similarities (Holm and Sander, 1996) is currently limited by the relatively small number of structures available in PDB (Berman et al., 2000), but by studying sequence variation among members of such families (Koehl and Levitt, 2002) one can guide the use of sequence similarity information.
To perform similarity-based detection of families one first selects the pairwise similarities to be used and then applies an algorithm that uses these similarities to group proteins into families. Algorithms for detection of protein families are unsupervised [see for example van Dongen (2000a)] or supervised [see for example Grundy et al., (1997) and Bejerano and Yona (2001)] learning procedures. Pairwise similarities can be provided by sequence alignments [BLAST (Altschul et al., 1990), Smith-Waterman (Smith and Waterman, 1981)] or by more sophisticated approaches using domain architecture databases (Bateman et al., 2002; Servant et al., 2002; Mulder et al., 2003). Other approaches use domain order as a fingerprint for the protein (Geer et al., 2002).
Methods that quantify similarity by using some attribute of the best BLAST hit and use single-linkage clustering are not always successful. Such straightforward approaches may group together dissimilar multidomain proteins that share a common domain (Smith and Zhang 1997), and can be fooled by promiscuous domains (Doolittle, 1995; Marcotte et al., 1999). Several graph-based methods have been proposed to overcome some of these limitations of single-linkage clustering (Matsuda et al., 1999; Enright and Ouzounis, 2000; Enright et al., 2002).
Reliability of computed clusters has been assessed by Zhang and Zhao, who studied the reliability of hierarchical clusters from gene expression data (Zhang and Zhao, 2000) and proposed a resampling algorithm for building a consensus tree, and van Dongen (van Dongen, 2000b), who introduced criteria for evaluating clustering results using a novel inter-cluster distance. It must be noted that the quality of a protein family computation cannot be truly assessed in the absence of an objective external criterion based on biological knowledge.
Here we describe a method that integrates results of multiple classifications in a single scheme, using an election algorithm. First, we formulate the problem as an instance of consensus clustering. We propose a heuristic that efficiently realizes the computation in practice. Finally, we compare our method with others and illustrate the quality of results in the case of protein families computed for Génolevures Hemiascomycetous yeasts.
1.2 Agreement between clusterings
DEFINITION 1.
A clusteringis a partition of a dataset D = {di} into disjoint subsets
1, ... ,
k called clusters, such that
.
In what follows we will use the terms clustering and partition, and cluster and subset, interchangeably.
Let the number of elements in D and in cluster
i be n and ni respectively. Let
' be a second partition of D,
, with cluster sizes
.
Most criteria for comparing clusterings are based on the confusion matrix (Kohavi and Provost, 1998), which measures the size of the intersection between the clusterings.
![]() |
Traditional measures of the proportion of agreements between two clusterings are the Rand index (Rand, 1971) and the Jaccard index (Jaccard, 1908). As the expected value of the Rand index does not take a constant value1, the adjusted Rand index (Hubert and Arabie, 1985) is preferred; it assumes the generalized hypergeometric distribution as the model of randomness and provides an index bounded above by 1 with a constant expected value 0.
A well-known measure on the space of partitions is the equivalence mismatch coefficient emc [(Mirkin, 1996) p. 241], which is precisely the Hamming distance for binary vectors if the set of all pairs
di,dj
is enumerated, and a partition is represented by a characteristic vector defined on this enumeration. One may interpret emc in terms of edge numbers of complete graphs, where it represents the number of node moves needed to convert one partition into another.
Another category of criteria for comparing clusterings is based on set cardinality. The projection number
of
onto
', defined as
, is an asymetric index of the degree of refinement of one partition versus another. The same index normalized by n is introduced in Ben-Hur and Guyon (2003) and a symmetric version is defined in Meila (2003). Laresen and Aone (Laresen and Aone, 1999) use the following asymmetric criterion:
![]() |
and
':
![]() | (1) |
and
' in a certain undirected weighted graph constructed on the set of all partitions of D, where two partitions are connected by an edge if one can be obtained from the other by joining two of its sets. The weight of the edge is the size of the smallest of two sets. In this construction d(
,
') is the length of the shortest (weighted) path between
and
'.
1.3 Consensus clustering
Combining the strengths of different clustering algorithms is the focus of research on clustering ensembles. This problem can be seen as finding the median partition with respect to given partitions.
Consensus clustering (CC): Given k partitions,
1, ... ,
k, find a consensus partition
that minimizes
, where d is a distance.
The first mathematical treatment goes back to Régnier (Régnier, 1965), where the problem was transformed into an integer problem and a branch and bound solution was proposed. The problem is proven to be NP-complete (Barthélemy and Leclerc, 1995). Wakabayashi gives an in-depth analysis of the median partition problem in Wakabayashi (1998) and concludes that approximating relations with a transitive one is NP-complete in general.
CC is NP-complete in general, yet it is not known whether it is NP-complete for any particular k. The case of k = 1 is trivial. The case of k = 2 is also simple since any of the partitions solves the problem optimally. Nothing is known for k
3.
If the partition
of the dataset D, |D| = n to discover is not necessarily one of the original partitions
1, ... ,
k, then the size of the potential search space corresponds to the Bell numbers (Bell, 1934) or equivalently to the sum of Stirling numbers of second kind for all m, 1
m
n [see Knuth (1992) on Stirling (1749)].
In the light of its hardness, exactly solving large instances of CC is intractable. Even so, exact methods have been implemented. Grötschel and Wakabayashi (Grötschel and Wakabayashi, 1989) give a cutting planes solution, which works well for n in the hundreds.
A variety of approximations have been applied to this problem. For example Filkov (Filkov and Skiena, 2003) proposes a simple approximation that works by dramatically reducing the search-space.
Factor-2 approximation: Given an instance of CC, select a partition
among partitions
1, ... ,
k that minimizes
.
The time complexity is O(k2n) since it takes O(n) to compute the distance between any two partitions, and there are O(k2) pairs. This algorithm is a factor-2 approximation to CC.
Many heuristics based on local search have been studied in application to CC, such as simulated annealing that explores the search space by one-element moves, and a greedy algorithm (Filkov and Skiena, 2003).
The problem of finding a consensus clustering can be approached from a graph-based, combinatorial or statistical perspective. Several methods exist for discovering a consensus clustering solution from many multiple partitions: co-association methods (Fred and Jain., 2002), voting approach (Topchy et al., 2004b), information-theoretic approach (Meila, 2003), hypergraph partitioning methods (Strehl and Ghosh, 2003) and use of mixture models (Topchy et al., 2004a).
| 2 RESULTS |
|---|
|
|
|---|
2.1 Comparing protein families
Given a set of m proteomes P1, ... , Pm, we are interested in computing the protein families. A proteome P = {pi} is defined as a set of proteins. The universe of proteins
is defined over the proteomes under study as
. In what follows we suppose that a protein p can appear only in one proteome and that it can appear only once2.
As described earlier any method for protein family computation first selects the support data, then computes the families. We suppose that the filtering algorithm3,
, defines a subset
, the set of proteins over which a family computation will be performed.
2.1.1 Distance and similarity
DEFINITION 2
A family computation F overdefines a partitioning of
into disjoint subsets
such that
. The subsets fi are called families.
Given two family computations
over
and
over
, the confusion matrix can be used to measure the similarity:
![]() |
Note that the fact that relative complements
and
may not be empty is not a problem since the only implication is that certain Mij values will be 0.
The confusion matrix can be equivalently represented by a bipartite graph in a straightforward manner (Figure 1). Let
, where the vertex set
is labeled by family names in F1 and F2, respectively, and the E is the set of edges weighted by the values in Mij where Mij > 0. That is, for two families
and
with an element in common, two vertices with labels
and
are created as well as an edge
such that w(e) = Mij.
|
Given a graph FRel, its strongly connected components
represent relationships between families in F1 and F2.
DEFINITION 3
The similarity score of a strongly connected component, where
,
and Ec
E is
The similarity score for FRel is defined as
(2)
(3)
This measure is symmetric and provides information on what is preserved between two family computations.
2.1.2 Making a choice
The basic case we have just considered consists of two family computations F1 and F2. Connected components {c1, ... , cm} define all the local distances between families in two original partitions. Let us suppose that the comparison FRel1,2 has exactly k non-singletons (indices [1.k]):
.
Having made the comparison
how do we choose the families that will form the solution? We propose to consider solutions locally, connected component by connected component, using exclusively families in F1 or in F2. That is for a non-singleton component
we are only allowed to pick either
or
. The rationale behind this choice strategy is that we are not allowed to invent classifications ab initio, but only to choose between the existing.
DEFINITION 4
A family choice Fc is defined by![]()
, where c is a vector of size k with elements taking values in [1,2].
A family choice Fc defines a partition of a certain subset of the protein set
, namely the partition containing proteins of the families in Fc. The number of family choices is equal to the number of different vectors c of size k, that is 2k.
LEMMA 1
For any two family choicestheir cumulative distances to the original partitions F1 and F2 are the same, that is
.
PROOF: For any family choice Fc the cumulative distance is
, where m is the number of connected components in the family relationship graph
.
DEFINITION 5
A reference set is the set of connected components C used as the choice basis for the final partition.
Lemma 1 indicates that all the consensus choice partitions are equivalent w.r.t. the reference set. This implies that the concrete choice is dependent on the needs of the application (see for a discussion Section 3).
2.2 Consensus for families
Let us consider n family computations
over their respective support sets of proteins
. We denote by
the set of all families.
2.2.1 Rules of the game
Having made all the n(n 1)/2 possible comparisons FReli,j, how do we exploit this result for families computation?
As CC is NP-complete, the key to the approximation algorithm is to reduce the search-space. We propose to do so by only considering the families belonging to
in the manner analogous to the discussion in Section 2.1.2.
Then any acceptable solution F is composed only of the families from
that are disjoint, that is
. In this manner an acceptable solution defines a partition of some subset of
. The set of all acceptable solutions
is a subset of the powerset of
,
. The inclusion is strict since only empty intersections are accepted. Thus we have
.
While the size of
is dramatically smaller than the whole search-space (whose size is equal to the n-th Bell number), this reduction of search-space is not sufficient and the corresponding exact algorithm remains NP-complete.
2.2.2 Definitions
DEFINITION 6
A proteinbelongs to at most n distinct families in
defined by function
:
Let us now consider all pairwise comparisons FReli,j. A comparison graph FReli,j can be decomposed into its mi,j connected components as defined in Section 2.1,
. The set of connected components over all of the comparison graphs is defined as
![]() |
DEFINITION 7
A family f can appear in at most n(n1)/2 distinct connected components indefined by function
.
We call the set Cp the set of p-components.
Then, for a protein p we can define by composition the set of p-components as the set of all connected components where p is a member. We will denote this function
=
.
![]() |
induces a set of proteins
such that all pi
P belong to some family f that itself belongs to a component c
Cp. We will call this set of proteins a support set for a given p-component.
We say that two sets of p-components Cp and Cq have an intersection, noted by
, if their corresponding support sets have an intersection, that is
.
DEFINITION 8
For a given FReli,j, we say that a setis a conflict region if and only if
ck, cl
R, ck
cl
![]()
. We say that a conflict region R is maximal if and only if
such that cm
R and
ck
R, cm
ck =
.
2.2.3 Consensus
Let
be the set of all maximal conflict regions. A conflict region R has a support set of proteins
,
. PR has n subsets (n being the number of family computations) defined by
and
, each corresponding to the proteins belonging to families of a particular family computation Fi.
Consensus clustering over the reduced search space defined in Section 2.2.1 can be reformulated, using the definitions of Section 2.2.2, in terms of minimum set cover. The idea is to cover exactly one of the sets among PR and
with the support sets of the connected components constituting a given conflict region. A cost function w on
can be given by the van Dongen distance [Equation (1)] or our similarity score [Equation (2)], for example.
Minimum exact cover (MDC): Let R be a maximal conflict region. Let PR and
be the support set of proteins of R and the family computation-specific sets of proteins, respectively. We denote by Pc =
1(c) the support set of proteins for a connected component that is included in the region R. Find a subset of connected components C
R such that (1) every protein from
Pc, c
C appears in one and only one of Pc4, (2) there exists P among PR and
entirely covered by the support sets in C, that is
, and (3) C minimizes
.
The consensus family computation can then be defined as a MDC solution for all
. This problem is a variant of minimum cover problem with an additional condition for the sets to be disjoint and is known to be NP-complete (Garey and Johnson, 1979).
2.2.4 Efficient heuristic
Two further relaxations make the problem tractable. First, we do not require to have all the n(n 1)/2comparisons. Second, we do not require to have an exact cover. It is the second criterion that allows our heuristic to run in polynomial time. The obvious consequence of this relaxation is that not all of the proteins in
will be found in the result. This is not an issue in the context of protein families since there is no reason to suppose that all proteins from
belong to families of cardinality higher than 1.
Let us consider n family computations and m
n(n1)/2 family comparisons FReli,j. Each element
has a distance (say, the van Dongen distance) value di and a similarity value si. Starting from the maximal conflict regions in
we build a conflict graph (Figure 1).
DEFINITION 9
For family comparisons FReli,j we define a conflict graphas follows.
- The vertex set V is labeled by elements from
, and we have |V| = |
|.
The edge set E is defined by all non-empty intersections between connected components, that is e =
ci,cj
![]()
E if and only if ci
cj
![]()
.
We distinguish two disjoint classes of edges E = Ep
Ef, those that are only based on proteins Ep and those that correspond to families Ef. The latter means that there is a non-empty intersection on the level of families, that is
1(ci)
kappa1(cj)
![]()
, while the former is the complement E\Ef.
By construction the set of connected components of G is exactly the set of conflict regions
.
The consensus computation works by resolution of conflicts for all connected components of a conflict graph G and is given in the algorithm 1. The conflict resolution is done by a voting procedure vote.
Algorithm 1 Family computation algorithm by weak consensus
Complexity The complexity for computing strongly connected components is known to be O(V + E) for a graph G =
V,E
(Tarjan, 1972). The Condorcet election complexity is O(p2) for p voters (that is the number of proteins in a given conflict region). Let us denote by m the largest set out of n family computations: m = maxi |Fi|. The largest possible conflict region will involve all proteins in
. The worst-case complexity of algorithm 1 is then
.
| 3 APPLICATION AND DISCUSSION |
|---|
|
|
|---|
The algorithm 1 has been used to calculate protein families for Saccharomyces cerevisiae and the four fully sequenced species of Hemiascomycetous yeasts in the context of the Génolevures project (Dujon et al., 2004; Sherman et al., 2006). The following is a summary of the key steps in this application. For a detailed presentation see the online supplement at http://cbi.labri.fr/Genolevures/fam/method.html
Data, alignments and filtering The data consist of five proteomes of S.cerevisiae, Candida glabrata, Kluyveromyces lactis, Debaryomyces hansenii and Yarrowia lipolytica. These proteomes taken together comprise the total set of proteins
. We produced complementary alignments using Blast and Smith-Waterman filtered according to scoring criteria5 and an approximation to homeomorphy [see (Wu et al., 2004)]. These criteria produce four datasets of pairwise similarity:
corresponding to Smith-Waterman homeomorphic alignments,
to Smith-Waterman alignments without the restriction of homeomorphy,
corresponding to Blast homeomorphic alignments and
to Blast non-homeomorphic alignments.
Clustering TribeMCL software (Enright et al., 2002) was used for clustering. Three different inflation coefficients (1.2, 2.4 and 4.0) have been applied to each dataset, thus resulting in 12 different clustering results.
Parameter tuning Four parameters were added to the algorithm 1 in order to make it efficient in practice. The first three parameters help to break too large conflict regions. The last one realizes the choice policy.
- Smax is the maximal allowed size of the support protein set for conflict regions.
- fmax is the maximal number of families allowed in a conflict region.
- p is the percentile above which families are removed from
; it is determined based on the distribution of family sizes in a given conflict region
.
- A choice strategy (c.f. section 2.1.2). Given a connected component
, either systematically choose
such that
or systematically choose
such that
.
Consensus families Various parameter ranges were tested resulting in 256 family computations. The final result contains 4389 protein families. Evaluation of the result was performed by comparison with manually curated literature standards, to reciprocal best hit partitions (RBH) (Rivera et al., 1998), and to Biofacet (Glemet and Codani, 1997) single-linkage clustering; see the online supplement for details. Results using our consensus method were qualitatively and quantitatively more satifying and have been adopted by the Consortium.
The first example is provided by families GLR.2644 and GLS.60, respectively, homologs to RPL24 ribosomal-like protein 24 and homologs to ribosomal protein L30 of the large (60S) ribosomal subunit. Instead of producing two groups of homologs, RBH clustering produces three groups corresponding to three species groups.
A second example shown in the supplement illustrates the too-large families that are typically produced by blind MCL or RBH clustering, in this case an impossible family with 406 members. Our consensus algorithm splits this group into 30 families which are in fact different sets of protein kinases, annotated as SNF1 protein serine/threonine kinase (GLC.1371, 5 members), BCK1 ser/thr protein kinase (GLC.1809, 6 members), PKC1 protein Kinase C (GLC.1902, 5 members), cytoplasmic serine/threonine protein kinase (GLC.2023, 7 members), protein kinase involved in morphogenesis and septin checkpoints (GLC.2551, 10 members), PSK1 and PSK2 proteins, PAS domain containing S/T protein kinases (GLC.298, 7 members), etc.
These families are currently used in a variety of specific studies by the Génolevures Consortium and will be expanded as new yeast genomes are sequenced.
| Acknowledgments |
|---|
The present work was done as part of the the Génolevures Consortium. The authors thank all the members of the Consortium for numerous, friendly and creative discussions, and in particular Philippe Baret and André Goffeau for the discussions on the transporter families, and Hélène Ferry-Dumazet for the work on standard protein families. Support for this work was provided by the Région Aquitaine program Génotypage et Génomique comparée, the French Ministry of Research and Education through ATIP Jeunes Chercheurs, the ACI IMPBIO Génolevures En Ligne and the French ANR "GENARISE" project. Génolevures is supported by CNRS (GDR 2354).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
1When the cluster size is small, the Rand index tends toward 1 as the number of clusters increases.
2Since we are only interested in families, we do not introduce any information about the genes coding for these proteins (chromosomes, relative positions, etc.); in this way, a simple set-based approach is sufficient. ![]()
3Even if the filtering criteria are not the subject of this paper, it must be noted that they have a great impact in practical applications since they affect the distance/similarity values. ![]()
4That is for any ci,cj
C, their intersection is empty. ![]()
5Based on statistical analyses not detailed here. ![]()
| REFERENCES |
|---|
|
|
|---|
Altschul, S., et al. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403410[CrossRef][ISI][Medline].
Barthélemy, J.-P. and Leclerc, B. (1995) The median procedure for partitions. In Cox, I.J., Hansen, P., Julesz, B. (Eds.). Partitioning Data Sets: DIMACS Series in Discrete Mathematics and Theoretical Computer Science, , pp. 334 19, American Mathematical Society, Providence, RI.
Bateman, A., et al. (2002) The Pfam protein families database. Nucleic Acids Res, . 30, 276280
Bejerano, G. and Yona, G. (2001) Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17, 2343
Bell, E. (1934) Exponential numbers. Am. Math. Monthly, 41, 411419[CrossRef].
Ben-Hur, A. and Guyon, I. (2003) Detecting stable clusters using principal component analysis. Methods Mol. Biol, . 224, 159182[Medline].
Berman, H., et al. (2000) The Protein Data Bank. Nucleic Acids Res, . 28, 235242
The French Revolution research collection de Caritat marquis de Condorcet, J.-A.-N. (1995) Essai sur l'application de l'analyse à la probabilité des décisions rendues à la pluralité des voix. Reprod. of édition de l'Impr. royale, 1785.
Doolittle, R. (1981) Similar amino acid sequences: chance or common ancestry? Science, 214, 14959
Doolittle, R. (1995) The multiplicity of domains in proteins. Annu. Rev. Biochem, . 64, 287314[CrossRef][ISI][Medline].
Dujon, B., et al. (2004) Genome evolution in yeasts. Nature, 430, 3544[CrossRef][Medline].
Enright, A., et al. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res, . 30, 15751584
Enright, A. and Ouzounis, C. (2000) Generage: a robust algorithm for sequence clustering and domain detection. Bioinformatics, 16, 451457
Filkov, V. and Skiena, S. (2003) Integrating microarray data by consensus clustering. Proceedings 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'03)November 3-5Sacramento, CA, pp. 418425.
Fred, A. and Jain, A. (2002) Data clustering using evidence accumulation. Proceedings of the 16th International. Conference on Pattern Recognition (ICPR 2002)August 11-15Québec City, Canada, pp. 276280.
Garey, M. and Johnson, D. Computers and Intractability; A Guide to the Theory of NP-Completeness, (1979) , New York, NY W.H. Freeman & Co.
Geer, L., et al. (2002) CDART: protein homology by domain architecture. Genome Res, . 12, 16191623
Glemet, E. and Codani, J.J. (1997) LASSAP, a large scale sequence comparison package (note: LASSAP is now biofacet). Comput. Appl. Biosc, . 13, 137143.
Grötschel, M. and Wakabayashi, Y. (1989) A cutting plane algorithm for a clustering problem. Math. Program. B, 5996.
Grundy, W., et al. (1997) Meta-MEME: motif-based hidden markov models of protein families. Comp. Appl. Biosci, . 3, 397406.
Holm, H. and Sander, S. (1996) Mapping the protein universe. Science, 273, 595602
Hubert, L. and Arabie, P. (1985) Comparing partitions. J. of Classifi, . 2, 193218[CrossRef].
Jaccard, P. (1908) Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat. Proc. Soc. Vaud. Sci. Nat, . 44, 223270.
Knuth, D. (1992) Two notes on notation. Am. Math. Monthly, 99, 403422[CrossRef].
Koehl, P. and Levitt, M. (2002) Sequence variations within protein families are linearly related to structural variations. J. Mol. Biol, . 323, 55162[CrossRef][ISI][Medline].
Kohavi, R. and Provost, F. (1998) Glossary. Mach. Learn. J, . 30, 271274[CrossRef].
Larsen, B. and Aone, H. (1999) Fast and effective text mining using linear-time document clustering. Proceedings of the 5th ACM SIGKDD International ConferenceAugust 1519San Diego, CA, pp. 1622.
Marcotte, E., et al. (1999) Detecting protein function and protein. Science, 285, 751753
Matsuda, H., et al. (1999) Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci, . 210, 305325[CrossRef].
Meila, M. (2003) Comparing clusterings by the variation of information. Proceedings of COLT'2003, pp. 173187.
Mirkin, B. Mathematical Classification and Clustering, (1996) , Dordrecht, The Netherlands Kluwer Academic Publishers.
Mulder, N., et al. (2003) The InterPro database, 2003 brings increased coverage and new features. Nucleic Acids Res, . 31, 315318
Rand, W. (1971) Objective criteria for evaluation of clustering methods. J. Am. Stat. Assoc, . 66, 846850[CrossRef][ISI].
Régnier, S. (1965) Sur quelques aspects mathématiques des problèmes de classification automatique. ICC Bull, . 4, 175191.
Rivera, M., et al. (1998) Genomic evidence for two functionally distinct gene classes. Genetics, 95, 62396244.
Servant, F., et al. (2002) Prodom: automated clustering of homologous domains. Brief. Bioinform, . 3, 246251
Sherman, D., et al. (2006) Génolevures complete genomes provide data and tools for comparative genomics of hemiascomycetous yeasts. Nucleic Acids Res, . 34, D432D435
Smith, T. and Waterman, M. (1981) Identification of common molecular subsequences. J. Mol. Biol, . 147, 195197[CrossRef][ISI][Medline].
Smith, T.F. and Zhang, X. (1997) The challenges of genome sequence annotation or the devil is in the details. Nat. Biotechnol, . 15, 12221223[CrossRef][Medline].
Stirling, J. (1749) Methodus differentialis, sive tractatus de summation et interpolation serierum infinitarium. English translation by J. Holliday, The differential method: A treatise of the summation and interpolation of infinite series. London, 1730.
J. Mach. Learning Res. Strehl, A. and Ghosh, J. (2002) Cluster ensemblesa knowledge reuse framework for combining multiple partitions. 3, 583617.
Tarjan, R.E. (1972) Depth-first search and linear graph algorithms. SIAM J. Comput, . 1, 146160.
Topchy, A., et al. (2004a) A mixture model for clustering ensembles. Proc. SIAM Conference on Data Mining, pp. 379390.
Topchy, A., et al. (2004b) Analysis of consensus partition in cluster ensemble. Proc. IEEE International Conference on Data Mining (ICDM'04)November 1-4Brighton, UK, pp. 225232.
van Dongen, S. Graph Clustering by Flow Simulation (2000a) PhD thesis, University of Utrecht.
van Dongen, S. (2000b) Performance criteria for graph clustering and markov cluster experiments. Technical Report INS-R0012, National Research Institute for Mathematics and Computer Science, The Netherlands.
Wakabayashi, Y. (1998) The complexity of computing medians of relations. Resenhas IME-USP, 3, 323349.
Wu, C., et al. (2004) PIRSF: family classification system at the protein information resource. Nucleic Acids Res, . 32, 112114.
Zhang, K. and Zhao, H. (2000) Assessing reliability of gene clusters from gene expression data. Funct. Integr. Genomics, 1, 156173[Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
.


defines a partitioning of
such that
. The subsets fi are called families.

,
and Ec
E is

, where c is a vector of size k with elements taking values in [1,2].
their cumulative distances to the original partitions F1 and F2 are the same, that is
.
belongs to at most n distinct families in 



is a conflict region if and only if
ck, cl
cl
. We say that a conflict region R is maximal if and only if
such that cm
R and
as follows.
, that is e =