Skip Navigation

Bioinformatics 2007 23(2):e71-e76; doi:10.1093/bioinformatics/btl314
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nikolski, M.
Right arrow Articles by Sherman, D. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nikolski, M.
Right arrow Articles by Sherman, D. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Computational Genomics

Family relationships: should consensus reign?—consensus clustering for protein families

Macha Nikolski * and David J. Sherman

CNRS/LaBRI, Université Bordeaux 1 351 cours de la Libération, 33405 Talence Cedex, France

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 BACKGROUND
 2 RESULTS
 3 APPLICATION AND DISCUSSION
 REFERENCES
 

Motivation: Reliable identification of protein families is key to phylogenetic analysis, functional annotation and the exploration of protein function diversity in a given phylogenetic branch. As more and more complete genomes are sequenced, there is a need for powerful and reliable algorithms facilitating protein families construction.

Results: We have formulated the problem of protein families construction as an instance of consensus clustering, for which we designed a novel algorithm that is computationally efficient in practice and produces high quality results. Our algorithm uses an election method to construct consensus families from competing clustering computations.

Our consensus clustering algorithm is tailored to serve the specific needs of comparative genomics projects. First, it provides a robust means to incorporate results from different and complementary clustering methods, thus avoiding the need for an a priori choice that may introduce computational bias in the results. Second, it is suited to large-scale projects due to the practical efficiency. And third, it produces high quality results where families tend to represent groupings by biological function.

Availability: This method has been used for Génolevures project to compute protein families of Hemiascomycetous yeasts. The data are available online at http://cbi.labri.fr/Genolevures/fam/

Supplementary information: Supplementary data are available at http://cbi.labri.fr/Genolevures/fam/

Contact: macha{at}labri.fr


    1 BACKGROUND
 TOP
 ABSTRACT
 1 BACKGROUND
 2 RESULTS
 3 APPLICATION AND DISCUSSION
 REFERENCES
 
1.1 Overview
When confronted with computation of protein families, one must decide which algorithm to use, evaluate the quality of the result and decide which computed families—if any—correspond to real protein families. We argue that it is the comparison of different results that produces the set of the most plausible families.

Algorithmic means for establishing protein families on a large scale are based on a notion of similarity. Often the only available similarity is sequence similarity; some authors go so far as to define the notion of a protein family itself via the similarity of sequences (Doolittle, 1981). Use of structural similarities (Holm and Sander, 1996) is currently limited by the relatively small number of structures available in PDB (Berman et al., 2000), but by studying sequence variation among members of such families (Koehl and Levitt, 2002) one can guide the use of sequence similarity information.

To perform similarity-based detection of families one first selects the pairwise similarities to be used and then applies an algorithm that uses these similarities to group proteins into families. Algorithms for detection of protein families are unsupervised [see for example van Dongen (2000a)] or supervised [see for example Grundy et al., (1997) and Bejerano and Yona (2001)] learning procedures. Pairwise similarities can be provided by sequence alignments [BLAST (Altschul et al., 1990), Smith-Waterman (Smith and Waterman, 1981)] or by more sophisticated approaches using domain architecture databases (Bateman et al., 2002; Servant et al., 2002; Mulder et al., 2003). Other approaches use domain order as a fingerprint for the protein (Geer et al., 2002).

Methods that quantify similarity by using some attribute of the best BLAST hit and use single-linkage clustering are not always successful. Such straightforward approaches may group together dissimilar multidomain proteins that share a common domain (Smith and Zhang 1997), and can be fooled by promiscuous domains (Doolittle, 1995; Marcotte et al., 1999). Several graph-based methods have been proposed to overcome some of these limitations of single-linkage clustering (Matsuda et al., 1999; Enright and Ouzounis, 2000; Enright et al., 2002).

Reliability of computed clusters has been assessed by Zhang and Zhao, who studied the reliability of hierarchical clusters from gene expression data (Zhang and Zhao, 2000) and proposed a resampling algorithm for building a consensus tree, and van Dongen (van Dongen, 2000b), who introduced criteria for evaluating clustering results using a novel inter-cluster distance. It must be noted that the quality of a protein family computation cannot be truly assessed in the absence of an objective external criterion based on biological knowledge.

Here we describe a method that integrates results of multiple classifications in a single scheme, using an election algorithm. First, we formulate the problem as an instance of consensus clustering. We propose a heuristic that efficiently realizes the computation in practice. Finally, we compare our method with others and illustrate the quality of results in the case of protein families computed for Génolevures Hemiascomycetous yeasts.

1.2 Agreement between clusterings

DEFINITION 1.
A clustering {Pi} is a partition of a dataset D = {di} into disjoint subsets {Pi}1, ... , {Pi}k called clusters, such that Formula.

In what follows we will use the terms clustering and partition, and cluster and subset, interchangeably.

Let the number of elements in D and in cluster {Pi}i be n and ni respectively. Let {Pi}' be a second partition of D, Formula, with cluster sizes Formula.

Most criteria for comparing clusterings are based on the confusion matrix (Kohavi and Provost, 1998), which measures the size of the intersection between the clusterings.

Formula

Traditional measures of the proportion of agreements between two clusterings are the Rand index (Rand, 1971) and the Jaccard index (Jaccard, 1908). As the expected value of the Rand index does not take a constant value1, the adjusted Rand index (Hubert and Arabie, 1985) is preferred; it assumes the generalized hypergeometric distribution as the model of randomness and provides an index bounded above by 1 with a constant expected value 0.

A well-known measure on the space of partitions is the equivalence mismatch coefficient emc [(Mirkin, 1996) p. 241], which is precisely the Hamming distance for binary vectors if the set of all pairs <di,dj> is enumerated, and a partition is represented by a characteristic vector defined on this enumeration. One may interpret emc in terms of edge numbers of complete graphs, where it represents the number of node moves needed to convert one partition into another.

Another category of criteria for comparing clusterings is based on set cardinality. The projection number {pi} of {Pi} onto {Pi}', defined as Formula, is an asymetric index of the degree of refinement of one partition versus another. The same index normalized by n is introduced in Ben-Hur and Guyon (2003) and a symmetric version is defined in Meila (2003). Laresen and Aone (Laresen and Aone, 1999) use the following asymmetric criterion:

Formula
Based on the projection number van Dongen introduces a distance (van Dongen, 2000a) between two partitions {Pi} and {Pi}':

Formula 1(1)
This gives the shortest distance between {Pi} and {Pi}' in a certain undirected weighted graph constructed on the set of all partitions of D, where two partitions are connected by an edge if one can be obtained from the other by joining two of its sets. The weight of the edge is the size of the smallest of two sets. In this construction d({Pi}, {Pi}') is the length of the shortest (weighted) path between {Pi} and {Pi}'.

1.3 Consensus clustering
Combining the strengths of different clustering algorithms is the focus of research on clustering ensembles. This problem can be seen as finding the median partition with respect to given partitions.

Consensus clustering (CC): Given k partitions, {Pi}1, ... , {Pi}k, find a consensus partition {Pi} that minimizes Formula 1, where d is a distance.

The first mathematical treatment goes back to Régnier (Régnier, 1965), where the problem was transformed into an integer problem and a branch and bound solution was proposed. The problem is proven to be NP-complete (Barthélemy and Leclerc, 1995). Wakabayashi gives an in-depth analysis of the median partition problem in Wakabayashi (1998) and concludes that approximating relations with a transitive one is NP-complete in general.

CC is NP-complete in general, yet it is not known whether it is NP-complete for any particular k. The case of k = 1 is trivial. The case of k = 2 is also simple since any of the partitions solves the problem optimally. Nothing is known for k ≥ 3.

If the partition {Pi} of the dataset D, |D| = n to discover is not necessarily one of the original partitions {Pi}1, ... , {Pi}k, then the size of the potential search space corresponds to the Bell numbers (Bell, 1934) or equivalently to the sum of Stirling numbers of second kind for all m, 1 ≤ m ≤ n [see Knuth (1992) on Stirling (1749)].

In the light of its hardness, exactly solving large instances of CC is intractable. Even so, exact methods have been implemented. Grötschel and Wakabayashi (Grötschel and Wakabayashi, 1989) give a cutting planes solution, which works well for n in the hundreds.

A variety of approximations have been applied to this problem. For example Filkov (Filkov and Skiena, 2003) proposes a simple approximation that works by dramatically reducing the search-space.

Factor-2 approximation: Given an instance of CC, select a partition {Pi} among partitions {Pi}1, ... , {Pi}k that minimizes Formula 1.

The time complexity is O(k2n) since it takes O(n) to compute the distance between any two partitions, and there are O(k2) pairs. This algorithm is a factor-2 approximation to CC.

Many heuristics based on local search have been studied in application to CC, such as simulated annealing that explores the search space by one-element moves, and a greedy algorithm (Filkov and Skiena, 2003).

The problem of finding a consensus clustering can be approached from a graph-based, combinatorial or statistical perspective. Several methods exist for discovering a consensus clustering solution from many multiple partitions: co-association methods (Fred and Jain., 2002), voting approach (Topchy et al., 2004b), information-theoretic approach (Meila, 2003), hypergraph partitioning methods (Strehl and Ghosh, 2003) and use of mixture models (Topchy et al., 2004a).


    2 RESULTS
 TOP
 ABSTRACT
 1 BACKGROUND
 2 RESULTS
 3 APPLICATION AND DISCUSSION
 REFERENCES
 
2.1 Comparing protein families
Given a set of m proteomes P1, ... , Pm, we are interested in computing the protein families. A proteome P = {pi} is defined as a set of proteins. The universe of proteins Formula 1 is defined over the proteomes under study as Formula 1. In what follows we suppose that a protein p can appear only in one proteome and that it can appear only once2.

As described earlier any method for protein family computation first selects the support data, then computes the families. We suppose that the filtering algorithm3, {alpha}, defines a subset Formula 1, the set of proteins over which a family computation will be performed.

2.1.1 Distance and similarity

DEFINITION 2
A family computation F over Formula 1 defines a partitioning of Formula 1 into disjoint subsets Formula 1 such that Formula 1. The subsets fi are called families.

Given two family computations Formula 1 over Formula 1 and Formula 1 over Formula 1, the confusion matrix can be used to measure the similarity:

Formula 1

Note that the fact that relative complements Formula 1 and Formula 1 may not be empty is not a problem since the only implication is that certain Mij values will be 0.

The confusion matrix can be equivalently represented by a bipartite graph in a straightforward manner (Figure 1). Let Formula 1, where the vertex set Formula 1 is labeled by family names in F1 and F2, respectively, and the E is the set of edges weighted by the values in Mij where Mij > 0. That is, for two families Formula 1 and Formula 1 with an element in common, two vertices with labels Formula 1 and Formula 1 are created as well as an edge Formula 1 such that w(e) = Mij.


Figure 1
View larger version (42K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 (a) Family computations F1, F2 and F3 are first compared pairwise and (b) comparison graphs FRel1,2, FRel2,3 and FRel2,3 are built. In the second step, (c) maximal conflict regions R1 and R2 involving all the connected components ci of graphs FRelij are constructed.

 
Given a graph FRel, its strongly connected components Formula 1 represent relationships between families in F1 and F2.

DEFINITION 3
The similarity score of a strongly connected component Formula 1, where Formula 1, Formula 1 and Ec sub E is

Formula 2(2)
The similarity score for FRel is defined as

Formula 3(3)

This measure is symmetric and provides information on what is preserved between two family computations.

2.1.2 Making a choice
The basic case we have just considered consists of two family computations F1 and F2. Connected components {c1, ... , cm} define all the local distances between families in two original partitions. Let us suppose that the comparison FRel1,2 has exactly k non-singletons (indices [1.k]):Formula 3.

Having made the comparison Formula 3 how do we choose the families that will form the solution? We propose to consider solutions locally, connected component by connected component, using exclusively families in F1 or in F2. That is for a non-singleton component Formula 3 we are only allowed to pick either Formula 3 or Formula 3. The rationale behind this choice strategy is that we are not allowed to invent classifications ab initio, but only to choose between the existing.

DEFINITION 4
A family choice Fc is defined by Formula 3 Formula 3, where c is a vector of size k with elements taking values in [1,2].

A family choice Fc defines a partition of a certain subset of the protein set Formula 3, namely the partition containing proteins of the families in Fc. The number of family choices is equal to the number of different vectors c of size k, that is 2k.

LEMMA 1
For any two family choices Formula 3 their cumulative distances to the original partitions F1 and F2 are the same, that is Formula 3.

PROOF: For any family choice Fc the cumulative distance is Formula 3, where m is the number of connected components in the family relationship graph Formula 3.

DEFINITION 5
A reference set is the set of connected components C used as the choice basis for the final partition.

Lemma 1 indicates that all the consensus choice partitions are equivalent w.r.t. the reference set. This implies that the concrete choice is dependent on the needs of the application (see for a discussion Section 3).

2.2 Consensus for families
Let us consider n family computations Formula 3 over their respective support sets of proteins Formula 3. We denote by Formula 3 the set of all families.

2.2.1 Rules of the game
Having made all the n(n – 1)/2 possible comparisons FReli,j, how do we exploit this result for families computation?

As CC is NP-complete, the key to the approximation algorithm is to reduce the search-space. We propose to do so by only considering the families belonging to Formula 3 in the manner analogous to the discussion in Section 2.1.2.

Then any acceptable solution F is composed only of the families from Formula 3 that are disjoint, that is Formula 3. In this manner an acceptable solution defines a partition of some subset of Formula 3. The set of all acceptable solutions Formula 3 is a subset of the powerset of Formula 3, Formula 3. The inclusion is strict since only empty intersections are accepted. Thus we have Formula 3.

While the size of Formula 3 is dramatically smaller than the whole search-space (whose size is equal to the n-th Bell number), this reduction of search-space is not sufficient and the corresponding exact algorithm remains NP-complete.

2.2.2 Definitions

DEFINITION 6
A protein Formula 3 belongs to at most n distinct families in Formula 3 defined by function {phi}:

Formula 3

Let us now consider all pairwise comparisons FReli,j. A comparison graph FReli,j can be decomposed into its mi,j connected components as defined in Section 2.1, Formula 3. The set of connected components over all of the comparison graphs is defined as

Formula 3

DEFINITION 7
A family f can appear in at most n(n–1)/2 distinct connected components in Formula 3 defined by function {kappa}.

Formula 3
We call the set Cp the set of p-components.

Then, for a protein p we can define by composition the set of p-components as the set of all connected components where p is a member. We will denote this function {sigma} = {phi} {circ} {kappa}.

Formula 3
Note that for Cp, the inverse image given by {sigma} induces a set of proteins Formula 3 such that all pi isin P belong to some family f that itself belongs to a component c isin Cp. We will call this set of proteins a support set for a given p-component.

We say that two sets of p-components Cp and Cq have an intersection, noted by Formula 3, if their corresponding support sets have an intersection, that is Formula 3.

DEFINITION 8
For a given FReli,j, we say that a set Formula 3 is a conflict region if and only if {forall} ck, cl isin R, ck {cap} cl != {emptyset}. We say that a conflict region R is maximal if and only if Formula 3 such that cm {notin} R and {forall} ck isin R, cm {cap} ck = {emptyset}.

2.2.3 Consensus
Let Formula 3 be the set of all maximal conflict regions. A conflict region R has a support set of proteins Formula 3, Formula 3. PR has n subsets (n being the number of family computations) defined by Formula 3 and Formula 3, each corresponding to the proteins belonging to families of a particular family computation Fi.

Consensus clustering over the reduced search space defined in Section 2.2.1 can be reformulated, using the definitions of Section 2.2.2, in terms of minimum set cover. The idea is to cover exactly one of the sets among PR and Formula 3 with the support sets of the connected components constituting a given conflict region. A cost function w on Formula 3 can be given by the van Dongen distance [Equation (1)] or our similarity score [Equation (2)], for example.

Minimum exact cover (MDC): Let R be a maximal conflict region. Let PR and Formula 3 be the support set of proteins of R and the family computation-specific sets of proteins, respectively. We denote by Pc = {sigma}–1(c) the support set of proteins for a connected component that is included in the region R. Find a subset of connected components C {subseteq} R such that (1) every protein from {cup}Pc, c isin C appears in one and only one of Pc4, (2) there exists P among PR and Formula 3 entirely covered by the support sets in C, that is Formula 3, and (3) C minimizes Formula 3.

The consensus family computation can then be defined as a MDC solution for all Formula 3. This problem is a variant of minimum cover problem with an additional condition for the sets to be disjoint and is known to be NP-complete (Garey and Johnson, 1979).

2.2.4 Efficient heuristic
Two further relaxations make the problem tractable. First, we do not require to have all the n(n – 1)/2comparisons. Second, we do not require to have an exact cover. It is the second criterion that allows our heuristic to run in polynomial time. The obvious consequence of this relaxation is that not all of the proteins in Formula 3 will be found in the result. This is not an issue in the context of protein families since there is no reason to suppose that all proteins from Formula 3 belong to families of cardinality higher than 1.

Let us consider n family computations and m ≤ n(n–1)/2 family comparisons FReli,j. Each element Formula 3 has a distance (say, the van Dongen distance) value di and a similarity value si. Starting from the maximal conflict regions in Formula 3 we build a conflict graph (Figure 1).

DEFINITION 9
For family comparisons FReli,j we define a conflict graph Formula 3 as follows.
  • The vertex set V is labeled by elements from Formula 3, and we have |V| = |Formula 3|.
    The edge set E is defined by all non-empty intersections between connected components Formula 3, that is e = <ci,cj> isin E if and only if ci {cap} cj != {emptyset}.

We distinguish two disjoint classes of edges E = Ep {cup} Ef, those that are only based on proteins Ep and those that correspond to families Ef. The latter means that there is a non-empty intersection on the level of families, that is {kappa}–1(ci) {cap} kappa–1(cj) != {emptyset}, while the former is the complement E\Ef.

By construction the set of connected components of G is exactly the set of conflict regions Formula 3.

The consensus computation works by resolution of conflicts for all connected components of a conflict graph G and is given in the algorithm 1. The conflict resolution is done by a voting procedure vote.


Algorithm 1 Family computation algorithm by weak consensus

Table 1

Complexity The complexity for computing strongly connected components is known to be O(V + E) for a graph G = <V,E> (Tarjan, 1972). The Condorcet election complexity is O(p2) for p voters (that is the number of proteins in a given conflict region). Let us denote by m the largest set out of n family computations: m = maxi |Fi|. The largest possible conflict region will involve all proteins in Formula 3. The worst-case complexity of algorithm 1 is then Formula 3.


    3 APPLICATION AND DISCUSSION
 TOP
 ABSTRACT
 1 BACKGROUND
 2 RESULTS
 3 APPLICATION AND DISCUSSION
 REFERENCES
 
The algorithm 1 has been used to calculate protein families for Saccharomyces cerevisiae and the four fully sequenced species of Hemiascomycetous yeasts in the context of the Génolevures project (Dujon et al., 2004; Sherman et al., 2006). The following is a summary of the key steps in this application. For a detailed presentation see the online supplement at http://cbi.labri.fr/Genolevures/fam/method.html

Data, alignments and filtering The data consist of five proteomes of S.cerevisiae, Candida glabrata, Kluyveromyces lactis, Debaryomyces hansenii and Yarrowia lipolytica. These proteomes taken together comprise the total set of proteins Formula 3. We produced complementary alignments using Blast and Smith-Waterman filtered according to scoring criteria5 and an approximation to homeomorphy [see (Wu et al., 2004)]. These criteria produce four datasets of pairwise similarity: Formula 3 corresponding to Smith-Waterman homeomorphic alignments, Formula 3 to Smith-Waterman alignments without the restriction of homeomorphy, Formula 3 corresponding to Blast homeomorphic alignments and Formula 3 to Blast non-homeomorphic alignments.

Clustering TribeMCL software (Enright et al., 2002) was used for clustering. Three different inflation coefficients (1.2, 2.4 and 4.0) have been applied to each dataset, thus resulting in 12 different clustering results.

Parameter tuning Four parameters were added to the algorithm 1 in order to make it efficient in practice. The first three parameters help to break too large conflict regions. The last one realizes the choice policy.

  1. Smax is the maximal allowed size of the support protein set for conflict regions.
  2. fmax is the maximal number of families allowed in a conflict region.
  3. p is the percentile above which families are removed from R; it is determined based on the distribution of family sizes in a given conflict region R.
  4. A choice strategy (c.f. section 2.1.2). Given a connected component Formula 3, either systematically choose Formula 3 such that Formula 3 or systematically choose Formula 3 such that Formula 3.
The voting procedure vote has been implemented as Condorcet election (de Caritat marquis de Condorcet, 1995).

Consensus families Various parameter ranges were tested resulting in 256 family computations. The final result contains 4389 protein families. Evaluation of the result was performed by comparison with manually curated literature standards, to reciprocal best hit partitions (RBH) (Rivera et al., 1998), and to Biofacet (Glemet and Codani, 1997) single-linkage clustering; see the online supplement for details. Results using our consensus method were qualitatively and quantitatively more satifying and have been adopted by the Consortium.

The first example is provided by families GLR.2644 and GLS.60, respectively, homologs to RPL24 ribosomal-like protein 24 and homologs to ribosomal protein L30 of the large (60S) ribosomal subunit. Instead of producing two groups of homologs, RBH clustering produces three groups corresponding to three species groups.

A second example shown in the supplement illustrates the too-large families that are typically produced by blind MCL or RBH clustering, in this case an impossible family with 406 members. Our consensus algorithm splits this group into 30 families which are in fact different sets of protein kinases, annotated as ‘SNF1 protein serine/threonine kinase’ (GLC.1371, 5 members), ‘BCK1 ser/thr protein kinase’ (GLC.1809, 6 members), ‘PKC1 protein Kinase C’ (GLC.1902, 5 members), ‘cytoplasmic serine/threonine protein kinase’ (GLC.2023, 7 members), ‘protein kinase involved in morphogenesis and septin checkpoints’ (GLC.2551, 10 members), ‘PSK1 and PSK2 proteins, PAS domain containing S/T protein kinases’ (GLC.298, 7 members), etc.

These families are currently used in a variety of specific studies by the Génolevures Consortium and will be expanded as new yeast genomes are sequenced.


    Acknowledgments
 
The present work was done as part of the the Génolevures Consortium. The authors thank all the members of the Consortium for numerous, friendly and creative discussions, and in particular Philippe Baret and André Goffeau for the discussions on the transporter families, and Hélène Ferry-Dumazet for the work on standard protein families. Support for this work was provided by the Région Aquitaine program Génotypage et Génomique comparée, the French Ministry of Research and Education through ATIP Jeunes Chercheurs, the ACI IMPBIO Génolevures En Ligne and the French ANR "GENARISE" project. Génolevures is supported by CNRS (GDR 2354).

Conflict of Interest: none declared.


    FOOTNOTES
 
1When the cluster size is small, the Rand index tends toward 1 as the number of clusters increases. Back

2Since we are only interested in families, we do not introduce any information about the genes coding for these proteins (chromosomes, relative positions, etc.); in this way, a simple set-based approach is sufficient. Back

3Even if the filtering criteria are not the subject of this paper, it must be noted that they have a great impact in practical applications since they affect the distance/similarity values. Back

4That is for any ci,cj isin C, their intersection is empty. Back

5Based on statistical analyses not detailed here. Back


    REFERENCES
 TOP
 ABSTRACT
 1 BACKGROUND
 2 RESULTS
 3 APPLICATION AND DISCUSSION
 REFERENCES
 

    Altschul, S., et al. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403–410[CrossRef][Web of Science][Medline].

    Barthélemy, J.-P. and Leclerc, B. (1995) The median procedure for partitions. In Cox, I.J., Hansen, P., Julesz, B. (Eds.). Partitioning Data Sets: DIMACS Series in Discrete Mathematics and Theoretical Computer Science, , pp. 3–34 19, American Mathematical Society, Providence, RI.

    Bateman, A., et al. (2002) The Pfam protein families database. Nucleic Acids Res, . 30, 276–280[Abstract/Free Full Text].

    Bejerano, G. and Yona, G. (2001) Variations on probabilistic suffix trees: statistical modeling and prediction of protein families. Bioinformatics, 17, 23–43[Abstract/Free Full Text].

    Bell, E. (1934) Exponential numbers. Am. Math. Monthly, 41, 411–419[CrossRef].

    Ben-Hur, A. and Guyon, I. (2003) Detecting stable clusters using principal component analysis. Methods Mol. Biol, . 224, 159–182[Medline].

    Berman, H., et al. (2000) The Protein Data Bank. Nucleic Acids Res, . 28, 235–242[Abstract/Free Full Text].

    The French Revolution research collection de Caritat marquis de Condorcet, J.-A.-N. (1995) Essai sur l'application de l'analyse à la probabilité des décisions rendues à la pluralité des voix. Reprod. of édition de l'Impr. royale, 1785.

    Doolittle, R. (1981) Similar amino acid sequences: chance or common ancestry? Science, 214, 149–59[Abstract/Free Full Text].

    Doolittle, R. (1995) The multiplicity of domains in proteins. Annu. Rev. Biochem, . 64, 287–314[CrossRef][Web of Science][Medline].

    Dujon, B., et al. (2004) Genome evolution in yeasts. Nature, 430, 35–44[CrossRef][Medline].

    Enright, A., et al. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res, . 30, 1575–1584[Abstract/Free Full Text].

    Enright, A. and Ouzounis, C. (2000) Generage: a robust algorithm for sequence clustering and domain detection. Bioinformatics, 16, 451–457[Abstract/Free Full Text].

    Filkov, V. and Skiena, S. (2003) Integrating microarray data by consensus clustering. Proceedings 15th IEEE International Conference on Tools with Artificial Intelligence (ICTAI'03)November 3-5Sacramento, CA, pp. 418–425.

    Fred, A. and Jain, A. (2002) Data clustering using evidence accumulation. Proceedings of the 16th International. Conference on Pattern Recognition (ICPR 2002)August 11-15Québec City, Canada, pp. 276–280.

    Garey, M. and Johnson, D. Computers and Intractability; A Guide to the Theory of NP-Completeness, (1979) , New York, NY W.H. Freeman & Co.

    Geer, L., et al. (2002) CDART: protein homology by domain architecture. Genome Res, . 12, 1619–1623[Abstract/Free Full Text].

    Glemet, E. and Codani, J.J. (1997) LASSAP, a large scale sequence comparison package (note: LASSAP is now biofacet). Comput. Appl. Biosc, . 13, 137–143.

    Grötschel, M. and Wakabayashi, Y. (1989) A cutting plane algorithm for a clustering problem. Math. Program. B, 59–96.

    Grundy, W., et al. (1997) Meta-MEME: motif-based hidden markov models of protein families. Comp. Appl. Biosci, . 3, 397–406.

    Holm, H. and Sander, S. (1996) Mapping the protein universe. Science, 273, 595–602[Abstract/Free Full Text].

    Hubert, L. and Arabie, P. (1985) Comparing partitions. J. of Classifi, . 2, 193–218[CrossRef].

    Jaccard, P. (1908) Nouvelles recherches sur la distribution florale. Bull. Soc. Vaud. Sci. Nat. Proc. Soc. Vaud. Sci. Nat, . 44, 223–270.

    Knuth, D. (1992) Two notes on notation. Am. Math. Monthly, 99, 403–422[CrossRef].

    Koehl, P. and Levitt, M. (2002) Sequence variations within protein families are linearly related to structural variations. J. Mol. Biol, . 323, 551–62[CrossRef][Web of Science][Medline].

    Kohavi, R. and Provost, F. (1998) Glossary. Mach. Learn. J, . 30, 271–274[CrossRef].

    Larsen, B. and Aone, H. (1999) Fast and effective text mining using linear-time document clustering. Proceedings of the 5th ACM SIGKDD International ConferenceAugust 15–19San Diego, CA, pp. 16–22.

    Marcotte, E., et al. (1999) Detecting protein function and protein. Science, 285, 751–753[Abstract/Free Full Text].

    Matsuda, H., et al. (1999) Classifying molecular sequences using a linkage graph with their pairwise similarities. Theor. Comput. Sci, . 210, 305–325[CrossRef].

    Meila, M. (2003) Comparing clusterings by the variation of information. Proceedings of COLT'2003, pp. 173–187.

    Mirkin, B. Mathematical Classification and Clustering, (1996) , Dordrecht, The Netherlands Kluwer Academic Publishers.

    Mulder, N., et al. (2003) The InterPro database, 2003 brings increased coverage and new features. Nucleic Acids Res, . 31, 315–318[Abstract/Free Full Text].

    Rand, W. (1971) Objective criteria for evaluation of clustering methods. J. Am. Stat. Assoc, . 66, 846–850[CrossRef][Web of Science].

    Régnier, S. (1965) Sur quelques aspects mathématiques des problèmes de classification automatique. ICC Bull, . 4, 175–191.

    Rivera, M., et al. (1998) Genomic evidence for two functionally distinct gene classes. Genetics, 95, 6239–6244.

    Servant, F., et al. (2002) Prodom: automated clustering of homologous domains. Brief. Bioinform, . 3, 246–251[Abstract/Free Full Text].

    Sherman, D., et al. (2006) Génolevures complete genomes provide data and tools for comparative genomics of hemiascomycetous yeasts. Nucleic Acids Res, . 34, D432–D435[Abstract/Free Full Text].

    Smith, T. and Waterman, M. (1981) Identification of common molecular subsequences. J. Mol. Biol, . 147, 195–197[CrossRef][Web of Science][Medline].

    Smith, T.F. and Zhang, X. (1997) The challenges of genome sequence annotation or ‘the devil is in the details’. Nat. Biotechnol, . 15, 1222–1223[CrossRef][Medline].

    Stirling, J. (1749) Methodus differentialis, sive tractatus de summation et interpolation serierum infinitarium. English translation by J. Holliday, The differential method: A treatise of the summation and interpolation of infinite series. London, 1730.

    J. Mach. Learning Res. Strehl, A. and Ghosh, J. (2002) Cluster ensembles—a knowledge reuse framework for combining multiple partitions. 3, 583–617.

    Tarjan, R.E. (1972) Depth-first search and linear graph algorithms. SIAM J. Comput, . 1, 146–160.

    Topchy, A., et al. (2004a) A mixture model for clustering ensembles. Proc. SIAM Conference on Data Mining, pp. 379–390.

    Topchy, A., et al. (2004b) Analysis of consensus partition in cluster ensemble. Proc. IEEE International Conference on Data Mining (ICDM'04)November 1-4Brighton, UK, pp. 225–232.

    van Dongen, S. Graph Clustering by Flow Simulation (2000a) PhD thesis, University of Utrecht.

    van Dongen, S. (2000b) Performance criteria for graph clustering and markov cluster experiments. Technical Report INS-R0012, National Research Institute for Mathematics and Computer Science, The Netherlands.

    Wakabayashi, Y. (1998) The complexity of computing medians of relations. Resenhas IME-USP, 3, 323–349.

    Wu, C., et al. (2004) PIRSF: family classification system at the protein information resource. Nucleic Acids Res, . 32, 112–114.

    Zhang, K. and Zhao, H. (2000) Assessing reliability of gene clusters from gene expression data. Funct. Integr. Genomics, 1, 156–173[Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
D. J. Sherman, T. Martin, M. Nikolski, C. Cayla, J.-L. Souciet, P. Durrens, and for the Genolevures Consortium
Genolevures: protein families and synteny among complete hemiascomycetous yeast proteomes and genomes
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D550 - D554.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Nikolski, M.
Right arrow Articles by Sherman, D. J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Nikolski, M.
Right arrow Articles by Sherman, D. J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?