Bioinformatics Advance Access originally published online on September 6, 2007
Bioinformatics 2007 23(18):2361-2367; doi:10.1093/bioinformatics/btm358
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The global trace graph, a novel paradigm for searching protein sequence databases



1Institute of Biotechnology and 2Department of Biological and Environmental Sciences, Division of Genetics, P.O. Box 56 (Viikinkaari 5), FI-00014 University of Helsinki, Finland
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Propagating functional annotations to sequence-similar, presumably homologous proteins lies at the heart of the bioinformatics industry. Correct propagation is crucially dependent on the accurate identification of subtle sequence motifs that are conserved in evolution. The evolutionary signal can be difficult to detect because functional sites may consist of non-contiguous residues while segments in-between may be mutated without affecting fold or function.
Results: Here, we report a novel graph clustering algorithm in which all known protein sequences simultaneously self-organize into hypothetical multiple sequence alignments. This eliminates noise so that non-contiguous sequence motifs can be tracked down between extremely distant homologues. The novel data structure enables fast sequence database searching methods which are superior to profile-profile comparison at recognizing distant homologues. This study will boost the leverage of structural and functional genomics and opens up new avenues for data mining a complete set of functional signature motifs.
Availability: http://www.bioinfo.biocenter.helsinki.fi/gtg
Contact: liisa.holm{at}helsinki.fi
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Reliable detection of distant relationships is important for obtaining clues about the function and structure of new proteins. Ancient speciation events typically lead to a picture like Figure 1, where similarity is very strong within sub-families but of only marginal significance between sub-families. One imagines that new functions evolve by bursts of rapid change of a founder gene, followed by the accumulation of random mutations at constant rate once the new function has become fixed.
|
Protein families are modelled using profiles (Eddy, 1998). Profile models are designed to describe a set of sequences that fluctuate around a static distribution of amino acids at the centre of a family. Sub-populations which spread out over an elongated region in sequence space are best captured by multiple profiles. This strategy is adopted, e.g. in the PFAM collection of protein families (Bateman et al., 2004). In PFAM, distant relationships are annotated through clan membership by human experts. A match to a family profile then implies membership also in the clan. Unfortunately, most clan-level relationships remain unknown and unannotated. Sequencing more and more genomes has so far failed to saturate the coverage of protein families and, the coverage by PFAM families has stagnated around 70 % of known proteins. Automatically derived hierarchical clusterings of proteins (Heger and Holm, 2000; Heger et al., 2005; Kaplan et al., 2005 and references therein) have the advantage of completeness, but the determination of the boundary between homologous and unrelated sequences is unreliable or simply left to the user. To maximize information transfer, there is a need for methods which are able to traverse large evolutionary distances while discriminating against unrelated sequences. For example, a series of at least four PSI-Blast (Schaffer et al., 2001) searches is required to unify the PUA-like domain clan (Fig. 1).
The theoretical reasons for a limited horizon of detection of homologues by profiles were laid out by Altschul (1991) and subsequently empirically verified by numerous groups (Griffith-Jones and Bateman, 2002; Sadreyev and Grishin, 2004; Sander and Schneider, 1991). The basic argument is that to be able to discriminate related from unrelated sequences in a database, a profile must have a sufficient number of informative positions, in which amino acid frequencies differ from database background. Favouring some amino acids necessarily disfavours the others. Many families have diverged so far in different directions, that a common profile becomes too diluted to discriminate random sequences from related sequences. Currently, the most sensitive methods of remote homologue detections are based on the comparison of two profile models (Soding, 2005). The next generation of methods combines information from multiple profile models.
Here, we demonstrate that sequence neighbourhoods defined in terms of consistency pull distantly related proteins together over high orders of transitive similarity as detected by profile models. Unlike profile models, consistency is not dependent on amino acid types, but integrates statistics from a large library of pairwise alignments. Two residues (from proteins A and B) are consistently aligned if their pairing is supported by a large number of intermediate sequences X aligned to both A and B. Note that residues may be consistently aligned even if their amino acid types differ, if the variable segment is firmly anchored between conserved regions giving a strong signal in profile alignment. Going beyond triplet relationships (Do et al., 2005; Notredame et al., 1998), we introduced a consistency score that is evaluated—as the strength of the weakest link—over paths involving an arbitrary number of intermediates (Heger et al., 2004). Consistency-based scoring has been previously shown to improve the quality of alignments in a pre-defined set of homologous proteins (Do et al., 2005; Heger and Holm 2003a; Heger et al., 2004; Notredame et al., 2000). The contribution of this article is to show that consistency is applicable to fish out sets of homologous proteins from the database, in the first place, using a novel data structure—the Global Trace Graph.
| 2 METHODS |
|---|
|
|
|---|
2.1 Data structures
The Global Trace Graph is generated by a clustering process outlined in Figure 2. Our implementation computes a core global trace graph which considers only representative sequences in nrdb40 (Heger and Holm, 2003b). This core is implicitly extended to include every known protein sequence. This is because sequences with more than 40 % sequence identity to nrdb40 representatives are linked to the representatives by unambiguous alignment. Pairwise alignment data were obtained from our PairsDb database (Heger and Holm, 2003b). PSI-Blast (Schaffer et al., 2001) was used to compare the nrdb40 database against itself using parameters –j 10 –e 1.0 –v 500, i.e. at most 10 iterations and reporting at most 500 hits with an expectation value less than 1.0.
|
The global trace graph is generated in three steps briefly outlined here. A detailed description is in Supplementary Materials. First, an unweighted alignment trace graph is constructed where the nodes (vertices) are residues, and there is an edge between all residue pairs that are aligned in the input library of pairwise alignments. Nodes are labelled by the protein identifier, sequential residue number in the protein, and amino acid type.
In the second step, each edge is weighted by consistency. The consistency reflects the overlap between the neighbourhoods of two aligned residues and ranges from 100 for perfect consistency (the two residues are aligned to the same residues in all intermediate sequences) to 1. For computational convenience, we use discrete edge weights rounded to the nearest integer.
In the third step, the global trace graph is processed to remove redundant edges. Redundancy is a consequence of the way in which we evaluate the similarity of two residues that are connected by a path in the global trace graph. The path score is defined as the minimum edge weight (weakest link) along the path (Heger et al., 2004). Only the maximal path will be of interest, i.e. that path whose weakest link has the highest edge weight. The maximal path scores are retained by single linkage clustering at the coalescence level of two residues.
A crucial element during clustering is the additional requirement of single site occupancy, i.e. a cluster may contain at most one residue per protein domain. Since exact domain boundaries are unknowable, we imposed the constraint that residues from the same protein with a sequential separation less than 50 residues may not occur in the same cluster. Without this constraint, the meaning of clusters would be blurred as two nearby residues from the same protein could be occupying the same structurally equivalent position. A buffer of ±50 residues approximates the size of typical globular domains.
For remote homology detection, it is useful to include inter-cluster edges. These are edges between any pairs of residues which are not part of the same cluster and thereby link two clusters, but do not join them because of the rule of single site occupancy and stronger signals elsewhere. The elimination of redundant inter-cluster edges resulted in data compression by two orders of magnitude (details in Supplementary Materials).
2.2 Database searching
We call the clustered data structure the global trace graph (GTG). The elimination of redundant edges compared to the original alignment trace graph allows efficient database searching at high orders of transitivity. The tree structure retains a unique path connecting any two residues within the same cluster (Fig. 2). A GTG-LOCAL search considers only intra-cluster paths. To increase sensitivity, it is sometimes useful to search for connections between residues in different clusters; this is called a GTG-DEEP search (Fig. 3).
|
Database searching generates pairwise alignments between the query protein and proteins from a given target list. For example, fold assignment uses the set of structurally known proteins (PDB) as target set. Searching the whole nrdb40 database, a very large number of target proteins can potentially be reached from a given query protein. Motif tracking filters the set of reachable targets so that only target proteins that share at least one conserved residue type with the query protein are retained (Fig. 4). A conserved residue type has a frequency above 50 % in a subtree and all proteins with identical amino acids to the query become part of the target set. Motif tracking threads are started from each residue in the query sequence. Therefore, our approach frees the user from defining any search patterns beforehand as is required by PHI-Blast (Zhang et al., 1998).
|
Pairwise sequence alignments are generated using dynamic programming over a (sparse) dot matrix where paired residues are scored by the path score (Heger et al., 2004). Briefly, the score of an alignment is the sum of path scores between aligned pairs (s, t) of residues where residue s belongs to the query protein and residue t belongs to the target protein. If the residues reside in the same cluster, the path score w(s, t) is given directly by the consistency score at the coalescence level of s and t. The deep search is new to this work. Jumping from one cluster to another, the path score becomes the minimum of the scores of three sub-paths s ... i, i – j, j ... t where i and j are intermediate residues, ... indicates an intra-cluster path and i – j is an inter-cluster edge. The deep search may connect many alternative target residues to a given source residue and vice versa. We limited the number of out-going connections from any query residue to the ten highest-scoring paths s.i – j, among the 1000 inter-cluster edges nearest to the query residue (only 0.16 % of clusters have more than 1000 inter-cluster edges).
The pairwise alignments are optimized with respect to the consistency score, which does not take amino acid type into account. The motif score is a complementary measure defined as a weighted sum of identically aligned amino acids. If the amino acid type of the query residue is x, the motif score m is defined as
|
| (1) |
) and motif score (weight 1 –
). Ranking remains quite robust at different values of
(Supplementary Fig. 2). Unless specified otherwise, we used the motif score (
= 0) for ranking.
2.3 Assessment of coverage and reliability
Coverage-reliability plots were generated by sorting the query-target pairs in order of decreasing similarity score. A reference classification is used to label each pair as true (related) or false (unrelated). If the classification is unavailable, the pair is ignored. Reliability is TP/P, coverage is TP/T and false discovery rate is FP/P, where P is the number of predictions above a score threshold, T is the number of true pairs in the test set, TP is the number of correct predictions (true pairs) above the score threshold and FP is the number of wrong predictions (false pairs) above the threshold.
2.4 Implementation and availability
All programs were written in C/C++ with a Python interface. All computations were done on a Linux farm using six Intel® XeonTM 2.0 GHz CPU nodes. The pre-processing steps of edge weighting, clustering and removing redundant inter-cluster edges took about 6 weeks in total. The interactive GTG server (http://www.bioinfo.biocenter.helsinki.fi/gtg) can be used to search the PDB (January 2007), SCOP v.1.69, PFAM-A v.19 or nrdb40 v.2 databases, and to view alignments between a query sequence and its neighbours. The output of the GTG server is limited to 50 hits (some query sequences which belong to large superfamilies, e.g. methyltransferases, retrieve many thousand homologues from nrdb40). A stand-alone version of the database search script and databases (requiring 20 GB disk space) are available from the authors on request. With unlimited output, median database search times using GTG-DEEP with motif filtering are 2 min against PFAM-A domains and 6 min against nrdb40 representatives.
| 3 Results |
|---|
|
|
|---|
3.1 Super-alignment of all known protein sequences
In constructing the Global Trace Graph (GTG), we used as input a representative set of all known protein sequences, which are <40 % identical to each other. The resulting alignment trace graph comprises 137 732 290 vertices (residues) and 37 831 356 970 edges representing data from 150 640 613 pairwise alignments between 395 612 representative sequences. In the GTG, the residues are grouped into 6 140 576 clusters, which are linked by 101 161 217 inter-cluster edges. The Global Trace Graph is a huge data structure, but once computed, it enables fast, sensitive and selective database searches for high-order transitive similarities with a minimum of effort. For example, our method specifically identifies a set of distantly related domains from the PUA/ASCH superfamily clan in PFAM which, according to PSI-Blast, are second, third and fourth neighbours of the query (Table 1).
|
3.2 Large-scale benchmarks for the detection of remote homologues
Benchmarks based on the SCOP structural classification (Andreeva et al., 2004) are commonly used for rigorous assessment of the ability of database search methods to detect remote homologues. SCOP classifies known protein structures hierarchically at four levels: class, fold, superfamily, and family The Lindahl test set (Lindahl and Elofsson, 2000) comprises 976 SCOP domains. A total of 434 query domains have a remote homologue in the test set that belongs to the same superfamily but different family. To avoid bias due to possible misclassification, hits that belong to a different superfamily but to the same fold are ignored. GTG-DEEP outperforms all but one sequence-structure threading method at detecting a correct hit at rank 1 (Table 2). The differences are rather small, SPARKS-0 (Zhou and Zhou, 2004) detecting one more correct top-1 hit than GTG-DEEP (
= 0.4). The point to note here is that GTG operates on a different principle from the well performing sequence-structure threading methods in Table 2. GTG uses only sequence information, without any requirement for a structurally known template or sophisticated machine learning (Cheng and Baldi, 2006).
|
Sequence-based methods for remote homologue detection are often compared using coverage-reliability plots. Programs, which compare a pair of HMM profiles, are the top performers in published comparisons (Soding, 2005). GTG-DEEP outperformed HMM-HMM comparison programs, detecting 15–20 % more remote homologues at a fixed error rate of 5 % (Fig. 5 and Supplementary Fig. 3).
|
3.3 Automatic detection of familial relationships
PFAM-A v. 19 comprises 7430 families, of which 1346 families have been classified by experts into 205 clans. GTG implies the assignment of 112 previously unclassified PFAM families to known clans with a motif score above 1000, and a further 256 previously unclassified PFAM families are directly linked to a known clan with motif scores between 500 and 1000 (Supplementary Table 1). Coverage-reliability plots yield consistent false discovery rates of 0.3–0.5 % and 5 % at motif scores of 1000 and 500, respectively using either SCOP-folds or PFAM-clans as reference (Supplementary Fig. 3–4). While assignments especially at lower scores are bound to contain many false predictions, non-trivial and biologically interesting discoveries can be made and validated based on conserved sequence motifs and 3D mapping. For example, the amidohydrolase superfamily clan (Holm and Sander, 1997) (clan CL0034 in PFAM) is extended by glucuronate isomerase (PF02614) and RNase P subunit p30 (PF01876, most similar to the PHP domain PF02811). Both of these links are validated by the similarity of known structures for these families. Moreover, several unclassified families share a motif in the SAM-binding site of methyltransferases of known structure, including four domain families of unknown function (DUF752/PF05430, DUF633/PF04816, DUF43/PF01816, DUF1613/PF07757) which are therefore predicted to be SAM-dependent methyltransferases (Table 3).
|
| 4 DISCUSSION |
|---|
|
|
|---|
We have presented a methodology that integrates information globally across multiple profile comparisons. Common wisdom says that a transitive similarity search is doomed to fail at high orders of transitivity. This is because of domain chaining and the small-world property of the graph of protein similarities. Due to domain chaining and spurious matches, almost any two proteins become linked by similarity through third or fourth neighbours. The key to overcoming these pitfalls is a clustering of all residues in the sequence database. First, the possibility of domain chaining is eliminated, since any relationship is traced only between individual residues (so that an explicit alignment exists between the residues). Second, we perform a hierarchical clustering of residues, which is driven by consistency. This clustering eliminates many weaker, mainly false, links from consideration. In contrast to classical domain hunting (Holm and Sander, 1997; McEntyre and Gibson, 2004), our approach allows fully automatic database searches to retrieve similarities at high orders of transitivity, rather than being limited to a single intermediate (Park et al., 1998), and at better selectivity than transitive closure (Neuwald et al., 1997).
Each cluster of the global trace graph (GTG) contains at most one residue from any protein (domain) and can therefore be interpreted as a column of a hypothetical multiple alignment. Conserved residues can be detected—and, excitingly, enumerated—as subtrees within GTG clusters which are enriched in particular amino acid types. Further studies will address the correlation of conserved sequence patterns with specific biological functions (Dietmann and Holm, 2001; Heger and Holm, 2003c; Marttinen et al., 2006; Sivakumar et al., 2006; Watson et al., 2005).
The rapid growth of sequence databases presents a challenge to keep the Global Trace Graph updated. Much of sequence database growth is due to sequences that are similar to previously known ones and are thus already represented in the graph. However, at infrequent intervals, the Global Trace Graph will need to be re-computed. Most construction steps can be performed in parallel, which, with still increasing computational processing power, will help to keep the construction time practical.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank the Structural Genomics Group for discussion. This work was supported by grants from the Academy of Finland (#1102273 and #1103428) and Tekes (40672/01).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Burkhard Rost
Present address: MRC Functional Genetics Unit, Department of Physiology, Anatomy & Genetics, University of Oxford, South Parks Road, OX1 3QX Oxford, UK. ![]()
Present address: Department of Gernetics, Harvard Medical School, Boston, MA, USA. ![]()
Present address: Babraham Institute, Cambridge, UK. ![]()
Received on March 8, 2007; revised on June 11, 2007; accepted on July 5, 2007
| REFERENCES |
|---|
|
|
|---|
Altschul SF. Amino acid matrices from an information theoretic perspective. J. Mol. Biol, ( (1991) ) 219, : 555–565.[CrossRef][ISI][Medline].
Andreeva A, et al. SCOP database in 2004: refinements integrate structure and sequence family data. Nucleic Acids Res, ( (2004) ) 32, : D226–D229.
Bateman A, et al. The Pfam protein families database. Nucleic Acids Res, ( (2004) ) 32, : D138–D141.
Cheng J, Baldi P. A machine learning information retrieval approach to protein fold recognition. Bioinformatics, ( (2006) ) 522, : 1456–1463..
Dietmann S, Holm L. Identification of homology in protein structure classifiction. Nat. Struct Biol, ( (2001) ) 8, : 953–957.[CrossRef][ISI][Medline].
Do CB, et al. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res, ( (2005) ) 15, : 330–340.
Eddy SR. Profile hidden Markov models. Bioinformatics, ( (1998) ) 14, : 755–763.
Griffith-Jones S, Bateman A. The use of structure information to increase alignment accuracy does not aid homologue detection with profile HMMs. Bioinformatics, ( (2002) ) 18, : 1243–1249.
Heger A, Holm L. Towards a covering set of protein family profiles. Prog. Biophys, ( (2000) ) 73, : 321–337.[CrossRef].
Heger A, Holm L. More for less in structural genomics. J. Struct. Funct. Genomics, ( (2003a) ) 4, : 57–66.[CrossRef][Medline].
Heger A, Holm L. Exhaustive enumeration of protein domain families. J. Mol. Biol, ( (2003b) ) 328, : 749–767.[CrossRef][ISI][Medline].
Heger A, Holm L. Sensitive pattern discovery with fuzzy alignments of distantly related proteins. Bioinformatics, ( (2003c) ) 19, (Suppl. 1): i130–i137.[Abstract].
Heger A, et al. Accurate detection of very sparse sequence motifs. J. Comput. Biol, ( (2004) ) 11, : 843–857.[CrossRef][ISI][Medline].
Heger A, et al. ADDA: a domain database with global coverage of the protein universe. Nucl. Acids Res, ( (2005) ) 33, : D188–D191.
Holm L, Park J. DaliLite workbench for protein structure comparison. Bioinformatics, ( (2000) ) 16, : 566–567.
Holm L, Sander C. An evolutionary treasure: unification of a broad set of amidohydrolases related to urease. Proteins, ( (1997) ) 28, : 72–82.[CrossRef][ISI][Medline].
Kaplan N, et al. ProtoNet 4.0: a hierarchical classification of one million protein sequences. Nucleic Acids Res, ( (2005) ) 33, : D216–D218.
Kim D, et al. PROSPECT II: protein structure prediction program for the genome-scale. Protein Eng, ( (2003) ) 16, : 641–650.
Lindahl E, Elofsson A. Identification of related proteins on family, superfamily and fold level. J. Mol. Biol, ( (2000) ) 295, : 613–625.[CrossRef][ISI][Medline].
Marttinen P, et al. Bayesian search of functionally divergent protein subgroups and their function specific residues. Bioinformatics, ( (2006) ) 22, : 2466–2474.
McEntyre JR, Gibson TJ. Patterns and clusters within the PSM column in TiBS, 1992–2004. Trends Biochem. Sci, ( (2004) ) 29, : 627–633.[CrossRef][ISI][Medline].
Neuwald AF, et al. Extracting protein alignment models from the sequence database. Nucleic Acids Res, ( (1997) ) 25, : 1665–1677.
Notredame C, et al. COFFEE: an objective function for multiple sequence alignments. Bioinformatics, ( (1998) ) 14, : 407–422.
Notredame C, et al. T-Coffee: a novel method for fast and accurate multiple sequence alignment. J. Mol. Biol, ( (2000) ) 302, : 205–217.[CrossRef][ISI][Medline].
Park J, et al. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J. Mol. Biol, ( (1998) ) 284, : 1201–1210.[CrossRef][ISI][Medline].
Sadreyev RI, Grishin NV. Quality of alignment comparison by COMPASS improves with inclusion of diverse confident homologs. Bioinformatics, ( (2004) ) 20, : 818–828.
Sander C, Schneider R. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, ( (1991) ) 9, : 56–68.[CrossRef][ISI][Medline].
Schaffer AA, et al. Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res, ( (2001) ) 29, : 2994–3005.
Shi J, et al. FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol, ( (2001) ) 310, : 243–257.[CrossRef][ISI][Medline].
Sivakumar A, et al. From sequences to a functional unit. Physiol. Genomics, ( (2006) ) 25, : 1–8.
Soding J. Protein homology detection by HMM-HMM comparison. Bioinformatics, ( (2005) ) 21, : 951–960.
Watson JD, et al. Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol, ( (2005) ) 15, : 275–284.[CrossRef][ISI][Medline].
Zhang Z, et al. Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res, ( (1998) ) 26, : 3986–3990.
Zhou H, Zhou Y. Single-body residue-level knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins, ( (2004) ) 35, : 1005–1013..
Zhou H, Zhou Y. Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins, ( (2005) ) 58, : 321–328.[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
A. Heger, E. Korpelainen, T. Hupponen, K. Mattila, V. Ollikainen, and L. Holm PairsDB atlas of protein sequence space Nucleic Acids Res., January 11, 2008; 36(suppl_1): D276 - D280. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





