Skip Navigation


Bioinformatics Advance Access originally published online on December 14, 2004
Bioinformatics 2005 21(8):1437-1442; doi:10.1093/bioinformatics/bti218
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1437    most recent
bti218v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Prigent, V.
Right arrow Articles by Plewniak, F.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Prigent, V.
Right arrow Articles by Plewniak, F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

DbW: automatic update of a functional family-specific multiple alignment

V. Prigent *, J. C. Thierry , O. Poch and F. Plewniak

Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, (CNRS/INSERM/ULP) BP 10142, 67404 Illkirch Cedex, France

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 

Motivation: Recent advances in gene sequencing have provided complete sequence information for a number of genomes and as a result the amount of data in the sequence databases is growing at an exponential rate. We introduce here a new program, DbW, to automate the update of a functional family-specific multiple alignment that tries to include relevant sequences. The program is based on the use of different sources of information: sequences and annotations in databases.

Results: The advantages of DbW are demonstrated using the 20 families of aminoacyl-tRNA synthetases, where DbW detects a maximum of homologous sequences in the Swiss-Prot and SPTREMBL databases. The global specificity of DbW in this test is 98.4% (1.6% of the sequences included in the alignment did not belong to the family according to their function), and the global sensitivity of DbW is estimated to be 95.2%. Thus, DbW provides a reliable basis for the many applications that rely on accurate multiple alignments, e.g. functional residue identification, 2D/3D structure prediction or homology modeling.

Availability: The DbW software is available for download at ftp://ftp-igbmc.u-strasbg.fr/pub/DbW/DbW.tar and online at http://titus.u-strasbg.fr/DbW

Contact: prigent{at}igbmc.u-strasbg.fr


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
For nearly 15 years, the amount of data in the sequence databases has grown at an exponential rate. In this context, since sequence databases are constantly being updated to include new sequences, a protein family studied by a biologist also requires regular updates to include new members. Systematic methods are required to cope with the steadily increasing volume of database updates. Automatic alerting services are available to the internet community which allow researchers to search for homologues of a sequence of interest and to limit the search to the most recent updates of the databases. For example, DBWatcher (Plewniak, 1996, http://www-igbmc.u-strasbg.fr/BioInfo/LocalDoc/DBWatcher/) is a program handling periodic BLAST searches (Altschul et al., 1990) to find similarities to a user specified sequence. It keeps track of the previous searches and performs new ones only when necessary (i.e. when the database has been updated, the sequence has been modified or when settings have been changed). Similar free protein sequence alerting services are available: SwissShop (Peitsch, 1995, http://ca.expasy.org/swiss-shop/) and FastAlert (Eggenberger et al., 1996). With these methods, searches are made regularly and a list of the newly detected sequences is sent to the user by e-mail. This list will include (1) sequences of the same structural family (same fold) but having different functions and also (2) some fold unrelated sequences (false positives). So the user needs to manually filter the sequences proposed to select only those sequences that belong to the protein functional family of interest (same fold and same function). To this end, the relationship between sequence homology and function is evaluated but this work is difficult because no clear measure of functional similarity exists between any two proteins (Bork et al., 1998; Wilson et al., 2000; Lan et al., 2003). The user can also use text-mining if annotations of the newly detected sequences are available in databases (Swiss-Prot, Pfam, Interpro). However, this filtering is time consuming and, in order to relieve the user of this task, an automatic system is required.

In this paper we present a new program, DbW, that takes advantage of the multiple alignment of a protein functional family in order to first characterize the specificity of the sequences and then filter the new sequences detected regularly by searching the Swiss-Prot and SPTREMBL databases to exclude functionally unrelated sequences. By default, the program takes as input a multiple alignment of the protein functional family. However, DbW has been extended to be able to generate the multiple alignment from a single input sequence. As a result, only the functional family related sequences will be regularly selected and integrated in a complete multiple alignment proposed to the user. In order to perform a comprehensive evaluation of the reliability of DbW, we used the 20 families of aminoacyl-tRNA synthetases (aaRS). The aaRS differ by the number of members in a family [from 27 (LysRS of class I) to 202 (AspRS) members in the non-redundant Swiss-Prot database], the fold between aaRS of class I (SCOP classification: ‘contains a conserved all-alpha subdomain at the C-terminal extension’) and class II (SCOP classification: ‘contains large mixed beta-sheet’), the length of the sequences (from 500 to 1500 residues) and the degree of similarity between the sequences [from 22% (TyrRS) to 65% ({alpha}-subunit of GlyRS)]. Each of the 20 aaRS families, without exception, has a complex, modular multidomain architecture. Furthermore, the domains form a network that connects aaRS of different families (Wolf et al., 1999).


    METHODS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
Requirements
The DbW program is written in the Tcl language and can be downloaded at ftp://ftp-igbmc.u-strasbg.fr/pub/DbW/DbW.tar. This algorithm requires several other programs, that are all written in the C language: BLAST, TribeMCL (Enright et al., 2002), DbClustal (Thompson et al., 2000), Ballast (Plewniak et al., 2000), LEON (Thompson et al., 2004), HMMER (Eddy, 1998), RASCAL (Thompson et al., 2003), CLUSTALW (Thompson et al., 1994). Moreover, a local or internet access to Swiss-Prot and SPTREMBL databases and to the Sequence Retrieval System (SRS) (Etzold et al., 1996) is needed.

Makefiles for Solaris and Digital Unix installation are available but porting to other Unix-based operating systems should be straightforward.

Input/output
By default, the program takes as input a high-quality reference multiple alignment of a protein functional family. But DbW has been extended to take as input, if needed, a single protein sequence.

A functional family-specific alignment is automatically and regularly updated. In addition, the user is provided with (1) an annotation of its protein family with the keywords from sequence descriptions (see below Determination of keywords) and (2) description of integrated sequences (Access, ID, organism, classification, description).

Overview
The DbW procedure consists of several steps (Fig. 1). First of all, a high-quality reference alignment is created if needed from the single sequence provided by the user (Step 0), the DbW profile for this alignment is determined (Step 1) and a filter specific to the sequences is created (Step 2). These steps are performed only when the request is initiated. Then, during the last step, new sequences are detected in the Swiss-Prot and SPTREMBL databases using the DbW profile, fold and functionally unrelated sequences are filtered out and the complete functional family-specific alignment is automatically updated (Step 3). Step 3 is repeated on a regular basis. All the steps of the DbW method are explained in more detail below.



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 1 Overview of the method developed in DbW.

 
Creation of the reference alignment
If a user query sequence is given as input to the program, a reference alignment is created. DbW searches the databases for homologues using the BlastP program. All the sequences detected by BlastP with E-value <0.001, including both related and unrelated sequences of the query, are compared to others using BlastP. A graph is then constructed using pairwise sequence similarities generated where nodes of the graph represent proteins and the connecting edges represent sequence similarities. The proteins in this graph are then clustered by the TribeMCL program. DbW uses a text analysis to conserve the TribeMCL clusters similar to that containing the query (i.e. clusters with sequences related to the query) and excludes the others. The text annotation analysis, i.e. the analysis of the description field referenced in databases and extracted by SRS, is explained in the following subsection. The sequences of the conserved clusters are then aligned with the DbClustal program and fragments are eliminated. Fragments are detected (1) by text analysis and (2) using the following criteria: if another sequence exists that shares more than 60% identity and if the length of the two sequences differs by more than 20%, then the shorter sequence is a fragment. Then, the RASCAL program corrects any local alignment errors and the LEON program detects and removes unrelated sequences in the reference alignment. Finally, identical sequences are removed to avoid redundancy and their accession numbers are conserved in a file proposed to the user. Once created, this reference alignment will not be modified any further.

Similarity between TribeMCL clusters by text analysis
All the TribeMCL clusters are compared against the cluster containing the query. The vector-cosine model of text retrieval is used (MacCallum et al., 2000; Wilbur and Yang, 1996). For each TribeMCL cluster, a text is created with words of the Swiss-Prot and SPTREMBL description field from all sequences in this cluster. All analyses are case insensitive and all punctuation marks are removed. Some words not carrying any functional information (fragment, protein, putative, long, hypothetical, probable, potential, like, related) are removed, because they may be a source of noise in processing. In summary, the similarity between two texts sim(d, e) is calculated as the cosine of the angle between two vectors representing each text ( and ).

High values for sim(d, e) are the result of matching rare words which occur frequently within each text. We have experimentally determined a similarity threshold of 0.15, which is also confirmed by the literature (Wilbur and Yang, 1996).

Creation of the DbW profile
HMMER is an implementation of profile HMM methods for sensitive database searches using multiple sequence alignment profiles as queries (Eddy, 1998). The reference alignment is used to build the DbW profile using the ‘hmmbuild’ program with the –f option (that permits fragments to be detected) in the HMMER 2.3.2 package. Finally, the ‘hmmcalibrate’ program is used to calibrate the HMM search statistics.

Functional family-specific filter creation
A functional family-specific filter is created to test the newly detected sequences. This filter consists of two rules based on the characterization of the specificity of the sequences in the reference alignment.

First rule (Fig. 2): DbW first detects locally conserved segments (local maximum segments or LMSs) in the sequences included in the reference alignment using the Ballast program. Then ‘characteristic LMSs’ are determined, i.e. LMSs found in at least 80% of the sequences. A cutoff for the number of ‘characteristic LMSs’ is determined, corresponding to the minimum number found for any sequence in the reference alignment. Newly detected sequences are considered to be a family member if they contain at least this minimum cutoff of ‘characteristic LMSs’. However, this rule is not sufficient. In some cases, if we use this single rule, a sequence of a close but different functional family can be integrated in the user family.



View larger version (28K):
[in this window]
[in a new window]
 
Fig. 2 Filter creation (first rule) for the protein family: (1) Detection of locally conserved segments (LMSs) in the sequences of reference alignment. (2) Determination of the ‘characteristic LMSs’ (LMSs found in at least 80% of sequences of reference alignment, e.g. LMSs 2, 3, 4 and 7). (3) Determination of the first rule: among all the ‘characteristic LMSs’ of the family, a family member must have a minimum number of matches along its sequence (cutoff corresponding to the minimum number found for any sequence in the reference alignment, e.g. LMS 2).

 
Second rule: DbW also analyses all the LMSs detected by Ballast and conserves those that are never found in the reference alignment—the ‘unrelated sequence specific LMSs’. A family member contains no match to any of these ‘unrelated sequence specific LMSs’.

Database searching
Regular database searches are performed using the DbW profile in the Swiss-Prot and SPTREMBL databases. The newly detected sequences are filtered and family members are integrated in the complete alignment, which is proposed to the user. New sequences are aligned to the reference alignment using the ‘sequence-to-profile’ option of the ClustalW program. The reference alignment is unmodified.

Determination of keywords in the reference alignment of a family
Description fields referenced in the Swiss-Prot and SPTREMBL databases for each of the sequences of the reference alignment are automatically extracted by SRS. Keywords are then determined using an automatic method to scan words in these description fields, to select and to rank only discriminating words. This method is based on a frequency analysis of individual words (Marcotte et al., 2001). All analyses used here are case insensitive and all punctuation marks are substituted by spaces or removed.

First, the descriptions of all sequences in the reference alignment are used as a training set and a dictionary is constructed containing the frequencies of each of these words in the Swiss-Prot database. Next, the word frequencies calculated in the training set are compared to the calculated Swiss-Prot frequencies. Words with unexpectedly higher or lower frequencies might be useful for discriminating the reference sequences. For each word in the training set, the number of occurrences n is counted, and the probability p(n/N, f) of finding the word the observed number of times given the known dictionary frequency f and the total number of words N in the training set, is calculated from the Poisson distribution as

This approximation is valid when the total number of words used to generate the dictionary is much greater than N and when f is small. In practice, to avoid floating point errors, the log of the probability is calculated as ln p(n/N,f) {approx}Nf + n ln (Nf) – ln(n!), where n! is estimated using Stirling's approximation for large n. Statistically significant words found with ln p ≤ –13 (p ≤ 2.10–6) are considered to be discriminating and are proposed to the user.


    RESULTS AND DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
The DbW program has been tested for over a year on the 20 families of aaRS. For several reasons, aaRS appear to be excellent test cases:

  1. Because each of the aaRS is indispensable in the context of the modern translation system, this collection provides numerous sequences of functionally equivalent proteins from diverse organisms (bacterial, archaeal and eukaryotic organisms).
  2. Because aaRS have been among the most popular objects of molecular biology studies, the sequences available in the databases are well annotated.
  3. Each of the 20 aaRS families, without exception, has a complex, modular multi-domain architecture. Furthermore, the domains form a network that connects aaRS of different families (Wolf et al., 1999). There is a close relationship between some aaRS families of the same class (e.g. IleRS and ValRS). The test set also includes an example of protein fusion (GluRS–ProRS).
  4. The aaRS families differ by the length and the degree of similarity of the sequences.

We describe below a comprehensive analysis of the results obtained by DbW for all 20 synthetase families, corresponding to 23 complete multiple alignments. Indeed, for three of the families (LysRS, GlyRS and PheRS) several different functional alignments were required. There are two types of LysRS that belong to class I and class II, respectively, and appear to be functionally unrelated to each other (Ibba et al., 1997), therefore two alignments were created for LysRS. Similarly, three alignments were created for each subunit of GlyRS ({alpha}, {alpha}', ß) and two for each subunit of PheRS ({alpha}, ß). Only one alignment was used for GluRS and GlnRS that appear to be related due to specific evolutionary events [duplication of GluRS in eukaryotes, followed by a switch of specificity to glutamine in one of the copies; horizontal transfer of GlnRS from eukaryotes to Proteobacteria; horizontal transfer of mitochondrial GluRS from eukaryotes to spirochaetes and Chlamidiae (Wolf et al., 1999)].

The 23 alignments were created in January 2003 and since then have been maintained by regular update. The reliability of each DbW step has been evaluated.

Creation of the reference alignments
DbW takes advantage of the reference multiple alignment given as input by the user. However, if needed, this reference alignment is created by DbW from a single sequence. The quality of all the subsequent steps depends on the quality of this reference alignment. To create a high-quality reference alignment, DbW needs to detect a maximum number of family members and reject non-related sequences.

To detect a maximum of homologues, the selection of the user sequence for each functional family is critical. The user query sequence should be representative of the whole family in order to detect as many sequences as possible using BlastP. So, distantly related members of the family or fragments are excluded as queries. In this study, searches were performed using the Escherichia coli, Homo sapiens, Aeropyrum pernix or Saccharomyces cerevisiae aaRS sequences, in order to explore the three kingdoms of the tree of life and to select the queries that detect the most sequences. For example, in the case of ValRS, these four queries detect 120, 118, 98 and 114 ValRS, respectively in the Swiss-Prot (Release 41) and SPTREMBL (Release 22) databases with an E-value <0.001. We therefore chose the E.coli sequence as a query to run DbW.

Detected sequences are then clustered into homologous families (i.e. proteins sharing a common ancestor) with the TribeMCL program, a method for rapid and accurate clustering of protein sequences into families. The quality of the clusters is very high, although some of the largest families (with more than 1000 members) may contain a number of unrelated members (Enright et al., 2002). Detailed examination of the TribeMCL clusters of all 20 aaRS families shows that proteins are well partitioned, according to the InterPro annotations, except for some particular families: TrpRS, AsnRS, GluRS/GlnRS and ProRS. These results highlight two problems that can affect the DbW process: a close relationship between two functional families and multidomain proteins. (1) A small number of TyrRS (2%) are present in the clusters of TrpRS, due to a strong conservation between TrpRS and TyrRS. Indeed, sequence relatedness of structurally superimposable residues throughout TrpRS and TyrRS implies that they diverged more recently than most aminoacyl-tRNA synthetases (Doublie et al., 1995). The AsnRS family clusters contain a certain number of AspRS sequences (~25%). This is explained by evolutionary events: duplication of AspRS in eukaryotes, followed by the switch of specificity to asparagine in one of the copies; ancient horizontal transfer of AsnRS from eukaryotes to bacteria (Wolf et al., 1999). (2) Some bifunctional Glu–ProRS are present in the clusters of the GluRS/GlnRS (3%), and some in the clusters of the ProRS (16%). Indeed, the genes of GluRS and ProRS are organized differently in the three kingdoms of the tree of life. In bacteria and archaea, distinct genes encode the two proteins while in several organisms from the eukaryotic phylum of coelomate metazoans, the two polypeptides are carried by a single polypeptide chain to form a bifunctional protein (Berthonneau and Mirande, 2000).

After clustering by TribeMCL, the TribeMCL cluster containing the query and other TribeMCL clusters that are similar according to the annotation description are conserved. In the 23 aaRS alignments, all similar clusters have been merged together. [aaRS are indeed well annotated and the description lines contain very specific nomenclature unlikely to be shared between remote homologues, so close but different functional protein families do not merge together.] However, some difficulties can be encountered with this method. One difficulty arises from the interpretation of free-style text (human-written text) by computer programs. For example, an IleRS can be differently annotated: ‘Isoleucyl-tRNA synthetase, cytoplasmic (EC 6.1.1.5 [EC] ) (Isoleucine–tRNA ligase) (IleRS) (IRS)’, ‘IleS protein (Fragment)’, and some typing errors can also be introduced: ‘Isoleucyl-tRNA syntehtase-related’. But the TribeMCL aaRS clusters contain many sequences (in average 60 sequences), so most of the definitions are present in each cluster, allowing to conserve similar clusters despite this handicap. A second limitation is that some proteins of unknown function are not yet well annotated in databases (‘hypothetical protein’). With this method, only some uncharacterized proteins present in conserved clusters that contain annotated proteins can be integrated in the reference alignment.

Finally, the agreement between two methods (TribeMCL, text-analysis) using different sources of information (sequences and annotations) permits the reference alignments to contain a representative sample of sequences of a given protein family. This approach appears to be both sensitive and specific to the family of interest: 3325 related sequences are included in the 23 alignments produced reference, from 34 (LysRS of class I) to 195 (AlaRS) per alignment (138 on average) and with only a few distantly related sequences. In our test set, the reference alignments were then manually verified to remove unrelated sequences and fragments, if necessary, in order to ensure a high-quality protein functional alignment and to maximize the efficacy of the subsequent steps of the program.

Automatic and regular update of the complete alignments
The multiple alignments created are regularly updated by searching the databases for new members of the protein functional family. The search method chosen in DbW is HMMER and starting with the high-quality reference alignments, 2196 aaRS annotated in Swiss-Prot (Release 44, based on Interpro annotation) are detected. Careful examination of the HMMER results confirmed that more distant relationships were detected than with BlastP (data not shown), although a time penalty is incurred. Search methods using evolutionary information in multiple alignments are indeed known to be more sensitive than single sequence methods (Holmes, 2000). The best illustration has been obtained with the TyrRS family: regardless of the query, BlastP detected at most 17 of the 33 sequences annotated in Swiss-Prot (Release 44) whereas HMMER, starting with an alignment containing a representative sample of TyrRS, detected all 33 sequences. This can be explained by a subdivision of the TyrRS family due to evolutionary events: eukaryotic and archaeal TyrRS genes have evolved more rapidly than the bacterial TyrRS genes. Moreover, in eukaryotes, cytoplasmic and mitochondrial isoformes of TyrRS are encoded by separate genes and the gene encoding the mitochondrial protein appears to be of bacterial origin while the ancestry of the cytoplasmic version can be traced back to the archaea (Brown et al., 1997).

HMMER hits with an associated E-value above a statistically significant cutoff (0.001) are then filtered by DbW. Detailed examination of the 23 multiple alignments of aaRS shows that most of the sequences are integrated in the alignment of their functional family. However, because of the very similar domain architecture shared by these 23 functional families of aaRS (Wolf et al., 1999), certain newly detected sequences may not be integrated into the correct family. For example, 21 AspRS have been integrated into the complete alignment of AsnRS, instead of AspRS.

Since January 2003, 4298 non-redundant sequences have been integrated in these alignments. On average, from 1 (class I LysRS) to 13 (GluRS/GlnRS) sequences are integrated per month in the complete alignments (on average 6 aaRS sequences/month in a complete alignment), and 95 sequences (of which 69 aaRS) detected with an E-value <0.001 have been excluded from the complete alignments. Excluded aaRS are either distantly related aaRS or fragments.

Complete alignment quality
The sensitivity and specificity of the DbW filter estimated using the complete alignments of the 23 aaRS (Table 1) follows:

where TP (True Positive) are the homologous sequences integrated into their functional protein family, FN (False Negative) are the homologous sequences excluded from their functional protein family, TN (True Negative) are the unrelated sequences excluded from the functional protein family, and FP (False Positive) are the unrelated sequences integrated into their functional protein family.


View this table:
[in this window]
[in a new window]
 
Table 1 Sensitivity and specificity of the DbW filter estimated using the complete alignments of the 23 aaRS

 
The global filter specificity of DbW in this test is 98.4% (1.6% of the sequences included in the alignment did not belong to the family according to their function). The global filter sensitivity of DbW is estimated to be 95.2%. However, 13% of the sequences in the test multiple alignments have not yet been classified in the Interpro database, so the filter sensitivity or specificity of the method could not be measured accurately.

DbW has been optimized to maximize the specificity of the complete alignments. Inclusion of proteins that are not members of the functional family will bias subsequent applications. The construction of a reference multiple alignment and the corresponding filter are designed to select exclusively only protein functional family members, although some family members may be excluded as a consequence. Nevertheless, the sensitivity achieved is still 95.2%. This is in contrast to database search methods, such as PSI-BLAST (Altschul et al., 1997), that are designed to detect as many sequences as possible, although some false positive hits may be detected. For example, for the ‘SYI_HUMAN’ IleRS query sequence, DbW selects only IleRS family members and excludes all closely related ValRS sequences and also some potential IleRS. After four iterations of PSI-BLAST (E-value <0.001), sequences belonging to the IleRS, ValRS, LeuRS, MetRS, CysRS and ArgRS (aaRS of class Ia) families are detected.

Automatic annotation of the families
The calculation of keywords is useful for automatically generating annotation about a specific protein family. DbW captures information from Swiss-Prot descriptions of all the sequences integrated in the complete alignment of a family and discriminating words are determined. Each of the 23 alignments has been well described by this method. The most significant words obtained are found in most cases to be good indicators of different aspects of protein function. For example, the MetRS family is described as ‘EC 6.1.1.10 [EC] ’, ‘Methionyl-tRNA’, ‘MetG’ and ‘Methionine—tRNA’.


    CONCLUSIONS
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 
A new program to automate the update of functional family-specific multiple alignment was presented in this paper. This program, DbW, takes advantage of a multiple alignment to create a functional family-specific filter based on the characterization of the sequences of the family. By making use of DbW, researchers can integrate newly discovered sequences relevant to their research with no need to constantly repeat the same database searches. DbW provides a reliable basis for the many applications that rely on accurate multiple alignments, e.g. functional residue identification, 2D/3D structure prediction or homology modeling. However, some errors can be introduced during the update of the complete alignment. In particular, a few distantly related sequences can be integrated, so a rapid manual validation may be useful.

Work is now in progress to automatically explore protein functional family relationships. Future developments will highlight links existing between close families (i.e. AspRS and AsnRS) and will process the information within these families to improve the filter and thus the reliability of the update.


    Acknowledgments
 
We thank Raymond Ripp, Patrice Koehl, Julie Thompson and Luc Moulinier for the stimulating discussions. This work was funded by the Institut National de la Santé et de la Recherche Médicale, the Centre National de la Recherche Scientifique, the Université Louis Pasteur, the Fond National de la Science (GENOPOLE), the SPINE project, contract-no. QLG2-CT-2002-00988 under the RTD programme ‘Quality of Life and Management of Living Ressources’ and the Ministère de la Recherche et des Nouvelles Technologies under the ACI ‘IMPBio’ project number 2004-78.

Received on September 21, 2004; revised on December 10, 2004; accepted on December 10, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 METHODS
 RESULTS AND DISCUSSION
 CONCLUSIONS
 REFERENCES
 

    Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J. (1990) Basic Local Alignment Search Tool. J. Mol. Biol., 215, 403–410[CrossRef][Web of Science][Medline].

    Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402[Abstract/Free Full Text].

    Berthonneau, E. and Mirande, M. (2000) A gene fusion event in the evolution of aminoacyl-tRNA synthetases. FEBS Lett., 470, 300–304[CrossRef][Web of Science][Medline].

    Bork, P., Dandekar, T., Diaz-Lazcoz, Y., Eisenhaber, F., Huynen, M., Yuan, Y. (1998) Predicting function: from genes to genomes and back. J. Mol. Biol., 283, 707–725[CrossRef][Web of Science][Medline].

    Brown, J.R., Robb, F.T., Weiss, R., Doolittle, W.F. (1997) Evidence for the early divergence of tryptophanyl- and tyrosyl-tRNA synthetases. J. Mol. Evol., 45, 9–16[CrossRef][Web of Science][Medline].

    Doublie, S., Bricogne, G., Gilmore, C., Carter, C.W. (1995) Tryptophanyl-tRNA synthetase crystal structure reveals an unexpected homology to tyrosyl-tRNA synthetase. Structure, 3, 17–31[Medline].

    Eddy, S.R. (1998) Profile hidden Markov models. Bioinformatics, 14, 755–763[Abstract/Free Full Text].

    Eggenberger, F., Redaschi, N., Doelz, R. (1996) Fast Alert—an automatic search system to alert about new entries in biological sequence databanks. CABIOS, 12, 129–133.

    Enright, A.J., Van Dongen, S., Ouzounis, C.A. (2002) An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res., 30, 1575–1584[Abstract/Free Full Text].

    Etzold, T., Ulyanov, A., Argos, P. (1996) SRS: information retrieval system for molecular biology data banks. Methods Enzymol., 266, 114–128[Web of Science][Medline].

    Holmes, I. (2000) Review of sequence homology search techniques on the World-Wide Web. HIV Sequence Compendium, Group T-10, Los Alamos National Laboratory.

    Ibba, M., Morgan, S., Curnow, A.W., Pridmore, D.R., Vothknecht, U.C., Gardner, W., Lin, W., Woese, C.R., Soll, D. (1997) A euryarchaeal lysyl-tRNA synthetase: resemblance to class I synthetases. Science, 278, 1119–1122[Abstract/Free Full Text].

    Lan, N., Montelione, G.T., Gerstein, M. (2003) Ontologies for proteomics: towards a systematic definition of structure and function that scales to the genome level. Curr. Opin. Chem. Biol., 7, 44–54[CrossRef][Web of Science][Medline].

    MacCallum, R., Kelley, L.A., Sternberg, M.J.E. (2000) SAWTED: Structure Assignment With Text Description-Enhanced detection of remote homologues with automated SWISS-PROT annotation comparisons. Bioinformatics, 16, 125–129[Abstract/Free Full Text].

    Marcotte, E.M., Xenarios, I., And Eisenberg, D. (2001) Mining literature for protein–protein interactions. Bioinformatics, 17, 359–363[Abstract/Free Full Text].

    Peitsch, M. (1995) The Swiss-Shop BioComputing Server.

    Plewniak, F. (1996) DBWatcher.

    Plewniak, F., Thompson, J.D., Poch, O. (2000) Ballast: BLAST post-processing based on locally conserved segments. Bioinformatics, 16, 750–759[Abstract/Free Full Text].

    Thompson, J.D., Higgins, D.G., Gibson, T.J. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res., 22, 4673–4680[Abstract/Free Full Text].

    Thompson, J.D., Plewniak, F., Thierry, J., Poch, O. (2000) DbClustal: rapid and reliable global multiple alignments of protein sequences detected by database searches. Nucleic Acids Res., 28, 2919–2926[Abstract/Free Full Text].

    Thompson, J.D., Prigent, V., Poch, O. (2004) LEON—multiple aLignment Evaluation Of Neighbours. Nucleic Acids Res., 32, 1298–1307[Abstract/Free Full Text].

    Thompson, J.D., Thierry, J.C., Poch, O. (2003) RASCAL: rapid scanning and correction of multiple sequence alignments. Bioinformatics, 19, 1155–1561[Abstract/Free Full Text].

    Wilbur, W.J. and Yang, Y. (1996) An analysis of statistical term strength and it use in the indexing and retrieval of molecular biology texts. Comput. Biol. Med., 3, 209–222.

    Wilson, C.A., Kreychman, J., Gerstein, M. (2000) Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J. Mol. Biol., 297, 233–249[CrossRef][Web of Science][Medline].

    Wolf, Y.I., Aravind, L., Grishin, N.V., Koonin, E.V. (1999) Evolution of aminoacyl-tRNA synthetases-analysis of unique domain architectures and phylogenetic trees reveals a complex history of horizontal gene transfer events. Genome Res., 8, 689–710.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Biol. Chem.Home page
A. Sheoran and E. A. First
Activation of D-Tyrosine by Bacillus stearothermophilus Tyrosyl-tRNA Synthetase: 2. COOPERATIVE BINDING OF ATP IS LIMITED TO THE INITIAL TURNOVER OF THE ENZYME
J. Biol. Chem., May 9, 2008; 283(19): 12971 - 12980.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1437    most recent
bti218v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (4)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Prigent, V.
Right arrow Articles by Plewniak, F.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Prigent, V.
Right arrow Articles by Plewniak, F.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?