Bioinformatics Advance Access originally published online on May 8, 2006
Bioinformatics 2006 22(14):1786-1787; doi:10.1093/bioinformatics/btl179
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
HoSeqI: automated homologous sequence identification in gene family databases
Laboratoire de Biométrie et Biologie Évolutive, UMR CNRS 5558, Université Claude-Bernard Lyon 1, 43 boulevard du 11 Novembre 1918 69622 Villeurbanne Cedex, France
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: We present a web service allowing to automatically assign sequences to homologous gene families from a set of databases. After identification of the most similar gene family to the query sequence, this sequence is added to the whole alignment and the phylogenetic tree of the family is rebuilt. Thus, the phylogenetic position of the query sequence in its gene family can be easily identified.
Availability: http://pbil.univ-lyon1.fr/software/HoSeqI/
Contact: arigon{at}biomserv.univ-lyon1.fr
Supplementary Information: Supplementary Data are available at Bioinformatics online.
In several contexts such as (1) species or taxon identification from molecular markers of environmental organisms, (2) confrontation of a new sequence to a database, or (3) update of homologous gene family sequence databases, the classification of a new sequence into a collection is needed. This classification allows the identification of which family the sequence belongs to and contributes to the assessment of its evolutionary relationships. Today, massive sequencing techniques are routinely used and the number of new available sequences grows up quickly. The identification tasks require the chaining of different programs (for similarity search, alignment and tree computation) that are sometimes complex to handle. Moreover, some results have to be manually checked. Doing these tasks sequentially makes the work of sequence identification tedious and time-consuming. Automated bioinformatic tools are thus necessary to carry out these operations in an accurate and fast way. Some tools exist to make sequence identification. For instance, BIBI (Bioinformatic Bacterial Identification) (Devulder et al., 2003) was specifically developed for bacterial identification. The European ribosomal RNA database (Wuyts et al., 2004) compiles complete or nearly complete ribosomal RNA (rRNA) sequences and uses the BIBI algorithm in order to allow a user to make rRNA quick phylogeny analyses. The Ribosomal Database Project (Cole et al., 2005) proposes a database with aligned and annotated rRNA gene sequences and provides analysis services such as the RDP classifier that places sequences in the RDP hierarchy in order to give an initial taxonomic placement for sequences.
We developed comprehensive sequence family databases (i.e. HOVERGEN and HOGENOM) (Duret et al., 1999) in which homologous protein gene sequences are clustered into families and aligned. These databases can be used for different purposes, among which are phylogenetic analyses. The addition of a single sequence to a given family from these databases can have many repercussions on the topology of the associated phylogenetic tree; these changes may be located near the introduced sequence, but they may also be located in deep nodes. In such case, the phylogenetic information brought by the whole family should be taken into account. Also, as HOVERGEN and HOGENOM contain large families, with several thousand sequences, powerful algorithms are required in order to quickly add a sequence to a large alignment. Currently available sequence identification tools such as those presented previously are developed to treat specific data such as rRNA sequences. BIBI algorithm limits comparisons to the most similar sequences and the RDP classifier uses a naïve Bayesian rRNA classifier. So they cannot be used effectively with large family databases such as HOVERGEN and HOGENOM.
We built a software environmentcalled HoSeqI (Homologous Sequence Identification)allowing the automatic identification of homologous sequences and their classification into our sequence family databases. It integrates different programs of similarity search, multiple alignments and phylogenetic tree building, as well as specific tools we developed. This environment can be accessed through a web service implemented in HTML-PHP. It is divided into three parts. First, the identification procedure uses BLASTP (Altschul et al., 1997) to compare the query sequence with the entries of the family database chosen by the user. BLASTP outputs are parsed in order to identify to which families the submitted sequence belongs. All distinct families with non-overlapping matches are selected allowing to process sequences that contain non-overlapping regions from distinct homologous gene families. If several families are identified, they are all proposed to the user who can then choose which one to select. The interface provides links to BLASTP output and information about proposed families in order to assist user choices.
Second, for each identified family, a set of multiple alignment programs is proposed to the user CLUSTAL W (Thompson et al., 1994), MULTALIN (Corpet, 1988), MABIOS (Abdeddaïm, 1997), MENTALIGN (Dufayard, 2004) and MUSCLE (Edgar, 2004). MENTALIGN is an incremental algorithm that has been developed specifically by our group in order to manage very large alignments and trees containing thousands of sequences. MUSCLE proposes two specific uses of the program, MUSCLE-prog and MUSCLE-fast allowing to align a large number of sequences much more quickly than with other programs. All these programs also make it possible to very quickly add a sequence to a pre-existing alignment. The HOVERGEN and HOGENOM databases contain all multiple alignments and phylogenetic trees for families of <500 sequences. So, the query sequence can be easily added to alignments of these families. For other families (>500 sequences), the whole sequence alignment has to be computed. According to the identified family, the proposed list of alignment programs varies. Indeed, problems may occur when some programs such as CLUSTAL W and MABIOS are used to compute a multiple alignment containing >500 sequences (e.g. execution is too slow for a web application, memory allocation).
Lastly, the obtained alignment is used to build the phylogenetic tree. The user can choose among the following tree building programs: QuickTree (Howe et al., 2002), FastME (Desper and Gascuel, 2002), BIONJ (Gascuel, 1997) and PhyML (Guindon and Gascuel, 2003). QuickTree is a fast implementation of the neighbor-joining (NJ) algorithm (Saitou and Nei, 1987). It allows a rapid phylogenetic rebuilding for large sequence families. FastME is based on the minimum evolution method. BIONJ is an improved version of the NJ algorithm. PhyML is able to compute large phylogenies by maximum likelihood. When the input of the phylogenetic tree program has to be a distance matrix, we use PROTDIST [with Kimura's formula (Kimura, 1983)] to compute it (Felsenstein, 1989). For each program, the user can apply the bootstrap option. The tree is then automatically rooted at its mid-point.
For all programs used in HoSeqI (BLASTP, multiple alignment programs and phylogenetic tree building programs), the interface allows to choose non-default parameter values. All results are presented through web pages and can be downloaded. Resulting alignments and phylogenetic trees can also be displayed by two Java applets: Jalview (http://www2.ebi.ac.uk/~michele/jalview/) and ATV (Zmasek and Eddy, 2001). Some selected options can result in time-consuming alignment and phylogenetic tree building (e.g. if the user chooses PhyML with the bootstrap option). In these cases, computations are performed offline and the user receives an e-mail with links to the various results that are kept on the server for one month.
The usefulness of HoSeqI is to automate the identification process on large family databases and to contribute to the study of the evolutionary background of new sequences. HoSeqI proposes a user-friendly interface that allows a user to easily identify a query sequence and to visualize the obtained alignment and tree. The user can thus locate the sequence in the tree of its gene family and study the evolution of this new sequence. Computation times range between 30 s (for 143 sequences in the associated family) and 2 min 30 s (for 1132 sequences in the associated family).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Joaquin Dopazo
Received on February 17, 2006; revised on April 19, 2006; accepted on May 3, 2006
| REFERENCES |
|---|
|
|
|---|
Abdeddaïm, S. (1997) Fast and sound two-step algorithms for multiple alignment of nucleic sequences. Int. J. Artif. Intell. Tools, 6, 179192.
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 33893402
Cole, J.R., et al. (2005) The Ribosomal Database Project (RDP-II): sequences and tools for high-throughput rRNA analysis. Nucleic Acids Res, . 33, D294D296
Corpet, F. (1988) Multiple sequence alignment with hierarchical clustering. Nucleic Acids Res, . 16, 1088110890
Desper, R. and Gascuel, O. (2002) Fast and accurate phylogeny reconstruction algorithms based on the minimum-evolution principle. J. Comput. Biol, . 19, 687705.
Devulder, G., et al. (2003) BIBI, a bioinformatic bacterial identification tool. J. Clin. Microbiol, . 41, 17851787
Dufayard, J.F. (2004) Incremental algorithms for the alignment and the phylogeny of large homologous sequence families. Ph.D. , Grenoble, France Thesis, Joseph Fourier University.
Duret, L., Perrière, G., Gouy, M. (1999) HOVERGEN: database and software for comparative analysis of homologous vertebrate genes. In Letovsky, S. (Ed.). Bioinformatics Databases and Systems, , Boston, MA Kluwer Academic Publishers, pp. 1329.
Edgar, R.C. (2004) MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics, 5, 113[CrossRef][Medline].
Gascuel, O. (1997) BIONJ: an improved version of the NJ algorithm based on a simple model of sequence data. Mol. Biol. Evol, . 14, 685695[Abstract].
Guindon, S. and Gascuel, O. (2003) A simple, fast, and accurate algorithm to estimate large phylogenies by maximum likelihood. Syst. Biol, . 52, 696704[CrossRef][ISI][Medline].
Howe, K., et al. (2002) QuickTree: building huge Neighbour-Joining trees of protein sequences. Bioinformatics, 18, 15461547
Felsenstein, J. (1989) PHYLIPPhylogeny Inference Package (Version 3.2). Cladistics, 5, 164166.
Kimura, M. The Neutral Theory of Molecular Evolution, (1983) , Cambridge Cambridge University Press.
Saitou, N. and Nei, M. (1987) The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol, . 4, 406425[Abstract].
Thompson, J.D., et al. (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position specific gap penalties and weight matrix choice. Nucleic Acids Res, . 22, 46734680
Wuyts, J., et al. (2004) The European ribosomal RNA database. Nucleic Acids Res, . 32, D101D103
Zmasek, C.M. and Eddy, S.R. (2001) ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics, 17, 383384
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||