Bioinformatics Advance Access originally published online on January 12, 2005
Bioinformatics 2005 21(9):2095-2096; doi:10.1093/bioinformatics/bti252
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
GOAnno: GO annotation based on multiple alignment
1Laboratoire de Biologie et Génomique Structurales, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS/INSERM/ULP BP 163, 67404 Illkirch cedex, France
2Laboratoire de Physiopathologie Cellulaire et Moléculaire de la Rétine, Inserm U592, Université Pierre et Marie Curie 75571 Paris, France
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: GOAnno is a web tool that automatically annotates proteins according to the Gene Ontology (GO) using evolutionary information available in hierarchized multiple alignments. GO terms present in the aligned functional subfamily can be cross-validated and propagated to obtain highly reliable predicted GO annotation based on the GOAnno algorithm.
Availability: The web tool and a reduced version for local installation are freely available at http://igbmc.u-strasbg.fr/GOAnno/GOAnno.html
Contact: chalmel{at}igbmc.u-strasbg.fr
Supplementary information: The website supplies a detailed explanation and illustration of the algorithm at http://igbmc.u-strasbg.fr/GOAnno/GOAnnoHelp.html
| INTRODUCTION |
|---|
|
|
|---|
Recent efforts in high-throughput sequencing have given rise to a rapid increase in the amount of sequences available in the public databases. Since GeneQuiz (Andrade et al., 1999) that automatically annotated protein function, the systematic annotation of this data is now typically based on the Gene Ontology (GO) (Gene Ontology Consortium, 2000), a hierarchical and standardized vocabulary developed by the GO Consortium (www.geneontology.org). Several tools employ sequence similarities by best BLAST (Altschul et al., 1997) hits selection (Hennig et al., 2003; Khan et al., 2003; Zehetner et al., 2003) or a predefined subset of GO terms (Jensen et al., 2003).
GOAnno is a web tool for automated protein GO annotation. In contrast to the above methods, GOAnno takes advantage of the evolutionary information available in Multiple Alignments of Complete Sequences (MACS) (Lecompte et al., 2001) organized hierarchically into functional subfamilies. The members within subfamilies are conserved enough to filter, enrich and propagate GO terms using the GOAnno algorithm. Another originality is the absence of any predefined parameters such as GO level or subsets of GO terms. The tool uses a query protein sequence as input and proposes detailed GO annotations in an interactive HTML file as ouput.
| PROGRAM OVERVIEW |
|---|
|
|
|---|
The GOAnno algorithm is explained and illustrated in detail in Supplementary information. GOAnno incorporates a five-step process:
- The query protein functional subfamily determination step incorporates the strategy used in PipeAlign (Plewniak et al., 2003), a toolkit for protein family analysis using a query sequence to perform a protein database sequence search and resulting in a hierarchized MACS of protein homologs clustered into potential functional subfamilies (http://igbmc.u-strasbg.fr/PipeAlign).
The next four steps are independently applied for each of the three GO categories: cellular component, molecular function and biological process. At the end of each step, the redundant and parent GO terms are systematically removed.
- An Initial Protein gene Ontology (IPO) is constructed for each query subfamily member from the GO annotation associated with the protein in the sequence databases when available and extracted from the conversion tables available from the GO Consortium (InterPro, Pfam, Prints, PRODOM, Prosite, SMART protein motifs, Enzyme Commission numbers and Swiss-Prot keywords to GO terms).
- The construction of the MACS permits the identification of the Proximal Proteins (proteins sharing at least 98% identity with the input protein). All the IPO of these proximal proteins are concatenated to form the Proximal Protein gene Ontology (PPO).
- The quality of the query subfamily alignment is assessed using the objective scoring function norMD (Thompson et al., 2001). NorMD > 0.3 implies a high-quality and allows the propagation of GO terms within the subfamily according to the following criteria. Briefly, all IPO of the proteins are collected to build the corresponding GO tree. For each IPO, all paths to the root are decomposed into linear branches. Then, a score based on the number of the protein is calculated for each node and each branch. Afterward, highly specialized nodes and branches associated with rare nodes are eliminated based on two cut-off values p and f respectively. GO terms which pass these selections define the Mean Subfamily gene Ontology (MSO).
- The previously determined IPO of the query, PPO and MSO are collected to define the final GPO (Global Protein gene Ontology) that is finally assigned to the query.
In the context of the study of mechanisms leading to retinal degeneration, GOAnno was used on microarray experiments to analyze 1046 UniProt (Apweiler et al., 2004) proteins (Chalmel,F., Poch,O., Lavedan,C., Ripp,R., Wicker,N., Dolomeyer,A., Clérin,E., Mohand-Saïd,S., Lambrou,G., Sahel,J.-A. and Léveillard,T., in preparation). Of these 1046 proteins, 698 had an IPO, corresponding to a total of 2285 GO terms. Using the GOAnno algorithm, GPO were assigned to 191 supplementary proteins (27.4%), corresponding to 1520 new associated GO terms (66.5%).
The interface of GOAnno is designed to accept a single protein sequence as input. The user has the opportunity to modify the GOAnno parameters (e.g. f and p). The program proposes as output a downloadable XML file and an interactive HTML page containing a detailed table describing the IPO, PPO, MSO and GPO steps, where each GO term is linked to the AmiGO entry (http://www.godatabase.org/cgi-bin/amigo/go.cgi) and each protein accession number to the corresponding UniProt entry.
A light version of GOAnno excluding the first step is also available for local use. In this case, the homologies in terms of subfamily and proximal proteins of the query entry must be previously determined. The program allows automatic batch processing of a gene list, which is of particular interest in interpreting high-throughput experiments such as microarray transcription profiling.
GOAnno provides an efficient way to assign a potential GO to an unknown sequence and to increase an existing GO annotation. It can also be used for in-depth comparisons of functionality relative to a subfamily. GOAnno is designed to help biologists by automatically providing reliable protein functional information combined with an intuitive user interface that can be operated without any previous experience in judging the quality of predicted GO annotation.
| Acknowledgments |
|---|
The authors thank G. Berthommier, L. Bianchetti, F. Plewniak, W. Raffelsberger and R. Ripp for stimulating discussions. This work was funded by the INSERM, the CNRS, the ULP de Strasbourg, the FNS (GENOPOLE), the SPINE (E.C. contract number QLG2-CT-200200988) and the RETNET (E.C. contract number MRTN-CT-2003504003) projects.
Received on October 25, 2004; revised on December 22, 2004; accepted on December 23, 2004
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, 25, 33893402
Andrade, M.A., Brown, N.P., Leroy, C., Hoersch, S., de Daruvar, A., Reich, C., Franchini, A., Tamames, J., Valencia, A., Ouzounis, C., Sander, C. (1999) Automated genome sequence analysis and annotation. Bioinformatics, 15, 391412
Apweiler, R., Bairoch, A., Wu, C.H., Barker, W.C., Boeckmann, B., Ferro, S., Gasteiger, E., Huang, H., Lopez, R., Magrane, M., et al. (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res., 32, D115D119
The Gene Ontology Consortium. (2000) Gene Ontology: tool for the unification of biology. Nat. Genet, 25, 2529[CrossRef][ISI][Medline].
Hennig, S., Groth, D., Lehrach, H. (2003) Automated Gene Ontology annotation for anonymous sequence data. Nucleic Acids Res, 31, 37123715
Jensen, L.J., Gupta, R., Staerfeldt, H.H., Brunak, S. (2003) Prediction of human protein function according to Gene Ontology categories. Bioinformatics, 19, 635642
Khan, S., Situ, G., Decker, K., Schmidt, C.J. (2003) GoFigure: automated Gene Ontology annotation. Bioinformatics, 19, 24842485
Lecompte, O., Thompson, J.D., Plewniak, F., Thierry, J., Poch, O. (2001) Multiple alignment of complete sequences (MACS) in the post-genomic era. Gene, 270, 1730[CrossRef][ISI][Medline].
Plewniak, F., Bianchetti, L., Brelivet, Y., Carles, A., Chalmel, F., Lecompte, O., Mochel, T., Moulinier, L., Muller, A., Muller, J., et al. (2003) PipeAlign: A new toolkit for protein family analysis. Nucleic Acids Res, 31, 38293832
Thompson, J.D., Plewniak, F., Ripp, R., Thierry, J.C., Poch, O. (2001) Towards a reliable objective function for multiple sequence alignments. J. Mol. Biol, 314, 937951[CrossRef][ISI][Medline].
Zehetner, G. (2003) OntoBlast function: from sequence similarities directly to potential functional annotations by ontology terms. Nucleic Acids Res, 31, 37993803
This article has been cited by other articles:
![]() |
S. Gotz, J. M. Garcia-Gomez, J. Terol, T. D. Williams, S. H. Nagaraj, M. J. Nueda, M. Robles, M. Talon, J. Dopazo, and A. Conesa High-throughput functional annotation and data mining with the Blast2GO suite Nucleic Acids Res., June 1, 2008; 36(10): 3420 - 3435. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Abou-Sleymane, F. Chalmel, D. Helmlinger, A. Lardenois, C. Thibault, C. Weber, K. Merienne, J.-L. Mandel, O. Poch, D. Devys, et al. Polyglutamine expansion causes neurodegeneration by altering the neuronal differentiation program Hum. Mol. Genet., March 1, 2006; 15(5): 691 - 703. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Hodges, J. S. Redelius, W. Wu, and C. Hoog Accelerated Discovery of Novel Protein Function in Cultured Human Cells Mol. Cell. Proteomics, September 1, 2005; 4(9): 1319 - 1327. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. D. Thompson, S. R. Holbrook, K. Katoh, P. Koehl, D. Moras, E. Westhof, and O. Poch MAO: a Multiple Alignment Ontology for nucleic acid and protein sequences Nucleic Acids Res., July 25, 2005; 33(13): 4164 - 4171. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


