Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (37)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kretschmann, E.
Right arrow Articles by Apweiler, R.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kretschmann, E.
Right arrow Articles by Apweiler, R.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Bioinformatics Vol. 17 no. 10 2001
Pages 920-926
© 2001 Oxford University Press

Automatic rule generation for protein annotation with the C4.5 data mining algorithm applied on SWISS-PROT

Ernst Kretschmann , Wolfgang Fleischmann and Rolf Apweiler

The EMBL Outstation, The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK

Received on April 20, 2001 ; revised on July 8, 2001 ; accepted on July 8, 2001

Motivation: The gap between the amount of newly submitted protein data and reliable functional annotation in public databases is growing. Traditional manual annotation by literature curation and sequence analysis tools without the use of automated annotation systems is not able to keep up with the ever increasing quantity of data that is submitted. Automated supplements to manually curated databases such as TrEMBL or GenPept cover raw data but provide only limited annotation. To improve this situation automatic tools are needed that support manual annotation, automatically increase the amount of reliable information and help to detect inconsistencies in manually generated annotations.

Results: A standard data mining algorithm was successfully applied to gain knowledge about the Keyword annotation in SWISS-PROT. 11 306 rules were generated, which are provided in a database and can be applied to yet unannotated protein sequences and viewed using a web browser. They rely on the taxonomy of the organism, in which the protein was found and on signature matches of its sequence. The statistical evaluation of the generated rules by cross-validation suggests that by applying them on arbitrary proteins 33% of their keyword annotation can be generated with an error rate of 1.5%. The coverage rate of the keyword annotation can be increased to 60% by tolerating a higher error rate of 5%.

Availability: The results of the automatic data mining process can be browsed on http://golgi.ebi.ac.uk:8080/Spearmint/ Source code is available upon request.

Contact: kretsch{at}ebi.ac.uk


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
I. V. Tetko, I. V. Rodchenkov, M. C. Walter, T. Rattei, and H.-W. Mewes
Beyond the 'best' match: machine learning annotation of protein sequences by integration of different sources of information
Bioinformatics, March 1, 2008; 24(5): 621 - 628.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
I. Friedberg
Automated protein function prediction--the genomic challenge
Brief Bioinform, September 1, 2006; 7(3): 225 - 242.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
C. H. Wu, R. Apweiler, A. Bairoch, D. A. Natale, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, et al.
The Universal Protein Resource (UniProt): an expanding universe of protein information
Nucleic Acids Res., January 1, 2006; 34(suppl_1): D187 - D191.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. Petryszak, E. Kretschmann, D. Wieser, and R. Apweiler
The predictive power of the CluSTr database
Bioinformatics, September 15, 2005; 21(18): 3604 - 3609.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Z. R. Yang
Mining SARS-CoV protease cleavage data using non-orthogonal decision trees: a novel method for decisive template selection
Bioinformatics, June 1, 2005; 21(11): 2644 - 2650.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
M. Schneider, A. Bairoch, C. H. Wu, and R. Apweiler
Plant Protein Annotation in the UniProt Knowledgebase
Plant Physiology, May 1, 2005; 138(1): 59 - 66.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Bairoch, R. Apweiler, C. H. Wu, W. C. Barker, B. Boeckmann, S. Ferro, E. Gasteiger, H. Huang, R. Lopez, M. Magrane, et al.
The Universal Protein Resource (UniProt)
Nucleic Acids Res., January 1, 2005; 33(suppl_1): D154 - D159.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
E. V. Kriventseva, F. Servant, and R. Apweiler
Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters
Nucleic Acids Res., January 1, 2003; 31(1): 388 - 389.
[Abstract] [Full Text] [PDF]



Disclaimer:
Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.