Bioinformatics Advance Access originally published online on April 4, 2008
Bioinformatics 2008 24(10):1321-1322; doi:10.1093/bioinformatics/btn122
UniProtJAPI: a remote API for accessing UniProt data
The European Bioinformatics Institute, Hinxton, Cambridge CB10 1SD, UK
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Programmatic access to the UniProt Knowledgebase (UniProtKB) is essential for many bioinformatics applications dealing with protein data. We have created a Java library named UniProtJAPI, which facilitates the integration of UniProt data into Java-based software applications. The library supports queries and similarity searches that return UniProtKB entries in the form of Java objects. These objects contain functional annotations or sequence information associated with a UniProt entry. Here, we briefly describe the UniProtJAPI and demonstrate its usage.
Availability: http://www.ebi.ac.uk/uniprot/remotingAPI
Contact: spatient{at}ebi.ac.uk
| 1 INTRODUCTION |
|---|
|
|
|---|
The Universal Protein Resource (UniProt; The UniProt Consortium, 2008) provides a comprehensive and freely accessible central resource of protein sequences and functional annotation. UniProt data can be browsed online through http://www.uniprot.org. In addition, a number of services exist (Labarga et al., 2007) to retrieve the data in various formats including XML, RDF, fasta and flat file. In order to process the information contained in these formats, a parser needs to be written that transforms the input into suitable data structures. Sometimes this can be done without further knowledge of the UniProt entry structure; the protein sequence, for example, can simply be parsed from the corresponding sequence line. Difficulties arise if splice variants of a protein sequence are required, since both the comment and feature sections of the entry need to be parsed at the same time and a deeper understanding of the UniProt structures is needed. Furthermore, data format changes within UniProt can lead to significant maintenance overhead.
Therefore, a Java application programming interface (UniProtJAPI) has been developed to provide remote access for Java applications processing UniProt and related data, granting users access to all four major components in UniProt. These are the UniProt Knowledgebase (UniProtKB), the UniProt Reference Clusters (UniRef), the UniProt Archive (UniParc) and the UniProt Metagenomic and Environmental Sequences (UniMES) database. Additionally, the UniProtJAPI has been extended to take into account information referenced in UniProtKB entries, for instance InterPro (Mulder et al., 2007). This is a resource of protein families, domains and sites, containing a number of member databases that derive protein signatures. The UniProtJAPI provides this InterPro-derived information such as scores, and start and end positions of the signatures, for each UniProtKB entry. The UniProtJAPI provides the ability to perform text and sequence similarity search across all this data, allowing users to access a single database entry with a given accession number, or whole entry sets matching defined criteria.
| 2 CENTRAL DATA STRUCTURES |
|---|
|
|
|---|
The UniProtJAPI represents protein data from the UniProtKB as Java objects, which enables programmatic retrieval of functional annotation and sequence information. The object structures resemble the flat file and XML format structures to facilitate access for those users already familiar with the UniProtKB formats.
The five central data structures of the UniProtJAPI are: UniProtEntry, UniParcEntry, UniRefEntry, ProteinData and UniMesEntry. A UniProtEntry object represents a UniProtKB entry and provides methods to access all of its information. For example, getDescription().getProteinName() returns the protein name associated with this entry. A UniParcEntry object models a UniParc non-redundant sequence while a UniRefEntry object is a UniRef sequence cluster of 100, 90 or 50 percentage identity. The relationship between the former objects is represented in a ProteinData object. The ProteinData object associated with a UniProtKB entry is accessible via the getUniProtEntry() method. The UniParc entry and the three levels of UniRef sequence clusters relating to the UniProtKB entry are also accessible using the getUniParcEntry() and getUniRefEntry() methods. Additional information is available from the getInterProMatches() method that returns a list of InterProMatch objects. These represent the sequence patterns for the UniProtKB entry. Lastly, a UniMesEntry object provides access to a UniMES sequence and its related sequence patterns. A graphical representation of the UniProtJAPI object model along with the library documentation is available online.
| 3 LICENSE AND GETTING STARTED |
|---|
|
|
|---|
UniProtJAPI is based on open-source technologies and the software is under the Apache License, version 2. UniProt data is under the Creative Commons Attribution-NoDerivs license. To use the UniProtJAPI, a compressed data file (zip file) that contains all the Java classes has to be downloaded. The library requires a Java 5 runtime environment or above and an HTTP connection.
| 4 EXAMPLES |
|---|
|
|
|---|
The UniProtJAPI may be used in a broad range of applications. It is used in the generation of IntAct (Kerrien et al., 2007) to retrieve a list of all proteins that have been updated between two dates. The IntEnz (Fleischmann et al., 2004) project uses it to retrieve all database cross-references for a specific EC number. Additional usages could include a user wanting to investigate characterized proteins that are similar to an unknown protein. In order to assist the user with this task, the similarity search tool Blast is integrated in the API.
The use of Blast is demonstrated in Figure 1, where an uncharacterized sequence from the UniMES database is submitted to retrieve a set of similar ProteinData objects (line 13). Common Blast options are available; these include similarity matrix options, searching against a specific organism or setting the E-value threshold. For brevity, default options are used in the example. The Blast service returns a collection of ProteinData objects and alignment information (line 22). Each of the ProteinData objects contains a UniProtEntry object which can be used to extract functional annotation, database cross-references or other entry information such as its sequence or the organism name (lines 26–36). The ProteinData object that is returned as the best hit by the Blast search corresponds to the UniProtKB/Swiss-Prot entry P84239 [GenBank] (line 26). For further analysis of this object, the database cross-references are accessed, showing hits to the EMBL, SMR, InterPro, Gene3D, Pfam, PRINTS, SMART and PROSITE databases (lines 39–45). In addition, the supplementary InterProMatch objects are accessed to provide scores and positions of the sequence patterns from Gene3D, Pfam, PRINTS, SMART and PROSITE databases (lines 47–53). A closer look at the InterPro cross-references shows that the best hit object belongs to three InterPro families all of which describe histone proteins. It would be interesting to compare the proteins belonging to these families against the full Blast result, not only against the best hit. To do this, further queries against InterPro are made, and result sets are combined using binary operations.
|
The UniProtJAPI supports binary operations that allow the combination or intersection of sets of Java objects (line 64). Intersecting these sets retrieved during the InterPro and the Blast queries, results in a further set of ProteinData objects (lines 59–67). Each object in the final result set belongs to the InterPro groups IPR009072, IPR007125 and IPR000164, which can then be used for further analysis (line 66).
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank all Consortium members, in particular, N. Sklyar and E. Salazar.
Funding: UniProt is supported by the NIH grant 2 U01 HG02712-04, the EC FELICS grant (021902RII3), the NIH grant 1R01HGO2273-01, by the Swiss Federal Government through the Federal Office of Education and Science and by the NIH grants and contracts HHSN266200400061C, NCIcaBIG, and 1R01GM080646-01, and the NSF grant IIS-0430743.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Martin Bishop
Received on December 6, 2007; revised on April 1, 2008; accepted on April 3, 2008
| REFERENCES |
|---|
|
|
|---|
Fleischmann A, et al. IntEnz, the integrated relational enzyme database (Database issue). Nucleic Acids Res (2004) 32:D434–D437.
Kerrien S, et al. IntAct–open source resource for molecular interaction data (Database issue). Nucleic Acids Res (2007) 35:D561–D565.
Labarga A, et al. Web services at the European Bioinformatics Institute (Web Server issue). Nucleic Acids Res (2007) 35:W6–W11.
Mulder N, et al. New developments in the InterPro database. Nucleic Acids Res (2007) 35:D224–D228.
The UniProt Consortium. The Universal Protein Resource (UniProt). Nucleic Acids Res (2008) 36:D190–D195.
This article has been cited by other articles:
![]() |
The UniProt Consortium The Universal Protein Resource (UniProt) in 2010 Nucleic Acids Res., October 20, 2009; (2009) gkp846v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
C. Plake, L. Royer, R. Winnenburg, J. Hakenberg, and M. Schroeder GoGene: gene annotation in the fast lane Nucleic Acids Res., July 1, 2009; 37(suppl_2): W300 - W304. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

