Bioinformatics Advance Access originally published online on May 30, 2007
Bioinformatics 2007 23(15):2018-2020; doi:10.1093/bioinformatics/btm269
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
StrBioLib: a Java library for development of custom computational structural biology applications
Physical Biosciences Division, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA, and Department of Plant and Microbial Biology, University of California, Berkeley, CA 94720, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: StrBioLib is a library of Java classes useful for developing software for computational structural biology research. StrBioLib contains classes to represent and manipulate protein structures, biopolymer sequences, sets of biopolymer sequences, and alignments between biopolymers based on either sequence or structure. Interfaces are provided to interact with commonly used bioinformatics applications, including (PSI)-BLAST, MODELLER, MUSCLE and Primer3, and tools are provided to read and write many file formats used to represent bioinformatic data. The library includes a general-purpose neural network object with multiple training algorithms, the Hooke and Jeeves non-linear optimization algorithm, and tools for efficient C-style string parsing and formatting. StrBioLib is the basis for the Pred2ary secondary structure prediction program, is used to build the ASTRAL compendium for sequence and structure analysis, and has been extensively tested through use in many smaller projects. Examples and documentation are available at the site below.
Availability: StrBioLib may be obtained under the terms of the GNU LGPL license from http://strbio.sourceforge.net/
Contact: JMChandonia{at}lbl.gov
Computational structural biology research often requires time-consuming development of custom software to analyze data. Development of such software is facilitated by publicly available libraries that read and write the multitude of file formats in which bioinformatic data is stored, implement commonly used algorithms, and otherwise efficiently perform common tasks (Mangalam, 2002). Object-oriented languages such as Java or C++ are particularly well suited for such libraries, as judicious choice of an object representation allows methods to be described and implemented in high level terms, thus facilitating rapid development and testing of alternative algorithms. In addition, object-oriented programming languages facilitate efficient reuse of code through extension and inheritance of existing classes. StrBioLib is a library of Java classes that represent objects, concepts and tools useful for the development of algorithms for computational structural biology research. StrBioLib is complementary to existing libraries such as BioJava (Pocock et al., 2000) that focus on tools for analysis of biological sequences. StrBioLib is mature, having been used by several research groups over more than 10 years; some classes predate the Java programming language and were ported from earlier C and C++ versions. A new public release of the library, version 1.1, was made available through SourceForge in January 2007. Details and applications of StrBioLib are given below.
| 1 MOLECULAR BIOLOGY CLASSES |
|---|
|
|
|---|
The core of StrBioLib is the org.strbio.mol package, which contains classes that represent objects from the field of structural molecular biology: Atom, Molecule (composed of Atoms), Monomer (also containing Atoms), Polymer (an ordered set of Monomers and associated metadata), Residue (a type of Monomer representing an amino acid residue), Nucleotide (another subclass of Monomer) and Protein (a type of Polymer composed of Residues). The mol package also contains objects that represent groups of Polymers (PolymerSet) and Proteins (ProteinSet), including specialized groups such as a Profile (representing a set of sequences aligned to a Protein). In addition to providing methods to efficiently manipulate the objects in memory, each class also contains methods to read and write the objects from a variety of file formats, including the widely used FASTA, PDB, MSF, DSSP, HSSP and BLAST formats. The mol package also contains an Alignment object, an efficient representation of a sequence or structure-based alignment between two Polymers.
The org.strbio.mol package is supported by classes in the org.strbio.mol.lib package, which contain objects that represent and implement more abstract concepts in structural biology, such as algorithms (e.g. secondary structure prediction, threading, sequence searching and sequence alignment), scoring matrices, and parameter sets. Tweaking an algorithm is often a simple matter of extending a class to change functionality; this greatly simplified development of the MakeRAF software, described below.
| 2 INTERFACES TO BIOINFORMATIC TOOLS AND DATABASES |
|---|
|
|
|---|
StrBioLib also contains classes to interact and exchange data with many commonly used bioinformatic tools and databases. Tools that must be installed locally have their corresponding classes in the org.strbio.local package, and tools that must be accessed over the Internet correspond to classes in the org.strbio.net package. A partial listing of tools and databases that StrBioLib can manipulate or interact with is given in Table 1.
|
| 3 GENERAL PURPOSE TOOLS |
|---|
|
|
|---|
StrBioLib also contains packages of tools that are useful in a wide range of applications beyond the field of structural biology. While some of the objects are now provided in current releases of the JDK, StrBioLib contains implementations that are also compatible with earlier versions of Java. The org.strbio.util package contains a neural network object that implements both traditional Steepest Descent and Scaled Conjugate Gradient (Møller, 1993) algorithms. It also contains an algorithm for non-linear optimization using the direct search method of Hooke and Jeeves (1961), and a double-linked list that allows efficient random access to any element. The org.strbio.io package contains an extensive library of string functions for implementing C-style formatted I/O without the overhead of creating and destroying objects; these methods are essential for supporting the multitude of file formats used by bioinformatic programs with speed comparable to that of C code. The org.strbio.math package contains classes to support matrix algebra as well as statistical objects that provide calculations, such as Pearson and Matthews correlation coefficients (Matthews, 1975). The org.strbio.util.ui package contains classes useful for developing graphical user interfaces, and the org.strbio.util.graph package contains classes useful for graphing data.
| 4 APPLICATIONS |
|---|
|
|
|---|
StrBioLib has been used to develop a number of published applications, including secondary structure prediction software (Chandonia and Karplus, 1995, 1996; Pred2ary, Chandonia and Karplus, 1999), threading methods (JThread, Chandonia and Cohen, 2003), and the MakeRAF software that creates mappings between the sequence and experimentally observed residues from PDB files in the ASTRAL database (Chandonia et al., 2002). MakeRAF provides an example of how to create a customized function for scoring gaps in sequence alignments by implementing the org.strbio.mol.lib.GapModel interface. Both MakeRAF and Pred2ary are included with StrBioLib, along with instructions for stand-alone installation and testing, and sample output useful for validation.
Several additional programs are included with StrBioLib in the org.strbio.app package. These programs may be run as stand-alone utilities or used as models for further application development. ConvertProtein is an application for converting between various protein file formats and performing basic data manipulation (e.g. rotation and translation of protein structure, or elimination of particular atoms or residues). Align is a front end to the sequence alignment algorithms included in StrBioLib. FindProteins and SplitProteins are utilities to manipulate large sets of proteins; they work, respectively, by separating particular proteins from a group by name, and splitting a group into multiple subsets of equal size.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Thanks to L. Howard Holley, Jonathan D. Blake and Marcin P. Joachimiak for discussions leading to implementation of some of the classes. This work is supported by grants from the NIH (R01-GM39900, R01-GM073109, and 1-P50-GM62412) and by the U.S. Department of Energy under contract DE-AC02-05CH11231. Funding to pay the Open Access publication charges was provided by the NIH (R01-GM073109).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on March 21, 2007; revised on May 3, 2007; accepted on May 10, 2007
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. (1990) 215:403–410.[CrossRef][Web of Science][Medline]
Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.
Berman HM, et al. The Protein Data Bank. Nucleic Acids Res. (2000) 28:235–242.
Chandonia JM, Cohen FE. New local potential useful for genome annotation and 3D modeling. J. Mol. Biol. (2003) 332:835–850.[CrossRef][Web of Science][Medline]
Chandonia JM, Karplus M. Neural networks for secondary structure and structural class predictions. Protein Sci. (1995) 4:275–285.[Web of Science][Medline]
Chandonia JM, Karplus M. The importance of larger data sets for protein secondary structure prediction with neural networks. Protein Sci. (1996) 5:768–774.[Web of Science][Medline]
Chandonia JM, Karplus M. New methods for accurate prediction of protein secondary structure. Proteins (1999) 35:293–306.[CrossRef][Web of Science][Medline]
Chandonia JM, et al. ASTRAL compendium enhancements. Nucleic Acids Res. (2002) 30:260–263.
Chen L, et al. TargetDB: a target registration database for structural genomics projects. Bioinformatics (2004) 20:2860–2862.
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. (2004) 32:1792–1797.
Falicov A, Cohen FE. A surface of minimum area metric for the structural comparison of proteins. J. Mol. Biol. (1996) 258:871–892.[CrossRef][Web of Science][Medline]
Gasteiger E, Hoogland C, Gattiker A, Duvaud S, Wilkins MR, Appel RD, Bairoch A. Protein Identification and Analysis Tools on the ExPASy Server;. In: The Proteomics Protocols Handbook—Walker John M, ed. (2005) Humana Press. 571–607.
Hooke R, Jeeves TA. Direct search solution of numerical and statistical problems. J. ACM (1961) 8:212–229.[CrossRef]
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers (1983) 22:2577–2637.[CrossRef][Web of Science][Medline]
Le Novere N. MELTING, computing the melting temperature of nucleic acid duplex. Bioinformatics (2001) 17:1226–1227.
Mangalam H. The Bio* toolkits – a brief overview. Brief. Bioinform. (2002) 3:296–302.
Mathews DH, et al. Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J. Mol. Biol. (1999) 288:911–940.[CrossRef][Web of Science][Medline]
Matthews BW. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta (1975) 405:442–451.[Medline]
Møller MF. A scaled conjugate gradient algorithm for fast supervised learning. Neural Netw. (1993) 6:525–533.[CrossRef][Web of Science]
Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. (1995) 247:536–540.[CrossRef][Web of Science][Medline]
Orengo CA, et al. CATH – a hierarchic classification of protein domain structures. Structure (1997) 5:1093–1108.[Medline]
Pocock MR, et al. BioJava: open source components for bioinformatics. ACM SIGBIO Newsl. (2000) 20:10–12.[CrossRef]
Rozen S, Skaletsky H. Primer3 on the WWW for general users and for biologist programmers. Methods Mol. Biol. (2000) 132:365–386.[Medline]
Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J. Mol. Biol. (1993) 234:779–815.[CrossRef][Web of Science][Medline]
Siew N, et al. MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics (2000) 16:776–785.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||