Bioinformatics Advance Access originally published online on October 10, 2005
Bioinformatics 2005 21(24):4425-4426; doi:10.1093/bioinformatics/bti712
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
QUASARscoring and ranking of sequencestructure alignments


Practical Informatics and Bioinformatics Group, Department of Informatics, Ludwig-Maximilians-University Amalienstrasse 17, D-80333 Munich, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Sequencestructure alignments are a common means for protein structure prediction in the fields of fold recognition and homology modeling, and there is a broad variety of programs that provide such alignments based on sequence similarity, secondary structure or contact potentials. Nevertheless, finding the best sequencestructure alignment in a pool of alignments remains a difficult problem. QUASAR (quality of sequencestructure alignments ranking) provides a unifying framework for scoring sequencestructure alignments that aids finding well-performing combinations of well-known and custom-made scoring schemes. Those scoring functions can be benchmarked against widely accepted quality scores like MaxSub, TMScore, Touch and APDB, thus enabling users to test their own alignment scores against standard-of-truth structure-based scores. Furthermore, individual score combinations can be optimized with respect to benchmark sets based on known structural relationships using QUASAR's in-built optimization routines.
Availability: The software, examples, the Java documentation and a tutorial are available at http://www.bio.ifi.lmu.de/QUASAR
Contact: fabian.birzele{at}ifi.lmu.de
| 1 INTRODUCTION |
|---|
|
|
|---|
With the growing gap between the number of known protein sequences in databases like Swiss-Prot/TrEMBL (Boeckmann et al., 2003) and the number of experimentally determined protein structures in the PDB (Berman et al., 2000), automated structure prediction methods have become valuable tools for assigning potential coordinate models to new protein sequences. The first step to building a complete all-atom model is often to align a sequence of unknown structure (the so-called target) to a database of sequences with known structures (so-called templates). On the basis of these alignments and the underlying known template structures, models are built and refined. Since alignment quality determines the model quality, it is desirable to identify good models at the alignment stage in order to get rid of the overhead of producing obviously unsuitable coordinate models. This mainly restricts efforts to sequence and secondary-structure-based measures (i.e. alignment scores) instead of using structural properties.
The QUASAR (quality of sequencestructure alignments ranking) system has been designed to fit two needs. First, it is a platform-independent and easily extendable software package for scoring and ranking sequencestructure alignments coming from different sources. Second, it aids the process of developing, benchmarking and optimizing new alignment quality measurements. The graphical user interface (GUI) of QUASAR provides quick access to each of the possible use cases and allows for visualization and comparison of the results as well as for configuration of all essential parts. Once configured, QUASAR can also be used directly from the command-line.
| 2 METHODS |
|---|
|
|
|---|
2.1 Scoring alignments
The so-called scoring schemes represent alignment quality scores that require only information that is available from the sequence (e.g. predicted secondary-structure) or that can be directly inferred from the template structure. Scoring schemes provided by the system include several amino acid and secondary-structure-based exchange matrices [like PAM (Dayhoff et al., 1978) and (Luthy et al., 1991)], the two standard secondary-structure fit measures Q3 and SOV (Zemla et al., 1999) as well as two contact-capacity-based scores (Berrera et al., 2003; Singer et al., 2002). The number of available scoring schemes can be easily extended by implementing a Java interface or, in the case of (amino acid exchange) scoring matrices, by adding a text file in a QUASAR specific format that contains the matrix information. This provides a fast connection to matrix collections such as the AAIndex database (Kawashima et al., 1999).
2.2 Combining scores
With the so-called score conductor, the user can integrate several scoring schemes into one scoring function by combining the scores in a weighted sum (assigning user-specified weights for the single scores), i.e. as a linear combination of the individual scores. In addition, by editing the configuration file, experienced users can build more complex, tree-like formulas using further operators like multiplication and division. Therefore, a user can test different combinations of scoring schemes with a minimal amount of extra time and thus improve the ranking quality over the performance of the single scores. The final quality score of every alignment is calculated by combining the single alignment scores according to the formula given in the configuration. Single scores can also be normalized to range between zero and one to combine scores with different magnitudes.
2.3 Benchmarking scores
To help the user find a scoring function that gives the best possible results, QUASAR contains a number of structure-based quality scores like Touch, APDB (O'Sullivan et al., 2003), as well as reimplementations of MaxSub (Siew et al., 2000) and TMScore (Zhang and Skolnick, 2004), both based on a different superimposition routine (Fortran QRT fit). For a given alignment benchmark set for which the structures of query and template proteins are known, QUASAR measures the correlation coefficient of the ranking resulting from the specified alignment score with a structure-based benchmark measure (e.g. RMSD). It is also possible to use a user-defined quality score as a reference by annotating it to the alignments (Fig. 1). This makes it easy to compare the performance of an alignment score or a combination of scores with a given standard-of-truth without the need to implement the score in Java.
|
2.4 Optimizing scores
The performance of a scoring function depends heavily on the weights which are assigned to the individual scoring schemes. Thus, QUASAR allows optimizing these weights with respect to a benchmark set of alignments with assigned or computed standard-of-truth scores (see above). So far, two optimization routines are available. One may invoke least-squares optimization or use a genetic algorithm to explore the space of possible score combinations. The fitness of a combination of scoring scheme weights is evaluated with respect to a benchmark set as described in the previous subsection. Such an optimization may also uncover the main ingredients of an already well-performing score combination by ruling out unnecessary scores.
2.5 Implementation
QUASAR is completely implemented in Java (Version 1.4+). It is freely available for academic users as standalone and Java Web Start application. All scoring schemes, scoring functions, benchmark scores and optimization routines can be configured in an XML-like configuration file that can be generated using the GUI.
| 3 USE CASES |
|---|
|
|
|---|
3.1 Benchmarking and optimization
A first, interactive use case might be as follows: given a new scoring scheme, e.g. a new scoring matrix,
- One builds a benchmark set of alignments and loads the data into QUASAR.
- In QUASAR, one explores the performance of the new scoring matrix in comparison with and in combination with in-built scores. The evaluation is done with respect to the standard-of-truth benchmark scores available in QUASAR and with help of the visualization panel.
- One further improves the ranking performance by combining well-performing schemes and optimizing their weights using QUASAR's optimization routines.
- Now, one saves the configuration for future use of QUASAR from the command-line.
3.2 Automated alignment ranking
A second, non-interactive use case is the ranking of sequencestructure alignments. Here, one already has an optimized combination of scores together with the corresponding QUASAR configuration at hand. Given a set of different sequencestructure alignments for a target (e.g. to different template structures), one includes the call of QUASAR using the configuration file into the structure prediction process and is thus able to e.g. discard alignments on the basis of the previously optimized alignment score automatically.
| 4 CONCLUSION |
|---|
|
|
|---|
Sequencestructure alignments play an important role in protein structure prediction and analysis. With QUASAR we provide a software that facilitates alignment scoring, comparison of known with user-defined scoring schemes and optimization of score combinations. It is platform-independent and can be used interactively or from the command-line. Future extensions will include new scoring schemes, improved analysis of results and more optimization options. We encourage users to send their own scoring schemes in order to have them included in future releases.
| Acknowledgments |
|---|
This work was funded by the German Research Foundation (DFG) under project grant PROSEQO II (Zi616/2).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors.
Received on August 19, 2005; revised on September 27, 2005; accepted on October 6, 2005
| REFERENCES |
|---|
|
|
|---|
Berman, H., et al. (2000) The Protein Data Bank. Nucleic Acids Res, . 28, 235242
Berrera, M., et al. (2003) Amino acid empirical contact energy definitions for fold recognition in the space of contact maps. BMC Bioinformatics, 4, 8[CrossRef][Medline].
Boeckmann, B., et al. (2003) The Swiss-Prot protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res, . 31, 365370
Dayhoff, M.O., et al. (1978) A model of evolutionary change in proteins. Atlas Protein Sequence Struct, . 5, 345352.
Kawashima, S., et al. (1999) AAindex: amino acid index database. Nucleic Acids Res, . 27, 368369
Luthy, R., et al. (1991) Secondary structure-based profiles: use of structure-conserving scoring tables in searching protein sequence databases for structural similarities. Proteins, 10, 229239[CrossRef][ISI][Medline].
O'Sullivan, O., et al. (2003) APDB: a novel measure for benchmarking sequence alignment methods without reference alignments. Bioinformatics, 19, 215i221i[Abstract].
Siew, N., et al. (2000) MaxSub: an automated measure for the assessment of protein structure prediction quality. Bioinformatics, 16, 776785
Singer, M.S., et al. (2002) Prediction of protein residue contacts with a PDB-derived likelihood matrix. Protein Eng, . 15, 721725
Zemla, A., et al. (1999) A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins, 34, 220223[CrossRef][ISI][Medline].
Zhang, Y. and Skolnick, J. (2004) Scoring function for automated assessment of protein structure template quality. Proteins, 57, 702710[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
F. Birzele, J. E. Gewehr, G. Csaba, and R. Zimmer Vorolign--fast structural alignment using Voronoi contacts Bioinformatics, January 15, 2007; 23(2): e205 - e211. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

