Bioinformatics Advance Access originally published online on August 27, 2004
Bioinformatics 2005 21(7):1267-1268; doi:10.1093/bioinformatics/bth493
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SABmarka benchmark for sequence alignment that covers the entire known fold space
1Department of Ultrastructure, Vrije Universiteit Brussel Pleinlaan 2, 1050 Brussel, Belgium
2Algonomics NV Technologiepark 4, 9052 Gent-Zwijnaarde, Belgium
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: The Sequence Alignment Benchmark (SABmark) provides sets of multiple alignment problems derived from the SCOP classification. These sets, Twilight Zone and Superfamilies, both cover the entire known fold space using sequences with very low to low, and low to intermediate similarity, respectively. In addition, each set has an alternate version in which unalignable but apparently similar sequences are added to each problem.
Availability: SABmark is available from http://bioinformatics.vub.ac.be
Contact: ivwalle{at}vub.ac.be
Protein sequence alignment has long been one of the most fundamental tools in bioinformatics, and is still an active field of research today. Plenty of room for improvement remains however, because structurally related sequences are more often than not quite dissimilar to each other and contain little information to produce a good alignment from. Normally, each newly developed technique is compared with other algorithms in terms of both accuracy and speed. Although this was initially done using a test set of alignments selected by the author, there are currently still few independent databases available that are designed specifically for this purpose (Bahr et al., 2001; Raghava et al., 2003). The often used BaliBase, for example, contains 167 alignments that address different problems such as equidistant sequences, orphan sequences and internal insertions.
Our Sequence Alignment Benchmark (SABmark) focuses solely on sequences with very low to intermediate similarity (050% identity), since it has been shown that above this region the performance of most alignment programs is already excellent and can be improved little (e.g. see Sauder et al., 2000). In addition, it allows to benchmark cases where not all sequences are related to each other. SABmark systematically covers the entire known fold space using only high-quality structures taken from the SCOP database (Murzin et al., 1995), and is updated following new releases (currently 1.65). For each alignment problem, pairwise reference structure alignments are provided, derived as a consensus from SOFI and CE (Boutonnet et al., 1995; Shindyalov and Bourne, 1998). Between pairs, these references are not necessarily entirely consistent with each other, which, in contrast to a single reference multiple alignment, reflects the uncertainty about parts of the reference alignment. As a result, multiple alignment algorithms cannot obtain a perfect score for each problem, though nearly so for the more similar structures.
SABmark consists of two alignment sets, Twilight Zone and Superfamilies, which contain single-domain sequences with very low to low, and low to intermediate similarity, respectively. The sequences of the Twilight Zone set are taken from a SCOP subset provided by the ASTRAL compendium, in which domains have a pairwise Blast E-value of at least 1, for a theoretical database size of 108 residues (Altschul et al., 1997; Chandonia et al., 2004). To ensure the quality of the structures, this subset is reduced further by retaining only those that have all four backbone atoms present in all residues, and for which the PDB ATOM records exactly match the SEQRES sequence. The remaining structures are subsequently divided into groups corresponding to SCOP folds, which is the lowest level of structure similarity. In order to avoid bias towards very well-represented folds, the maximum size of each group is limited to 25 sequences simply by keeping only the largest structures. The Superfamilies set is created analogously from sequences that have at most 50% identity to each other, but here groups are formed by division into SCOP superfamilies, which correspond to sequences with a putative common evolutionary origin.
One of the major applications of sequence alignment is determining which sequences are related to each other and which are not. SABmark contains a second version of the Twilight Zone and the Superfamilies sets, which allow to benchmark procedures for this purpose. In these sets, each group of originally N sequences is expanded with at most N other, structurally unrelated yet apparently similar sequences. These false positives were selected from Blast searches of the original sequences (true positives) against a 70% identity subset of SCOP. Each false positive belongs to a different fold than the true positives and has at least one E-value to a true positive that is lower than at least one E-value of that true positive with another true positive. In practice, N or more false positives were found for almost all alignments.
An overview of the general characteristics of each SABmark set is given in Figure 1 and Table 1. The database is provided as a well-organized set of standard format files (Fasta, PDB), for which new alignments can be created, scored and archived by using a small number of Perl scripts. For example:
- run_alignm.pl twi
- score.pl twi alignm
- archive.pl twi alignm
- score.pl twi alignm
|
|
| Acknowledgments |
|---|
This work was supported by a grant from Instituut voor de bevordering van Wetenschap en Techniek (IWT), Belgium
Received on July 13, 2004; revised on August 19, 2004; accepted on August 19, 2004
| REFERENCES |
|---|
|
|
|---|
Altschul, S., Madden, T., Schäffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Bahr, A., Thompson, J., Thierry, J., Poch, O. (2001) BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res., 29, 323326
Boutonnet, N., Rooman, M., Ochagavia, M., Richelle, J., Wodak, S. (1995) Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. Protein Eng., 8, 647662[Web of Science][Medline].
Chandonia, J.-M., Hon, G., Walker, N.S., Conte, L.L., Koehl, P., Levitt, M., Brenner, S.E. (2004) The ASTRAL Compendium in 2004. Nucleic Acids Res., 32, D189D192
Murzin, A., Brenner, S., Hubbard, T., Chothia, C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536540[CrossRef][Web of Science][Medline].
Raghava, G.P.S., Searle, S.M.J., Audley, P.C., Barber, J.D., Barton, G.J. (2003) OXBench:a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47[CrossRef][Medline].
Sauder, J., Arthur, J., Dunbrack, R. (2000) Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins, 40, 622[CrossRef][Web of Science][Medline].
Shindyalov, I. and Bourne, P. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739747
Van Walle, I., Lasters, I., Wyns, L. (2004) Align-ma new algorithm for multiple alignment of highly divergent sequences. Bioinformatics, 20, 14281435
This article has been cited by other articles:
![]() |
X. Xia, S. Zhang, Y. Su, and Z. Sun MICAlign: a sequence-to-structure alignment tool integrating multiple sources of information in conditional random fields Bioinformatics, June 1, 2009; 25(11): 1433 - 1434. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Ahola, T. Aittokallio, M. Vihinen, and E. Uusipaikka Model-based prediction of sequence alignment quality Bioinformatics, October 1, 2008; 24(19): 2165 - 2171. [Abstract] [Full Text] [PDF] |
||||
![]() |
W. Boomsma, K. V. Mardia, C. C. Taylor, J. Ferkinghoff-Borg, A. Krogh, and T. Hamelryck A generative, probabilistic model of local protein structure PNAS, July 1, 2008; 105(26): 8932 - 8937. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei, M. Tang, and N. V. Grishin PROMALS3D web server for accurate multiple protein sequence and structure alignments Nucleic Acids Res., July 1, 2008; 36(suppl_2): W30 - W34. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei, B.-H. Kim, and N. V. Grishin PROMALS3D: a tool for multiple protein sequence and structure alignments Nucleic Acids Res., April 1, 2008; 36(7): 2295 - 2300. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Stout, J. Bacardit, J. D. Hirst, and N. Krasnogor Prediction of recursive convex hull class assignments for protein residues Bioinformatics, April 1, 2008; 24(7): 916 - 923. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. R. Dalton and R. M. Jackson An evaluation of automated homology modelling methods at low target template sequence similarity Bioinformatics, August 1, 2007; 23(15): 1901 - 1908. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei, B.-H. Kim, M. Tang, and N. V. Grishin PROMALS web server for accurate multiple protein sequence alignments Nucleic Acids Res., July 13, 2007; 35(suppl_2): W649 - W652. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. H. Margulies, G. M. Cooper, G. Asimenos, D. J. Thomas, C. N. Dewey, A. Siepel, E. Birney, D. Keefe, A. S. Schwartz, M. Hou, et al. Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome Genome Res., June 1, 2007; 17(6): 760 - 774. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. S. Papadopoulos and R. Agarwala COBALT: constraint-based alignment tool for multiple protein sequences Bioinformatics, May 1, 2007; 23(9): 1073 - 1079. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei and N. V. Grishin PROMALS: towards accurate multiple sequence alignments of distantly related proteins Bioinformatics, April 1, 2007; 23(7): 802 - 808. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. S. Schwartz and L. Pachter Multiple alignment by sequence annealing Bioinformatics, January 15, 2007; 23(2): e24 - e29. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Sonego, M. Pacurar, S. Dhir, A. Kertesz-Farkas, A. Kocsor, Z. Gaspari, J. A.M. Leunissen, and S. Pongor A Protein Classification Benchmark collection for machine learning Nucleic Acids Res., January 12, 2007; 35(suppl_1): D232 - D236. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Armougom, O. Poirot, S. Moretti, D. G. Higgins, P. Bucher, V. Keduas, and C. Notredame APDB: a web server to evaluate the accuracy of sequence alignments using structural information. Bioinformatics, October 1, 2006; 22(19): 2439 - 2440. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei and N. V. Grishin MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information Nucleic Acids Res., September 11, 2006; 34(16): 4364 - 4374. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Lassmann and E. L. L. Sonnhammer Automatic assessment of alignment quality Nucleic Acids Res., December 16, 2005; 33(22): 7120 - 7128. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Zhou and Y. Zhou SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures Bioinformatics, September 15, 2005; 21(18): 3615 - 3621. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




