Skip Navigation


Bioinformatics Advance Access originally published online on August 27, 2004
Bioinformatics 2005 21(7):1267-1268; doi:10.1093/bioinformatics/bth493
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/1267    most recent
bth493v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (36)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Van Walle, I.
Right arrow Articles by Wyns, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Van Walle, I.
Right arrow Articles by Wyns, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

SABmark—a benchmark for sequence alignment that covers the entire known fold space

Ivo Van Walle 1,*, Ignace Lasters 2 and Lode Wyns 1

1Department of Ultrastructure, Vrije Universiteit Brussel Pleinlaan 2, 1050 Brussel, Belgium
2Algonomics NV Technologiepark 4, 9052 Gent-Zwijnaarde, Belgium

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 REFERENCES
 

Summary: The Sequence Alignment Benchmark (SABmark) provides sets of multiple alignment problems derived from the SCOP classification. These sets, Twilight Zone and Superfamilies, both cover the entire known fold space using sequences with very low to low, and low to intermediate similarity, respectively. In addition, each set has an alternate version in which unalignable but apparently similar sequences are added to each problem.

Availability: SABmark is available from http://bioinformatics.vub.ac.be

Contact: ivwalle{at}vub.ac.be

Protein sequence alignment has long been one of the most fundamental tools in bioinformatics, and is still an active field of research today. Plenty of room for improvement remains however, because structurally related sequences are more often than not quite dissimilar to each other and contain little information to produce a good alignment from. Normally, each newly developed technique is compared with other algorithms in terms of both accuracy and speed. Although this was initially done using a test set of alignments selected by the author, there are currently still few independent databases available that are designed specifically for this purpose (Bahr et al., 2001; Raghava et al., 2003). The often used BaliBase, for example, contains 167 alignments that address different problems such as equidistant sequences, orphan sequences and internal insertions.

Our Sequence Alignment Benchmark (SABmark) focuses solely on sequences with very low to intermediate similarity (0–50% identity), since it has been shown that above this region the performance of most alignment programs is already excellent and can be improved little (e.g. see Sauder et al., 2000). In addition, it allows to benchmark cases where not all sequences are related to each other. SABmark systematically covers the entire known fold space using only high-quality structures taken from the SCOP database (Murzin et al., 1995), and is updated following new releases (currently 1.65). For each alignment problem, pairwise reference structure alignments are provided, derived as a consensus from SOFI and CE (Boutonnet et al., 1995; Shindyalov and Bourne, 1998). Between pairs, these references are not necessarily entirely consistent with each other, which, in contrast to a single reference multiple alignment, reflects the uncertainty about parts of the reference alignment. As a result, multiple alignment algorithms cannot obtain a perfect score for each problem, though nearly so for the more similar structures.

SABmark consists of two alignment sets, Twilight Zone and Superfamilies, which contain single-domain sequences with very low to low, and low to intermediate similarity, respectively. The sequences of the Twilight Zone set are taken from a SCOP subset provided by the ASTRAL compendium, in which domains have a pairwise Blast E-value of at least 1, for a theoretical database size of 108 residues (Altschul et al., 1997; Chandonia et al., 2004). To ensure the quality of the structures, this subset is reduced further by retaining only those that have all four backbone atoms present in all residues, and for which the PDB ATOM records exactly match the SEQRES sequence. The remaining structures are subsequently divided into groups corresponding to SCOP folds, which is the lowest level of structure similarity. In order to avoid bias towards very well-represented folds, the maximum size of each group is limited to 25 sequences simply by keeping only the largest structures. The Superfamilies set is created analogously from sequences that have at most 50% identity to each other, but here groups are formed by division into SCOP superfamilies, which correspond to sequences with a putative common evolutionary origin.

One of the major applications of sequence alignment is determining which sequences are related to each other and which are not. SABmark contains a second version of the Twilight Zone and the Superfamilies sets, which allow to benchmark procedures for this purpose. In these sets, each group of originally N sequences is expanded with at most N other, structurally unrelated yet apparently similar sequences. These false positives were selected from Blast searches of the original sequences (true positives) against a 70% identity subset of SCOP. Each false positive belongs to a different fold than the true positives and has at least one E-value to a true positive that is lower than at least one E-value of that true positive with another true positive. In practice, N or more false positives were found for almost all alignments.

An overview of the general characteristics of each SABmark set is given in Figure 1 and Table 1. The database is provided as a well-organized set of standard format files (Fasta, PDB), for which new alignments can be created, scored and archived by using a small number of Perl scripts. For example:

run_alignm.pl twi
score.pl twi alignm
archive.pl twi alignm
This will calculate Align-m multiple alignments for each problem of the Twilight Zone set Van Walle et al., 2004. A flat textfile report is subsequently generated with, for each sequence pair, the number of residues aligned correctly and incorrectly with respect to the reference, and some other information such as percentage identity and structure similarity according to SCOP. Finally, all data are archived into a single zipped tarball that can be restored later. To benchmark a new algorithm or different parameter settings, all that needs to be done is copy and adjust the run_alignm.pl script. Further details can be found in the accompanying manual.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 1 Distribution of sequence pairs as a function of percentage identity for the Twilight Zone (Twi) and Superfamilies (Sup) set, and for each level of structure similarity (fold, superfamily and family). Intervals are 2.5% identity wide.

 

View this table:
[in this window]
[in a new window]
 
Table 1 Specifications of each set

 


    Acknowledgments
 
This work was supported by a grant from Instituut voor de bevordering van Wetenschap en Techniek (IWT), Belgium

Received on July 13, 2004; revised on August 19, 2004; accepted on August 19, 2004

    REFERENCES
 TOP
 Abstract
 REFERENCES
 

    Altschul, S., Madden, T., Schäffer, A., Zhang, J., Zhang, Z., Miller, W., Lipman, D. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 3389–3402[Abstract/Free Full Text].

    Bahr, A., Thompson, J., Thierry, J., Poch, O. (2001) BAliBASE (Benchmark Alignment dataBASE): enhancements for repeats, transmembrane sequences and circular permutations. Nucleic Acids Res., 29, 323–326[Abstract/Free Full Text].

    Boutonnet, N., Rooman, M., Ochagavia, M., Richelle, J., Wodak, S. (1995) Optimal protein structure alignments by multiple linkage clustering: application to distantly related proteins. Protein Eng., 8, 647–662[Web of Science][Medline].

    Chandonia, J.-M., Hon, G., Walker, N.S., Conte, L.L., Koehl, P., Levitt, M., Brenner, S.E. (2004) The ASTRAL Compendium in 2004. Nucleic Acids Res., 32, D189–D192[Abstract/Free Full Text].

    Murzin, A., Brenner, S., Hubbard, T., Chothia, C. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol., 247, 536–540[CrossRef][Web of Science][Medline].

    Raghava, G.P.S., Searle, S.M.J., Audley, P.C., Barber, J.D., Barton, G.J. (2003) OXBench:a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics, 4, 47[CrossRef][Medline].

    Sauder, J., Arthur, J., Dunbrack, R. (2000) Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins, 40, 6–22[CrossRef][Web of Science][Medline].

    Shindyalov, I. and Bourne, P. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng., 11, 739–747[Abstract/Free Full Text].

    Van Walle, I., Lasters, I., Wyns, L. (2004) Align-m—a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics, 20, 1428–1435[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
C. L. Strope, K. Abel, S. D. Scott, and E. N. Moriyama
Biological Sequence Simulation for Testing Complex Evolutionary Hypotheses: indel-Seq-Gen Version 2.0
Mol. Biol. Evol., November 1, 2009; 26(11): 2581 - 2593.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
C. Kemena and C. Notredame
Upcoming challenges for multiple sequence alignment methods in the high-throughput era
Bioinformatics, October 1, 2009; 25(19): 2455 - 2465.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. H. Fong and A. Marchler-Bauer
CORAL: aligning conserved core regions across domain families
Bioinformatics, August 1, 2009; 25(15): 1862 - 1868.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
X. Xia, S. Zhang, Y. Su, and Z. Sun
MICAlign: a sequence-to-structure alignment tool integrating multiple sources of information in conditional random fields
Bioinformatics, June 1, 2009; 25(11): 1433 - 1434.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
V. Ahola, T. Aittokallio, M. Vihinen, and E. Uusipaikka
Model-based prediction of sequence alignment quality
Bioinformatics, October 1, 2008; 24(19): 2165 - 2171.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
W. Boomsma, K. V. Mardia, C. C. Taylor, J. Ferkinghoff-Borg, A. Krogh, and T. Hamelryck
A generative, probabilistic model of local protein structure
PNAS, July 1, 2008; 105(26): 8932 - 8937.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Pei, M. Tang, and N. V. Grishin
PROMALS3D web server for accurate multiple protein sequence and structure alignments
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W30 - W34.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Pei, B.-H. Kim, and N. V. Grishin
PROMALS3D: a tool for multiple protein sequence and structure alignments
Nucleic Acids Res., April 1, 2008; 36(7): 2295 - 2300.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Stout, J. Bacardit, J. D. Hirst, and N. Krasnogor
Prediction of recursive convex hull class assignments for protein residues
Bioinformatics, April 1, 2008; 24(7): 916 - 923.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. A. R. Dalton and R. M. Jackson
An evaluation of automated homology modelling methods at low target template sequence similarity
Bioinformatics, August 1, 2007; 23(15): 1901 - 1908.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Pei, B.-H. Kim, M. Tang, and N. V. Grishin
PROMALS web server for accurate multiple protein sequence alignments
Nucleic Acids Res., July 13, 2007; 35(suppl_2): W649 - W652.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
E. H. Margulies, G. M. Cooper, G. Asimenos, D. J. Thomas, C. N. Dewey, A. Siepel, E. Birney, D. Keefe, A. S. Schwartz, M. Hou, et al.
Analyses of deep mammalian sequence alignments and constraint predictions for 1% of the human genome
Genome Res., June 1, 2007; 17(6): 760 - 774.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. S. Papadopoulos and R. Agarwala
COBALT: constraint-based alignment tool for multiple protein sequences
Bioinformatics, May 1, 2007; 23(9): 1073 - 1079.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. Pei and N. V. Grishin
PROMALS: towards accurate multiple sequence alignments of distantly related proteins
Bioinformatics, April 1, 2007; 23(7): 802 - 808.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. S. Schwartz and L. Pachter
Multiple alignment by sequence annealing
Bioinformatics, January 15, 2007; 23(2): e24 - e29.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
P. Sonego, M. Pacurar, S. Dhir, A. Kertesz-Farkas, A. Kocsor, Z. Gaspari, J. A.M. Leunissen, and S. Pongor
A Protein Classification Benchmark collection for machine learning
Nucleic Acids Res., January 12, 2007; 35(suppl_1): D232 - D236.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
F. Armougom, O. Poirot, S. Moretti, D. G. Higgins, P. Bucher, V. Keduas, and C. Notredame
APDB: a web server to evaluate the accuracy of sequence alignments using structural information.
Bioinformatics, October 1, 2006; 22(19): 2439 - 2440.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
J. Pei and N. V. Grishin
MUMMALS: multiple sequence alignment improved by using hidden Markov models with local structural information
Nucleic Acids Res., September 11, 2006; 34(16): 4364 - 4374.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
T. Lassmann and E. L. L. Sonnhammer
Automatic assessment of alignment quality
Nucleic Acids Res., December 16, 2005; 33(22): 7120 - 7128.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Zhou and Y. Zhou
SPEM: improving multiple sequence alignment with sequence profiles and predicted secondary structures
Bioinformatics, September 15, 2005; 21(18): 3615 - 3621.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/1267    most recent
bth493v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (36)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Van Walle, I.
Right arrow Articles by Wyns, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Van Walle, I.
Right arrow Articles by Wyns, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?