Bioinformatics Advance Access originally published online on August 8, 2007
Bioinformatics 2007 23(19):2648-2649; doi:10.1093/bioinformatics/btm389
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DNA reference alignment benchmarks based on tertiary structure of encoded proteins
1Computer Science Department and 2Biology Department, Brigham Young University, Provo, Utah 84602, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Multiple sequence alignments (MSAs) are at the heart of bioinformatics analysis. Recently, a number of multiple protein sequence alignment benchmarks (i.e. BAliBASE, OXBench, PREFAB and SMART) have been released to evaluate new and existing MSA applications. These databases have been well received by researchers and help to quantitatively evaluate MSA programs on protein sequences. Unfortunately, analogous DNA benchmarks are not available, making evaluation of MSA programs difficult for DNA sequences.
Results: This work presents the first known multiple DNA sequence alignment benchmarks that are (1) comprised of protein-coding portions of DNA (2) based on biological features such as the tertiary structure of encoded proteins. These reference DNA databases contain a total of 3545 alignments, comprising of 68 581 sequences. Two versions of the database are available: mdsa_100s and mdsa_all. The mdsa_100s version contains the alignments of the data sets that TBLASTN found 100% sequence identity for each sequence. The mdsa_all version includes all hits with an E-value score above the threshold of 0.001. A primary use of these databases is to benchmark the performance of MSA applications on DNA data sets. The first such case study is included in the Supplementary Material.
Availability: The databases, further details and the Supplementary Material are publicly available at http://csl.cs.byu.edu/mdsas/http://csl.cs.byu.edu/mdsas/
Contact: hyrumdc{at}gmail.com
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Multiple sequence alignments (MSAs) provide the foundation for much of the analysis in bioinformatics. They are the first step for everything from annotation of genomes to evolutionary studies. Because of this, it is crucial for automated alignment programs to generate highly accurate and biologically meaningful MSAs to ensure accuracy in subsequent steps in the research process.
Recently, a number of protein sequence databases have been presented to provide a benchmark for alignment algorithms: BAliBASE (Thompson et al., 2005) OXBench (Raghava et al., 2003) PREFAB (Edgar, 2004b) and SMART (Ponting et al., 1999). These databases leverage structural-alignments to provide a suite of gold standard alignments. They are assumed to be the true alignments, and calculated alignments are evaluated by comparing against them. They have been well accepted by the scientific community and used in numerous studies to compare the quality of protein alignments generated by MSA programs (Do et al., 2005; Edgar, 2004a, b; Karplus and Hu, 2001; Lasmann and Sonnhammer, 2002, 2005; Thompon et al., 1999; Van Walle, 2004). These multiple protein sequence alignment (MPSA) benchmarks are limited to the evaluation of protein alignment applications.
Rarely is a novel alignment technique assessed for its ability to align nucleotide data accurately. The shortage of assessments of MSAs with DNA data may be due to the lack of DNA reference alignments. Applications that work well on amino acid sequences may not be as accurate on DNA data sets. One solution to this problem would be to compare calculated nucleotide alignments against reference nucleotide alignments that are based on the biological features used in protein benchmarks.
Work has been done to address this lack of reference DNA alignments. Pollard et al. (2004) created a benchmarking tool for the alignment of non-protein coding DNA using simulated data. While this benchmark gives researchers a starting point to evaluate DNA alignments, the degree to which the simulated sequences reflect those in nature is uncertain.
A gold standard benchmark of DNA alignments that is (1) comprised of protein-coding portions of DNA and (2) based on biological features such as the tertiary structure of encoded proteins can help researchers assess the quality of DNA alignment algorithms. This article presents the first known collection of protein-coding DNA benchmark alignments that meet this criteria. A computational tool, MPSA2MDSA, was developed and utilized to convert the following MPSAs into multiple DNA sequence alignment (MDSAs): BAliBASE, OXBench, PREFAB and SMART.
| 2 MATERIALS AND METHODS |
|---|
|
|
|---|
Estimating an MDSA from an MPSA is a straight forward procedure that requires three steps. The first step is to find the best analogous DNA sequence (hit) from a protein sequence (query). We queried the September 2006 version of GenBank's; nt database (Benson et al., 2005) with each of the protein sequences using the TBLASTN algorithm (Altschul et al., 1990). TBLASTN provides the accession number of the best hit. The DNA sequences are then retrieved from the nt database with fastacmd, an NCBI tool. The second step is to account for the occasional gaps introduced by the similarity search. The final step is to apply the alignment from the MPSA to the MDSA. This is done by inserting gaps that correspond with the gaps in the protein alignment. This step is important to preserve the alignment features obtained by higher-order methods (e.g. secondary and tertiary structure or chemical properties) or in other words, to preserve the higher order benchmark alignment. By preserving the biological information, the DNA alignment can be considered a reference alignment. Each step is covered in more detail in the Supplementary Material.
Two versions of each database are publicly available. The first version, mdsa_100s, includes only those data sets with all perfect matches (100% sequence identity). This version ensures the highest level of integrity in the conversion. The second version, mdsa_all, includes all hits with an E-value score above the threshold of 0.001. This version retains more of the MPSAs and aids in comparison with the original MPSAs.
For any heuristic, it is important to quantify the accuracy. Here, the accuracy can be measured by the sequence identity of the hit sequence. In general, as the sequence identity increases, so does the likelihood that the two sequences share the same tertiary structure. For this work, sequences that share 100% sequence identity are assumed to have the same tertiary structure. Sequences with the same tertiary structure will have the same alignment.
Using the nt database, 97.4% of the protein queries found a match with an E-value score above the threshold of 0.001. Furthermore, 69.0% of these hits have 100% sequence identity with the query. While the tool finds a high percentage of exact matches with a current database, databases are growing at an exponential rate, thereby increasing the number of analogous hits of protein queries.
In total, 3545 DNA reference alignments, comprising of 68 581 sequences and 35 600 958 bases are publicly available at http://csl.cs.byu.edu/mdsas/.
To illustrate the usefulness of the reference DNA databases, a case study of the performance and ranks of alignment programs on DNA data sets is included in the Supplementary Material (see also Table 1). Alignments and their respective scores were calculated for seven different multiple sequence alignment applications for each of the 3545 alignments.
|
| 7 CONCLUSION |
|---|
|
|
|---|
In this work, the first known databases of reference protein-coding DNA alignments are presented. These databases are constructed by leveraging the popular BLAST program to find DNA sequences corresponding to those found in multiple protein sequence alignments. The alignments of the protein sequences (which reflect higher-order information) are applied to the DNA sequences to qualify them to be reference alignments. High-quality hits were obtained from public databases. Over two-thirds of the queries found a perfect match in the nt database. Two versions of the converted databases are available, the first only contains hits that perfectly matched the query, and the comprehensive second version includes all hits above the cut off threshold. These DNA reference alignment databases are publicly available. This benchmark will be extremely useful in evaluating the quality of DNA alignments generated by existing and forthcoming MSA techniques. Finally, the first case study of DNA alignments evaluated by these reference alignments is included in the Supplementary Material.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We would like to thank Keith Crandall for his review and critiques of this work. This material is based upon work supported by the National Science Foundation under Grant No. 0120718.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: John Quackenbush
Received on April 7, 2007; revised on July 19, 2007; accepted on July 22, 2007
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. (1990) 215:403–410.[CrossRef][Web of Science][Medline]
Benson DA, et al. GenBank. Nucleic Acids Res. (2005) 33:D34–D38.
Do CB, et al. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. (2005) 15:330–340.
Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics (2004a) 5:113–131.[CrossRef][Medline]
Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. (2004b) 32:1792–1797.
Karplus K, Hu B. Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics (2001) 17:713–720.
Lassmann T, Sonnhammer ELL. Quality assessment of multiple alignment programs. FEBS Lett. (2002) 529:126–130.[CrossRef][Web of Science][Medline]
Lassmann T, Sonnhammer ELL. Automatic assessment of alignment quality. Nucleic Acids Res. (2005) 33:7120–7128.
Pollard DA, et al. Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics (2004) 5.
Ponting C, et al. SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res. (1999) 27:229–232.
Raghava GPS, et al. OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics (2003) 4:47.[CrossRef][Medline]
Thompson JD, et al. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. (1999) 27:2682–2690.
Thompson JD, et al. BALiBASE 3.0 latest developments of the multiple sequence alignment benchmark. Proteins: Struct. Funct. Bioinformatics (2005) 61:127–136.[CrossRef]
Van Walle I. Align-m–a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics (2004) 20:1428–1435.
This article has been cited by other articles:
![]() |
J. L. Wegrzyn, J. M. Lee, J. Liechty, and D. B. Neale PineSAP--sequence alignment and SNP identification pipeline Bioinformatics, October 1, 2009; 25(19): 2609 - 2610. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. A. Morrison Why Would Phylogeneticists Ignore Computerized Sequence Alignment? Syst Biol, March 25, 2009; (2009) syp009v1. [Full Text] [PDF] |
||||
![]() |
Y. Lu and S.-H. Sze Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues Nucleic Acids Res., February 1, 2009; 37(2): 463 - 472. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. G. Hall How Well Does the HoT Score Reflect Sequence Alignment Accuracy? Mol. Biol. Evol., August 1, 2008; 25(8): 1576 - 1580. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Katoh and H. Toh Recent developments in the MAFFT multiple sequence alignment program Brief Bioinform, July 1, 2008; 9(4): 286 - 298. [Abstract] [Full Text] [PDF] |
||||
![]() |
A. Wilm, D. G. Higgins, and C. Notredame R-Coffee: a method for multiple alignment of non-coding RNA Nucleic Acids Res., May 1, 2008; 36(9): e52 - e52. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. G. Hall Simulating DNA Coding Sequence Evolution with EvolveAGene 3 Mol. Biol. Evol., April 1, 2008; 25(4): 688 - 695. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




