Skip Navigation


Bioinformatics Advance Access originally published online on August 8, 2007
Bioinformatics 2007 23(19):2648-2649; doi:10.1093/bioinformatics/btm389
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/19/2648    most recent
btm389v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (9)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Carroll, H.
Right arrow Articles by McClellan, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Carroll, H.
Right arrow Articles by McClellan, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

DNA reference alignment benchmarks based on tertiary structure of encoded proteins

Hyrum Carroll 1,*, Wesley Beckstead 2, Timothy O'Connor 2, Mark Ebbert 1,2, Mark Clement 1, Quinn Snell 1 and David McClellan 1

1Computer Science Department and 2Biology Department, Brigham Young University, Provo, Utah 84602, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 7 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Multiple sequence alignments (MSAs) are at the heart of bioinformatics analysis. Recently, a number of multiple protein sequence alignment benchmarks (i.e. BAliBASE, OXBench, PREFAB and SMART) have been released to evaluate new and existing MSA applications. These databases have been well received by researchers and help to quantitatively evaluate MSA programs on protein sequences. Unfortunately, analogous DNA benchmarks are not available, making evaluation of MSA programs difficult for DNA sequences.

Results: This work presents the first known multiple DNA sequence alignment benchmarks that are (1) comprised of protein-coding portions of DNA (2) based on biological features such as the tertiary structure of encoded proteins. These reference DNA databases contain a total of 3545 alignments, comprising of 68 581 sequences. Two versions of the database are available: mdsa_100s and mdsa_all. The mdsa_100s version contains the alignments of the data sets that TBLASTN found 100% sequence identity for each sequence. The mdsa_all version includes all hits with an E-value score above the threshold of 0.001. A primary use of these databases is to benchmark the performance of MSA applications on DNA data sets. The first such case study is included in the Supplementary Material.

Availability: The databases, further details and the Supplementary Material are publicly available at http://csl.cs.byu.edu/mdsas/http://csl.cs.byu.edu/mdsas/

Contact: hyrumdc{at}gmail.com

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 7 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Multiple sequence alignments (MSAs) provide the foundation for much of the analysis in bioinformatics. They are the first step for everything from annotation of genomes to evolutionary studies. Because of this, it is crucial for automated alignment programs to generate highly accurate and biologically meaningful MSAs to ensure accuracy in subsequent steps in the research process.

Recently, a number of protein sequence databases have been presented to provide a benchmark for alignment algorithms: BAliBASE (Thompson et al., 2005) OXBench (Raghava et al., 2003) PREFAB (Edgar, 2004b) and SMART (Ponting et al., 1999). These databases leverage structural-alignments to provide a suite of ‘gold standard’ alignments. They are assumed to be the ‘true’ alignments, and calculated alignments are evaluated by comparing against them. They have been well accepted by the scientific community and used in numerous studies to compare the quality of protein alignments generated by MSA programs (Do et al., 2005; Edgar, 2004a, b; Karplus and Hu, 2001; Lasmann and Sonnhammer, 2002, 2005; Thompon et al., 1999; Van Walle, 2004). These multiple protein sequence alignment (MPSA) benchmarks are limited to the evaluation of protein alignment applications.

Rarely is a novel alignment technique assessed for its ability to align nucleotide data accurately. The shortage of assessments of MSAs with DNA data may be due to the lack of DNA reference alignments. Applications that work well on amino acid sequences may not be as accurate on DNA data sets. One solution to this problem would be to compare calculated nucleotide alignments against reference nucleotide alignments that are based on the biological features used in protein benchmarks.

Work has been done to address this lack of reference DNA alignments. Pollard et al. (2004) created a benchmarking tool for the alignment of non-protein coding DNA using simulated data. While this benchmark gives researchers a starting point to evaluate DNA alignments, the degree to which the simulated sequences reflect those in nature is uncertain.

A ‘gold standard’ benchmark of DNA alignments that is (1) comprised of protein-coding portions of DNA and (2) based on biological features such as the tertiary structure of encoded proteins can help researchers assess the quality of DNA alignment algorithms. This article presents the first known collection of protein-coding DNA benchmark alignments that meet this criteria. A computational tool, MPSA2MDSA, was developed and utilized to convert the following MPSAs into multiple DNA sequence alignment (MDSAs): BAliBASE, OXBench, PREFAB and SMART.


    2 MATERIALS AND METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 7 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Estimating an MDSA from an MPSA is a straight forward procedure that requires three steps. The first step is to find the best analogous DNA sequence (hit) from a protein sequence (query). We queried the September 2006 version of GenBank's; nt database (Benson et al., 2005) with each of the protein sequences using the TBLASTN algorithm (Altschul et al., 1990). TBLASTN provides the accession number of the best hit. The DNA sequences are then retrieved from the nt database with fastacmd, an NCBI tool. The second step is to account for the occasional gaps introduced by the similarity search. The final step is to apply the alignment from the MPSA to the MDSA. This is done by inserting gaps that correspond with the gaps in the protein alignment. This step is important to preserve the alignment features obtained by higher-order methods (e.g. secondary and tertiary structure or chemical properties) or in other words, to preserve the higher order benchmark alignment. By preserving the biological information, the DNA alignment can be considered a reference alignment. Each step is covered in more detail in the Supplementary Material.

Two versions of each database are publicly available. The first version, mdsa_100s, includes only those data sets with all perfect matches (100% sequence identity). This version ensures the highest level of integrity in the conversion. The second version, mdsa_all, includes all hits with an E-value score above the threshold of 0.001. This version retains more of the MPSAs and aids in comparison with the original MPSAs.

For any heuristic, it is important to quantify the accuracy. Here, the accuracy can be measured by the sequence identity of the hit sequence. In general, as the sequence identity increases, so does the likelihood that the two sequences share the same tertiary structure. For this work, sequences that share 100% sequence identity are assumed to have the same tertiary structure. Sequences with the same tertiary structure will have the same alignment.

Using the nt database, 97.4% of the protein queries found a match with an E-value score above the threshold of 0.001. Furthermore, 69.0% of these hits have 100% sequence identity with the query. While the tool finds a high percentage of exact matches with a current database, databases are growing at an exponential rate, thereby increasing the number of analogous hits of protein queries.

In total, 3545 DNA reference alignments, comprising of 68 581 sequences and 35 600 958 bases are publicly available at http://csl.cs.byu.edu/mdsas/.

To illustrate the usefulness of the reference DNA databases, a case study of the performance and ranks of alignment programs on DNA data sets is included in the Supplementary Material (see also Table 1). Alignments and their respective scores were calculated for seven different multiple sequence alignment applications for each of the 3545 alignments.


View this table:
[in this window]
[in a new window]

 
Table 1. Q score, TC score and CPU time ranks

 

    7 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 7 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this work, the first known databases of reference protein-coding DNA alignments are presented. These databases are constructed by leveraging the popular BLAST program to find DNA sequences corresponding to those found in multiple protein sequence alignments. The alignments of the protein sequences (which reflect higher-order information) are applied to the DNA sequences to qualify them to be reference alignments. High-quality hits were obtained from public databases. Over two-thirds of the queries found a perfect match in the nt database. Two versions of the converted databases are available, the first only contains hits that perfectly matched the query, and the comprehensive second version includes all hits above the cut off threshold. These DNA reference alignment databases are publicly available. This benchmark will be extremely useful in evaluating the quality of DNA alignments generated by existing and forthcoming MSA techniques. Finally, the first case study of DNA alignments evaluated by these reference alignments is included in the Supplementary Material.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 7 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We would like to thank Keith Crandall for his review and critiques of this work. This material is based upon work supported by the National Science Foundation under Grant No. 0120718.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on April 7, 2007; revised on July 19, 2007; accepted on July 22, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 MATERIALS AND METHODS
 7 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. (1990) 215:403–410.[CrossRef][Web of Science][Medline]

    Benson DA, et al. GenBank. Nucleic Acids Res. (2005) 33:D34–D38.[Abstract/Free Full Text]

    Do CB, et al. ProbCons: probabilistic consistency-based multiple sequence alignment. Genome Res. (2005) 15:330–340.[Abstract/Free Full Text]

    Edgar RC. MUSCLE: a multiple sequence alignment method with reduced time and space complexity. BMC Bioinformatics (2004a) 5:113–131.[CrossRef][Medline]

    Edgar RC. MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. (2004b) 32:1792–1797.[Abstract/Free Full Text]

    Karplus K, Hu B. Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics (2001) 17:713–720.[Abstract/Free Full Text]

    Lassmann T, Sonnhammer ELL. Quality assessment of multiple alignment programs. FEBS Lett. (2002) 529:126–130.[CrossRef][Web of Science][Medline]

    Lassmann T, Sonnhammer ELL. Automatic assessment of alignment quality. Nucleic Acids Res. (2005) 33:7120–7128.[Abstract/Free Full Text]

    Pollard DA, et al. Benchmarking tools for the alignment of functional noncoding DNA. BMC Bioinformatics (2004) 5.

    Ponting C, et al. SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res. (1999) 27:229–232.[Abstract/Free Full Text]

    Raghava GPS, et al. OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics (2003) 4:47.[CrossRef][Medline]

    Thompson JD, et al. A comprehensive comparison of multiple sequence alignment programs. Nucleic Acids Res. (1999) 27:2682–2690.[Abstract/Free Full Text]

    Thompson JD, et al. BALiBASE 3.0 latest developments of the multiple sequence alignment benchmark. Proteins: Struct. Funct. Bioinformatics (2005) 61:127–136.[CrossRef]

    Van Walle I. Align-m–a new algorithm for multiple alignment of highly divergent sequences. Bioinformatics (2004) 20:1428–1435.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
J. L. Wegrzyn, J. M. Lee, J. Liechty, and D. B. Neale
PineSAP--sequence alignment and SNP identification pipeline
Bioinformatics, October 1, 2009; 25(19): 2609 - 2610.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
D. A. Morrison
Why Would Phylogeneticists Ignore Computerized Sequence Alignment?
Syst Biol, March 25, 2009; (2009) syp009v1.
[Full Text] [PDF]


Home page
Nucleic Acids ResHome page
Y. Lu and S.-H. Sze
Improving accuracy of multiple sequence alignment algorithms based on alignment of neighboring residues
Nucleic Acids Res., February 1, 2009; 37(2): 463 - 472.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
B. G. Hall
How Well Does the HoT Score Reflect Sequence Alignment Accuracy?
Mol. Biol. Evol., August 1, 2008; 25(8): 1576 - 1580.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
K. Katoh and H. Toh
Recent developments in the MAFFT multiple sequence alignment program
Brief Bioinform, July 1, 2008; 9(4): 286 - 298.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
A. Wilm, D. G. Higgins, and C. Notredame
R-Coffee: a method for multiple alignment of non-coding RNA
Nucleic Acids Res., May 1, 2008; 36(9): e52 - e52.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
B. G. Hall
Simulating DNA Coding Sequence Evolution with EvolveAGene 3
Mol. Biol. Evol., April 1, 2008; 25(4): 688 - 695.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/19/2648    most recent
btm389v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (9)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Carroll, H.
Right arrow Articles by McClellan, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Carroll, H.
Right arrow Articles by McClellan, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?