Skip Navigation


Bioinformatics Advance Access originally published online on March 25, 2007
Bioinformatics 2007 23(11):1427-1428; doi:10.1093/bioinformatics/btm095
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/11/1427    most recent
btm095v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Cartwright, R. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cartwright, R. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Ngila: global pairwise alignments with logarithmic and affine gap costs

Reed A. Cartwright *

Department of Genetics, University of Georgia, Athens, GA 30602-7223, USA
Present address: Bioinformatics Research Center, North Carolina State University, Campus Box 7566, Raleigh, NC 27695-7566, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 STATISTICAL ALIGNMENT
 4 PERFORMANCE
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Ngila is an application that will find the best alignment of a pair of sequences using log-affine gap costs, which are the most biologically realistic gap costs.

Availability: Portable source code for Ngila can be downloaded from its development website, http://scit.us/projects/ngila/. It compiles on most operating systems.

Contact: racartwr{at}ncsu.edu or reed{at}scit.us

Supplementary information: Appendices


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 STATISTICAL ALIGNMENT
 4 PERFORMANCE
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Over the last two decades, several molecular studies have demonstrated that indel lengths obey a power law, i.e. the logarithms of indel lengths are linearly related to the logarithms of their frequency (Benner et al., 1993; Chang and Benner, 2004; Gonnet et al., 1992; Gu and Li, 1995; Zhang and Gerstein, 2003). Taking this observation into account and using biologically accurate models of indel lengths can improve alignment accuracy. Under a Zipfian power law distribution, the probability that an observation is Formula is Formula , where Formula is a parameter and Formula is Riemann’s Zeta Function. The mean and variance of this distribution are undefined (i.e. infinite) for Formula and Formula , respectively.

Based on this power law, some researchers have argued that logarithmic gap costs, Formula , may be appropriate for sequence alignments (Gonnet et al., 1992; Gu and Li, 1995; Waterman, 1984). However, Cartwright (2006) has demonstrated that log-affine gap costs, Formula , are more accurate than both affine and logarithmic gap costs when indels follow a power law. Even though it is straight forward to implement monotonic gap costs using Miller and Myers (1988) and while Mott (1999) provided the application MONOTONE for finding local alignments using logarithmic costs, there are no applications available that can align sequences globally using logarithmic or log-affine gap costs. Here we present Ngila, a portable implementation of Miller and Myers (1988) that can globally align pairs of sequences using logarithmic and affine gap costs.


    2 IMPLEMENTATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 STATISTICAL ALIGNMENT
 4 PERFORMANCE
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Needleman and Wunsch (1970) provided a method of sequence alignment where the maximum similarity of a pair of sequences was calculated using dynamic programming. Sellers (1974) provided an alternative approach where the minimum distance of the sequence pair was calculated instead. These algorithms took O(mn) time and O(mn) space where m and n are the lengths of the two sequences being aligned. Waterman et al. (1976) generalized Sellers’s metric to arbitrary gap weights, producing an Formula algorithm, referred to as WSB. Gotoh (1982) demonstrated that with affine gap penalties the WSB algorithm can be run in O(mn) time.

Waterman (1984) subsequently modified the WSB algorithm into the candidate-list method for concave gap weights, which could simplify the search process. Waterman’s candidate list method was further refined by Miller and Myers (1988) into an O(mn) algorithm. In addition, Miller and Myers (1988) pointed out how their algorithm could be modified using Hirschberg’s (1975) divide-and-conquer method to produce alignments in Formula time but with O(m) space.

Ngila implements the Miller and Myers (1988) algorithm in order to find a least costly global alignment of two sequences given homology costs and a gap cost of Formula . Two versions of the algorithm are included: holistic and divide-and-conquer. The former is faster but the latter utilizes less memory. Ngila starts with the divide-and-conquer method but switches to the holistic method for subsequences smaller than a user-established threshold. This improves its speed without substantially increasing memory requirements. Ngila also allows users to assign costs to end gaps that are smaller than costs for internal gaps. This is important for aligning using the free-end-gap method.


    3 STATISTICAL ALIGNMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 STATISTICAL ALIGNMENT
 4 PERFORMANCE
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The algorithm in Ngila finds the minimum cost of aligning two sequences (Sellers, 1974) instead of the maximum similarity of aligning two sequences (Needleman and Wunch, 1970). An extension of the method of Smith et al. (1981) can be used to convert a maximum search to a minimum search (Holmes and Durbin, 1998). This is important when using Ngila to find maximum likelihood alignments from a statistical model. One advantage of statistical models is the ability to estimate parameters without circularity (Thorne et al., 1991).

Based on a statistical model, the scores of ‘matches’ of type i, {alpha}i, and the penalties of gaps of length k, wk, can be used to calculate the alignment with maximum log-likelihood: Formula , where {eta}i is the number of residue matches of type i and {Delta}k is the number of gaps of length k. A minimum cost analog of this equation is Formula , where Formula is the cost of a match of type i, Formula is the cost of a gap of length k, and x and y are arbitrary constants (Cartwright, 2006). The above result allows minimum cost algorithms like Ngila to be used to calculate maximum likelihood alignments. Note that such conversion will also change the cost of free end gaps of length k to xk/2y.

Applying the above conversion to a simple statistical model developed in Cartwright (2006) suggests that the following costs are biologically appropriate: ßmatch = 0 , ßmismatch = 1, and


Formula 1

(1)
Here, {theta} is the expected number of substitutions per residue between a pair of sequences, {lambda} is the expected number of indels per substitution and z is the slope parameter of the power law of indel lengths. We are currently developing an expectation-maximization (EM) algorithm to robustly estimate these parameters from comparative genomic sequences.


    4 PERFORMANCE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 STATISTICAL ALIGNMENT
 4 PERFORMANCE
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Using human ADH1C intron 7 (NC_000004 [GenBank] .10: 100488859–100489717 complement) and mouse ADH1 intron 7 (NC_000069.4: 1138226273–138227189), an EM algorithm summed across all possible alignments and estimated Formula {lambda} = 0.213542 and z = 1.80675 (Cartwright, unpublished data). Using these values, the sequence simulation program, Dawg (Cartwright, 2005), created a dataset of 1000 pairwise alignments. The sequences in these alignments were truncated to the lengths shown in Table 1. These were then aligned using a log-affine gap cost calculated via Equation (1) and an affine cost derived from the log-affine gap cost by least squares. Speed and alignment quality were calculated and are summarized in Table 1. As expected, the time complexity of Ngila is roughly O(mn), and log-affine costs are slower but produce better alignments.


View this table:
[in this window]
[in a new window]

 
Table 1. Comparison of log-affine and affine gap costs

 

    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 STATISTICAL ALIGNMENT
 4 PERFORMANCE
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Ngila is an application that will calculate the best alignment between a pair of sequences using log-affine gap costs, which are very accurate and biologically realistic. It has the potential to advance research into alignment algorithms and provide more biologically accurate alignments.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 STATISTICAL ALIGNMENT
 4 PERFORMANCE
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This research was funded by an NSF Predoctoral Fellowship and N.I.H. grant GM070806. The author would like to thank Ben Redelings, Jeff Thorne and two anonymous reviewers for their helpful suggestions.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on October 31, 2006; revised on February 12, 2007; accepted on March 7, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 IMPLEMENTATION
 3 STATISTICAL ALIGNMENT
 4 PERFORMANCE
 5 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Benner SA, et al. Empirical and structural models for insertions and deletions in the divergent evolution of proteins. J. Mol. Biol (1993) 229:1065–1082.[CrossRef][Web of Science][Medline]

    Cartwright RA. DNA assembly with gaps (Dawg): Simulating sequence evolution. Bioinformatics (2005) 22(Suppl. 3):iii31–iii38.

    Cartwright RA. Logarithmic gap costs decrease alignment accuracy. BMC Bioinformatics (2006) 7:527.[CrossRef][Medline]

    Chang MSS, Benner SA. Empirical analysis of protein insertions and deletions determining parameters for the correct placement of gaps in protein sequence alignments. J. Mol. Biol (2004) 341:617–631.[CrossRef][Web of Science][Medline]

    Gonnet GH, et al. Exhaustive matching of the entire protein sequence database. Science (1992) 256:1443–1445.[Abstract/Free Full Text]

    Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol (1982) 162:705–708.[CrossRef][Web of Science][Medline]

    Gu X, Li WH. The size distribution of insertions and deletions in human and rodent pseudogenes suggests the logarithmic gap penalty for sequence alignment. J. Mol. Evol (1995) 40:464–473.[CrossRef][Web of Science][Medline]

    Hirschberg DS. A linear space algorithm for computing maximal common subsequences. Commun. ACM (1975) 18:341–343.[CrossRef]

    Holmes I, Durbin R. Dynamic programming alignment accuracy. J. Comput. Biol (1998) 5:493–504.[Web of Science][Medline]

    Miller W, Myers EW. Sequence comparison with concave weighting functions. Bull. Math. Biol (1988) 50:97–120.[Web of Science][Medline]

    Mott R. Local sequence alignments with monotonic gap penalties. Bioinformatics (1999) 15:455–462.[Abstract/Free Full Text]

    Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol (1970) 48:443–453.[CrossRef][Web of Science][Medline]

    Sellers PH. On the theory and computation of evolutionary distances. SIAM J. Appl. Math (1974) 26:787–793.[CrossRef]

    Smith TF, et al. Comparative biosequence metrics. J. Mol. Evol (1981) 18:38–46.[CrossRef][Web of Science][Medline]

    Thorne JL, et al. An evolutionary model for maximum-likelihood alignment of DNA-sequences. J. Mol. Evol (1991) 33:114–124.[CrossRef][Web of Science][Medline]

    Waterman MS. Efficient sequence alignment algorithms. J. Theor. Biol (1984) 108:333–337.[Web of Science][Medline]

    Waterman MS, et al. Some biological sequence metrics. Adv. Math (1976) 20:367–387.[CrossRef]

    Zhang Z, Gerstein M. Patterns of nucleotide substitution, insertion and deletion in the human genome inferred from pseudogenes. Nucleic Acids Res (2003) 31:5338–5348.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Mol Biol EvolHome page
R. A. Cartwright
Problems and Solutions for Estimating Indel Rates and Length Distributions
Mol. Biol. Evol., February 1, 2009; 26(2): 473 - 480.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/11/1427    most recent
btm095v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Cartwright, R. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Cartwright, R. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?