Skip Navigation


Bioinformatics Advance Access originally published online on February 23, 2008
Bioinformatics 2008 24(7):897-900; doi:10.1093/bioinformatics/btn052
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/7/897    most recent
btn052v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Malde, K.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Malde, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

The effect of sequence quality on sequence alignment

Ketil Malde *

Institute of Marine Research, Bergen, Norway

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 

Motivation: The nucleotide sequencing process produces not only the sequence of nucleotides, but also associated quality values. Quality values provide valuable information, but are primarily used only for trimming sequences and generally ignored in subsequent analyses.

Results: This article describes how the scoring schemes of standard alignment algorithms can be modified to take into account quality values to produce improved alignments and statistically more accurate scores. A prototype implementation is also provided, and used to post-process a set of BLAST results. Quality-adjusted alignment is a natural extension of standard alignment methods, and can be implemented with only a small constant factor performance penalty. The method can also be applied to related methods including heuristic search algorithms like BLAST and FASTA.

Availability: Software is available at http://malde.org/~ketil/qaa.

Contact: ketil.malde{at}imr.no

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
Nucleotide sequencing is a process that is subject to errors, and when base calling software analyses chromatograms to determine the sequence, they commonly output quality values that estimates the error probability of each position. Although errors in the sequences are frequent, the quality values have been shown to accurately reflect the probability of incorrect base calls (Ewing and Green, 1998; Walther et al., 2001). Quality values thus provide important information to subsequent stages in the analysis process.

Quality values are commonly used in sequence assembly (Green, 1999; Huang and Madan, 1999) to assist in determining the correct base for the consensus sequence. A similar use is in SNP analysis (e.g. Marth et al. 1999), where using quality information can help to differentiate between read errors and true polymorphism.

Aside from the sequence assembly and SNP analysis, quality information is rarely utilized directly, and in particular, the ubiquitous database search tools like BLAST (Altschul et al., 1990) and FASTA (Lipman and Pearson, 1988) make no particular provision for sequence quality. Consequently, it is sometimes suggested that low-quality fragments of sequences be removed before searching, a process often referred to as trimming.

There exists a multitude of tools to perform quality-based trimming—e.g. Pregap4 from the Staden package (Staden et al., 1998) and Lucy and Lucy2 (Chou and Holmes, 2001; Li and Chou, 2004), and Phred also incorporates sequence trimming information in its output (Ewing et al., 1998)—trimming puts the user in the unfortunate situation to choose between retaining dubious sequence parts or discard a potentially large fraction of the data set. The former will reduce specificity of searches, while the latter will reduce sensitivity, and striking on optimal balance can be difficult.

This article describes how standard sequence alignment algorithms can be extended to incorporate quality information in the alignments. The current implementation supports the standard dynamic programming algorithms for local and global alignment (Gotoh, 1982; Needleman and Wunsch, 1970; Smith and Waterman, 1981), but the method extends naturally to more commonly used heuristic methods like BLAST.

1.1 Substitution matrices
A substitution matrix defines the score for the substitution of each possible pair of letters, including the substitution of a letter with itself. For nucleotide sequences, often a uniform background distribution is assumed and the substitution matrix is simplified to two values: a positive score (or reward) in the case of identical letters and a negative score (penalty) for a mismatch.

Substitution matrices are normally calculated with log-odds scores. If the substitution x -> y occurs with frequency qxy in alignments, and x and y occur with respective frequencies fx and fy in the data set, the substitution is given the score (Altschul, 1991)


Formula 1

(1)

The scaling factor {lambda} can be chosen freely, but a common choice is {lambda} = ln 2, which scales the alignment scores to bit units. All scores reported here will use this choice for {lambda}. When the background distribution is uniform with each letter occurring with frequency f, and noticing that qxy = fx P(x -> y|x), the scores can be simplified to sxy = log2(P(x -> y|x)/f).

For nucleotides, we assume uniform background distribution of P = 0.25, and probability {varepsilon} for errors independent of base. Each of the three possible incorrect bases thus has probability {varepsilon}/3, and substituting in Equation (1) gives


Formula 2

(2)

A substitution matrix corresponds to a specific distance between the sequences being aligned. In most cases, this distance is evolutionary distance, but the principle applies equally well to distance incurred by random errors.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
For each position in the called sequence, the base calling software outputs a quality value that corresponds to the estimated probability of an incorrect base call. A given quality Q corresponds to the error probability {varepsilon} where {varepsilon} = 10Q/10.

In order to determine the score of a substitution between two sequences at a position where the sequences have error rates of {varepsilon}1 and {varepsilon}2, we first calculate a combined error rate


Formula 3

(3)
since there is a probability of Formula , there is a match between two erroneous base calls.

The combined error in Equation (3) can then be applied to Equation (2). The resulting function generates the substitution scores dynamically for each pair of positions, effectively adjusting the substitution matrix for each position depending on the reliability of the base calls. Modifying the standard dynamic programming algorithms to use this function instead of static values is straightforward, and it should apply equally to similar heuristic methods.

2.1 Implementation
In the current implementation, quality adjustment is applied to the standard dynamic programming algorithms with affine gap penalties for local and global alignment. Quality adjustment is currently supported for nucleotide sequences only, and as discussed above, a uniform distribution of nucleotides is assumed.

Calculating substitution scores as a function of quality values is a computationally expensive operation, and the substitution scores for quality values from 0 to 99 are pre-calculated and stored in a table. Although the present implementation has not been optimized, this allows the quality adjusted algorithm to run with a speed close to the standard algorithm with fixed substitution scores.

BLASTN uses a default match reward of 1 and mismatch penalty of –3. This corresponds to comparing sequences with constant quality values of approximately 22. (This is equivalent to a single sequence error rate of 0.63%, and a combined error rate of 1.25%.) For sequences where no quality value is provided, the implementation therefore defaults to a quality value of 22.

Given that low-quality sequence is likely to contain insertion or deletion errors, a reasonable argument can be made for reducing the severity of gap penalties according to quality. Unfortunately, the theory for choosing gap penalties is less developed than for substitution scores, and the current implementation makes the conservative choice of retaining fixed values for gap opening (–10) and gap extension (–4).


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
EST clustering is a common application that relies on comparing nucleotide sequences. ESTs are typically clustered using sequence similarity, attempting to group sequences that originate from the same gene. In practice, sequences that are found to have an adequate match are clustered together, and consequently, sequences that fail to have such matches remain as singletons.

Note that the comparison is usually not simply an alignment score, but often based on specific limits on, e.g. sequence identity, alignment length and unaligned overlap. In addition, chimeric sequences and splice variants can match other sequences partially. It is therefore not uncommon that singletons have good matches against other clustered sequences, and this does not necessarily reflect weaknesses in the clustering algorithm.

As an example usage of sequence alignment with quality adjustment, we will investigate the clustering of a set of 33 852 EST sequences from sea louse (Lepeophtheirus salmonis). Clustering these sequences using TGICL (Pertea et al., 2003) with default options resulted in 3441 contigs and 14 185 singletons.

The sequences were classified according to average quality score, and sequences with an average quality score of less than 15 were selected for further study. Of 6003 low-quality sequences, only 384 were clustered with other sequences, while 5619 low-quality sequences were left as singletons by the clustering process.

The low-quality sequences were aligned against the contigs using BLASTN with the e-value threshold set to 10–5. Each match was then recalculated using Smith–Waterman local alignment, and using quality-adjusted Smith–Waterman.

Figure 1 shows the alignment scores for Smith–Waterman and quality-adjusted Smith–Waterman plotted against the BLAST scores for the same sequence pair. As expected, the standard Smith–Waterman alignment achieves scores close to the corresponding BLAST score, while the quality adjusted alignment tends to achieve higher scores.


Figure 1
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Alignment scores as a function of BLASTN scores. The alignments are calculated using Smith–Waterman local alignment and BLASTN default scores (red) and alignment scores using quality-adjusted alignment scores (blue). The line is f(x) = x, indicating equal BLAST and Smith-Waterman alignment scores.

 
Of the 1285 distinct sequences that scored hits above the given threshold of 10–5, 530 achieved BLAST scores higher than 300 bits. For quality-adjusted alignment, 765 sequences scored higher than 300 bits.

Figure 2 shows the alignment score as a function of alignment length. For short alignments, scores close to the optimum of two bits per nucleotide can be seen—representing alignments involving a short but high-quality segment.


Figure 2
View larger version (24K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Alignment scores as a function of alignment lengths for Smith–Waterman local alignments with default BLASTN parameters (red), and using quality-adjusted alignment scores (blue).

 
Figure 3 shows the difference in alignment score as a function of the difference in alignment length. The quality-adjusted alignment tends to achieve somewhat longer alignments by being able to extend more easily across low-quality regions, but for the majority of alignments, the difference is small. Alignment score is generally substantially higher for quality adjusted alignment, reflecting that most sequencing errors occur in positions with low-quality values, and that high-quality positions tend to match.


Figure 3
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Difference in alignment score and alignment length between Smith–Waterman local alignment, and the quality-adjusted local alignment. Positive values indicate that the quality-adjusted alignments are longer or higher scoring.

 
Figure 4 shows the quality of an example sequence. BLASTN reports a match between this sequence and a contig from position 202 to position 337, with an additional short, exact match from position 428 to 444. Local alignment with Smith–Waterman identifies the former as the highest scoring match. Quality-adjusted alignment identifies a match from position 192 to 549. This alignment scores 138.7 bits, compared to 48.1 bits with BLASTN and 48.9 bits with Smith–Waterman alignment.


Figure 4
View larger version (20K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. The quality values of a sequence plotted against sequence position. Each position is represented by one point, the dashed line shows the sliding average over a 20-base window. The vertical bars show the trimming parameters suggested by Phred, indicating a high-quality segment between positions 49 and 253.

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
As we have seen, quality information can be incorporated in standard sequence alignment algorithms. This generally improves the accuracy of score estimation, and in some cases, reveals long alignments, typically because they include regions of low quality which would otherwise terminate the alignment prematurely.

As noted by States et al. (1991), using scoring matrices with the correct PAM distance (Dayhoff et al., 1978) is important to achieve correct results. As quality tend to vary substantially along the sequence length, a static substitution matrix will necessarily be suboptimal, at least for parts of the sequence. For instance, the average sequence quality of the sequence from Figure 4 is 12.7. Running BLASTN with the corresponding parameters –q –2 –r 4 gives a match from position 192 to 376, which is longer than the default parameters, but shorter than the quality-adjusted alignment.

Sequence assembly is one main area where quality information is routinely used, but only for calculating the final consensus. As trimming is often conservative in order to retain as much sequence as possible, sequences often have low-quality regions, typically near the ends, which complicates the overlap alignments used in the assembly process. One way to address this is to have a configurable overhang parameter which allows matches to have an unaligned segment below a certain length at the end of the sequence (Pertea et al., 2003). Using a quality-based alignment method is likely to produce more accurate alignments, while eliminating this parameter.

A similar problem occurs when ESTs are matched against the genome to build clusters (Malde and Sczyrba, unpublished). ESTs occasionally match the genome with a short match between the end of the EST and a position genome sequence at considerable distance downstream from the rest of the match. In many cases, this is more likely to be caused by poor quality of the EST sequence than by a very long intron. Using a quality-adjusted alignment would help to classify this correctly.

Database searching with nucleotide sequences is an important tool for annotating unidentified sequences. When a poor quality sequence is aligned using static matching scores, matches will often be terminated prematurely. In addition to the incorrect significance estimate, this means that even if the sequence has a full length match in the database, the short alignments can easily be mistaken as matches against conserved domains, motifs or binding sites. In contrast, quality-adjusted alignments can bridge or incorporate the low-quality segments in the alignment, and thus help to provide more accurate annotations.

Quality adjusted scores can be implemented as a table look-up, which incurs a constant time and space penalty compared to a static matrix. One optimization commonly used is scaling the scoring matrix (by careful selection of the {lambda} parameter) in order to use integer calculations. The quality adjusted matrix requires a much wider range of values, and it is not clear whether an integer approximation is practical. Using floating point values can incur an additional constant time penalty.

The current implementation provides local (Smith–Waterman) and global (Needleman–Wunsch) alignments, and scoring for nucleotide sequences. The general method is applicable to any algorithm with a similar scoring scheme, and it should be straightforward to extend BLAST or other search tools based on, e.g. PSSMs or hidden Markov models with similar functionality.

Quality could also be incorporated in nucleotide–protein and protein–protein alignments, especially in the context of translated searches where the sequences may be less reliable. One possible option is to average qualities when translating ORFs, but as low-quality sequences are likely to cause frame shifts, using an approach suggested by (Guan and Uberbacher, 1996) is likely to be more sensitive.


    ACKNOWLEDGEMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
This work is funded by The Institute of Marine Research, Bergen, Norway.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

Received on September 13, 2007; revised on February 1, 2008; accepted on February 1, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 

    Altschul SF. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol, ( (1991) ) 219, : 555–565.[CrossRef][ISI][Medline].

    Altschul S, et al. A basic local alignment search tool. J. Mol. Biol, ( (1990) ) 215, : 403–410.[CrossRef][ISI][Medline].

    Chou H-H, Holmes MH. DNA sequence quality trimming and vector removal. Bioinformatics, ( (2001) ) 17, : 1093–1104.[Abstract/Free Full Text].

    Dayhoff MO, et al. A Model of Evolutionary Change in Proteins, ( (1978) ) 5, . Washington, DC: National Biomedical Research Foundation. 345–352..

    Ewing B, Green P. Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res, ( (1998) ) 8, : 186–194.[Abstract/Free Full Text].

    Ewing B, et al. Base-calling of automated sequencer traces using Phred. I. Accuracy asessment. Genome Res, ( (1998) ) 8, : 175–185.[Abstract/Free Full Text].

    Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol, ( (1982) ) 162, : 705–708.[CrossRef][ISI][Medline].

    Green P. Phrap documentation. ( (1999) ) Available at http://www.phrap.org/phredphrap/phrap.html..

    Guan X, Uberbacher EC. Alignments of DNA and protein sequences containing framshift errors. CABIOS – Comp. Appl. Biosci, ( (1996) ) 12, : 31–40..

    Huang X, Madan A. CAP3: a DNA sequence assembly program. Genome Res, ( (1999) ) 9, : 868–877.[Abstract/Free Full Text].

    Lipman DJ, Pearson WR. Improved tools for biological sequence comparison. Proc. Nat. Acad. of Sci. USA, ( (1988) ) 85, : 2444–2448.[CrossRef].

    Li S, Chou H-H. LUCY2: an interactive DNA sequence quality trimming and vector removal tool. Bioinformatics, ( (2004) ) 20, : 2865–2866.[Abstract/Free Full Text].

    Marth GT, et al. A general approach to single-nucleotide polymorphism discovery. Nat. Genet, ( (1999) ) 23, : 452–456.[CrossRef][ISI][Medline].

    Needleman S, Wunsch C. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol, ( (1970) ) 48, : 443–453.[CrossRef][ISI][Medline].

    Pertea G, et al. TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics, ( (2003) ) 19, : 651–652.[Abstract/Free Full Text].

    Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol, ( (1981) ) 147, : 195–197.[CrossRef][ISI][Medline].

    Staden R, et al. The Staden Package, ( (1998) ) Totowa: The Humana Press Inc..

    States DJ, et al. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. METHODS: A Companion Methods Enzymol, ( (1991) ) 3, : 66–70.[CrossRef].

    Walther D, et al. Basecalling with LifeTrace. Genome Res, ( (2001) ) 11, : 875–888.[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/7/897    most recent
btn052v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Malde, K.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Malde, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?