Bioinformatics Advance Access originally published online on February 23, 2008
Bioinformatics 2008 24(7):897-900; doi:10.1093/bioinformatics/btn052
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
The effect of sequence quality on sequence alignment
Institute of Marine Research, Bergen, Norway
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The nucleotide sequencing process produces not only the sequence of nucleotides, but also associated quality values. Quality values provide valuable information, but are primarily used only for trimming sequences and generally ignored in subsequent analyses.
Results: This article describes how the scoring schemes of standard alignment algorithms can be modified to take into account quality values to produce improved alignments and statistically more accurate scores. A prototype implementation is also provided, and used to post-process a set of BLAST results. Quality-adjusted alignment is a natural extension of standard alignment methods, and can be implemented with only a small constant factor performance penalty. The method can also be applied to related methods including heuristic search algorithms like BLAST and FASTA.
Availability: Software is available at http://malde.org/~ketil/qaa.
Contact: ketil.malde{at}imr.no
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
Nucleotide sequencing is a process that is subject to errors, and when base calling software analyses chromatograms to determine the sequence, they commonly output quality values that estimates the error probability of each position. Although errors in the sequences are frequent, the quality values have been shown to accurately reflect the probability of incorrect base calls (Ewing and Green, 1998; Walther et al., 2001). Quality values thus provide important information to subsequent stages in the analysis process.
Quality values are commonly used in sequence assembly (Green, 1999; Huang and Madan, 1999) to assist in determining the correct base for the consensus sequence. A similar use is in SNP analysis (e.g. Marth et al. 1999), where using quality information can help to differentiate between read errors and true polymorphism.
Aside from the sequence assembly and SNP analysis, quality information is rarely utilized directly, and in particular, the ubiquitous database search tools like BLAST (Altschul et al., 1990) and FASTA (Lipman and Pearson, 1988) make no particular provision for sequence quality. Consequently, it is sometimes suggested that low-quality fragments of sequences be removed before searching, a process often referred to as trimming.
There exists a multitude of tools to perform quality-based trimming—e.g. Pregap4 from the Staden package (Staden et al., 1998) and Lucy and Lucy2 (Chou and Holmes, 2001; Li and Chou, 2004), and Phred also incorporates sequence trimming information in its output (Ewing et al., 1998)—trimming puts the user in the unfortunate situation to choose between retaining dubious sequence parts or discard a potentially large fraction of the data set. The former will reduce specificity of searches, while the latter will reduce sensitivity, and striking on optimal balance can be difficult.
This article describes how standard sequence alignment algorithms can be extended to incorporate quality information in the alignments. The current implementation supports the standard dynamic programming algorithms for local and global alignment (Gotoh, 1982; Needleman and Wunsch, 1970; Smith and Waterman, 1981), but the method extends naturally to more commonly used heuristic methods like BLAST.
1.1 Substitution matrices
A substitution matrix defines the score for the substitution of each possible pair of letters, including the substitution of a letter with itself. For nucleotide sequences, often a uniform background distribution is assumed and the substitution matrix is simplified to two values: a positive score (or reward) in the case of identical letters and a negative score (penalty) for a mismatch.
Substitution matrices are normally calculated with log-odds scores. If the substitution x
y occurs with frequency qxy in alignments, and x and y occur with respective frequencies fx and fy in the data set, the substitution is given the score (Altschul, 1991)
|
| (1) |
The scaling factor
can be chosen freely, but a common choice is
= ln 2, which scales the alignment scores to bit units. All scores reported here will use this choice for
. When the background distribution is uniform with each letter occurring with frequency f, and noticing that qxy = fx P(x
y|x), the scores can be simplified to sxy = log2(P(x
y|x)/f).
For nucleotides, we assume uniform background distribution of P = 0.25, and probability
for errors independent of base. Each of the three possible incorrect bases thus has probability
/3, and substituting in Equation (1) gives
|
| (2) |
A substitution matrix corresponds to a specific distance between the sequences being aligned. In most cases, this distance is evolutionary distance, but the principle applies equally well to distance incurred by random errors.
| 2 METHODS |
|---|
|
|
|---|
For each position in the called sequence, the base calling software outputs a quality value that corresponds to the estimated probability of an incorrect base call. A given quality Q corresponds to the error probability
where
= 10–Q/10.
In order to determine the score of a substitution between two sequences at a position where the sequences have error rates of
1 and
2, we first calculate a combined error rate
|
| (3) |
The combined error in Equation (3) can then be applied to Equation (2). The resulting function generates the substitution scores dynamically for each pair of positions, effectively adjusting the substitution matrix for each position depending on the reliability of the base calls. Modifying the standard dynamic programming algorithms to use this function instead of static values is straightforward, and it should apply equally to similar heuristic methods.
2.1 Implementation
In the current implementation, quality adjustment is applied to the standard dynamic programming algorithms with affine gap penalties for local and global alignment. Quality adjustment is currently supported for nucleotide sequences only, and as discussed above, a uniform distribution of nucleotides is assumed.
Calculating substitution scores as a function of quality values is a computationally expensive operation, and the substitution scores for quality values from 0 to 99 are pre-calculated and stored in a table. Although the present implementation has not been optimized, this allows the quality adjusted algorithm to run with a speed close to the standard algorithm with fixed substitution scores.
BLASTN uses a default match reward of 1 and mismatch penalty of –3. This corresponds to comparing sequences with constant quality values of approximately 22. (This is equivalent to a single sequence error rate of 0.63%, and a combined error rate of 1.25%.) For sequences where no quality value is provided, the implementation therefore defaults to a quality value of 22.
Given that low-quality sequence is likely to contain insertion or deletion errors, a reasonable argument can be made for reducing the severity of gap penalties according to quality. Unfortunately, the theory for choosing gap penalties is less developed than for substitution scores, and the current implementation makes the conservative choice of retaining fixed values for gap opening (–10) and gap extension (–4).
| 3 RESULTS |
|---|
|
|
|---|
EST clustering is a common application that relies on comparing nucleotide sequences. ESTs are typically clustered using sequence similarity, attempting to group sequences that originate from the same gene. In practice, sequences that are found to have an adequate match are clustered together, and consequently, sequences that fail to have such matches remain as singletons.
Note that the comparison is usually not simply an alignment score, but often based on specific limits on, e.g. sequence identity, alignment length and unaligned overlap. In addition, chimeric sequences and splice variants can match other sequences partially. It is therefore not uncommon that singletons have good matches against other clustered sequences, and this does not necessarily reflect weaknesses in the clustering algorithm.
As an example usage of sequence alignment with quality adjustment, we will investigate the clustering of a set of 33 852 EST sequences from sea louse (Lepeophtheirus salmonis). Clustering these sequences using TGICL (Pertea et al., 2003) with default options resulted in 3441 contigs and 14 185 singletons.
The sequences were classified according to average quality score, and sequences with an average quality score of less than 15 were selected for further study. Of 6003 low-quality sequences, only 384 were clustered with other sequences, while 5619 low-quality sequences were left as singletons by the clustering process.
The low-quality sequences were aligned against the contigs using BLASTN with the e-value threshold set to 10–5. Each match was then recalculated using Smith–Waterman local alignment, and using quality-adjusted Smith–Waterman.
Figure 1 shows the alignment scores for Smith–Waterman and quality-adjusted Smith–Waterman plotted against the BLAST scores for the same sequence pair. As expected, the standard Smith–Waterman alignment achieves scores close to the corresponding BLAST score, while the quality adjusted alignment tends to achieve higher scores.
|
Of the 1285 distinct sequences that scored hits above the given threshold of 10–5, 530 achieved BLAST scores higher than 300 bits. For quality-adjusted alignment, 765 sequences scored higher than 300 bits.
Figure 2 shows the alignment score as a function of alignment length. For short alignments, scores close to the optimum of two bits per nucleotide can be seen—representing alignments involving a short but high-quality segment.
|
Figure 3 shows the difference in alignment score as a function of the difference in alignment length. The quality-adjusted alignment tends to achieve somewhat longer alignments by being able to extend more easily across low-quality regions, but for the majority of alignments, the difference is small. Alignment score is generally substantially higher for quality adjusted alignment, reflecting that most sequencing errors occur in positions with low-quality values, and that high-quality positions tend to match.
|
Figure 4 shows the quality of an example sequence. BLASTN reports a match between this sequence and a contig from position 202 to position 337, with an additional short, exact match from position 428 to 444. Local alignment with Smith–Waterman identifies the former as the highest scoring match. Quality-adjusted alignment identifies a match from position 192 to 549. This alignment scores 138.7 bits, compared to 48.1 bits with BLASTN and 48.9 bits with Smith–Waterman alignment.
|
| 4 DISCUSSION |
|---|
|
|
|---|
As we have seen, quality information can be incorporated in standard sequence alignment algorithms. This generally improves the accuracy of score estimation, and in some cases, reveals long alignments, typically because they include regions of low quality which would otherwise terminate the alignment prematurely.
As noted by States et al. (1991), using scoring matrices with the correct PAM distance (Dayhoff et al., 1978) is important to achieve correct results. As quality tend to vary substantially along the sequence length, a static substitution matrix will necessarily be suboptimal, at least for parts of the sequence. For instance, the average sequence quality of the sequence from Figure 4 is 12.7. Running BLASTN with the corresponding parameters –q –2 –r 4 gives a match from position 192 to 376, which is longer than the default parameters, but shorter than the quality-adjusted alignment.
Sequence assembly is one main area where quality information is routinely used, but only for calculating the final consensus. As trimming is often conservative in order to retain as much sequence as possible, sequences often have low-quality regions, typically near the ends, which complicates the overlap alignments used in the assembly process. One way to address this is to have a configurable overhang parameter which allows matches to have an unaligned segment below a certain length at the end of the sequence (Pertea et al., 2003). Using a quality-based alignment method is likely to produce more accurate alignments, while eliminating this parameter.
A similar problem occurs when ESTs are matched against the genome to build clusters (Malde and Sczyrba, unpublished). ESTs occasionally match the genome with a short match between the end of the EST and a position genome sequence at considerable distance downstream from the rest of the match. In many cases, this is more likely to be caused by poor quality of the EST sequence than by a very long intron. Using a quality-adjusted alignment would help to classify this correctly.
Database searching with nucleotide sequences is an important tool for annotating unidentified sequences. When a poor quality sequence is aligned using static matching scores, matches will often be terminated prematurely. In addition to the incorrect significance estimate, this means that even if the sequence has a full length match in the database, the short alignments can easily be mistaken as matches against conserved domains, motifs or binding sites. In contrast, quality-adjusted alignments can bridge or incorporate the low-quality segments in the alignment, and thus help to provide more accurate annotations.
Quality adjusted scores can be implemented as a table look-up, which incurs a constant time and space penalty compared to a static matrix. One optimization commonly used is scaling the scoring matrix (by careful selection of the
parameter) in order to use integer calculations. The quality adjusted matrix requires a much wider range of values, and it is not clear whether an integer approximation is practical. Using floating point values can incur an additional constant time penalty.
The current implementation provides local (Smith–Waterman) and global (Needleman–Wunsch) alignments, and scoring for nucleotide sequences. The general method is applicable to any algorithm with a similar scoring scheme, and it should be straightforward to extend BLAST or other search tools based on, e.g. PSSMs or hidden Markov models with similar functionality.
Quality could also be incorporated in nucleotide–protein and protein–protein alignments, especially in the context of translated searches where the sequences may be less reliable. One possible option is to average qualities when translating ORFs, but as low-quality sequences are likely to cause frame shifts, using an approach suggested by (Guan and Uberbacher, 1996) is likely to be more sensitive.
| ACKNOWLEDGEMENT |
|---|
|
|
|---|
This work is funded by The Institute of Marine Research, Bergen, Norway.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Limsoon Wong
Received on September 13, 2007; revised on February 1, 2008; accepted on February 1, 2008
| REFERENCES |
|---|
|
|
|---|
Altschul SF. Amino acid substitution matrices from an information theoretic perspective. J. Mol. Biol (1991) 219:555–565.[CrossRef][Web of Science][Medline]
Altschul S, et al. A basic local alignment search tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]
Chou H-H, Holmes MH. DNA sequence quality trimming and vector removal. Bioinformatics (2001) 17:1093–1104.
Dayhoff MO, et al. A Model of Evolutionary Change in Proteins (1978) 5. Washington, DC: National Biomedical Research Foundation. 345–352.
Ewing B, Green P. Base-calling of automated sequencer traces using Phred. II. Error probabilities. Genome Res (1998) 8:186–194.
Ewing B, et al. Base-calling of automated sequencer traces using Phred. I. Accuracy asessment. Genome Res (1998) 8:175–185.
Gotoh O. An improved algorithm for matching biological sequences. J. Mol. Biol (1982) 162:705–708.[CrossRef][Web of Science][Medline]
Green P. Phrap documentation. (1999) Available at http://www.phrap.org/phredphrap/phrap.html.
Guan X, Uberbacher EC. Alignments of DNA and protein sequences containing framshift errors. CABIOS – Comp. Appl. Biosci (1996) 12:31–40.
Huang X, Madan A. CAP3: a DNA sequence assembly program. Genome Res (1999) 9:868–877.
Lipman DJ, Pearson WR. Improved tools for biological sequence comparison. Proc. Nat. Acad. of Sci. USA (1988) 85:2444–2448.[CrossRef]
Li S, Chou H-H. LUCY2: an interactive DNA sequence quality trimming and vector removal tool. Bioinformatics (2004) 20:2865–2866.
Marth GT, et al. A general approach to single-nucleotide polymorphism discovery. Nat. Genet (1999) 23:452–456.[CrossRef][Web of Science][Medline]
Needleman S, Wunsch C. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol (1970) 48:443–453.[CrossRef][Web of Science][Medline]
Pertea G, et al. TIGR gene indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics (2003) 19:651–652.
Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol (1981) 147:195–197.[CrossRef][Web of Science][Medline]
Staden R, et al. The Staden Package (1998) Totowa: The Humana Press Inc.
States DJ, et al. Improved sensitivity of nucleic acid database searches using application-specific scoring matrices. METHODS: A Companion Methods Enzymol (1991) 3:66–70.[CrossRef]
Walther D, et al. Basecalling with LifeTrace. Genome Res (2001) 11:875–888.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||



