Bioinformatics Advance Access originally published online on September 6, 2005
Bioinformatics 2005 21(22):4125-4132; doi:10.1093/bioinformatics/bti658
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Optimal word sizes for dissimilarity measures and estimation of the degree of dissimilarity between DNA sequences
1Department of Statistics, National Cheng-Kung University Tainan, Taiwan 70101
2Institute of Statistical Science, Academia Sinica Taipei, Taiwan 11529
3Institute of Bioinformatics, National Yang-Ming University Taipei, Taiwan 11221
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Several measures of DNA sequence dissimilarity have been developed. The purpose of this paper is 3-fold. Firstly, we compare the performance of several word-based or alignment-based methods. Secondly, we give a general guideline for choosing the window size and determining the optimal word sizes for several word-based measures at different window sizes. Thirdly, we use a large-scale simulation method to simulate data from the distribution of SKLD (symmetric KullbackLeibler discrepancy). These simulated data can be used to estimate the degree of dissimilarity ß between any pair of DNA sequences.
Results: Our study shows (1) for whole sequence similiarity/dissimilarity identification the window size taken should be as large as possible, but probably not >3000, as restricted by CPU time in practice, (2) for each measure the optimal word size increases with window size, (3) when the optimal word size is used, SKLD performance is superior in both simulation and real data analysis, (4) the estimate
of ß based on SKLD can be used to filter out quickly a large number of dissimilar sequences and speed alignment-based database search for similar sequences and (5)
is also applicable in local similarity comparison situations. For example, it can help in selecting oligo probes with high specificity and, therefore, has potential in probe design for microarrays.
Availability: The algorithm SKLD, estimate
and simulation software are implemented in MATLAB code, and are available at http://www.stat.ncku.edu.tw/tjwu
Contact: tjwu{at}stat.ncku.edu.tw
Supplementary information: Tables A1A3, and Remarks 111 at http://www.stat.ncku.edu.tw/tjwu
| 1 INTRODUCTION |
|---|
|
|
|---|
In the past two decades, the number of DNA sequence records has grown exponentially over time. The characterization of new sequence data presents the biologist with many methods of sequence comparison. Several measures of DNA sequence similarity/dissimilarity have been developed in the past. The purpose of this paper is 3-fold. First, we compare the performance of several word-based or alignment-based methods. Second, we give a general guideline for choosing the (sliding) window size and determine the optimal word sizes for several word-based measures at different window sizes. Third, we approximate the distribution of the SKLD (symmetric KullbackLeibler discrepancy) In, where n denotes the word size, and use such approximation to estimate the degree of dissimilarity, denoted by ß herein, between any pair of DNA sequences.
Throughout we focus on the U-I (uniform-independent) model of base composition, where the probability of encountering any one of the four bases (or letters), A, C, G and T, is taken to be 0.25 independent of the other bases. This model is appropriate because all sequences we generate in this paper can be treated as sequences drawn from the unlimited nucleotide virtual pool, see Sege and Saxberg (1982) for details of this viewpoint. Furthermore, the independence of bases is an approximation to the actual dependence in DNA sequences. Arratia et al. (1990) evaluated this approximation and found it to be quite good. Section 2 briefly reviews several word-based dissimilarity measures and shows that the time complexity of computing SKLD is significantly lower than that of an alignment-based method. Section 3 contains the main results. All results are obtained through extensive simulations that involve generating a large number of pairs of ms (motherson) sequences, where a mother sequence is randomly generated according to the U-I model of base composition, and the son sequence is a mutated version of the mother sequence. The type of mutation considered is the point mutation including three equally likely operations, namely, inserting, deleting and substituting a base at a randomly selected position of the mother sequence, while the selection of a base for insertion and substitution is done uniformly over the possible bases. The point mutation is the most common type of copying error and are actual chemical changes to the structure of the constituent DNA. See, e.g. Alberts et al. (1994) for details. In Section 3.1, we use the Spearman's rank statistic to determine the optimal word size for the word-based methods ED (Euclidean distance)
, SED (standardized Euclidean distance)
, In and SCRD [symmetric CressieRead family of discrepancies (Cressie and Read, 1984)]
, respectively, at window size varying from 10 to 6100. We then compare their performance with Hamming distance (see, e.g. Pinheiro et al., 2000) and the benchmark method BLAST [Basic Local Alignment Search Tool (Altschul et al., 1990, 1997)] that requires sequence alignment. The results are shown in Figure 2. Figures 2a and b show, when the optimal word size is used, SKLD performs the best among all the aforementioned methods. Figure 2c shows, for whole sequence similarity/dissimilaity comparison, the window size taken should be as large as possible (but probably not >3000, as restricted by CPU time in practice). A practical implementation is to let the window size be the minimum of 3000 and the lengths of the two DNA sequences under comparison. Section 3.2 constructs Table A1 at http://www.stat.ncku.edu.tw/tjwu that approximates the percentile point of the distribution of SKLD (at optimal word size) based on a large set of simulated data. This table can be used to obtain estimate
of ß between any pair of DNA sequences. Real data analysis in Section 4 shows
performance that is superior with a wide range of real sequences having U-I, skewed or first order to fifth order Markov chain base compositions, and is very robust to the misspecification of the model of base composition in practically realistic settings. In particular,
performs more favorably than BLAST and the alignment-free PSM (probabilistic similar measure) of Pham and Zuegg (2004) and improves the combined KL-D in Wu et al. (2001). Moreover, a complete genome analysis in Section 4.4 shows
is also applicable in local similarity comparison situations. Specifically, it shows
can help in selecting oligo probes with high specificity and has potential in probe design for gene expression microarrays. Finally, some remarks (Remarks 111) are given at http://www.stat.ncku.edu.tw/tjwu
|
| 2 DISSIMILARITY MEASURES |
|---|
|
|
|---|
Many rigorous DNA sequence comparison algorithms like FASTA (Pearson and Lipman, 1988; Pearson, 1990) and BLAST involve sequence alignment at some stage and become computationally prohibitive when comparison against a large database is required. Comparison algorithms using word frequencies as a measure of dissimilarity do not require sequence alignment. They measure the frequencies of words within a DNA sequence and then compare these frequencies between DNA sequences using statistical distances. In this way they can determine the relative dissimilarity in a large database of DNA sequences very rapidly. They have already been used as pre-selection filters to filter out highly dissimilar sequences and speed alignment-based database search for similar sequences. These filtration methods are currently being increasingly explored to optimize database search and gradually being incorporated in widely used bioinformatics applications. It is noteworthy that these word-based algorithms can also find some new functional similarities or dissimilarities that are invisible to other algorithms like FASTA (Blaisdell, 1989a; Hide et al., 1994) and are useful in detection of coding regions (Fichant and Gautier, 1987) and evolutionary tree reconstruction (Blaisdell, 1989a,b). Several word-based algorithms (Blaisdell, 1986, 1989a; Cressie and Read, 1984; Hide et al., 1994; Pevzner, 1992a,b; Torney et al., 1990; Wu et al., 1997, 2001, among others) have been developed. Vinga and Almeida (2003) review these algorithms and predict that the next few years will see some of them become widely used for functional annotation and phylogenetic study.
An n-word is a subsequence of n adjacent letters. Let
be any n-word within a strand of DNA of length m + n 1. Define Xi = 1 if
begins at position i and Xi = 0 otherwise. Also, define
as the frequency of
. Given two strands of DNA sequences Q and L (for the query and a library sequence in a database), let VL,n = (NL,
1/m, ..., NL,
4n/m) be the vector of relative frequencies of n-words over a segment WL, which is a window of length m + n 1 from the sequence L, where
1, ...,
4n are all of the possible n-words. Let VQ,n = (NQ,
1/m, ..., NQ,
4n/m) and WQ be defined similarly for Q. Thus,
is an expression of the dissimilarity of WL and WQ with respect to word composition. In what follows, WL and WQ are shifted over L and Q, respectively. A distance (say, window distance) is taken for each pair W = (WL, WQ). The distance between L and Q is taken to be the minimum of all window distances. The (squared) ED
![]() | (1) |
![]() | (2) |
,
<
<
, where the values at
= 0, 1 are defined by continuity. To avoid the possibility that
, we need to modify
. By the approach in Frith et al. (2004), the discrepancy becomes
![]() | (3) |
n = 0.5 x 4n. Since
is not symmetric in its arguments, it is better to use the SCRD defined by
![]() | (4) |
with
![]() | (5) |
and
can be computed fairly fast because they do not depend on the model of base composition and do not require computation of variances. Wu et al. (2001) showed that both
and
have better sensitivity and selectivity than
. Now, let lQ, lL and l denote the lengths of the query, library and window, respectively. It can be shown (Remark 1) that for computing both In and
the time complexity is O(lQlLl1), which becomes the linear O(max{lQ,lL}) if we take l = min{lQ, lL}. This time complexity is significantly lower than that of an alignment-based method, whose time complexity of finding the smallest penalty alignment (Waterman, 1989) is the quadratic O(lQlL). We report the actual CPU time and memory space required on a PC in computing SKLD over a test dataset in Section 4.2. | 3 METHODOLOGY |
|---|
|
|
|---|
In Section 1 we have described in detail the way of generating pairs of ms sequences. In order to use Equations (1)(5) to compute
,
, In and
between a mother and her son, we take the window size l = min{lm, ls} where lm and ls denote the length of the mother and her son, respectively. The difference between lm and ls is small (evidently, El = Els = lm). Nevertheless, the window is shifted from left to right over the longer sequence between the mother and her son, and we take 90% overlap on the windows throughout our simulation.
3.1 Comparison of measures and their optimal word sizes
For several particular examples of real genomic databases Hide et al. (1994) give a heuristic solution to the problem of optimizing the word size for
. In this section, we shall give a simple and general solution to this problem for each of
,
, In and
. We generate 5000 independent mother sequences of the same length, while the lengths considered are 10, 20, 50, 100 + 50j, j = 0, 1, 2, ..., 120 bases, respectively. For each mother, we generate a son sequence at each of the following 100 mutation rates: 1%, 2%, ..., 100%, where a mutation rate
% means the son is obtained by randomly selecting
% of the bases in the mother sequence for mutation and other bases are unchanged. For an alternative way of generating m-s pairs, see Remark 2.
It is quite worth analyzing the sensitivity of word-based dissimilarity measures to mutation, window size and word size. We have done analysis on the sample mean and variance of scores for ED [see also Torney et al. (1990)], SED, SKLD and some other members of SCRD family. Since the lessons learned are the same, we shall just present the results for SKLD here. Figure 1a shows the sample mean of the 5000 SK-LD scores, resulting from the above 5000 independent ms pairs, at every mutation rate
% = 1%, 2%, ..., 100% for window sizes 250 and 1600, and word size n = 2, 4, ... and 10 (results for other cases of window size and n are similar), where for each window size and n the mean score is normalizeddivided by the mean score at 100% mutation. It shows that the window size has a smaller influence than n on the SKLD mean score. For each n as
increases, the mean score increases. The mean score increases rapidly when
is small and slowly when
is large, and this phenomenon becomes more distinct as n gets larger. In terms of the slope of the curves, the larger the n the larger the slope at smaller
, and the smaller the n the larger the slope at larger
. Therefore, from the viewpoint of discerning the mean scores, longer word sizes are better for lower mutation rates and shorter word sizes are better for higher mutation rates. On the other hand, Figure 1b shows the logarithm of the sample SD of the 5000 SKLD scores at every word size n = 1,2, ..., 20 for window size 600 and
varying from 5 to 100%, where for each n the SD of scores is normalizeddivided by the SD of scores at 100% mutation. It shows that for larger n the SD tends to be smaller at larger
, and for smaller n the SD tends to be smaller at smaller
(results for other cases of window size and
are similar). Therefore, from the viewpoint of minimizing the SD of scores, longer word sizes are better for higher mutation rates and shorter word sizes are better for lower mutation rates. These two conflicting viewpoints demonstrate that the choice of word size implies a trade-off between discerning the mean and minimizing the variance, i.e. between reducing systematic and random errors. This implies that moderate word sizes are preferable because they balance the systematic and random errors and lead to smaller overall error over the whole spectrum of mutation rates
% = 1%, 2%, ..., 100%. In what follows we shall describe a method based on rank statistics that exactly chooses moderate word sizes as optimal solutions.
|
Let Xi
denote the dissimilarity score between the i-th mother and her son at mutation rate
% and Ri
the rank of Xi
among {Xi
, 1
100} arranged in ascending order of magnitude, where the smallest (least dissimilarity) score is taken as rank 1. In theory, Xi
should increase as
increases (i.e. an upward trend). Therefore, the average value
of 5000 Ai scores, with
being the Spearman's rank statistic for testing an upward trend, is used to compare the performance of dissimilarity measures, where the measure with the smallest
favors the upward trend the most and is considered to be most advantageous. The ranks for BLAST are computed by putting the 100 similarity scores in descending order of magnitude, where the largest (most similarity) score is taken as rank 1. The scores that are <20 (the version BLASTN 2.1.3 is used in our study, see http://www.ncbi.nlm.nih.gov/BLAST, which only provides this bound for very non-similar pairs of sequences) are treated as ties. If several scores are tied, then they are assigned the same rank which is the average of all the tied ranks (called mid-ranks in the area of rank statistics). For an alternative way of comparing the performance of measures, see Remark 3.
Figure 2a shows that at every window size considered (only the results at some selected window sizes are shown for clarity), if the optimal word size, obtained by searching over 1
n
20, is used, then the performance of SKLD is the best among the family of SCRD. Indeed, the curves in Figure 2a are symmetric with respect to
= 1/2 and are nearly horizontal over the interval [1,0] with the value at
= 0 a little smaller than those at 1 <
< 0. For the rest, we concentrate on the comparison of SKLD with other type of measures. The comparison of
,
, In with the Hamming distance and BLAST at window size 600 is given in Figure 2b, where the optimal word size, associated with the smallest
, is marked. It shows that when the optimal word size is used, the performance of SKLD is the best, followed by SED, then ED, then BLAST, and Hamming distance is the least favorable. The same performance rankings also hold at all the other window sizes considered. We also find that the optimal word size for each measure increases with window size, as shown in Figure 2c (only the case of SKLD at some selected window sizes is shown to save space). For example, at window sizes that are multiples of 50 and are (1) in the groups 100250, 300700, 7502500, 25504950 and 50006100, the optimal word sizes for SKLD are 5, 6, 7, 8 and 9, respectively; (2) in the groups 100250, 3001150, 12005100 and 51506100, the optimal word sizes for ED are 5, 6, 7 and 8, respectively; and (3) in the groups 100350, 400850, 9001550 and 16003000, the optimal word sizes for SED are 5, 6, 7 and 8, respectively (See Remark 4 for a discussion). They also show that for each measure, the smallest
, resulting from using the optimal word size, decreases with window size. This implies that, for whole sequence similarity/dissimilarity comparison, the window size taken should be as large as possible (but probably not >3000, as restricted by CPU time in practice). A practical implementation is to let the window size be the minimum of 3000 and the lengths of the two DNA sequences under comparison.
For the rest,
denotes the optimal word size associated with window size l for SKLD. The scatter plot (not shown to save space) of the pairs
, l
B = {10, 20, 50, 100 + 50j: j = 0, 1, 2, ..., 120} shows
increases with l. Moreover, the correlation coefficient of these pairs is 0.871 and that of the pairs
is 0.957. This provides strong statistical evidence of the monotonic relationship between
and l. Table 1 give the (interpolated) optimal word size for SKLD at every window size between 10 and 6100, where the boundaries (i.e. l = 11, 12, 24, 25, etc.) are determined by actual simulation (see Remark 5 for a discussion). It is worth noting that the optimal n we recommend for the ED
is consistent with the word size used in examples of Torney et al. (1990) and Hide et al. (1994). See Remark 6 for details.
|
3.2 Estimation of the degree of dissimilarity
This section describes how to estimate the degree of dissimilarity ß between any pair of DNA sequences using the SKLD In. Note that 0
ß
1 with ß = 0 standing for the least dissimilar case and ß = 1 for the most dissimilar case. At each window size l = 11, 12, 24, 25,70, 71, 268, 269, 715, 716 (see boundaries between groups in Table 1) and 10, 20, 50 and 100 + 50j, j = 0, 1, 2, ..., 48 we generate 5000 In scores, 1
n
20, resulting from 5000 independent ms pairs, with the length of every mother being l, at each of the 100 mutation rates
%,
= 1, 2, ..., 100. If we identify the degree of dissimilarity ß with the mutation rate, then at each window size l and word size n: (1) the 5000 In scores at each mutation rate
% (=ß) can be viewed as a random sample from the population Un,
of all In scores resulting from the unlimited virtual pool of all ms pairs of DNA sequences at that mutation rate ß and (2) the 500 000 In scores, resulting from pooling together the scores for all mutation rates, can be viewed as an approximately random sample from the population
. Therefore, a ß-th quantile of Un can be estimated by the counterpart of the empirical distribution function associated with the 500 000 In scores. Picking
leads to Table A1. Thus, based on the statistic
, we can use Tables A1 to estimate ß. This way of estimating ß may be called the pooling method (see Remark 7 for a discussion).
Figure 3 shows the relation between
and all window sizes (for clarity only the case when window size
25 is shown) included in Table A1 at some selected SKLD scores, where the jumps of the curves are due to change of word size and occur at vertical lines at 70.5, 268.5 and 715.5 (as boundaries between word sizes 4 and 5, 5 and 6, and 6 and 7, respectively, see Table 1). Figure 3 shows that for every fixed SKLD score,
is monotonically decreasing with window size. Therefore, the method of linear interpolation can be applied to obtain the approximate estimate
of ß at any window size l satisfying l1
l
l2, where both l1 and l2 are window sizes that are included in Tables A1 and are closest to l. Although the true relationship between the estimate and window size may not be linear, the linear interpolation should provide a good approximation, especially when l2 l1 is small. Let
denote the estimate of ß by matching the observed SKLD score
against Table A1 with window size lj and optimal word size
(note that we have designed Table A1 so that
and l2l1
50 always). Then
is between
and
. Thus,
, where a = (l2l)/(l2l1). A numerical example is given in Section 4.2.
|
| 4 EXPERIMENTAL RESULTS |
|---|
|
|
|---|
Throughout the data analysis, for any pair of DNA sequences under comparison, the window size is taken to be the minimum of their lengths for Experiments #1#3 and the probe length for Experiment #4, and the SKLD is computed at the optimal word size (Table 1), which leads to
via Table A1. Since the word size may vary with the pair of sequences under comparison, the present setup is more efficient than the one in Wu et al. (1997, 2001) in which the word size is not optimally chosen and does not vary with window size. For base compositions of all the DNA sequences used in our experiments (which vary from U-I, skewed to Markov chain of order up to 5), see Remarks 811.
4.1 Experiment #1
Six DNA sequences are taken from the threonine operons of Escherichia coli K-12 (gi:1786181) and Shigella flexneri (gi:30039813) of Genbank. The three sequences taken from each threonine operons are thrA (aspartokinase I-homoserine dehydrogenase I, 2463 bp), thrB (homoserine kinase, 933 bp) and thrC (threonine synthase, 1287 bp), using the ORF's 3372799 (ec-thrA), 28013733 (ec-thrB) and 37345020 (ec-thrC) in the case of E.coli K-12, and using 3362798 (sf-thrA), 28003732 (sf-thrB) and 37335019 (sf-thrC) in the case of S.flexneri. In addition, a sequence (rand-thrA) is randomly generated according to the base probabilities and length of ec-thrA for comparison.
The estimated degree of dissimilarity
using SKLD and similarity scores using BLAST, at the default (search) parameter setting, between the 7 sequences are shown in Table 2. BLAST scores at many other parameter settings have also been obtained but not shown in Table 2 to save space. The results obtained by
agree with those by the chaos game representation (Almeida et al., 2001), by PSM of Pham and Zuegg (2004) and by BLAST at a few parameter settings (e.g., the setting: -G 5 -E 2 -q -2 -r 3 -W 8), in which the thrA sequences are closer to thrC than to thrB, and thrB closer to thrA than to thrC. However, for BLAST, the result obtained at most parameter settings is the same as that at the default parameter setting which shows a slightly different relationship between the three sequences. Specifically, it put the thrB sequences closer to thrA than to thrC, and thrC closer to thrB than to thrA. The difference is also illustrated by the dendrogram in Figure 4. Moreover, the dendrogram shows that all the real DNA sequences are more closely clustered using
than using BLAST at the default setting. This suggests that, at distinguishing a randomized sequence from a group of related real DNA sequences as a whole,
is no worse than BLAST at all parameter settings and better than BLAST at most parameter settings.
|
|
4.2 Experiment #2
Both
(or equivalently, SKLD) and BLAST are used to perform a search for dissimilarities/similarities of the query sequence HSLIPAS (1612 bp) human lipoprotein lipase against a test dataset of 63 library sequences chosen from many different divisions of Genbank and vary from mammals, invertebrates, viruses, plants, bacteria, etc. [see Table A2 at http://www.stat.ncku.edu.tw/tjwu. It expands both Table 2 of Hide et al. (1994) and Table 1 of Wu et al. 1997)]. The 63 sequences of the test dataset vary in length from 322 to 2 462 499 bases. Every member of the test dataset is classified as being related or not related in biological function to the query sequence. There are 35 sequences classified as being related (they are numbered from 1 to 35 herein), and 28 sequences classified as being not related (they are numbered from 36 to 63 herein).
The
and BLAST scores between HSLIPAS and 63 library sequences are sorted from the highest to lowest similarity, respectively, and the sensitivity and selectivity are used to quantify their performances. Sensitivity is defined to be the number of HSLIPAS-related sequences found among the first 35 library sequences. Selectivity is measured in terms of consecutive correct classifications, which means, starting from the first sequence, the total number of sequences are counted until the first non-HSLIPAS-related library sequence occurs. Thus, a score of 35 for sensitivity and selectivity is a perfect score for this collection of sequences.
We find the sensitivity and selectivity for
are 34 and 30, respectively, and those for BLAST are 29 and 22, respectively, at the default parameter setting and are no better than 33 and 28, respectively, at other parameter settings (the optimal result is obtained, e.g. at: -G 5 -E 2 -q -1 -r 1 -W 7). Hence,
performs better than BLAST. Also,
improves the combined K-LD (Wu et al., 2001), whose sensitivity and selectivity are 31 and 24, respectively. Finally, the sensitivity and selectivity of PSM of Pham and Zuegg (2004) are 32 and 26, respectively. Thus,
performs more favorably than PSM.
As a numerical example to explain how to use Tables A1 to estimate ß, we compare the library SSLPLRNA (2963 bp) and the query HSLIPAS. We take l = 1612 and
(Table 1) to obtain
. Since l1 < l < l2 with l1 = 1600 and l2 = 1650 and a = (1650 1612)/(1650 1600) = 0.76, it follows by linear interpolation (see Section 3.2) and Table A1 that
.
It is worth mentioning that, using a PC with Pentium 4 processor running at 3.4 GHz CPU and 1 GB RAM, it takes only
4.4 CPU seconds to finish the computation of SKLD between the query and all the 59 library sequences: #1#31 and #36#63 (varying from 322 to 22 257 bp and average = 3845 bp in length, see Table A2), and 189 CPU seconds between the query and all the 4 library sequences: #32#35 (varying from 1 173 390 to 2 462 499 bp and average = 1 838 935 bp in length). Here the total memory space required, in addition to that for running the MATLAB program, varies from 4 to 22 MB over all library sequences. Therefore, our algorithm is fairly efficient.
4.3 Experiment #3
A protein is translated from a gene composed of several exons in the DNA sequence. Most likely, some of the exons are shuffled during evolution. Human and mouse genomes are not far away in the evolution tree and share many common segments. However, segments in a chromosome of mouse might have their similar segments in more than one chromosome of human. This is an example of DNA subsequences shuffling during evolution.
Fifteen severe acute respiratory syndrome ORF sequences, varying in length from 231 to 8628 bases, are chosen from Genbank. The names of their loci are AY707461, AY536760, AY536759, AY536758, AY702026, AY648300, AY569693, AY365036, AY525636, AY444813, AY322205S4, AY451866, AY707854, AY609081, AY322205S3 (labelled herein as g1, ..., g15). We choose g1 (1605 bp) as the query sequence and cut it into three segments of equal lengths, by permuting them, we obtain five new sequences g11, ..., g15. Let T(a,b) denote the similarity/dissimilarity score between sequences a and b using the measure T.We compare the ratio T(g1,gi)/T(g1j,gi) for T = SKLD or BLAST. We find that for the five groups (j = 1, ..., 5), while each group contains 14 ratios (i = 2, ..., 15), the group mean varies from 0.944 to 0.97 and SD from 0.057 to 0.088 for SKLD, whereas the group mean varies from 1.244 to 1.978 and SD from 0.261 to 1.045 for BLAST at the default parameter setting (similar results are obtained at other parameter settings). Therefore, SKLD is much less sensitive than BLAST to segment shuffling. Such a property of SKLD is desirable in reconstructing the evolution tree because shuffled DNA sequences (if no other biological factor is present to affect the changes) should not be far away from one another in the tree.
4.4 Experiment #4
In Experiments #1#3, we focus on whole sequence similiarity/dissimilarity identification. However, in many applications it is the local similarity that matters the most. As a demonstration of the applicability of our method in local similarity comparison situations, we choose the T7 phage genome (39 937 bp), which includes 60 genes with lengths varying from 90 to 3957 bp, as the test dataset to show how
can help in selecting oligo probes for use in gene expression microarray design. For simplicity, we focus on the selection of a single 70mer oligo probe for each gene.
Pick any gene, say G0, and any window W0 of size l = 70 bp from G0. The SKLD between the window W0 and the remaining set
of 59 genes, with optimal word size
(Table 1), is defined by I4(W0,
) = minG
I4(W0, G) where I4(W0, G) is the SKLD between W0 and G (recalling Section 2). The window that maximizes I4(W0,
) over all W0 is taken as a probe for G0 and its corresponding
value can be obtained from Table A1. By rejecting 13 redundant genes (with
) and a gene that is excessively similar to non-target genes (with
; threshold value other than 0.25 may be used), we obtain a probe for each of the remaining 46 genes. Their loci names, probes and
can be found in Table A3 at http://www.stat.ncku.edu.tw/tjwu.
The above approach does not use BLAST at all. It uses
along with a threshold value set by the user to avoid probes with excessive sequence similarity to a non-target gene that may be present during the hybridization. Thus it should increase the specificity of individual probes. A probe must satisfy three conditions (Hughes et al., 2001; Kane et al., 2000; Nordberg, 2005) to be specific: (1) total percent identity must be
7580% with a non-target gene; (2) it must not include a stretch of identical sequence >15 contiguous bases with a non-target gene; (3) it must not include any low complexity region (e.g. long stretches of the same base, homopolymeric runs, etc.).
We compare our method with OP (OligoPicker, Wang and Seed, 2003) and YODA (Nordberg, 2005). OP also selects 46 probes corresponding to the same 46 genes as ours, while YODA only selects 45 probes corresponding to 45 genes contained in ours (YODA rejects gene T7p25). All the 137 (= 46 + 46 + 45) probes satisfy the specificity conditions (1)(3), as can be seen by using the Hamming distance, BLAST and DUST programs (Hancock and Armstrong, 1994). For example, using Hamming distance the average percent identity and SD of probes with non-target genes selected by
, OP and YODA are 49.16 and 2.3%, 48.82 and 2.5 and 50.19% and 3%, respectively, and there is little difference among them. Next, let the BLAST score of a probe equal the largest BLAST score between that probe and all non-target genes, and then combine all such 137 BLAST scores and rank them in ascending order of magnitude (recalling Section 3.1). Evidently, a smaller average rank corresponds to less similarity, and hence higher specificity. Table 3 shows probes selected by
has much smaller average rank than those by OP and YODA. In conclusion, our study, although very primitive, shows
can help in locating probes with high specificity and has potential for probe design.
|
| 5 DISCUSSION |
|---|
|
|
|---|
After computing a similarity/dissimilarity score between any pair of DNA sequences, we are sometimes not certain about the relative magnitude of the score since the terms small score, moderate score, and large score are relative terms. Consequently, the term dissimilar sequences is not clearly defined using these scores. This motivates us to introduce in Section 3.2 the so-called degree of dissimilarity ß between any pair of DNA sequences and develop a method to estimate the degree of dissimilarity based on SKLD score. In this way a clear cut definition of dissimilar sequences is made possible through the estimated values
of ß, by
for some
. The exact choice of
depends on the user's objective. For example, if
is used as a filter in the database search, then small
results in filtering out a large number of dissimilar sequences, and thus dramatically speeds the database search for similar sequences. The subjects of future research are (1) the expansion of Table A1 to include larger window sizes (currently it only includes results up to window size 2500 due to restriction on our computing environment), (2) the extension of our method to (hidden) Markov chain model of base composition and other types of mutation, (3) the extensive investigation of the specificity, along with sensitivity and consistency [see, e.g. Nordberg (2005) for definitions], of single or multiple oligo probes selected by
for use in microarray design, (4) the enhancement of our simple consecutive base sliding word model to incorporate frame shifts (e.g. using not just one but three sliding words each picking up 1 in every 3 bases and can be computed independently for statistical characteristics), and/or to use nonconsecutive base word models similar to those presented in Huang et al. (2004) and (v) the use of our method in protein sequence comparisons (specifically, the use of our method in producing a better amino acid matching table such as the BLOSUM or PAM matrices).
| Acknowledgments |
|---|
The authors would like to express their sincere thanks to two referees for many valuable suggestions that significantly improved the quality and presentation of the article. Part of the research was done while T.-J.W. was visiting the Institute of Statistical Science, Academia Sinica, Taipei, Taiwan.
Conflict of Interest: none declared.
Received on May 25, 2005; revised on August 24, 2005; accepted on August 31, 2005
| REFERENCES |
|---|
|
|
|---|
Alberts, B., Bray, D., Lewis, J., Raff, M., Roberts, K., Watson, J.D. Molecular Biology of the Cell, (1994) 3rd ed , New York Garland.
Almeida, J.S., et al. (2001) Analysis of genomic sequences by chaos game representation. Bioinformatics, 17, 429437
Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol., 215, 403410[CrossRef][ISI][Medline].
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Arratia, R., et al. The Erdös-Rényi law in distribution, for coin tossing and sequence matching. Ann. Stat., 18, 539570.
Blaisdell, B.E. (1986) A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl Acad. Sci. USA, 83, 51555159
Blaisdell, B.E. (1989a) Effectiveness of measures requiring and not requiring prior sequence alignment of estimating the dissimilarity of natural sequences. J. Mol. Evol., 29, 526537[ISI][Medline].
Blaisdell, B.E. (1989b) Average value of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch count requiring sequence alignment for a computer-generated model system. J. Mol. Evol., 29, 538549[ISI][Medline].
Cressie, N. and Read, T.R.C. (1984) Multinomial goodness-of-fit tests. J. R. Stat. Soc. Ser. B, 46, 440464.
Fichant, G. and Gautier, C. (1987) Statistical method for predicting protein coding regions in nucleic acid sequences. CABIOS, 3, 287295.
Frith, M.C., et al. (2004) Finding functional sequence elements by multiple local alignment. Nucleic Acids Res., 32, 189200
Gentleman, J.F. and Mullin, R.C. (1989) The distribution of the frequency of occurrence of nucleotide subsequences, based on their overlap capability. Biometrics, 45, 3552[CrossRef][ISI][Medline].
Hancock, J.M. and Armstrong, J.S. (1994) SIMPLE34: an improved and enhanced implementation for VAX and Sun computers of the SIMPLE algorithm for analysis of clustered repetitive motifs in nucleotide sequences. Comput. Appl. Biosci., 10, 6770
Hide, W., et al. (1994) Biological evaluation of d2, an algorithm for high performance sequence comparison. J. Computat. Biol., 1, 199215.
Huang, X., et al. (2004) Efficient combination of multiple word models for improved sequence comparison. Bioinformatics, 20, 25292533
Hughes, T.R., et al. (2001) Expression profiling using microarrays fabricated by an ink-jet oligonucleotide synthesizer. Nat. Biotechnol., 19, 342347[CrossRef][ISI][Medline].
Kane, M.D., et al. (2000) Assessment of the sensitivity and specificity of oligonucleotide (50mer) microarrays. Nucleic Acids Res., 28, 45524557
Nordberg, E.K. (2005) YODA: selecting signature oligonucleotides. Bioinformatics, 21, 13651370
Pearson, W.R. (1990) Rapid and sensitive sequence comparison with FASTA and FASTP. Methods Enzymol., 183, 6398[ISI][Medline].
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 24442448
Pevzner, P.A. (1992a) Nucleotide sequences versus Markov models. Comput. Chem., 16, 103106.
Pevzner, P.A. (1992b) Statistical distance between texts and filtration methods in sequence comparison. CABIOS, 8, 121127.
Pham, T.D. and Zuegg, J. (2004) A probabilistic measure for alignment-free sequence comparison. Bioinformatics, 20, 34553461
Pinheiro, H.P., Moiseiwitsch, F.S., Sen, P.K. (2000) Analysis of variance based on the hamming distance. In Sen, P.K. and Rao, C.R. (Eds.). Handbook of Statistics Volume 18: Bioenvironmental and Public Health Statistics, , North-Holland Oxford, pp. 735.
Sege, R.D. and Saxberg, B.E.H. (1982) A statistical test for comparing several nucleotide sequences. Nucleic Acids Res., 10, 375389
Torney, D.C., Burks, C., Davison, D., Sirkin, K.M. (1990) Computation of d2: a measure of sequence dissimilarity. In Bell, G. and Mrarr, T. (Eds.). Computers and DNA, Santa Fe Institute Studies in the Sciences of Complexity, , New York Addison-Wesley, pp. 109125.
Vinga, S. and Almeida, J. (2003) Alignment-free sequence comparison-a review. Bioinformatics, 19, 513523
Wang, X. and Seed, B. (2003) Selection of oligonucleotide probes for protein coding sequences. Bioinformatics, 19, 796802
In Waterman, M.S. (Ed.). Mathematical Methods for DNA Sequences, (1989) , Boca Raton, FL CRC.
Wu, T.-J., et al. (1997) A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics, 53, 14311439[CrossRef][ISI][Medline].
Wu, T.-J., et al. (2001) Statistical measures of DNA sequences dissimilarity under Markov chain models of base composition. Biometrics, 57, 441448[CrossRef][ISI][Medline].
This article has been cited by other articles:
![]() |
A. E. Pozhitkov, D. Tautz, and P. A. Noble Oligonucleotide microarrays: widely applied poorly understood Brief Funct Genomic Proteomic, July 20, 2007; (2007) elm014v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. S. Vernikos and J. Parkhill Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands Bioinformatics, September 15, 2006; 22(18): 2196 - 2203. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








and all window sizes (only the case when window size
. The jumps of the curves are due to changes of word size 

