A statistical method for alignment-free comparison of regulatory sequences
1Department of Computer Science, 2Department of Entomology, 3Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Illinois, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: The similarity of two biological sequences has traditionally been assessed within the well-established framework of alignment. Here we focus on the task of identifying functional relationships between cis-regulatory sequences that are non-orthologous or greatly diverged. Alignment-free measures of sequence similarity are required in this regime.
Results: We investigate the use of a new score for alignment-free sequence comparison, called the
score. It is based on comparing the frequencies of all fixed-length words in the two sequences. An important, novel feature of the score is that it is comparable across sequence pairs drawn from arbitrary background distributions. We present a method that gives quadratic improvement in the time complexity of calculating the
score, over the naïve method. We then evaluate the score on several tissue-specific families of cis-regulatory modules (in Drosophila and human). The new score is highly successful in discriminating functionally related regulatory sequences from unrelated sequence pairs. The performance of the
score is compared to five other alignment-free similarity measures, and shown to be consistently superior to all of these measures.
Availability: Our implementation of the
score will be made freely available as source code, upon publication of this article, at: http://veda.cs.uiuc.edu/d2z/
Contact: sinhas{at}cs.uiuc.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| 1 INTRODUCTION |
|---|
|
|
|---|
We begin this research with the question How do we measure the similarity between two biological sequences? The most well-studied version of this problem is measuring the similarity between proteins or coding sequences in order to detect homology. The widely accepted solution to this question is based on the notion of alignment. The BLAST program Altschul et al. (1990), which reports the statistical significance of sequence alignment scores, is the de facto standard for comparing two sequences.
Here, we are interested in answering this question with respect to regulatory sequences (e.g. enhancers or promoters) of genes, rather than the coding sequences. Moreover, we focus on the case where the sequences being compared do not demonstrate any statistically significant alignment. This case may arise where the two sequences being compared are not orthologous, yet are functionally related; or they are highly diverged evolutionarily. Thus, we can rephrase our motivating question as How do we measure the similarity between two regulatory DNA sequences, in an alignment-free manner? The ability to compare two regulatory sequences for similarity will have very important consequences, as discussed next.
MOTIVATION 1: One of the first, routine steps in the analysis of newly sequenced genomes is to scan the genome for coding sequences homologous to known genes in a related species, typically using a tool like BLAST. Similarly, with growing scientific interest in gene regulation, it will become necessary to be able to detect regulatory regions in the new genome that are homologous to known enhancers or promoters. It is well known that orthologous regulatory sequences show a significantly less level of alignment than their corresponding coding sequences, hence a sensitive method for alignment-free sequence comparison will be required.
MOTIVATION 2: Regulatory sequence comparison can prove vital for the ab initio discovery of cis-regulatory modules (CRM's) with a common function. A CRM may be loosely defined as a contiguous non-coding sequence that contains multiple transcription factor binding sites and drives some aspect of a gene's expression profile. Suppose we are given a set of co-regulated genes in a single species, and wish to find in their upstream and downstream regions (henceforth called the control regions), the CRMs that mediate the common aspect of their expression profiles. The control regions may be tens of Kbp long for each gene (especially for metazoan genomes), while the CRMs to be discovered are often only hundreds of bp long. One must therefore search in the control regions for subsequences (the candidate CRMs) that share some functional similarity. The CRM search algorithm thus requires a method that can discern functional similarity among candidate CRMs based on their sequence similarity. Also, since the different CRMs are only functionally related, and not orthologous, we need the comparison method to be alignment-free.
1.1 Comparison of k-word frequency distributions
One approach to alignment-free sequence comparison is to fix a short word length k, compute the frequency distribution of all k-length words (also called k-words) in each sequence, and assess the similarity of the two frequency distributions. This approach has been adopted in earlier work on alignment-free comparison of coding sequences (Wu et al., 1997). We expect the same idea to apply to regulatory sequence comparison with k set to the typical length of binding site cores, e.g. k = 6. If each of the two regulatory sequences being compared has several binding sites for the same transcription factors [as is commonly seen in Drosophila developmental enhancers Rajewsky et al. (2002)], comparison of their k-word frequency distributions should reveal this featural similarity. Such a strategy becomes all the more promising if we are interested in comparing one set of (related) regulatory sequences to another set of related regulatory sequences. (The k-word distribution of each set will have high frequencies for the shared motifs.)
1.2 Previous work
Various quantitative measures for the similarity of two sequences, in terms of their k-word frequency distributions, have been proposed in the past, as summarized in the excellent review by Vinga et al. (2003). The simplest similarity score is the Euclidian distance between the two 4k-dimensional vectors of k-word counts (Blaisdell, 1986). Information theoretic measures like the Kullback-Leibler distance (also called relative entropy) (Wu et al., 2001), geometric measures such as the cosine of the angle between the count vectors (Stuart et al., 2002), and statistical measures such as the correlation coefficient (Fichant and Gantier, 1987) have been investigated by different authors in the context of alignment-free sequence comparison. There are also measures that do not treat every k-word's count equally, in view of the fact that different k-words' counts have different probability distributions (Waterman, 2000). Thus, the Standardized Euclidian distance and the Mahalanobis distance (Wu et al., 1997, 2001) account for the variances of k-words in computing the Euclidian distance between the count vectors.
Furthermore, there have been approaches that do not base the sequence similarity score on counts of all k-words; rather, they use certain techniques to short-list a smaller subset of k-words and compare the counts of these words in the two sequences. For example, Stuart et al. (2002) use singular value decomposition to identify the most important of the 4k k-words before computing the cosine of the angle between the k-word count vectors of two sequences. Similarly, van Helden (2004) uses a motif-finding program to identify statistically significant k-words and bases the similarity score of two regulatory sequences on the counts of these significant k-words only.
1.3 Contributions of this work
This paper explores the D2 score (Lippert et al., 2002), a natural alignment-free similarity measure between two sequences, defined as the number of k-word matches between the sequences. The D2 score can also be expressed as the inner product of the k-word count vectors of the two sequences. However, in an application where several different pairs of sequences drawn from different distributions are to be compared, the D2 score is not suitable. For such applications, any sequence similarity measure must be normalized to account for the different background distributions of sequences. In this article:
- We propose and demonstrate the use of the D2z score as such a normalized measure that captures the statistical significance of the D2 score. It is defined as the number of standard deviations by which the observed value of D2 deviates from its expected value under the background distribution. We present an efficient algorithm for calculating this statistic, and the implementation will be made publicly available.
- The D2z score is calculated under the general assumption of the two sequences being generated by Markov chains, which may be different for the two sequences. We exploit a known result on infinite series of stochastic matrices (Kleffe and Borodovsky, 1992) to reduce the algorithm's time complexity from O(
) to O(4k), a speedup that is crucial to the feasibility of the algorithm for realistic values of k.
- We perform extensive tests with the
score on tissue-specific families of known enhancers (CRMs) in Drosophila and human. Our results show that the
score can accurately discriminate (a) functionally related CRMs from unrelated sequence pairs, and (b) orthologous CRMs from orthologous non-regulatory non-coding sequences. To our knowledge, we have demonstrated for the first time that alignment-free similarity scores may be potentially applied to the discovery and analysis of regulatory sequences in metazoan genomes.
- The performance of the D2z score is compared to several other similarity measures, and shown to be consistently superior to the compared methods.
| 2 METHODS |
|---|
|
|
|---|
Let
. The background models may be different for the two sequences. In the special case of
= 0, the background model is iid (independent and identically distributed) letters.
The D2 statistic is defined to be the number of k-word matches between the two sequences A and B, including overlaps. It can be computed as
|
| (1) |
Another way to think of D2 is as the inner product of the vectors of word counts in A and B. More explicitly, let
be the set of all k-words on the alphabet
. For
let
(resp.
) be the number of times the word w appears in the sequence A (overlaps allowed). Then
is the count vector for the sequence A. Then, we can also write the D2 score as
The
score for the sequences A and B is defined by
|
| (2) |
Next we outline the key steps in the computation of
. We give formulas for the mean and variance of D2 which can be computed efficiently. We treat the iid case (
) and the Markov model case (
) separately since the formulas for the iid case are much simpler. In either case, our implementation computes the
score of two
1 kbp long sequences in less than 1 second on a workstation.
2.1 Expectation
2.1.1 IID model For
, write f Aa (resp. f Ba) for the background probability of letter a in the sequence A (resp. B). Also define
Then
|
|
2.1.2 Markov Model (MM) We work with MM of order one and leave the extension to the general case to Supplementary Material. Here we follow notation from Kleffe and Borodovsky (1992) and Sinha and Tompa (2000). For sequence
, let pjT(a) be the probability of an a in position j in T. As justified in Kleffe and Borodovsky (1992), we may assume pjT is independent of j and denote it by pT. The probabilities
are calculated as the steady state probabilities of the background Markov model. For a k-word
, write
for
, i.e. the probability of a k-word in T being w, given that its first letter is w1. With this notation,
. Then
|
| (3) |
We use techniques from Lippert et al. (2002) and Waterman (2000) to compute the covariance terms
. Any pair
corresponds to two pairs of k-words, whose relative locations may be described by one of three different scenarios, as shown in Figure 1. For any
, let
be those pairs (s, t) whose location with respect to (i, j) is as shown in Figure 1b or 1c. Formally,
. Pairs of the type shown in Figure 1c are called accordions and form the set
, while pairs of the type Figure 1b are called crabgrasses and form the set
.
|
2.2.1 IID model The covariances
Case 1:
(no overlap).
are independent; hence
.
Case 2:
(crabgrass terms). Consider the case where
and
, i.e. overlap in A, not in B. Then
|
|
Case 3:
("accordion" terms). This is shown in the Supplementary Materials.
2.2.2 Markov model We work with MM of order one. Extension to the general case is explained in the Supplementary Material. As in the iid model, we divide the computation into three cases (Fig. 1). However, now, in the first case (Fig. 1a), with no overlap in either sequence, the covariance
may not be zero. Here we work out a formula for this case and also illustrate how we reduce the complexity of computation from
to
. Computations of the crabgrass and accordion terms use the same ideas and are shown in the Supplementary Material.
Case 1:
(no overlaps). Let
and
.
|
|
|
| (4) |
|
|
|
| (5) |
The total contribution of such terms to the variance is then obtained from this sum by subtracting
, where
is as computed in Equation (3).
2.3 Normality of the D2 statistics
In the iid model, the D2 statistic is asymptotically normal by Lippert et al. (2002) for non-uniform background model and by Kantorovitz et al. (in preparation) in the uniform case. (Hence,
is standard normal.) For the Markov Models, we used simulations to test normality. Our simulations showed that even when the length of the sequences is as short as 400, the Kolmogorov–Smirnov (K-S) test give good p-values. For example, when k = 5 and Markov order
, with sample size of
, and sequence lengths n = 400, the K-S test gave p-value
. (High K-S p-value corresponds to good normal approximation.)
2.4 Background models
In practice, we are only given the two sequences to compare, not the Markov chains. We infer these generative models by maximum likelihood estimation—the (
+ 1)-word frequencies in each sequence are used to train the Markov chain of that sequence. This practice of assuming a local background distribution has been adopted in several previous bioinformatics analysis, and is especially meaningful for the longer metazoan genomes that demonstrate great variability in local sequence composition. Of course, the program can also be made to run with a global background model assumed to generate all sequences.
Markov chains are a popular and intuitive model of biological sequence generation, and most regulatory sequence analysis tools utilize this model. A first-order Markov chain, for instance, captures adjacent nucleotide dependencies, such as the abundance of CG dinucleotides in some parts of the genome. We feel it is imperative to allow for such higher-order Markov chains in assessing the statistical significance of D2 scores.
2.5 Other measures
For evaluation purposes, we also implemented other alignment-free measures of pairwise sequence similarity (or dissimilarity). This includes
- ed: The euclidean distance between the k-word count vectors
and
(Blaisdelle, 1986).
- kld: Kullback–Leibler discrepancy (Wu et al., 2001) between the two k-word frequency distributions {fiA} and {fiB}, defined as
. We made this score symmetric by taking the mean of
and
.
- p.p and p.a: van Helden's Poisson scores (van Helden 2004), which are defined as follows. Let
, let F(x, m) be the Poisson distribution with mean m, and
(resp.
) be the expected number of occurrences of the word w in A (resp. B). Then
Also,

- lcc: Pearson correlation coefficient between k-word count vectors
and
(3).
- cos: The cosine of the angle between the k-word count vectors
and
(12).
In each of the above scores, all possible k-words' counts are considered. Thus, the cos score is computed without the preceding singular value decomposition (SVD) step of [12], and the Poisson score is computed without the initial motif-finding step of (14). Our intention here is to assess the scores as measures of sequence similarity, without any assistance from feature selection techniques such as motif-finding or SVD. Such techniques may be used in conjunction with any of the evaluated scores, and their interplay with each of the scores is beyond the scope of this article. Moreover, one of the compelling applications of alignment-free scores are regulatory sequences that may not contain statistically over-represented words (motifs), yet share a significant similarity in their overall word counts. In such applications, motif-finding is not expected to improve similarity measurement.
| 3 EVALUATION AND COMPARISON |
|---|
|
|
|---|
Our goal here is to evaluate if functionally and/or evolutionarily related sequence pairs are scored better than unrelated pairs of sequences randomly chosen from the genome (Sections 3.1 and 3.3). We will also evaluate if two sets of related sequences score better (when compared to each other) than two sets of unrelated sequences (Section 3.2).
3.1 Evaluation on functionally related regulatory sequences
To assess the performance on functionally related sequences, we construct each data set as follows. A set of CRMs, known to regulate expression in the same tissue and/or development stage, is taken as the positive set. A set of equally many randomly chosen non-coding sequences, with lengths matching the CRMs, is taken as the negative set. Each pair of sequences in the positive set is compared, and so is each pair in the negative set. We assess if the pairs from the positive set score higher than pairs from the negative set. This is done by sorting all pairs, whether they are from the positive set or the negative set, in one combined list, then counting how many of the pairs in top half of this list are from the positive set. That is, we count the number of positive pairs in the top
scoring pairs, where n is the number of sequences in the positive set. If
, we only look at the top 300 pairs. The ideal score should place only positive pairs in the top K, while a random score should place K / 2.
In the following paragraphs, we enumerate the data sets analyzed and the performance of different scores on these. The scores evaluated are: our D2z score and the five scores outlined in Section 2.5 (ed, kld, poisson: p.p & p.a, lcc and cos). All scores are run with word lengths
. The D2z score and the Poisson score are run with background models of Markov order 0,1,2. The Poisson score is run both with the additive model (p.a) and the product model (p.p). For each method, separate tests are done with each combination of parameter values, and the best performing combination is chosen to represent that score in the performance charts. The following are all the data sets that were analyzed.
FLY_BLASTODERM: These are 82 non-overlapping CRMs with expression in the blastoderm-stage embryo of the fruitfly, D. melanogaster, as listed in the REDfly database (Gallo et al., 2006).
FLY_PNS: These are 23 experimentally validated CRMs (average length 998 bp) driving expression in the peripheral nervous system in the fruitfly (Gallo et al., 2006).
FLY_TRACHEAL: Nine CRMs (average length 1220 bp) involved in regulation of the tracheal system in the fruitfly (Gallo et al., 2006).
FLY_EYE: A set of 17 CRMs (average length 894 bp) expressing in the Drosophila eye (Gallo et al., 2006).
HUMAN_MUSCLE: A set of human CRMs regulating muscle specific gene expression (Wasserman and Fickett, 1998). This set had 28 CRMs with an average length of 450 bp.
HUMAN_LIVER: Nine CRMs driving expression specific to the human liver (Krivan and Wasserman 2001), each being of length 201 bp.
HUMAN_HBB: We took the CRMs regulating the HBB complex (a set of developmentally regulated, erythroid-specific genes encoding ß-globin in human), obtained from King et al. (2005). This set has 17 sequences with mean length 453 bp.
Thus, we created four sets of tissue-specific CRMs from the D. melanogaster genome, and three sets from the human genome. (We only considered sequences greater than 200 bp in length.) For each such set of CRMs, which constituted the positive set, a negative set was created by randomly choosing non-coding sequences from the same genome, with lengths matching those of the CRM set.
We begin with a careful examination of the results on the two largest data sets FLY_BLASTODERM and HUMAN_MUSCLE, shown in Figure 2. In the FLY_BLASTODERM experiment, the top 300 pairs in terms of the D2z score include 220 pairs (73%) from the positive set, i.e. the pairs scored as being most similar are predominantly the functionally related pairs. Compared to a random expectation of 150 pairs (50%), this is a highly significant result demonstrating that the D2z score is successful at detecting the functional similarity of blastoderm CRMs. The D2z score is significantly better than all other scores, the next best being the Poisson score, with 188 positive pairs (62%) in the top 300. Similar results are obtained from the HUMAN_MUSCLE data set. Here, the top 300 scoring pairs included 206 (69%) positive pairs under the D2z scoring scheme, significantly better than random (150; 50%). The next best scoring method was the Linear Correlation Coefficient (LCC), with 194 (65%) positive pairs in the top 300. The Poisson score performed significantly worse than the D2z score, with only 172 (57%) positives in the top 300.
|
We summarize the results on all seven data sets analyzed, in Table 1. The D2z score outperformed all other scores in each of the data sets analyzed, with the Poisson score being the next best method in five data sets. In each case, the fraction of positive pairs was significantly better than 50%. In five data sets, the same combination of parameters (k = 5,
|
3.2 Comparing sets of sequences
Here we extend the above tests, which compared one sequence with another, to comparing sets of sequences. The motivation of this exercise comes from the challenging bioinformatics task of predicting sets of related CRMs, given a set of co-regulated genes. To solve this problem, the search algorithm needs a way to measure the overall similarity shared by a set of candidate CRMs, and it will be useful in this context to have a score that can identify two related subsets of CRMs. We therefore adapted the seven data sets constructed above, as follows. The positive set was randomly partitioned into two disjoint subsets of equal size, and each subset of sequences was concatenated to produce one longer sequence. Thus, each random partitioning produced a pair of sequences, which are then compared against each other. A hundred such random partitionings were done, producing 100 pairs of random sequences. The negative set also produced 100 pairs of sequences, by a similar construction. Each of our scores of interest was then computed for each of these 200 pairs, and we plotted how many positive pairs were included in the top 100 pairs. The ideal score would place only positive pairs in the top 100.
On all but one data set (HUMAN_HBB), the D2z score produced the ideal results, i.e., all positive pairs scored above all negative pairs. On five of these data sets, the Poisson score also showed close to ideal performance. However, on the HUMAN_LIVER data set, all other methods gave below 71% accuracy, compared to D2z score's 100% accuracy.
These results strongly suggest the potential for using D2z in ab initio CRM discovery, though building such algorithms that exploit this score is beyond the scope of this article.
3.3 Evaluation on orthologous regulatory sequences
Here, we assessed the ability of different scoring methods to distinguish orthologous CRMs from randomly chosen pairs of orthologous sequences. We chose to work with a set of 52 CRMs related to the blastoderm development in D. melanogaster, whose orthologs in a distantly related fruitfly D. mojavensis could be obtained based on rough alignment maps. With these pairs of orthologous blastoderm CRMs forming our positive set, a negative set was constructed by choosing 52 length-matched random sequences from the D. melanogaster genome, and pairing them up with their D. mojavensis orthologs. These two fruitfly species were chosen because they are close enough evolutionarily to have rough alignment maps that reveal orthologs on the scale of
1 kbp, yet are far enough that non-functional ortholog pairs are expected to have relatively little sequence similarity.
As in the previous subsection, we then used each scoring scheme to rank all 104 sequence pairs (i.e. both CRMs and random pairs), and asked how many positive pairs (orthologous CRMs) show up in the top K pairs, for various
. A random method, which cannot distinguish the two classes, should place K / 2 positive pairs in the top K. Figure 3 shows the results of this test. We first observe that the D2z score performed better than all other similarity scores. As the figure shows, all ten of the top 10 pairs were positive, and the number of positive in the top 20, 30, 40 and 50 pairs were 15 (75%), 21 (70%), 27 (68%) and 31 (62%), respectively. These numbers mean for example that at a sensitivity of 40% (21 of 52), D2z score predicts orthologous CRM pairs with 70% specificity (21 of 30). We therefore interpret these results as strong evidence that the D2z score can be used to detect functional similarity between orthologous regulatory sequences, at a level much higher than random orthologs. This would pave the way for using cross-species comparison to locate CRMs genome-wide, in the absence of any motif information.
|
| 4 DISCUSSION AND CONCLUSIONS |
|---|
|
|
|---|
We have shown how a new sequence similarity score, the D2z score, can be used to detect functional and/or evolutionary similarities among regulatory sequences, in an alignment-free manner. The score is normalized, under the commonly used assumption of sequences being generated by Markov chains, and can hence be compared across different sequence pairs.
An advantage of the D2 score is that its definition can be easily generalized to allow mismatches while measuring sequence similarity. The mean and variance of this mismatch-allowed version of D2 can be computed efficiently, assuming a zeroth order Markov chain as the background model.
The
score assesses how unlikely the observed D2 score is if the two sequences are unrelated (generated independently). This is similar in spirit to the BLAST e-value that measures how surprising a particular alignment score is if the two sequences are independent.
The
score can be adapted to a more limited set of k-words (motifs), as in van Helden (2004). The computations of the mean and variance of D2 in this case can be easily adjusted from our calculations.
The D2 score and its z-score may also be used to compare two protein sequences or coding sequences, even though we have mostly been concerned above with regulatory sequences. For coding sequences we may set k to be some multiple of 3, and use a second-order Markov chain as background.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
MRK was funded in part by Illinois Sociogenomics Initiative.
Conflict of Interest: none declared.
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Basic Local Alignment Search Tool. J. Mol. Biol, ( (1990) ) 215, : 403–410.[CrossRef][ISI][Medline].
Blaisdell BE. A Measure of the Similarity of Sets of Sequences not Requiring Sequence Alignment. Proc. Natl Acad. Sci. USA, ( (1986) ) 83, : 5155–5159.
Fichant G, Gautier C. Statistical method for predicting protein coding regions in nucleic acid sequences. Comput. Appl. Biosci, ( (1987) ) 3, : 287–295.
Gallo SM, et al. REDfly: a Regulatory Element Database for Drosophila. Bioinformatics, ( (2006) ) 22, : 381–383.
Kleffe J, Borodovsky M. First and second moment of counts of words in random texts generated by Markov chains. Comput. Appl. Biosci, ( (1992) ) 8, : 433–441.
Kantorovitz MR, et al. Asymptotic behavior of k-word matches between two random sequences. in preparation..
Lippert RA, et al. Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl Acad. Sci. USA, ( (2002) ) 99, : 13980–13989.
King DC, et al. Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. Genome Res, ( (2005) ) 15, : 1051–1060.
Krivan W, Wasserman WW. A predictive model for regulatory sequences directing liver-specific transcription. Genome Res, ( (2001) ) 11, : 1559–1566.
Rajewsky N, et al. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics, ( (2002) ) 3, : 30.[CrossRef][Medline].
Sinha S, Tompa M. A statistical method for finding transcription factor binding sites. Proc. Int. Conf. Intell. Syst. Mol. Biol, ( (2000) ) 8, : 344–354.[Medline].
Stuart GW, Moffett K, Baker S. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics, ( (2002) ) 18, : 100–108.
Vinga S, Almeida JS. Alignment-free sequence comparison - a review. Bioinformatics, ( (2003) ) 19, : 513–523.
van Helden J. Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics, ( (2004) ) 20, : 399–406.
Wasserman WW, Fickett JW. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol, ( (1998) ) 278, : 167–81.[CrossRef][ISI][Medline].
Waterman MS. Introduction to Computational Biology, ( (2000) ) New York: Chapman & Hall/CRC..
Wu TJ, Burke JP, Davison DB. A Measure of DNA Sequence Dissimilarity Based on Mahalanobis Distance between Frequencies of Words. Biometrics, ( (1997) ) 53, : 1431–1439.[CrossRef][ISI][Medline].
Wu TJ, Hsieh YC, Li LA. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics, ( (2001) ) 57, : 441–443.[CrossRef][ISI][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







