Skip Navigation


Bioinformatics Advance Access originally published online on August 18, 2008
Bioinformatics 2008 24(20):2296-2302; doi:10.1093/bioinformatics/btn436
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/20/2296    most recent
btn436v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dai, Q.
Right arrow Articles by Wang, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dai, Q.
Right arrow Articles by Wang, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison

Qi Dai *, Yanchun Yang and Tianming Wang

Department of Applied Mathematics, Dalian University of Technology, Dalian 116024, China

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULT AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Many proposed statistical measures can efficiently compare biological sequences to further infer their structures, functions and evolutionary information. They are related in spirit because all the ideas for sequence comparison try to use the information on the k-word distributions, Markov model or both. Motivated by adding k-word distributions to Markov model directly, we investigated two novel statistical measures for sequence comparison, called wre.k.r and S2.k.r.

Results: The proposed measures were tested by similarity search, evaluation on functionally related regulatory sequences and phylogenetic analysis. This offers the systematic and quantitative experimental assessment of our measures. Moreover, we compared our achievements with these based on alignment or alignment-free. We grouped our experiments into two sets. The first one, performed via ROC (receiver operating curve) analysis, aims at assessing the intrinsic ability of our statistical measures to search for similar sequences from a database and discriminate functionally related regulatory sequences from unrelated sequences. The second one aims at assessing how well our statistical measure is used for phylogenetic analysis. The experimental assessment demonstrates that our similarity measures intending to incorporate k-word distributions into Markov model are more efficient.

Availability: The software, data and supplement material are freely available at http://math.dlut.edu.cn/daiqi/mplusd.html.

Contact: daiailiu2004{at}yahoo.com.cn

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULT AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Among biological sequence analysis, some important computational methods are similarity search, phylogenetic analysis and sequence clustering. The similarity search (Altschul et al., 1990, 1997; Pham, 2007; Pham and Zuegg, 2004) is to search a database of known function sequences and uses the structures and functions of the most closely matched known sequences to analyze the query sequence. Phylogenetic analysis (Felsenstein, 1996; Huelsenbeck and Roquist, 2001; Kumar et al., 2004; Lin et al., 2002; Ronquist and Huelsenbeck, 2003; Waddell and Steel, 1997; Waddell et al., 1999a, b, 2001) is the study of the evolutionary history among species. It can also provide useful information for pharmaceutical researchers to determine which medicinal species share the same medical qualities (Komatsu et al., 2001). Clustering sequences (Mohseni-Zadeh et al., 2004; Pipenbacher et al., 2002) is to get a biologically meaningful partition. When sequences are grouped into a family, it can give us some clues about the general features of that family. However, these efficient computational methods rely heavily on the similarity measures defined among biological sequences.

Because of the importance of research into similarity measure, numerous efficient algorithms have been developed, but challenges remain. Moreover, we believe that further improvements in the similarity measures will allow us to design more effective tools. One kind of similarity measures in this area is alignment-based measure. Waterman (1995) and Durbin et al. (1998) provided comprehensive reviews about this method. The search for optimal solutions using alignment-based method encounters difficulties: (i) computational aspect with regard to large databases (Pham and Zuegg, 2004); (ii) the scoring schemes chosen (Vinga and Almeida, 2003). Therefore, the emergence of research into alignment-free measure is apparent and necessary to overcome critical limitations of alignment-based measure.

Up to now, many efficient alignment-free measures have been proposed, but they are still in the early development compared with alignment-based measure. One of the comprehensive reviews (Vinga and Almeida, 2003) reported several similarity measures, such as the Euclidean distance (Blaisdell, 1986), Mahalanobis distance (Wu et al., 1997), Kullback–Leibler discrepancy (Wu et al., 2001), Cosine distance (Stuart et al., 2002), Pearson's correlation coefficient (Fichant and Gautier, 1987), Kolmogorov complexity (Li et al., 2001) and Lempel-Ziv (LZ) complexity (Otu and Sayood, 2003). Recently, several novel statistical measures have been designed for DNA sequence comparison based on Markov model, such as SimMM (Pham and Zuegg, 2004) and D2z (Kantorovitz et al., 2007).

These statistical measures are often associated with k-word distributions, Markov model or both. Markov model and k-word distributions contain the important information about biological sequences. But, they are not absolutely equivalent. Indeed, in k-word distributions, the number of occurrence of each word is fixed, whereas in Markov model it is given on average. The idea for k-word analysis with background of Markov model was used by Van Helden (2004) and Kantorovitz et al. (2007). Whereas, our work presents two novel statistical measures for sequence analysis, which directly adds k-word distributions to Markov model. The contents can be summarized as follows:

  1. Two statistical measures wre.k.r and S2.k.r, as the extended Jensen–Shannon divergence, are proposed. They are defined with the k-word concerns and calculated under Markov chains which may be different for two sequences. Particularly, the statistical measure, S2.k.r, is proved to be a valid distance measure.
  2. Our measures are applied to extensive tests, e.g. similarity search and evaluation on functionally related regulatory sequences. The performance of our measures is compared with other methods based on alignments or alignment-free. The results demonstrate that they are promising statistical measures for sequence comparison with future possible improvement on structure and function prediction.
  3. The distance measure, S2.k.r, is further used to construct phylogenetic tree. The results are in good agreement with the authoritative phylogenies, which indicates that S2.k.r is an efficient measure for phylogenetic analysis.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULT AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 The construction of Markov model for DNA sequences
Markov chain involves two probabilistic measures: transition probability matrix P and initial state distribution {pi}, and can be denoted in a compact form:


Formula 1

(1)
P=[pij] denotes the transition matrix of a discrete-time Markov chain. Each state transition probability pij is defined as follows:


Formula 2

(2)
where an stands for the actual state at time tn (n=1,2,···), and Si and Sj are the i-th and j-th states of a space with N distinct states. In the context of DNA sequences, its state space is {A, C, G, T}. The state transition probabilities are subject to


Formula 3

(3)


Formula 4

(4)

Additional {pi}={pi}i denotes the initial state distribution where


Formula 5

(5)

What we have presented above is the 1-step Markov chain, and the k-step Markov chain can be deduced from the 1-step Markov chain (Isaacson and Madsen, 1976). pFormula is the probability that Markov chain is in state j after k steps from state i, so


Formula 6

(6)
For any three events A, B and C, the following identity is known


Formula 7

(7)
Interpreting A as xk=j, B as xk–1=r and C as x0=i, we have


Formula 8

(8)
which is known as the Chapman–Kolmogorov equation. Hence, the matrix with the elements pFormula are


Formula 9

(9)

2.1.1 Estimate of the parameters
Since the transition probabilities are priori unknown, we have to estimate them according to the observed sequences. Here, we estimate the parameters of Markov model by using the maximum likelihood method (Durbin et al., 1998).

2.1.2 Test of hypotheses
A DNA sequence can be regarded as a discrete-time Markov chain. We test this hypotheses based on the following test statistic (Durbin et al., 1998)


Formula 10

(10)
where P0 is the given transition probability matrix, N(x,y) is the number of appearence of the pair states (x,y) in the given sequence, and S={A, C, G, T}.

2.2 Statistical measures
2.2.1 Previous similarity measures
We now describe seven similarity measures for biological sequences. Many similarity measures, for sequence comparison, are to fix a short word with length k, compute the frequency distributions of all k-length words (also called ‘k-word’) in each sequence and assess the similarity of the two frequency distributions.

  1. Euclidian distance: the Euclidian distance is one of the simple and common similarity measures of biological sequences. The similarity score between two biological sequences is the Euclidian distance between their 4k-dimensional vectors of k-word distributions (Blaisdell, 1986).
  2. Cosine of the angle (cos): in order to derive estimate of relatedness from the vector definitions of biological sequences, Stuart et al. (2002) proposed the pairwise cosine for phylogenetic analysis from whole genome sequences.
  3. Pearson's correlation coefficient (pcc): pcc measures the strength of the linear relationship between two variables. Fichant and Gautier (1987) presented the pearson's correlation to predict protein coding regions of a genome fragment.
  4. Kullback–Leibler discrepancy (kld): the relative entropy, also known as the Kullback–Leibler's distance or the divergence, was introduced by Kullback and Leibler (1951). Wu et al. (2001) applied kld to compare DNA sequences based on the frequencies of all k-words.In contrast to the methods described above, several similarity measures are defined based on Markov model. DNA and protein sequences have been realized to comprise a mixture of local regions that consist of compositional characteristics and pseudo-periodic sequence patterns. Markov chain takes into account this ‘periodical’ behavior of the bio-signal by making use of state transition probability.
  5. D2: the ‘D2’ score (Lippert et al., 2002) is a natural alignment-free similarity measure between two sequences, which is defined as the number of k-word matches between the sequences A and B. Let Formula , and Formula be the index set, where Formula and Formula , and Y(i,j) is the indicator variable for a match between the k-words starting at position i in A and at position j in B, then the D2 score can be computed by


    Formula 11

    (11)

  6. D2z: in an application where several different pairs of sequences drawn from different distributions are to be compared, the previous measure D2 is not suitable. Kantorovitz et al. (2007) proposed and demonstrated the use of the ‘D2z’ score as such a ‘normalized’ measure which captures the statistical significance of the D2 score is as follows:


    Formula 12

    (12)
    In Equation (13), E(D2) and {sigma}(D2) are the expectation and the SD of D2(A, B), respectively. The D2z score is calculated under the general assumption of the two sequences generated by Markov chains, which may be different for the two sequences.

  7. SimMM (D): SimMM is a probabilistic measure based on the concept of comparing the similarity/dissimilarity between two constructed Markov models (Pham and Zuegg, 2004). If {lambda}1=(P1, {pi}1) and {lambda}2=(P2, {pi}2) denote two Markov models of two biological sequences, their probabilistic distance, denoted by D({lambda}1, {lambda}2), can be computed as:


    Formula 13

    (13)


    Formula 14

    (14)
    where O{lambda}j=o1o2···oTj is the biological sequence j, and Tj is the length of the sequence O{lambda}j.

2.2.2 Novel statistical measures
In this subsection, we present two novel statistical measures for biological sequence comparison.

  1. Revised relative entropy (rre): relative entropy is the most important concept in both statistical biology and information theory. It has been explored as similarity measures, such as kld and SimMM to compare biological sequences. However, in an application where pk or P(O{lambda}j|{lambda}i) is equal to 0 or 1, kld(P1, P2) or D({lambda}i, {lambda}j)->{infty}. So kld and SimMM measures become unsuitable. For such an application, we propose a similar statistical measure between two Markov models of order r with the help of Jensen–Shannon divergence, denoted by rre.k.r({lambda}Formula,{lambda}2r), as follows:


    Formula 15

    (15)
    where the set Sk consists of all possible sequences of length k with symbol from the alphabet A. In the context of DNA sequences, A is {A, C, G, T}. Moreover, P(O|{lambda}r) denotes the probability of sequence O under the Markov model {lambda} of order r. From now on, we will restrict our discussion to Markov model with order from 0 to 2. Given Markov model {lambda}r, the probability of sequence O=o1o2···ok, 1≤k≤N, can be computed as follows:


    Formula 16

    (16)
    The symmetric form of rre.k.r, denoted by S1.k.r, is defined by


    Formula 17

    (17)

  2. Adding k-word distributions to Markov model: if o1o2···on is a given sequence of length n, O=oioi+1···oi+k, 1≤i≤nk, 1≤k≤n is a k-word of the given sequence. Its distributions can be estimated by


    Formula 18

    (18)
    where CO is the count of the k-word O in the given sequence.By adding the k-word distributions to Markov model, we present a novel statistical model in a compact form {varsigma}=({lambda}, W). It contains the information from both k-word distributions and Markov model. If {varsigma}Formula=({lambda}1r, W1k) and {varsigma}Formula=({lambda}2r, W2k) are two statistical models of the two bio-sequences of lengths n1 and n2, the weighted relative entropy (wre.k.r) is defined as follows:


    Formula 19

    (19)
    where the set Sk consists of all possible sequences of length k with symbol from the alphabet A; 1≤k≤min{n1,n2}. Moreover, {phi}(O|{varsigma}Formula) denotes the entry of sequence O under the statistical model {varsigma}Formula, which can be computed by


    Formula 20

    (20)
    In addition, if {phi}(O|{varsigma}Formula)={phi}(O|{varsigma}2,kr)=0, we assume that


    Formula 21

    (21)
    Similarly, the symmetric form, denoted as S2.k.r, is defined by S2.k.r({varsigma}Formula,{varsigma}2,kr)


    Formula 22

    (22)
    where |Sk| is the size of the set Sk.

A distance metric, D(·, ·), should satisfy the following conditions:

  1. D(S,Q)≥0 where the equality is satisfied iff S=Q (identity).
  2. D(S,Q)=D(Q,S) (symmetry).
  3. D(S,Q)≤D(S,T)+D(T,Q) (triangle inequality).

In the supplementary material, we prove that S2.k.r satisfies the three conditions and is, therefore, a valid distance metric.

2.3 Evaluation methods
Receiver operating curve (ROC) goes back to signal detection and classification problems and is now widely used (Egan, 1975). This approach is employed in binary classification of continuous data, usually categorized as positive (1) or negative (0) cases. The classification accuracy can be measured by plotting, for different threshold values, the number of true positives (TP), also named sensitivity or coverage versus false positives (FP), or (1-specificity), encountered for each threshold, properly normalized [Equation (23)].


Formula 23

(23)

A ROC curve is simply the plot of sensitivity versus (1-specificity) for different threshold values. The area under a ROC curve (AUC) is a widely employed parameter to quantify the quality of a classificator because it is a threshold independent performance measure and is closely related to the Wilcoxon signed-rank test (Bradley, 1997). For a perfect classifier, the AUC is 1 and for a random classifier the AUC is 0.5. For additional results and comprehensive discussion on AUC measure, see Green and Brenner (2002).


    3 RESULT AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULT AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Similarity search
The proposed statistical measure is used to search for similar sequences of a query sequence from a database of 39 library sequences, of which 20 sequences are known to be similar in biological function to the query sequence, and the remaining 19 sequences are known as being not similar in biological function to the query sequence. This dataset has been studied in Wu et al. (2001), Pham and Zuegg (2004). These 39 sequences were selected from mammals, viruses, plants, etc., of which lengths vary from 322 to 14 121 bases. The query sequence is HSLIPAS (Human mRNA for lipoprotein lipase), which has 1612 bases. These sequences are described in the Supplementary Material.

ROC curves are computed to evaluate and compare the performance of our measures with other measures. All statistical measures based on k-word distributions run with k from 2 to 8. The D2.k.r, D2z.k.r, rre.k.r, wre.k.r, S1.k.r and S2.k.r scores run with background models of Markov order r from 0 to 2 and word length k from 2 to 8. For each measure, the best combination is chosen to represent that score in the performance. The ROC curves obtained for the similarity search are presented in Figure 1a and b. Figure 1a denotes the comparison of similarity measures based on alignment and k-word distributions. Figure 1b denotes the comparison of similarity measures based on Markov model and Makrov model plus k-word distributions.


Figure 1
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. ROC curves for similarity search dataset, ROC(a) denotes the comparison of the similarity measures based on alignment or k-word distributions. ROC(b) denotes the comparison of the similarity measures based on Markov model. Similarity measure names are presented in Section 2, with parameter values as suffix. A random classifier would generate equal proportions of FP and TP classifications, which corresponds to the ROC diagonal (dashed line).

 
The AUC value is typically used as a measure of overall discrimination accuracy. Table 1 provides the AUC obtained from all the similarity measures. Figure 1 and Table 1 show that wre.2.1 performs better than other alignment-based or alignment-free measures on similarity search. Its AUC is 0.980 with the small standard error 0.018 for this estimate. ClustalW outperforms other alignment-free measures, such as cos.3, ed.3, kld.4, D2.8.0, rre.2.1, S1.4.1 and S2.3.1. Among the similarity measures based on k-word distributions, pcc.3 is clearly more efficient than other measures. Among the similarity measures based on Markov model, D outperforms D2.8.0, D2z.5.0, rre.2, S1.4.1 and S2.3.1. The main surprise of this analysis is that when we add the information on k-word distributions to Markov model in our way, wre.2.1 performs better than other similarity measures based on k-word distributions, Markov model or both. But the symmetrical form S2.3.1 has no significant improvement in similarity search. The inspection of the ROC curves themselves (Fig. 1) further illustrates this comparison between similarity measures.


View this table:
[in this window]
[in a new window]

 
Table 1. Comparison of AUCs obtained from all the similarity measures

 
3.2 Evaluation on functionally related regulatory sequences
Regulatory sequence comparison plays an important role in the ab initio discovery of cisregulatory modules (CRMs) with a common function. If a set of co-regulated genes in a single species is given, we wish to find, in their upstream and downstream regions (henceforth called the ‘control regions’), the CRMs that mediate the common aspect of their expression profiles. The control regions may be tens of Kilobase long for each gene (especially for metazoan genomes), while the CRMs to be discovered are often only hundreds of base pair long. One must therefore search in the control regions for subsequences (the candidate CRMs) that share some functional similarity (Kantorovitz et al., 2007).

The proposed statistical measures are further tested to evaluate if functionally or evolutionarily related sequence pairs are scored better than unrelated pairs of sequences randomly chosen from the genome. To assess the performance on functionally related sequences, we construct each dataset as follows. A set of CRMs, known to regulate expression in the same tissue, is taken as the ‘positive’ set. A set of equally many randomly chosen non-coding sequences, with lengths matching the CRMs, is taken as the ‘negative’ set. The following seven datasets are chosen to analyze: FLY BLASTODERM ( 82 CRMs with expression in the blastoderm-stage embryo of the fruitfly, Drosophila melanogaster); FLY PNS [23 CRMs (average length 998 bp) driving expression in the peripheral nervous system in the fruitfly]; FLY TRACHEAL [9 CRMs (average length 1220 bp) involved in regulation of the tracheal system in the fruitfly]; FLY EYE [17 CRMs (average length 894 bp) expressing in the Drosophila eye ]; HUMAN MUSCLE [28 human CRMs (average length 450) regulating muscle specific gene expression]; HUMAN LIVER [9 CRMs (average length 201) driving expression specific to the human liver]; HUMAN HBB [17 CRMs (average length 453) regulating the HBB complex]. They are well studied by Gallo et al. (2006) and Kantorovitz et al. (2007).

Each pair of sequences in the positive set is compared, and so is each pair in the negative set. The evaluation procedure is based on a binary classification of each sequence pair, where 1 corresponds to the pairs from positive set, 0 corresponds to the pairs from negative set. Let n be the number of sequences in the positive set, all the pairs constitute a vector of length 2Formula , which is used as prediction. Also, we can get a vector of length 2Formula consisting of 1 and 0 as class labels. A perfect measure would completely separate negative from positive set. Of course, this does not happen in practice, and the classes are interspersed. The ROC curves permit to assess the level of accuracy of this separation without choosing any distance threshold for the separation point. In particular, the AUC will give us a unique number of the relative accuracy of each measure.

The similarity measures evaluated here should satisfy the symmetry condition. The evaluated measures are: the similarity measures based on alignment, S1.k.r, S2.k.r and the seven measures outlined in Section 2, where the similarity measures based on alignment are Needleman–Wunsch (global alignment) or Smith–Waterman (local alignment) raw scores, with no correction for statistical significance, using linear gap penalties or affine gap penalties, with a gap penalty of 2. All statistical measures based on k-word distributions run with k-word from 2 to 8. The D2.k.r, D2z.k.r, S1.k.r and S2.k.r scores run with Markov order r from 0 to 2 and word length k from 2 to 8. For each measures, separate tests are done with each combination of parameter values, and the best combination is chosen to represent that score in the performance. ROC curves are computed to evaluate and compare the performances of our measures and other measures. The ROC curves obtained for PNS and EYE are presented in Figure 2. All the color figures for seven datasets are presented in the Supplementary Material.


Figure 2
View larger version (43K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. ROC curves for data PNS and EYE. Similarity measure names are presented in Section 2, with parameter values as suffix. A random classifier would generate equal proportions of FP and TP classifications, which corresponds to the ROC diagonal (dashed line).

 
Table 2 summarizes the AUCs obtained from all the similarity measures for seven datasets. In the BLASTODERM experiment, S2.7.2 performs better than other alignment-based or alignment-free measures, with the AUC 0.664. The next best measure is D, and the other measures lag behind. In the PNS experiment, S2.8.2 measure is significantly better than all other scores, its AUC is 0.854. The AUCs of all other measures are below 0.7. In the Fly TRACHEAL experiment, S2.8.2 significantly outperforms other methods, and its AUC is 0.975. It is followed by D and D2z.4.0. This is a highly significant result demonstrating that the S2.8.2 measure is successful at detecting the functional similarity of tracheal CRMs. In the EYE experiment, the AUC of S2.8.2 is 0.833, significantly better than other statistical measures. The next best measures are D and D2z.4.0. In the MUSCLE experiment, D2z.2.0 outperforms other methods, and its AUC is 0.703. It is followed by S2.8.0. In LIVER experiments, S2.5.1 performs significantly better than other measures. The next best measure is the D2z.2.0. In HBB experiments, S2.2.0 achieves the best performance, followed by D2z.3.0. From the seven experiments, we can see that S2.k.r performs significantly better than other measures among six experiments, because it incorporates the k-word information into Markov model directly. The inspection of the ROC curves themselves (in the Supplementary Material) further illustrates these comparisons among similarity measures. Since the different CRMs are only functionally related and not orthologous, the CRM search algorithm requires a method that can discern functional similarity among candidate CRMs based on their sequence similarity. From Table 2, we can note that the alignment-based methods lag behind some alignment-free methods.


View this table:
[in this window]
[in a new window]

 
Table 2. Comparison of AUCs obtained from all the similarity measures for seven datasets

 
3.3 Phylogenetic analysis
Since the proposed measure S2.k.r is a statistical distance measure, it can be further tested by phylogenetic analysis. Given a set of DNA, their phylogenetic relationship can be obtained through the following main operations: first, we construct the statistical models for biological sequences and calculate their similarity distance by using our statistical distance measure S2.k.r; second, by arranging all the similarity distance into a matrix, we obtain a pairwise distance matrix and finally, we put the pairwise distance matrix into the neighbor-joining program in the PHYLIP package (Felsenstein, 1989).

The phylogeny of eutherian orders has been unresolved due to conflicting results obtained from comparison of whole mtDNA sequences and individual proteins encoded by mtDNA (Cao et al., 1998; Lin et al., 2002; Waddell et al., 1999a, b, 2001). We choose a more controversial dataset (29 mammalian species, of which five are rodents deals) that have been widely studied in Reyes et al., (2000), Li et al. (2001) and Otu and Sayood (2003). These sequences are described in the Supplementary Material.

The phylogenetic tree constructed by our measure is presented in Figure 3. The bootstrap technique is employed to evaluate the tree topologies by resampling the sequence 100 times. We obtain the phylogenetic trees drawn by MEGA program (Kumar et al., 2004), bootstrap values, lower than 50, are hidden. Generally, an independent method can be developed to evaluate the accuracy of a phylogenetic tree, or the validity of a phylogenetic tree can be tested by comparing it with authoritative ones. Here, we adopt the latter one to test the validity of our phylogenetic tree.


Figure 3
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Phylogenetic tree obtained by our measure using complete mtDNA. Bootstraps are based on 100 replications.

 
It has been debated which two of the three main groups of placental mammals are more closely related: Primate, Ferungulates and Rodents. This is because by the maximum likelihood method, some protein support the (Ferungulates, (Primates, Rodents)) grouping, while other proteins support the (Rodents, (Ferungulates, Primates)) grouping (Cao et al., 1998). Our method has reconfirmed the hypothesis of (Rodents, (Primates, Ferungulates)). Figure 3 shows that our tree is quite consistent with the authoritative ones (Cao et al., 1998; Li et al., 2001; Otu and Sayood, 2003; Reyes et al., 2000) in the following four aspects: first of all, marsupials and monotremes are grouped closely, the bootstrap values (98% and 100%) are better than Li et al. (2001) (96%); second, five rodents are grouped closely, three non-murid rodents (squirrel, dormouse and guinea pig) are separated from murid rodents (rat and mouse, 100%), but the bootstrap values of three non-murid rodents are lower than Cao et al. (1998), Reyes et al. (2000) and Li et al. (2001); third, primates are closely related to each other, their bootstrap values (100%) are higher than Cao et al. (1998) (89% and 99%) and Reyes et al. (2000) (77% and 90%) and finally, ferungulates are grouped in a branch and clustered with primates with a high bootstrap value 98% [it is better than Cao et al. (1998) (94%), Reyes et al. (2000) (65%) and Li et al. (2001) (89%)]. These results confirm that S2.k.r can be considered as another efficient distance measure for phylogenetic analysis.


    4 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULT AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Many bioinformatics applications rely heavily on sequence comparison techniques, from searching a database with a query DNA sequence to the phylogenetic tree construction. Despite the prevalence of alignment-based methods, it is also noteworthy that alignment-based method is computationally intensive and consequently unpractical for querying large datasets, which forces the use of some heuristics to reduce the running times, as exemplified by BLAST. Alignment-free comparison method is therefore of great value as it reduces the technical constraints of alignments.

The statistical measures wre.k.r and S2.k.r presented in this work are alignment-free methods for sequence comparison. By incorporating the k-word distributions into Markov model, we construct a novel statistical model {varsigma} for each sequence. The similarity between two sequences can be obtained by computing the log-likelihood difference between their corresponding statistical models. Therefore, they are simple alignment-free methods that yield results reasonably. The test of our methods are to perform similarity search and evaluate the functionally related regulatory sequences. Our statistical measures are evaluated by comparison with methods based on alignment or not. The comparison demonstrates that our measures intending to incorporate k-word distributions into Markov model are more efficient (Tables 1 and 2). We also demonstrate that S2.k.r can be used as a novel distance measure to construct phylogenetic tree, and performs as well as the Kolmogorov complexity and LZ complexity.

Overall our results highlight the necessity for alignment-free measures to extract more information as possible. Thus, this understanding can then be used to guide development of more powerful measures for sequence comparison with future possible improvement on evolutionary study, structure and function prediction.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULT AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors thank all the anonymous referees for their valuable suggestions and support. In particular, the authors thank Prof. Tuan D. Pham for providing all MATLAB code for SimMM and Prof. Saurabh Sinha for provding the software of D2z, datasets and the technical help. The authors also thank Dr Weiping Wang for his kindly help.

Funding: National Natural Science Foundation of China (10571019).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

Received on April 15, 2008; revised on August 16, 2008; accepted on August 17, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULT AND DISCUSSION
 4 CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. (1990) 215:403–410.[CrossRef][Web of Science][Medline]

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]

    Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc. Natl Acad. Sci. USA (1986) 83:5155–5159.[Abstract/Free Full Text]

    Bradley AP. The use of the area under the ROC curve in the evaluation of machine learning algorithms. Pattern Recog. (1997) 30:1145–1159.[CrossRef]

    Cao Y, et al. Conflict among individual mitochondrial proteins in resolving the phylogeny of eutherian orders. J. Mol. Evol. (1998) 47:307–322.[CrossRef][Web of Science][Medline]

    Durbin R, et al. Biological Sequence Analysis (1998) Cambridge, UK: Cambridge University Press.

    Egan JP. Signal Detection Theory and ROC-Analysis (1975) New York: Academic Press.

    Felsenstein J. PHYLIP-Phylogeny inference package (version 3.2). Cladistics (1989) 5:164–166.

    Felsenstein J. Inferring phylogenies from protein sequences by parsimony, distance and likelihood methods. Meth. Enzymol. (1996) 266:418–427.[Web of Science][Medline]

    Fichant G, Gautier C. Statistical method for predicting protein coding regions in nucleic acid sequences. Comput. Appl. Biosci. (1987) 3:287–295.[Abstract/Free Full Text]

    Gallo SM, et al. REDfly: a regulatory element database for Drosophila. Bioinformatics (2006) 22:381–383.[Abstract/Free Full Text]

    Green RE, Brenner SE. Bootstrapping and normalization for enhanced evaluations of pairwise sequence comparison. Proc. IEEE (2002) 90:1834–1847.[CrossRef]

    Huelsenbeck JP, Ronquist F. MRBAYES: Bayesian inference of phylogenetic trees. Bioinformatics (2001) 17:754–755.[Abstract/Free Full Text]

    Isaacson DL, Madsen RW. Markov Chains Theory and It's Applications (1976) New York, London, sydney, Toronto: John Wiley & Sons.

    Kantorovitz MR, et al. A statistical method for alignment-free comparison of regulatory sequences. Bioinformatics (2007) 23:i249–i255.[Abstract/Free Full Text]

    Komatsu K, et al. Phylogenetic analysis based on 18S rRNA gene and matK gene sequences of Panax vietnamensis and five related species. Planta Med. (2001) 67:461–465.[CrossRef][Web of Science][Medline]

    Kullback S, Leibler RA. On information and sufficiency. Ann. Math. Stat. (1951) 22:79–86.[CrossRef]

    Kumar S, et al. MEGA3: integrated software for molecular evolutionary genetics analysis and sequence alignment. Brief. Bioinform. (2004) 5:150–163.[Abstract/Free Full Text]

    Li M, et al. An information-based sequence distance and its application to whole mitochondrial genome phylogeny. Bioinformatics (2001) 17:149–154.[Abstract/Free Full Text]

    Lippert RA, et al. Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl Acad. Sci. USA (2002) 99:13980–13989.[Abstract/Free Full Text]

    Lin Y, et al. Pika and vole mitochondrial genomes increase support for both rodent monophyly and Glires. Gene (2002) 294:119–129.[CrossRef][Web of Science][Medline]

    Mohseni-Zadeh S, et al. Cluster-C: an algorithm for the large-scale clustering of protein sequences based on the extraction of maximal cliques. Comput. Biol. Chem. (2004) 28:211–218.[CrossRef][Web of Science][Medline]

    Otu HH, Sayood K. A new sequence distance measure for phylogenetic tree construction. Bioinformatics (2003) 19:2122–2130.[Abstract/Free Full Text]

    Pham TD, Zuegg J. A probabilistic measure for alignment-free sequence comparison. Bioinformatics (2004) 20:3455–3461.[Abstract/Free Full Text]

    Pham TD. Spectral distortion measures for biological sequence comparisons and database searching. Pattern Recog. (2007) 40:516–529.[CrossRef]

    Pipenbacher P, et al. ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics (2002) 18:S182–S191.[Abstract]

    Reyes A, et al. Where do rodents fit? Evidence from the complete mitochondrial genome of Sciurus vulgaris. Mol. Biol. Evol. (2000) 17:979–983.[Free Full Text]

    Ronquist F, Huelsenbeck JP. MrBayes 3: Bayesian phylogenetic inference under mixed models. Bioinformatics (2003) 19:1572–1574.[Abstract/Free Full Text]

    Stuart GW, et al. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics (2002) 18:100–108.[Abstract/Free Full Text]

    Van Helden J. Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics (2004) 20:399–406.[Abstract/Free Full Text]

    Vinga S, Almeida J. Alignment-free sequence comparison-a review. Bioinformatics (2003) 19:513–523.[Abstract/Free Full Text]

    Waddell PJ, Steel MA. General time reversible distances with unequal rates across sites: mixing {gamma} and inverse Gaussian distributions with invariant sites. Mol. Phyl. Evol. (1997) 8:398–414.[CrossRef][Web of Science][Medline]

    Waddell PJ, et al. Towards resolving the interordinal relationships of placental mammals. Syst. Biol. (1999a) 48:1–5.[CrossRef][Web of Science][Medline]

    Waddell PJ, et al. Assessing the Cretaceous superordinal divergence times within birds and placental mammals using whole mitochondrial protein sequences and an extended statistical framework. Syst. Biol. (1999b) 8:119–137.

    Waddell PJ, et al. A phylogenetic foundation for comparative mammalian genomics. Genome Inform. Ser. (2001) 12:141–154.

    Waterman MS. Introduction to Computational Biology: Maps, Sequences, and Genomes: Interdisciplinary Statistics (1995) Boca Raton, FL: Chapman and Hall/CRC.

    Wu TJ, et al. A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics (1997) 53:1431–1439.[CrossRef][Web of Science][Medline]

    Wu TJ, et al. Statistical measures of DNA dissimilarity under Markov chain models of base composition. Biometrics (2001) 57:441–448.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/20/2296    most recent
btn436v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Dai, Q.
Right arrow Articles by Wang, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Dai, Q.
Right arrow Articles by Wang, T.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?