Skip Navigation

Bioinformatics 2007 23(13):i249-i255; doi:10.1093/bioinformatics/btm211
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kantorovitz, M. R.
Right arrow Articles by Sinha, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kantorovitz, M. R.
Right arrow Articles by Sinha, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

A statistical method for alignment-free comparison of regulatory sequences

Miriam R. Kantorovitz 1,2, Gene E. Robinson 2,3 and Saurabh Sinha 1,3,*

1Department of Computer Science, 2Department of Entomology, 3Institute for Genomic Biology, University of Illinois, Urbana-Champaign, Illinois, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EVALUATION AND COMPARISON
 4 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: The similarity of two biological sequences has traditionally been assessed within the well-established framework of alignment. Here we focus on the task of identifying functional relationships between cis-regulatory sequences that are non-orthologous or greatly diverged. ‘Alignment-free’ measures of sequence similarity are required in this regime.

Results: We investigate the use of a new score for alignment-free sequence comparison, called the Formula score. It is based on comparing the frequencies of all fixed-length words in the two sequences. An important, novel feature of the score is that it is comparable across sequence pairs drawn from arbitrary background distributions. We present a method that gives quadratic improvement in the time complexity of calculating the Formula score, over the naïve method. We then evaluate the score on several tissue-specific families of cis-regulatory modules (in Drosophila and human). The new score is highly successful in discriminating functionally related regulatory sequences from unrelated sequence pairs. The performance of the Formula score is compared to five other alignment-free similarity measures, and shown to be consistently superior to all of these measures.

Availability: Our implementation of the Formula score will be made freely available as source code, upon publication of this article, at: http://veda.cs.uiuc.edu/d2z/

Contact: sinhas{at}cs.uiuc.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EVALUATION AND COMPARISON
 4 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We begin this research with the question ‘How do we measure the similarity between two biological sequences?’ The most well-studied version of this problem is measuring the similarity between proteins or coding sequences in order to detect homology. The widely accepted solution to this question is based on the notion of alignment. The BLAST program Altschul et al. (1990), which reports the statistical significance of sequence alignment scores, is the de facto standard for comparing two sequences.

Here, we are interested in answering this question with respect to regulatory sequences (e.g. enhancers or promoters) of genes, rather than the coding sequences. Moreover, we focus on the case where the sequences being compared do not demonstrate any statistically significant alignment. This case may arise where the two sequences being compared are not orthologous, yet are functionally related; or they are highly diverged evolutionarily. Thus, we can rephrase our motivating question as ‘How do we measure the similarity between two regulatory DNA sequences, in an alignment-free manner?’ The ability to compare two regulatory sequences for similarity will have very important consequences, as discussed next.

MOTIVATION 1: One of the first, routine steps in the analysis of newly sequenced genomes is to scan the genome for coding sequences homologous to known genes in a related species, typically using a tool like BLAST. Similarly, with growing scientific interest in gene regulation, it will become necessary to be able to detect regulatory regions in the new genome that are homologous to known enhancers or promoters. It is well known that orthologous regulatory sequences show a significantly less level of alignment than their corresponding coding sequences, hence a sensitive method for alignment-free sequence comparison will be required.

MOTIVATION 2: Regulatory sequence comparison can prove vital for the ab initio discovery of cis-regulatory modules (‘CRM's) with a common function. A CRM may be loosely defined as a contiguous non-coding sequence that contains multiple transcription factor binding sites and drives some aspect of a gene's expression profile. Suppose we are given a set of co-regulated genes in a single species, and wish to find in their upstream and downstream regions (henceforth called the ‘control regions’), the CRMs that mediate the common aspect of their expression profiles. The control regions may be tens of Kbp long for each gene (especially for metazoan genomes), while the CRMs to be discovered are often only hundreds of bp long. One must therefore search in the control regions for subsequences (the candidate CRMs) that share some functional similarity. The CRM search algorithm thus requires a method that can discern functional similarity among candidate CRMs based on their sequence similarity. Also, since the different CRMs are only functionally related, and not orthologous, we need the comparison method to be alignment-free.

1.1 Comparison of k-word frequency distributions
One approach to alignment-free sequence comparison is to fix a short word length k, compute the frequency distribution of all k-length words (also called ‘k-words’) in each sequence, and assess the similarity of the two frequency distributions. This approach has been adopted in earlier work on alignment-free comparison of coding sequences (Wu et al., 1997). We expect the same idea to apply to regulatory sequence comparison with k set to the typical length of binding site cores, e.g. k = 6. If each of the two regulatory sequences being compared has several binding sites for the same transcription factors [as is commonly seen in Drosophila developmental enhancers Rajewsky et al. (2002)], comparison of their k-word frequency distributions should reveal this featural similarity. Such a strategy becomes all the more promising if we are interested in comparing one set of (related) regulatory sequences to another set of related regulatory sequences. (The k-word distribution of each set will have high frequencies for the shared motifs.)

1.2 Previous work
Various quantitative measures for the similarity of two sequences, in terms of their k-word frequency distributions, have been proposed in the past, as summarized in the excellent review by Vinga et al. (2003). The simplest similarity score is the Euclidian distance between the two 4k-dimensional vectors of k-word counts (Blaisdell, 1986). Information theoretic measures like the ‘Kullback-Leibler distance’ (also called ‘relative entropy’) (Wu et al., 2001), geometric measures such as the cosine of the angle between the count vectors (Stuart et al., 2002), and statistical measures such as the correlation coefficient (Fichant and Gantier, 1987) have been investigated by different authors in the context of alignment-free sequence comparison. There are also measures that do not treat every k-word's count equally, in view of the fact that different k-words' counts have different probability distributions (Waterman, 2000). Thus, the ‘Standardized Euclidian distance’ and the ‘Mahalanobis’ distance (Wu et al., 1997, 2001) account for the variances of k-words in computing the Euclidian distance between the count vectors.

Furthermore, there have been approaches that do not base the sequence similarity score on counts of all k-words; rather, they use certain techniques to short-list a smaller subset of k-words and compare the counts of these words in the two sequences. For example, Stuart et al. (2002) use singular value decomposition to identify the most ‘important’ of the 4k k-words before computing the cosine of the angle between the k-word count vectors of two sequences. Similarly, van Helden (2004) uses a motif-finding program to identify statistically significant k-words and bases the similarity score of two regulatory sequences on the counts of these significant k-words only.

1.3 Contributions of this work
This paper explores the ‘D2’ score (Lippert et al., 2002), a natural alignment-free similarity measure between two sequences, defined as the number of k-word matches between the sequences. The D2 score can also be expressed as the inner product of the k-word count vectors of the two sequences. However, in an application where several different pairs of sequences drawn from different distributions are to be compared, the D2 score is not suitable. For such applications, any sequence similarity measure must be ‘normalized’ to account for the different background distributions of sequences. In this article:

  1. We propose and demonstrate the use of the ‘D2z score as such a ‘normalized’ measure that captures the statistical significance of the D2 score. It is defined as the number of standard deviations by which the observed value of D2 deviates from its expected value under the background distribution. We present an efficient algorithm for calculating this statistic, and the implementation will be made publicly available.
  2. The D2z score is calculated under the general assumption of the two sequences being generated by Markov chains, which may be different for the two sequences. We exploit a known result on infinite series of stochastic matrices (Kleffe and Borodovsky, 1992) to reduce the algorithm's time complexity from O(Formula ) to O(4k), a speedup that is crucial to the feasibility of the algorithm for realistic values of k.
  3. We perform extensive tests with the Formula score on tissue-specific families of known enhancers (CRMs) in Drosophila and human. Our results show that the Formula score can accurately discriminate (a) functionally related CRMs from unrelated sequence pairs, and (b) orthologous CRMs from orthologous non-regulatory non-coding sequences. To our knowledge, we have demonstrated for the first time that alignment-free similarity scores may be potentially applied to the discovery and analysis of regulatory sequences in metazoan genomes.
  4. The performance of the D2z score is compared to several other similarity measures, and shown to be consistently superior to the compared methods.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EVALUATION AND COMPARISON
 4 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Let Formula and Formula be two sequences of lengths n1 and n2 respectively, over an alphabet Formula of size Formula . Each sequence is generated independently of the other by a (background) Markov Model of order {omega}. The background models may be different for the two sequences. In the special case of {omega} = 0, the background model is iid (independent and identically distributed) letters.

The D2 statistic is defined to be the number of k-word matches between the two sequences A and B, including overlaps. It can be computed as


Formula 1

(1)
where Formula is the indicator variable for a match between the k-words starting at position i in A and at position j in B, and the index set Formula where Formula and Formula .

Another way to think of D2 is as the inner product of the vectors of word counts in A and B. More explicitly, let Formula be the set of all k-words on the alphabet Formula . For Formula let Formula (resp. Formula ) be the number of times the word w appears in the sequence A (overlaps allowed). Then Formula is the count vector for the sequence A. Then, we can also write the D2 score as Formula

The Formula score for the sequences A and B is defined by


Formula 2

(2)
where E(D2) and Formula are the expectation and the standard deviation of Formula respectively. This score measures the number of standard deviations by which the observed value of D2 deviates from the mean. The Formula statistic is approximately standard normal when the lengths of the sequences are large enough, as discussed in Section 2.3.

Next we outline the key steps in the computation of Formula . We give formulas for the mean and variance of D2 which can be computed efficiently. We treat the iid case (Formula ) and the Markov model case (Formula ) separately since the formulas for the iid case are much simpler. In either case, our implementation computes the Formula score of two ~1 kbp long sequences in less than 1 second on a workstation.

2.1 Expectation
2.1.1 IID model For Formula , write f Aa (resp. f Ba) for the background probability of letter a in the sequence A (resp. B). Also define Formula Then


Formula

and by Equation (1), Formula .

2.1.2 Markov Model (MM) We work with MM of order one and leave the extension to the general case to Supplementary Material. Here we follow notation from Kleffe and Borodovsky (1992) and Sinha and Tompa (2000). For sequence Formula , let pjT(a) be the probability of an a in position j in T. As justified in Kleffe and Borodovsky (1992), we may assume pjT is independent of j and denote it by pT. The probabilities Formula are calculated as the steady state probabilities of the background Markov model. For a k-word Formula , write Formula for Formula , i.e. the probability of a k-word in T being w, given that its first letter is w1. With this notation, Formula . Then


Formula 3

(3)

2.2 Variance
By definition, Formula

Formula We use techniques from Lippert et al. (2002) and Waterman (2000) to compute the covariance terms Formula . Any pair Formula corresponds to two pairs of k-words, whose relative locations may be described by one of three different scenarios, as shown in Figure 1. For any Formula , let Formula be those pairs (s, t) whose location with respect to (i, j) is as shown in Figure 1b or 1c. Formally, Formula . Pairs of the type shown in Figure 1c are called ‘accordions’ and form the set Formula , while pairs of the type Figure 1b are called ‘crabgrasses’ and form the set Formula .


Figure 1
View larger version (3K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Three scenarios for two pairs of matching words between sequences A and B. (a) No overlaps, (b) overlap in one sequence only: ‘crabgrass’, (c) overlap in both sequences: ‘accordion’.

 
2.2.1 IID model The covariances Formula are computed by looking separately at the three cases of Figure 1.

Case 1: Formula (no overlap). Formula are independent; hence Formula .

Case 2: Formula (‘crabgrass’ terms). Consider the case where Formula and Formula , i.e. overlap in A, not in B. Then Formula


Formula

Hence Formula By switching the roles of A and B in the above computation and using symmetry, we can compute the contribution of all crabgrasses to the variance. This formula is left for the Supplementary Material.

Case 3: Formula ("accordion" terms). This is shown in the Supplementary Materials.

2.2.2 Markov model We work with MM of order one. Extension to the general case is explained in the Supplementary Material. As in the iid model, we divide the computation into three cases (Fig. 1). However, now, in the first case (Fig. 1a), with no overlap in either sequence, the covariance Formula may not be zero. Here we work out a formula for this case and also illustrate how we reduce the complexity of computation from Formula to Formula . Computations of the crabgrass and accordion terms use the same ideas and are shown in the Supplementary Material.

Case 1: Formula (no overlaps). Let Formula and Formula .


Formula

where Formula is the probability of seeing u1, l steps after wk in the Markov chain of T. Summing over all such terms we get the total contribution to variance as


Formula 4

(4)
where Formula . This formula has computational complexity Formula , due to the double summation on u and w. We use results from Kleffe and Borodovsky (1992) to reduce the complexity to O(d k) as follows. For Formula , let Formula where Formula , and similarly define Formula . Then the sum in Equation (4) can be written as


Formula

which is a computation of order Formula . The n2 in the complexity estimation comes from computing the term Formula . To further reduce computational complexity to O(d k) we use the following approximation of Sb,c. By Kleffe and Borodovsky (1992), when the length of the sequence is large enough,


Formula 5

(5)
where P is the Formula matrix of transition probabilities, Formula , with Formula and p is the vector of steady state probabilities. The vector ec is the unit vector that picks up the column corresponding to c.

The total contribution of such terms to the variance is then obtained from this sum by subtracting Formula , where Formula is as computed in Equation (3).

2.3 Normality of the D2 statistics
In the iid model, the D2 statistic is asymptotically normal by Lippert et al. (2002) for non-uniform background model and by Kantorovitz et al. (in preparation) in the uniform case. (Hence, Formula is standard normal.) For the Markov Models, we used simulations to test normality. Our simulations showed that even when the length of the sequences is as short as 400, the Kolmogorov–Smirnov (K-S) test give good p-values. For example, when k = 5 and Markov order Formula , with sample size of Formula , and sequence lengths n = 400, the K-S test gave p-value Formula . (High K-S p-value corresponds to good normal approximation.)

2.4 Background models
In practice, we are only given the two sequences to compare, not the Markov chains. We infer these generative models by maximum likelihood estimation—the ({omega} + 1)-word frequencies in each sequence are used to train the Markov chain of that sequence. This practice of assuming a ‘local background distribution’ has been adopted in several previous bioinformatics analysis, and is especially meaningful for the longer metazoan genomes that demonstrate great variability in local sequence composition. Of course, the program can also be made to run with a ‘global background model’ assumed to generate all sequences.

Markov chains are a popular and intuitive model of biological sequence generation, and most regulatory sequence analysis tools utilize this model. A first-order Markov chain, for instance, captures adjacent nucleotide dependencies, such as the abundance of CG dinucleotides in some parts of the genome. We feel it is imperative to allow for such higher-order Markov chains in assessing the statistical significance of D2 scores.

2.5 Other measures
For evaluation purposes, we also implemented other alignment-free measures of pairwise sequence similarity (or dissimilarity). This includes

  1. ‘ed’: The euclidean distance between the k-word count vectors Formula and Formula (Blaisdelle, 1986).
  2. ‘kld’: Kullback–Leibler discrepancy (Wu et al., 2001) between the two k-word frequency distributions {fiA} and {fiB}, defined as Formula . We made this score symmetric by taking the mean of Formula and Formula .
  3. ‘p.p’ and ‘p.a’: van Helden's ‘Poisson scores’ (van Helden 2004), which are defined as follows. Let Formula , let F(x, m) be the Poisson distribution with mean m, and Formula (resp. Formula ) be the expected number of occurrences of the word w in A (resp. B). Then Formula


    Formula

    Also, Formula


    Formula


  4. ‘lcc’: Pearson correlation coefficient between k-word count vectors Formula and Formula (3).
  5. ‘cos’: The cosine of the angle between the k-word count vectors Formula and Formula (12).

In each of the above scores, all possible k-words' counts are considered. Thus, the ‘cos’ score is computed without the preceding singular value decomposition (SVD) step of [12], and the ‘Poisson’ score is computed without the initial motif-finding step of (14). Our intention here is to assess the scores as measures of sequence similarity, without any assistance from feature selection techniques such as motif-finding or SVD. Such techniques may be used in conjunction with any of the evaluated scores, and their interplay with each of the scores is beyond the scope of this article. Moreover, one of the compelling applications of alignment-free scores are regulatory sequences that may not contain statistically over-represented words (motifs), yet share a significant similarity in their overall word counts. In such applications, motif-finding is not expected to improve similarity measurement.


    3 EVALUATION AND COMPARISON
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EVALUATION AND COMPARISON
 4 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
Our goal here is to evaluate if functionally and/or evolutionarily related sequence pairs are scored better than unrelated pairs of sequences randomly chosen from the genome (Sections 3.1 and 3.3). We will also evaluate if two sets of related sequences score better (when compared to each other) than two sets of unrelated sequences (Section 3.2).

3.1 Evaluation on functionally related regulatory sequences
To assess the performance on functionally related sequences, we construct each data set as follows. A set of CRMs, known to regulate expression in the same tissue and/or development stage, is taken as the ‘positive’ set. A set of equally many randomly chosen non-coding sequences, with lengths matching the CRMs, is taken as the ‘negative’ set. Each pair of sequences in the positive set is compared, and so is each pair in the negative set. We assess if the pairs from the positive set score higher than pairs from the negative set. This is done by sorting all pairs, whether they are from the positive set or the negative set, in one combined list, then counting how many of the pairs in top half of this list are from the positive set. That is, we count the number of positive pairs in the top Formula scoring pairs, where n is the number of sequences in the positive set. If Formula , we only look at the top 300 pairs. The ideal score should place only positive pairs in the top K, while a random score should place K / 2.

In the following paragraphs, we enumerate the data sets analyzed and the performance of different scores on these. The scores evaluated are: our D2z score and the five scores outlined in Section 2.5 (ed, kld, poisson: p.p & p.a, lcc and cos). All scores are run with word lengths Formula . The D2z score and the Poisson score are run with background models of Markov order 0,1,2. The Poisson score is run both with the ‘additive’ model (p.a) and the ‘product’ model (p.p). For each method, separate tests are done with each combination of parameter values, and the best performing combination is chosen to represent that score in the performance charts. The following are all the data sets that were analyzed.

FLY_BLASTODERM: These are 82 non-overlapping CRMs with expression in the blastoderm-stage embryo of the fruitfly, D. melanogaster, as listed in the REDfly database (Gallo et al., 2006).

FLY_PNS: These are 23 experimentally validated CRMs (average length 998 bp) driving expression in the peripheral nervous system in the fruitfly (Gallo et al., 2006).

FLY_TRACHEAL: Nine CRMs (average length 1220 bp) involved in regulation of the tracheal system in the fruitfly (Gallo et al., 2006).

FLY_EYE: A set of 17 CRMs (average length 894 bp) expressing in the Drosophila eye (Gallo et al., 2006).

HUMAN_MUSCLE: A set of human CRMs regulating muscle specific gene expression (Wasserman and Fickett, 1998). This set had 28 CRMs with an average length of 450 bp.

HUMAN_LIVER: Nine CRMs driving expression specific to the human liver (Krivan and Wasserman 2001), each being of length 201 bp.

HUMAN_HBB: We took the CRMs regulating the HBB complex (a set of developmentally regulated, erythroid-specific genes encoding ß-globin in human), obtained from King et al. (2005). This set has 17 sequences with mean length 453 bp.

Thus, we created four sets of tissue-specific CRMs from the D. melanogaster genome, and three sets from the human genome. (We only considered sequences greater than 200 bp in length.) For each such set of CRMs, which constituted the ‘positive’ set, a ‘negative’ set was created by randomly choosing non-coding sequences from the same genome, with lengths matching those of the CRM set.

We begin with a careful examination of the results on the two largest data sets FLY_BLASTODERM and HUMAN_MUSCLE, shown in Figure 2. In the FLY_BLASTODERM experiment, the top 300 pairs in terms of the D2z score include 220 pairs (73%) from the positive set, i.e. the pairs scored as being most similar are predominantly the functionally related pairs. Compared to a random expectation of 150 pairs (50%), this is a highly significant result demonstrating that the D2z score is successful at detecting the functional similarity of blastoderm CRMs. The D2z score is significantly better than all other scores, the next best being the Poisson score, with 188 ‘positive’ pairs (62%) in the top 300. Similar results are obtained from the HUMAN_MUSCLE data set. Here, the top 300 scoring pairs included 206 (69%) positive pairs under the D2z scoring scheme, significantly better than random (150; 50%). The next best scoring method was the Linear Correlation Coefficient (LCC), with 194 (65%) positive pairs in the top 300. The Poisson score performed significantly worse than the D2z score, with only 172 (57%) positives in the top 300.


Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Performance of similarity scores on FLY_BLASTODERM and HUMAN_MUSCLE data sets. Score names are as in Section 2.5, with parameter values as suffix.

 
We summarize the results on all seven data sets analyzed, in Table 1. The D2z score outperformed all other scores in each of the data sets analyzed, with the Poisson score being the next best method in five data sets. In each case, the fraction of positive pairs was significantly better than 50%. In five data sets, the same combination of parameters (k = 5, Formula ) of D2z gave the best performance; on the remaining two also, this combination was better than or equal to any of the non D2z scores (data not shown).


View this table:
[in this window]
[in a new window]

 
Table 1. Performance of sequence similarity scores on seven data sets from fruitfly and human. Column 4: K, the number of top-scoring pairs examined, and the number of positive pairs expected in these, by chance. Col. 5: The best performing method, the number of positive pairs found by it in top K, and percentage relative to K. Col. 6: The next best method, with the same information

 
3.2 Comparing sets of sequences
Here we extend the above tests, which compared one sequence with another, to comparing sets of sequences. The motivation of this exercise comes from the challenging bioinformatics task of predicting sets of related CRMs, given a set of co-regulated genes. To solve this problem, the search algorithm needs a way to measure the overall similarity shared by a set of candidate CRMs, and it will be useful in this context to have a score that can identify two related subsets of CRMs. We therefore adapted the seven data sets constructed above, as follows. The positive set was randomly partitioned into two disjoint subsets of equal size, and each subset of sequences was concatenated to produce one longer sequence. Thus, each random partitioning produced a pair of sequences, which are then compared against each other. A hundred such random partitionings were done, producing 100 pairs of random sequences. The negative set also produced 100 pairs of sequences, by a similar construction. Each of our scores of interest was then computed for each of these 200 pairs, and we plotted how many positive pairs were included in the top 100 pairs. The ideal score would place only positive pairs in the top 100.

On all but one data set (HUMAN_HBB), the D2z score produced the ideal results, i.e., all positive pairs scored above all negative pairs. On five of these data sets, the Poisson score also showed close to ideal performance. However, on the HUMAN_LIVER data set, all other methods gave below 71% accuracy, compared to D2z score's 100% accuracy.

These results strongly suggest the potential for using D2z in ab initio CRM discovery, though building such algorithms that exploit this score is beyond the scope of this article.

3.3 Evaluation on orthologous regulatory sequences
Here, we assessed the ability of different scoring methods to distinguish orthologous CRMs from randomly chosen pairs of orthologous sequences. We chose to work with a set of 52 CRMs related to the blastoderm development in D. melanogaster, whose orthologs in a distantly related fruitfly D. mojavensis could be obtained based on rough alignment maps. With these pairs of orthologous blastoderm CRMs forming our ‘positive’ set, a ‘negative’ set was constructed by choosing 52 length-matched random sequences from the D. melanogaster genome, and pairing them up with their D. mojavensis orthologs. These two fruitfly species were chosen because they are close enough evolutionarily to have rough alignment maps that reveal orthologs on the scale of ~1 kbp, yet are far enough that non-functional ortholog pairs are expected to have relatively little sequence similarity.

As in the previous subsection, we then used each scoring scheme to rank all 104 sequence pairs (i.e. both CRMs and random pairs), and asked how many positive pairs (orthologous CRMs) show up in the top K pairs, for various Formula . A random method, which cannot distinguish the two classes, should place K / 2 positive pairs in the top K. Figure 3 shows the results of this test. We first observe that the D2z score performed better than all other similarity scores. As the figure shows, all ten of the top 10 pairs were positive, and the number of positive in the top 20, 30, 40 and 50 pairs were 15 (75%), 21 (70%), 27 (68%) and 31 (62%), respectively. These numbers mean for example that at a sensitivity of 40% (21 of 52), D2z score predicts orthologous CRM pairs with 70% specificity (21 of 30). We therefore interpret these results as strong evidence that the D2z score can be used to detect functional similarity between orthologous regulatory sequences, at a level much higher than random orthologs. This would pave the way for using cross-species comparison to locate CRMs genome-wide, in the absence of any motif information.


Figure 3
View larger version (18K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Discriminating orthologous CRMs from randomly chosen orthologous sequences. The y-axis shows how many of the top K (x-axis) scoring pairs are orthologous CRMs. Score names are as in Section 2.5, with parameter values as suffix.

 

    4 DISCUSSION AND CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EVALUATION AND COMPARISON
 4 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have shown how a new sequence similarity score, the D2z score, can be used to detect functional and/or evolutionary similarities among regulatory sequences, in an alignment-free manner. The score is ‘normalized’, under the commonly used assumption of sequences being generated by Markov chains, and can hence be compared across different sequence pairs.

An advantage of the D2 score is that its definition can be easily generalized to allow mismatches while measuring sequence similarity. The mean and variance of this ‘mismatch-allowed’ version of D2 can be computed efficiently, assuming a zeroth order Markov chain as the background model.

The Formula score assesses how unlikely the observed D2 score is if the two sequences are unrelated (generated independently). This is similar in spirit to the BLAST e-value that measures how surprising a particular alignment score is if the two sequences are independent.

The Formula score can be adapted to a more limited set of k-words (motifs), as in van Helden (2004). The computations of the mean and variance of D2 in this case can be easily adjusted from our calculations.

The D2 score and its z-score may also be used to compare two protein sequences or coding sequences, even though we have mostly been concerned above with regulatory sequences. For coding sequences we may set k to be some multiple of 3, and use a second-order Markov chain as background.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EVALUATION AND COMPARISON
 4 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 
MRK was funded in part by Illinois Sociogenomics Initiative.

Conflict of Interest: none declared.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 EVALUATION AND COMPARISON
 4 DISCUSSION AND CONCLUSIONS
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Basic Local Alignment Search Tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]

    Blaisdell BE. A Measure of the Similarity of Sets of Sequences not Requiring Sequence Alignment. Proc. Natl Acad. Sci. USA (1986) 83:5155–5159.[Abstract/Free Full Text]

    Fichant G, Gautier C. Statistical method for predicting protein coding regions in nucleic acid sequences. Comput. Appl. Biosci (1987) 3:287–295.[Abstract/Free Full Text]

    Gallo SM, et al. REDfly: a Regulatory Element Database for Drosophila. Bioinformatics (2006) 22:381–383.[Abstract/Free Full Text]

    Kleffe J, Borodovsky M. First and second moment of counts of words in random texts generated by Markov chains. Comput. Appl. Biosci (1992) 8:433–441.[Abstract/Free Full Text]

    Kantorovitz MR, et al. Asymptotic behavior of k-word matches between two random sequences. in preparation.

    Lippert RA, et al. Distributional regimes for the number of k-word matches between two random sequences. Proc. Natl Acad. Sci. USA (2002) 99:13980–13989.[Abstract/Free Full Text]

    King DC, et al. Evaluation of regulatory potential and conservation scores for detecting cis-regulatory modules in aligned mammalian genome sequences. Genome Res (2005) 15:1051–1060.[Abstract/Free Full Text]

    Krivan W, Wasserman WW. A predictive model for regulatory sequences directing liver-specific transcription. Genome Res (2001) 11:1559–1566.[Abstract/Free Full Text]

    Rajewsky N, et al. Computational detection of genomic cis-regulatory modules applied to body patterning in the early Drosophila embryo. BMC Bioinformatics (2002) 3:30.[CrossRef][Medline]

    Sinha S, Tompa M. A statistical method for finding transcription factor binding sites. Proc. Int. Conf. Intell. Syst. Mol. Biol (2000) 8:344–354.[Medline]

    Stuart GW, Moffett K, Baker S. Integrated gene and species phylogenies from unaligned whole genome protein sequences. Bioinformatics (2002) 18:100–108.[Abstract/Free Full Text]

    Vinga S, Almeida JS. Alignment-free sequence comparison - a review. Bioinformatics (2003) 19:513–523.[Abstract/Free Full Text]

    van Helden J. Metrics for comparing regulatory sequences on the basis of pattern counts. Bioinformatics (2004) 20:399–406.[Abstract/Free Full Text]

    Wasserman WW, Fickett JW. Identification of regulatory regions which confer muscle-specific gene expression. J. Mol. Biol (1998) 278:167–81.[CrossRef][Web of Science][Medline]

    Waterman MS. Introduction to Computational Biology (2000) New York: Chapman & Hall/CRC.

    Wu TJ, Burke JP, Davison DB. A Measure of DNA Sequence Dissimilarity Based on Mahalanobis Distance between Frequencies of Words. Biometrics (1997) 53:1431–1439.[CrossRef][Web of Science][Medline]

    Wu TJ, Hsieh YC, Li LA. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics (2001) 57:441–443.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
Q. Dai, Y. Yang, and T. Wang
Markov model plus k-word distributions: a synergy that produces novel statistical measures for sequence comparison
Bioinformatics, October 15, 2008; 24(20): 2296 - 2302.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Kantorovitz, M. R.
Right arrow Articles by Sinha, S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kantorovitz, M. R.
Right arrow Articles by Sinha, S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?