Skip Navigation


Bioinformatics Advance Access originally published online on October 28, 2004
Bioinformatics 2005 21(7):938-943; doi:10.1093/bioinformatics/bti090
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/938    most recent
bti090v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zheng, W.-M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zheng, W.-M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Relation between weight matrix and substitution matrix: motif search by similarity

Wei-Mou Zheng

Institute of Theoretical Physics, Academia Sinica Beijing 100080, China


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 SUBSTITUTION MATRIX AND...
 3 MOTIF SEARCH BY...
 4 DISCUSSION
 REFERENCES
 

Motivation: The discovery of patterns shared by several sequences that differ greatly is a basic task in sequence analysis, and still a challenge. Several methods have been developed for detecting patterns. Methods commonly used for motif search include the Gibbs sampler, Expectation-Maximization (EM) algorithm and some intuitive greedy approaches. One cannot guarantee the optimality of the result produced by the Gibbs sampler in a single run. The deterministic EM methods tend to get trapped by local optima. Solutions found by greedy approaches are rarely sufficiently good.

Results: A simple model describing a motif or a portion of local multiple sequence alignment is the weight matrix model, in which a motif is characterized with position-specific probabilities. Two substitution matrices are proposed to relate the sequence similarity with the weight matrix. Combining the substitution matrix and weight matrix, we examine three typical sets of protein sequences with increasing complexity. At a low score threshold for pair similarity, sliding windows are compared with a seed window to find the score sum, which provides a measure of statistical significance for multiple sequence comparison. Such a similarity analysis reveals many aspects of motifs. Blocks determined by similarity can be used to deduce a primary weight matrix or an improved substitution matrix. The algorithm successfully obtains the optimal solution for the test sets by just greedy iteration.

Availability: Softwares and sequence datasets are available on request from the author.

Contact: zheng{at}itp.ac.cn


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 SUBSTITUTION MATRIX AND...
 3 MOTIF SEARCH BY...
 4 DISCUSSION
 REFERENCES
 
A motif or pattern is a portion of some local multiple alignment of sequences that is highly conserved. The discovery of patterns shared by several sequences that differ greatly is a basic task in sequence analysis, and still a challenge. Several methods have been developed for detecting patterns (Helden et al., 1998; Hertz et al., 1990; Bailey and Elkan, 1995; Hughey and Krogh, 1996; Eddy, 1995; Conlon et al., 2003; Thompson et al., 2003). A type of motif often discussed in the literature is called a ‘block’ which has no insert or delete positions. We consider only blocks here to avoid the explicit treatment of gaps.

A block may be generated by using sequence alignment tools such as BLAST. However, the quality of blocks derived directly from an alignment is not assured. Methods for finding a localized region of sequence similarity in a set of sequences without first having to produce an alignment have been developed. Methods commonly used for motif search include the Gibbs sampler, Expectation-Maximization (EM) algorithm and some intuitive greedy approaches. One cannot guarantee the optimality of the result produced by the Gibbs sampler in a single run. The deterministic EM methods tend to get trapped by local optima. Solutions found by greedy approaches are rarely sufficiently good.

In sequence alignment, one uses a substitution matrix to score similarity. The statistical motif search methods use a position-specific scoring matrix or weight matrix, which represents the variation found in the aligned positions. We shall discuss the relation between these two kinds of matrices, and present an approach combining both matrices for motif search. We shall test the method with the three sets of protein sequences prepared by Lawrence et al. (1993). Although the method is also valid for DNA sequences, we shall mention only protein sequences.


    2 SUBSTITUTION MATRIX AND WEIGHT MATRIX
 TOP
 Abstract
 1 INTRODUCTION
 2 SUBSTITUTION MATRIX AND...
 3 MOTIF SEARCH BY...
 4 DISCUSSION
 REFERENCES
 
In principle, a substitution matrix can be deduced from given aligned blocks. Motif search generally uses the weight matrix model, in which a motif is described by a position-specific matrix. Suppose the motif width is W. For amino acids, the weight matrix is a W x 20 matrix, whose entries are logarithmic odds, the ratio of the probability qi{alpha} of finding a certain amino acid {alpha} in position i of the motif to p{alpha}, that of finding it in a random background model, i.e.

(1)
When estimating qi{alpha} from counts ni{alpha}, which is the number of amino acid {alpha} at position i of the motif in a block of size n = {sum}{alpha} ni{alpha}, a pseudocount may be added to ni{alpha} to improve statistics especially for small n. Following Lawrence et al. (1993), we use the background pseudocounts to replace ni{alpha} with {rho}{alpha}, where {rho}{alpha} is the frequency of {alpha} in the complete dataset. Taking the logarithmic odds as scores, one maximizes the total score in the motif search. That is, the objective function for optimization, which is also used in this paper, is the average score,

(2)
Introducing the pair scores

(3)
we have

(4)
Thus, pair scores M{alpha}ß can be equivalently used for optimization. The 20 x 20 matrix of M{alpha}ß describes the similarity between a pair of amino acids in motif segments. While each individual sequence of the motif carries position-specific information, position-independent M{alpha}ß measures the closeness between sequences belonging to the same motif.

A popular substitution matrix for amino acids is BLOSUM Henikoff and Henikoff (1992) which is derived from local multiple alignment of distantly related sequences. A substitution matrix like BLOSUM62 is for general purposes, so it may not be very suitable for a motif search. However, a specific substitution matrix good for a motif may be constructed from the motif or a single aligned block. Following Henikoff and Henikoff (1992), in the above notation, we may write the entries S{alpha}ß of the matrix S as

(5)
Compared with Equation (3), the order of taking the logarithm and average is reversed (arithmetic mean versus geometric mean). Written in terms of amino acid count ni{alpha}, the pair counts in column i are ni{alpha} (niß {delta}{alpha}ß), so that, replacing qi{alpha} with its unbiased estimator, expression (1) becomes

(6)
where {delta}{alpha}ß = 1 for {alpha} = ß, and 0 otherwise, and pseudocounts may be added to improve the statistics. We use the simple Laplace rule by adding pseudocount 1 for each ni{alpha}. The formula can be extended to more than one block.

A substitution matrix reflects our knowledge about amino acid similarity. We shall use such knowledge to help the motif search in three sets of sequences of diverse complexity prepared by Lawrence et al. (1993).


    3 MOTIF SEARCH BY SIMILARITY
 TOP
 Abstract
 1 INTRODUCTION
 2 SUBSTITUTION MATRIX AND...
 3 MOTIF SEARCH BY...
 4 DISCUSSION
 REFERENCES
 
3.1 Single block motif
The first set is a set of helix–turn–helix (HTH) proteins containing the HTH motif for DNA-binding involved in gene regulation. It consists of 30 sequences , each of which contains a common pattern of 18 residues; i.e. the motif width is 18. The lengths of these sequences range between 61 and 524, with the average being 239. The common pattern or motif is determined by the motif starting position {ak} at each sequence. Once {ak} are known, the weight matrix M can be estimated. In contrast, if the weight matrix is known, it can be taken as the scores to scan the sequences to locate {ak}. However, neither {ak} nor the weight matrix is known. One way to break the loop is to start with an initialization, e.g. a random choice of {ak(0)}. At step n, weight matrix M(n) estimated from {ak}(n) may be used to update {ak}(n) to {ak}(n+1). There are roughly three ways of updating: fuzzy updating (e.g. the EM method), stochastic updating [e.g. the Gibbs sampler, or more general Metropolis–Hastings sampling Hastings (1970)] and greedy updating. The simplest is the greedy updating, where the one with the highest score is picked up.

The EM and greedy methods are local. Even stochastic samplers are prone to being trapped by a local optimal. Furthermore, random initialization does not employ the knowledge of amino acid similarity. Although at the beginning we do not have the substitution matrix specific for the motif to be found, we still have matrices like BLOSUM62 for general purposes. Let us start the motif search with BLOSUM62. Two segments with high similarity are close neighbors. Intuitively, we regard a motif as a group of close neighbors, in which some members would have more near neighbors than others. The similarity of two sequences Y = y1y2...yW and Z = z1z2 ... zW may be measured with the score sum, . We give a rough estimation of the threshold s0 of s for pair comparison as follows. If both Y and Z are motif segments, the expected number of times for the motif match to occur when two L long sequences are compared is estimated to be . If Y and Z are extracted from the background, we have a similar estimation except that now q{alpha} is replaced by p{alpha}. From the definition of the substitution matrix, the ratio of the expected number of matches found in the random background to that in sequences containing the motif is es. Since the expected number of motif matches is about 1, a reasonable s0 is determined from

(7)
which gives s0 {approx} 20. (The fact that BLOSUM is in units of half-bit has been considered in this estimation of statistical significance.) Taking each sliding window of width W from {Xk} as a seed, skipping the sequence that the seed is in, we search every sequence for the most similar neighbor of the seed in each sequence which has a similarity not less than s0, and find the score sum of these neighbors. A seed and its near neighbors form a block, from which a primitive weight matrix can be deduced. The weight matrix is then used to scan each sequence for updating positions {ak}. For locating the motif with the weight matrix score, we estimate another threshold w0 of the weight matrix sum score w in a similar way as that for s0 from

(8)
which gives w0 {approx} 7. The difference between w0 and s0 is the trivial difference in the base of the logarithm and scaling.

While we do only greedy updating, to be less greedy, we keep motifs of the top K highest sum scores derived from different seeds, instead of just the top one. The top 10 seeds for s0 = 20 are listed in Table 1. (Note that positions 223 and 94 for the first and second sequences in the original set correspond to 225 and 198 in the set used here, respectively. This is due to the longer lengths of these sequences in the new set. However, the motif segments do not change.)


View this table:
[in this window]
[in a new window]
 
Table 1 Top 10 seeds for s0 = 20 in the set with single motif

 
The top one seed is at position a2 = 198 of the second sequence, which has 22 near neighbors. Using the block of these 23 sequences, we estimate the weight matrix, and then perform iteration of the greedy updating with w0 = 7. The iteration converges right at the second step, but the position of the motif for sequence 7 is located at 77 instead of 5, and no motif is identified for sequences 3, 4, 14 and 24. The threshold w0 = 7 is good for estimating statistical significance when the weight matrix for the whole set is finally obtained. For the above iteration, a lower w0, say 0 or –1, should be used in scanning sequences with a non-optimal weight matrix. Since we look for only the optimal position, the exact value of w0 is not so important. With such a low w0, now we are able to find the motif in all sequences, but positions of sequences 3 and 7 are not correct. The result from the top second seed a25 = 205 is a little better: only the position of sequence 3 is incorrect.

We construct a substitution matrix S from the block picked up by the top one seed a2 = 198 according to Equation (6). Using this S, we rebuild a block from the same seed a2 = 198, and then estimate a weight matrix from the block for iteration. This gives the right answer. The details are shown in Table 2. In Table 1 there are four seeds, a2 = 198, 199, 197 and 201, which are close to each other, and all belong to the same sequence. This indicates that a wider motif should be examined.


View this table:
[in this window]
[in a new window]
 
Table 2 Searching steps for the fixed seed a2 = 198

 
We have also tried the other way of constructing a substitution matrix with Equation (3). With this new matrix M, at s0 = 20, the seed a2 = 198 has a neighbor in each of the other 29 sequences, but the positions in sequences 4 and 24 are not correct. The score s from M is much higher than that from S. Even at s0 = 31, the seed a2 = 198 still has 29 neighbors (with block size 30). If a higher s0, say 36, is used, the block size reduces to 27, which is comparable with the above case of substitution matrix S. After estimating a weight matrix from the rebuilt block for iteration, we can again find the right answer. We shall not discuss matrix M any further.

3.2 Double block motif
Most protein sequence families contain multiple colinear elements separated by gaps of variable length. One example which was once regarded as ‘one of the most difficult’ is the set of five lipocalins containing two weak motifs of width 16 recognized from structural comparisons. The average length of these sequences is 182. It is known that conventional automated sequence alignment tools fail to align these motifs (Boguski et al., 1992). For this task of motif search, we need to determine two sets of positions and for the two motifs. Still taking s0 = 20 and just using BLOSUM62 as the substitution matrix for similarity comparison, we find the sum of the top six seeds of high score as shown in Table 3. As before, only the neighbor most similar to a seed with s ≥ s0 in each sequence is kept. From the table we see first that, from the similarity point of view, the two motifs are not so weak, one of which is even supported by two seeds. Second, all of the positions are aligned correctly, but the positions of the second motif is shifted by 1. Third, the first motif might be wider than 16. We use the two blocks aligned by the seeds a4 = 14 and a1 = 105 to construct the substitution matrix S. Both the motifs are then correctly located by means of S. Of course, one can also correct the phase shift by a shifting operation as described in Lawrence et al. (1993). Furthermore, if we take s0 = 19 instead of 20, the correct seed a1 = 104 turns to have a higher sum score than the shifted one a1 = 105. (The similarity score between a1 = 104 and a5 = 109 is 19, while that between a1 = 105 and a5 = 110 is 22.) A s0 not greater than 19 does not change the top K seeds. For this reason, one should adjust s0 in motif searching by similarity.


View this table:
[in this window]
[in a new window]
 
Table 3 Top six seeds for s0 = 20 in the set with two motifs

 
3.3 Repeating motifs
The last set of five sequences involves subunit repeats of the heterodimeric protein isoprenyl transferases. The average length of the sequences is 380. It was suggested that the subunit repeat consists of three motifs of width 12. A large number of weak patterns cover up to 80% of the sequence length. One needs to determine three sets of positions, where i {1,2,3} is the motif index, k {1,2,3,4,5} is the sequence index and j the repeat index.

Taking BLOSUM62 as the substitution matrix and using each sliding window of width 12 as a seed, we search for the most similar segments to the seed in all sequences other than that which the seed is in. At pair threshold s0 = 20, there are a large number of seeds whose sum score is greater than 4x 30 = 120. A member of a block formed by a seed may have some new neighbors not belonging to the block. This might be regarded as an indication of the existence of subunit repeats. Many seeds are contiguous to each other, which implies close subunits, or means that a wider motif should be considered. However, no seeds are found coincident in position with the first motif of Lawrence et al. (1993). There is a phase shift of about 6. In order to discover subunit repeats, we next remove the restriction that at most a single neighbor in a sequence is allowed, and that no neighbor is searched for in the sequence where the given seed is in. That is, setting threshold s0, we search for all neighbors of a seed in the whole set. The overlap of neighbors may be either allowed or not allowed. At s0 = 20, a seed has at most 13 neighbors. A lower threshold s0 = 18 allows a seed to have up to 16 neighbors. The seeds with at least 12 non-overlapping neighbors are listed in Table 4. We group them into four groups via just their neighbors in Sequence 1, except for two seeds for which their neighbors (marked with ‘*’ in the table) in Sequence 2 have to be referred. For example, group I consists of four seeds. Three of them have a neighbor right at or close to a1 = 278. Although the fourth seed a3 = 257 has no neighbors which are close to any neighbors of the first three seeds, it does have a neighbor a2 = 175 near a2 = 173, which is a neighbor of the third seed. Thus, these four seeds form a group. In other words, any two member seeds in the same group must have at least a pair of close neighbors of each member. In fact, the first three groups correspond to the identified motifs, while the fourth (which may be marked as a1 = 236) overlaps with both the first and second motifs. Thus, we may focus on only the first three groups.


View this table:
[in this window]
[in a new window]
 
Table 4 Seeds are s0 = 18 in the set with repeated motifs

 
It is not easy to define the ‘best’ solution, even with the number and widths of motifs fixed. In this case, there is still an important parameter: the weight matrix score threshold w0, which is also related to the total number of motif repeats and the total score or objective function. For the solution given by Lawrence et al., 1993 the lowest motif score is found at , and is only 4.1. Removing this site results in a higher total score, although the total number of subunit repeats decreases from 80 to 79. It is not our purpose here to find the optimal solution for that specific set of sequences, but we do see ‘better’ solutions. For example, the solution shown in Table 5 with 80 repeats and w0 = 6.6 has a higher total score.


View this table:
[in this window]
[in a new window]
 
Table 5 A possible assignment of subunit repeats for the set of heterodimeric protein isoprenyl transferases

 
We now describe the procedure to find the above solution. We start with a seed. After finding a seed at s0 = 18 with a high total score or high number of neighbors, it and its neighbors form a block. A larger block can be found for the same seed at a reduced s0. Regarding a certain number, say 3, of residues flanking the block as the background, we construct the log-odds weight matrix from the block. By scanning sequences with the weight matrix at some w0 we can update the block and hence the weight matrix itself. Three seeds taken from different groups can generate three non-overlapping blocks. We estimate three weight matrices from the blocks, then a dynamic programming scan for non-overlapping motif locations is conducted at a certain w0. We may iterate the process several times until a convergence is reached. During the iteration, a changing w0 may be used. Since we only perform greedy updating, several combinations of seeds should be taken for running greedy searching.

An alternative is to determine the sum score for each seed at a low s0, and then keep the seeds whose sum score t is above some threshold t0. There can be many overlaps among the seeds. With t taken as score, a non-overlapping assignment of subunits can be obtained by dynamic programming. This directly provides the combinations of nearby seeds.


    4 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 SUBSTITUTION MATRIX AND...
 3 MOTIF SEARCH BY...
 4 DISCUSSION
 REFERENCES
 
Amino acid similarity described by substitution matrices is a part of our knowledge. This is not fully used in motif search methods with random initialization. (In some sense, MEME's Dirichlet mixture priors partly use such knowledge.) We have shown that two substitution matrices may be constructed for the weight matrix model, one equivalent to the weight matrix, and the other similar to BLOSUM. They work well in typical examples. The method is not very sensitive to the choice of the starting matrix. We have analyzed the set of double block motif starting with PAM250, and obtained a similar performance.

In motif search we use a not very high threshold for the pair comparison score, but the sum score provides another measure of statistical significance for multiple sequence comparison. The determination of proper thresholds is related to the estimation of statistical significance, which is a hard issue. We have given a rough estimate. Generally speaking, when the sequence lengths are short or the expected frequency of motif occurrence is high, a low threshold may be used. Indeed, similarity analysis reveals many aspects of motifs in a sequence set.

In a motif search, the exact motif pattern, i.e. motif positions, is rather sensitive to model parameters. However, the strong segments with high scores are robust to parameter changes. We have mentioned only the weight matrix model for motifs. The similarity analysis may be applied to more general models as well, e.g. models including gaps.

Although we have only discussed protein sequences, the same idea works also for DNA sequences, for which a BLOSUM counterpart may be constructed. For example, we may extract a substitution matrix from some datasets of transcription factor binding sites, and then use it in searching for new types of binding sites.


    Acknowledgments
 
This work was supported in part by the Special Funds for Major National Basic Research Projects and the National Natural Science Foundation of China.

Received on July 15, 2004; revised on October 1, 2004; accepted on October 4, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 SUBSTITUTION MATRIX AND...
 3 MOTIF SEARCH BY...
 4 DISCUSSION
 REFERENCES
 

    Bailey, T. and Elkan, C. (1995) Unsupervised learning of multiple motifs in biopolymers using expectation maximization. Machine Learning, 21, 51–80.

    Boguski, M.S., Ostell, J., States, D.J. In Ries, A.R., Sternberg, M.J.E., Wetzel, R. (Eds.). Protein engineering: a practical approach, (1992) , Oxford IRL Press, pp. 57–88.

    Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S. (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci., USA, 100, 3339–3344[Abstract/Free Full Text].

    Eddy, S. (1995) Multiple alignment using hidden Markov models. Proceedings of the International Conference on Intelligent Systems for Molecular BiologyJuly 16–19Cambridge, UK , Cambridge AAAI/MIT Press, pp. 114–120.

    Hastings, W.K. (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57, 97–109[Abstract/Free Full Text].

    Helden, J.V., Andre, B., Collado-Vides, J. (1998) Extracting regulatory sites from the upstream region of yeast genes by computational analysis of oligonucleotide frequencies. J. Mol. Biol., 281, 827–842[CrossRef][ISI][Medline].

    Henikoff, S. and Henikoff, J.G. (1992) Amino acid substitution matrices from protein blocks. Proc. Natl Acad. Sci., USA, 89, 10915–10919[Abstract/Free Full Text].

    Hertz, G., Hartzell, G., III, Stormo, G. (1990) Identification of consensus patterns in unaligned dna sequences known to be functionally related. Comput. Appl. Biosci., 6, 81–92[Abstract/Free Full Text].

    Hughey, R. and Krogh, A. (1996) Hidden Markov models for sequence analysis: extension and analysis of the basic method. Comput. Appl. Biosci., 12, 95–107[Abstract/Free Full Text].

    Lawrence, C.E., Altshul, S., Boguski, M., Liu, J., Neuwald, A., Wootton, J. (1993) Detecting subtle sequence signals: a Gibbs sampling strategy for multiple alignments. Science, 262, 208–214[Abstract/Free Full Text].

    Thompson, W., Rouchka, E.C., Lawrence, C.E. (2003) Gibbs recursive sampler: finding transcription factor binding sites. Nucleic Acids Res., 31, 3580–3585[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/938    most recent
bti090v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Zheng, W.-M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Zheng, W.-M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?