Skip Navigation


Bioinformatics Advance Access originally published online on January 25, 2005
Bioinformatics 2005 21(9):1838-1845; doi:10.1093/bioinformatics/bti286
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/1838    most recent
bti286v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, H.-L.
Right arrow Articles by Fu, C.-J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, H.-L.
Right arrow Articles by Fu, C.-J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

A linear programming approach for identifying a consensus sequence on DNA sequences

Han-Lin Li * and Chang-Jui Fu

Institute of Information Management, National Chiao Tung University Taiwan, China

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 PROPOSED METHOD
 IMPLEMENTATION
 EXTEND TO FIND UNKNOWN...
 DISCUSSION
 REFERENCES
 

Motivation: Maximum-likelihood methods for solving the consensus sequence identification (CSI) problem on DNA sequences may only find a local optimum rather than the global optimum. Additionally, such methods do not allow logical constraints to be imposed on their models. This study develops a linear programming technique to solve CSI problems by finding an optimum consensus sequence. This method is computationally more efficient and is guaranteed to reach the global optimum. The developed method can also be extended to treat more complicated CSI problems with ambiguous conserved patterns.

Results: A CSI problem is first formulated as a non-linear mixed 0-1 optimization program, which is then converted into a linear mixed 0-1 program. The proposed method provides the following advantages over maximum-likelihood methods: (1) It is guaranteed to find the global optimum. (2) It can embed various logical constraints into the corresponding model. (3) It is applicable to problems with many long sequences. (4) It can find the second and the third best solutions. An extension of the proposed linear mixed 0-1 program is also designed to solve CSI problems with an unknown spacer length between conserved regions. Two examples of searching for CRP-binding sites and for FNR-binding sites in the Escherichia coli genome are used to illustrate and test the proposed method.

Availability: A software package, Global Site Seer for the Microsoft Windows operating system is available by http://www.iim.nctu.edu.tw/~cjfu/gss.htm

Contact: hlli{at}cc.nctu.edu.tw


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 PROPOSED METHOD
 IMPLEMENTATION
 EXTEND TO FIND UNKNOWN...
 DISCUSSION
 REFERENCES
 
The methods for determining a consensus pattern can be split into two parts. The first part is the model for describing the shared pattern; the second part is the algorithm for identifying the optimal common site according to its shared pattern. This study belongs to the second part. A consensus sequence identification (CSI) problem is, given a set of sequences known to contain binding sites for a common factor but not knowing where the sites are, discovers the location of the sites in each sequence (Stormo, 2000).

The CSI problem is critical in research on gene expression such as the protein-binding site in a DNA strand. For the last decade several good methods have been developed for solving such problems (Brazma et al., 1998). Of these, the maximum-likelihood approach (Stormo and Hartzell, 1989; Hertz et al., 1990) is the best known. The traditional maximum-likelihood approach, which measures information content to determine alignments, works fairly well and can be relied upon to discover the common sites. However, they are still not able to determine the complete set of regulatory interactions for complicated promoters typical of metazoans (Stormo, 2000).

Recently, Ecker et al. (2002) utilized optimization techniques to reformulate the maximum-likelihood approach for solving CSI problems. They adopted a probabilistic model and formulated a well-designed non-linear model with reference to the expectation maximization algorithm of Lawrence and Reilly (1990). Their method, however, occasionally only finds a feasible solution or a local optimum: which means the best solution may not be found. Additionally, no further structural feature in a CSI problem can be embedded conveniently in their model.

This study proposes a linear programming method for solving a CSI problem to reach the globally optimal consensus sequence. Two examples of searching for CRP-binding sites and for FNR-binding sites in the Escherichia coli genome are used to illustrate the proposed method. The CSI problem is first formulated as a non-linear mixed 0-1 program for alignment of DNA sequences; each of the four bases are coded with two binary variables and a matching score is designed. This non-linear mixed 0-1 program is then converted into a linear mixed 0-1 program by linearization techniques. This study decomposes a CSI problem into several subprograms to be solved by a set of distributed computers linked via the internet. Owing to some special features of the binary relationships, this linear 0-1 program includes 2m binary variables where m is the number of active letters in the common site. Some very attractive properties of this method are:

  1. The required number of binary variables is independent of the number of sequences and the size of each sequence. That means, the proposed method is computationally efficient in solving a CSI problem with a large data size.
  2. The proposed method is guaranteed to find the global optimum instead of a local optimum.
  3. Many kinds of specific features accompanied with a CSI problem can be formulated straightforwardly as logical constraints and embedded into the linear program.

An example of searching CRP-binding sites, as discussed in Stormo and Hartzell (1989) and Ecker et al. (2002) is described as follows. Given 18 letter sequences, each 105 positions long, where each position contains a letter from the set {A, T, C, G}, find a common site of length 16 with the pattern

where Li, {square} {A, T, C, G} and {square}s mean the positions of ignored letters.

Restated, the problem is to specify

  1. the Lis of the common site pattern and
  2. the location of the site in each given sequence, which can fit most closely the common site.

The following are difficulties associated with the method of Ecker et al. (2002) and other maximum-likelihood methods (as reviewed in Brazma et al., 1998) for solving a CSI problem:

(1) Only a local optimal or feasible solution is obtained Since Ecker et al. (2002) formulated a CSI problem as a non-convex non-linear program, their method may only find local optima, as has been acknowledged. Other maximum-likelihood methods, which intend to maximize the probability of binding to the promoters in the sequences, may only find a feasible solution instead of finding a local optimal solution. It is not guaranteed that current maximum-likelihood methods can reach the global optimum for general CSI problems.

(2) Heavy computational burden. The non-linear program in Ecker et al. (2002) contains too many non-linear terms. The heavy computational burden in their method prohibits it from treating a CSI problem with a large number of sequences.

(3) Difficulty of adding logical constraints. When identifying protein binding sites, there usually exists some specific features to be considered as logical constraints. For example, the letters of position Li and L11–i are expected to be complementary (i.e. G–C and A–T). Formulating such a constraint in maximum-likelihood approaches is a complex task. It is even impossible to formulate even more complicated logical constraints (e.g. those with some ambiguity) when applying these approaches.

(4) Fixed number of ignored letters. Maximum-likelihood methods are mainly used to solve CSI problems with fixed number of ignored letters (e.g. six in the above example). However, in the real world this number is unknown and needs to be found by some preliminary processes.

(5) Difficulty of finding the second and the third best solutions. Since current methods may only find a local optimum, it is hard to find other solutions next to the best solution.

In order to overcome the above difficulties of solving a CSI problem, this study proposes a novel method to treat the same problem that molecular biologists actually are interested in solving. We formulate a CSI problem as the identification of a consensus sequence that minimizes the number of differences between the proposed sites. Our basic concept is to reformulate a CSI problem as a mixed 0-1 linear program which only contains a limited number of 0-1 variables and most variables are continuous. Such a mixed 0-1 linear program can be solved effectively by commonly used branching-and-bound algorithms or a branch-cut algorithm (Balas et al., 1996). The advantages of the proposed method are listed below:

  1. It is guaranteed to find the globally optimal solution. Since the objective function and constraints are all linear, the program should converge to the global optimum.
  2. It can effectively solve a CSI problem by a set of on-line computers as illustrated by our numerical experiments.
  3. It is convenient to add logical constraints. Since the binary variables are suited to express logical relationship, various complicated constraints can be embedded directly into the proposed method.
  4. It can be extended to treat CSI problems with unknown number of ignored letters.
  5. It is very straightforward to find the complete set of the second, third, etc. best consensus sequences.

In the following section we will discuss the linear programming technique for solving a CSI problem.


    PROPOSED METHOD
 TOP
 Abstract
 INTRODUCTION
 PROPOSED METHOD
 IMPLEMENTATION
 EXTEND TO FIND UNKNOWN...
 DISCUSSION
 REFERENCES
 
This study first formulates a CSI problem as a non-linear mixed 0-1 program. Then it converts this non-linear mixed 0-1 program into a linear mixed 0-1 program using linearization techniques. To reduce the computational burden, many 0-1 variables in this linear mixed 0-1 program can actually be solved as continuous variables by an all or nothing assignment technique which greatly improves the computational efficiency of this program.

Non-linear mixed 0-1 program
Here we use the example data in Stormo and Hartzell (1989) as listed in Appendix section, to describe the proposed method. First, represent the data in Appendix section as an 18*105 data matrix D:

(1)
where bl,p is the letter in the position p of the sequence l.

Recall the example discussed in the previous section. The common site we want to find has 16 positions (10 Lis and 6 ignored letters), a sequence has 90 corresponding sites, so an 18 * 900 data matrix D' is generated from D'.

(2)
where

and s = 1,...,90 is the starting position of each candidate site.

For Li {A, T, C, G}, two binary variables ui and vi can be used to express Li, an element of the consensus sequence, as shown in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Base code in the determined common site

 
Table 1 indicates that if Li is A, T, C or G, then ai = 1, ti = 1, ci = 1 or gi = 1, which implies the following conditions:

(3)

Now let Scorel be the degree of fitting to the common site found, specified as

(4)
where is the element of candidate sites extracted from D'. The constraints associated with Equation (4) are:

(5)

(6)

Clearly, 0 ≤ Scorel ≤ 10. And the objective is to maximize the total sum of Scorel.

Consider the sample data in Figure 1 for instance.

(7)

(8)



View larger version (31K):
[in this window]
[in a new window]
 
Fig. 1 A small example of finding consensus sequence: (a) two sequences to be compared; (b) schematic representation of the candidate sites; (c) the associated D' matrix.

 
All zl,s in Equation (4) are binary variables. Equation (5) implies that for a sequence l, only one site is chosen and no other sites contribute to Scorel. Suppose the k-th site is selected, then zl,k = 1 and zl,s = 0 for all s {1,2,...,90}, s != k. Since a huge amount of zl,s (i.e. |l|*|s|) are involved, to treat zl,s as binary variables would cause a heavy computational burden. Therefore zl,s should be resolved as continuous variables rather than binary variables. An important proposition is introduced below:

PROPOSITION 1
(All or nothing assignment). Let zl,s ≥ 0 be continuous variables instead of binary variables. If there is a k, k {1,2,...,90}, such that , then assigning zl,k = 1 and zl,s = 0 for all s != k, s {1,2,...,90}, can maximize the value of Scorel.

PROOF
Since {sum}s zl,s = 1 and zl,s ≥ 0, it is true that

REMARK
The objective function of a CSI problem f(x) can be rewritten as

(9)
where , , and for i = 1,2,...,10.

This result implies that SAi (or STi, SCi,SGi) is a set composed of (l, s) in which the product term zl,s ai (or zl,s ti, zl,s ci, zl,s gi) appears on the right hand side of Equation (4) because that .

For instance, the sum of Score1 and Score2 in Equations (7) and (8) becomes

(10)

Some logical constraints can be conveniently expressed by binary variables. For instance, the constraint that a CRP dimer binds a symmetrical site requires that

Such a logical structure can be formulated conveniently as the following constraints.

(11)

With reference to Table 1, clearly if Li = A (i.e. ui = 0 and vi = 0) then L11–i = T (i.e. u11–i = 1 and v11–i = 1) and vice versa; if Li = C (i.e. ui = 0 and vi = 1) then L11–i = G (i.e. u11–i = 1 and v11–i = 0) and vice versa. A CSI problem can then be formulated as a non-linear mixed 0-1 program below based on these constraints:

Program 1 (Non-linear 0-1 CSI program)

(12)
subject to , zl,s ≥ 0 for all l,s



This program intends to solve {ai, ti, ci, gi} for i = 1,2,...,10 thus to maximize the total degree of fitting to the common site for the given 18 sequences, subjected to a possible logical constraint. A very important feature of Program 1 is that we can treat zl,s as continuous variables rather than binary variables, which can improve the computational efficiency dramatically. We can ensure all found zl,s will still have binary values as discussed in the next section.

Linearization of Program 1
Program 1 is a mixed non-linear 0-1 program where qi {sum}zl,s for qi {ai, ti, gi, ci} and uivi are product terms. These product terms can be linearized directly by the following propositions:

PROPOSITION 2
The product term {lambda}i = qi {sum}zl,s where {lambda}i is to be maximized and qi {0,1} can be linearized as follows:

(13)
where M is a big constant ≥ the number of sequences.

PROOF
If qi = 1 then {lambda} i = {sum}zl,s; and otherwise {lambda}i = 0.

PROPOSITION 3
The product term wi = ui vi where ui, vi {0,1} can be linearized as follows:

(14)
Denote Z(ai ) = ai {sum}(l,s)SAi zl,s, Z(ti) = ti {sum}(l,s)STi zl,s, Z(ci ) = ci {sum}(l,s)SCi zl,s and Z(gi) = gi {sum}(l,s)SGi zl,s. Program 1 is then linearized into Program 2 based on Propositions 2 and 3.

Program 2 (Linear mixed 0-1 CSI program)

(15)
subject to , zl,s ≥ 0 for all l,s



zl,s's are treated as non-negative continuous variables for l = 1,2,...,18 and s = 1,2,...,90 where M can be any value greater than or equal to 18.

In Program 2, since ui and vi are binary variables, ai, ti, ci and gi should have binary values following Equation (3). Although zl,s are treated as continuous variables, the values of zl,s should be 0 or 1. This is because the optimal solution of a linear program should be a vertex point satisfying {sum}s zl,s = 1 for all l.

Consider the following proposition.

>PROPOSITION 4
Let the optimal solution of Program 2 be x* = (Z*,u*, v*) and {sum}s zl,s = 1. Assume that a sequence l contains sites s1,s2,...,sk such that for j =1,2,...,k, then,

where are specified in Equation (6).

PROOF
For {sum}s zl,s = 1, if sp,sq {s1,s2,...,sk} where , then to maximize requires zl,sq = 0. This conflicts with the observation that 0 < zl,sq < 1, therefore.

After solving Program 2 we can obtain the globally optimum solution TGTGA{square}{square}{square}{square}{square}{square}TCACA with objective value 147. The related non-zero zl,s values indicate the starting positions of the binding sites in the 18 sequences, as listed below:

All other zl,ss have value 0.

In Program 2 the total number of 0-1 variables is 2m and the total number of the continuous variables is 20m+|l|*|s|. Since the number of 0-1 variables is independent of the lengths of l and s, a CSI problem with many long sequences can be solved effectively.

Suboptimal common sites
Program 2 can find the exact global optimum solution. Sometimes the second best and the third best solutions may also be useful. It is very convenient for the proposed method to find a complete set of common sites by adding some extra constraints. For instance, the second best solution of Program 2 can be obtained conveniently by solving the following program:

(16)
subject to

  1. The same constraints in Model 1
  2. t1 + g2 + t3 + g4 + a5 + t6 + c7 + a8 + c9 + a10 ≤ 9 (new constraint)

The new constraint is used to force the program to find a new solution different from the solution of Program 2. The second best common site found is TTTGA{square}{square}{square}{square}{square}{square}TCAAA with score 129. Similarly we can find another solution by adding following constraint into Equation (16).

The third best common site found is AAATT{square}{square}{square}{square}{square}{square}AATTT with score 129.


    IMPLEMENTATION
 TOP
 Abstract
 INTRODUCTION
 PROPOSED METHOD
 IMPLEMENTATION
 EXTEND TO FIND UNKNOWN...
 DISCUSSION
 REFERENCES
 
Several experiments are tested here, using the example in the Appendix section, to analyze the effect of sequence length and number of sequences on the computational time. All examples are solved by LINGO (Schrage, 1999), a widely used optimization software, on a personal computer with a Pentium 4 2.0G CPU. A software package named Global Site Seer is developed, based on Program 2 for solving CSI problems. This software is available from http://www.iim.nctu.edu.tw/~cjfu/gss.htm

Figure 2 illustrates the experimental results for analyzing the time complexity. Figure 2a is the computational time given various sequence lengths, where the number of sequences is fixed at 18. The results show that the computational time changes slightly even if the sequence length is increased from 105 to 1050. Figure 2b is the computational time with various numbers of sequences. It shows that the solving time is roughly proportional to the number of sequences. The proposed model is quite promising for treating CSI problems with large sequence length and a large number of sequence numbers. Figure 2c shows that the computational time rises exponentially as the number of independent positions increases.



View larger version (29K):
[in this window]
[in a new window]
 
Fig. 2 The relationship between computational time and various factors involved in a CSI problem. This figure illustrates the computational time of solving Program 2 with (a) various sequences sizes, (b) various number of sequences and (c) various independent positions.

 
Time complexity and distributed computing
From the results of Figure 2 we know that the time complexity is roughly proportional to the number of sequences and is influenced slightly by the length of the sequences. However, the computational time rises exponentially as the number of independent positions increases. The worst case of time complexity of solving Program 2 on a single machine is estimated as O(|l|22m), where |l| is the number of sequences and m is the number of independent positions.

To treat CSI problems with more independent positions, a method of distributed computing is discussed in this section. Suppose there are n PCs available for solving Program 2, we can decompose Program 2 into n subprograms by specifying different values on some uis and vis. For instance, if n = 32 then the first subprogram can be formulated as

Subprogram 1

(17)
subject to

  1. The same constraints sets as in Program 2
  2. u1 = v1 = u2 = v2 = u3 = 0 (new constraints).

The new constraint (ii) is used to reduce the number of 0-1 variables from 10 to 5. Similarly, constraint (ii) for the second subprogram can be set as u1 = 1 and v1 = u2 = v2 = u3 = 0. Constraint (ii) for the 32nd subprogram could be u1 = v1 = u2 = v2 = u3 = 1. All these 32 subprograms are solved simultaneously. Such a distributed computation algorithm can enhance the computational efficiency greatly. The computational time of Program 2 can be estimated as follows:

(18)
where {alpha} and ß are parameters, m is the number of independent positions, n is the number of available PCs.

Figure 3 is the results of some experiments for solving Problem 2 with various m and n while |l| = 18. For the example of finding CRP-binding sites, the estimated {alpha} and ß values are {alpha} = 0.014 and ß = 0.621.



View larger version (47K):
[in this window]
[in a new window]
 
Fig. 3 Computational time of distributed computing with various m (independent positions) and n (number of available PCs).

 

    EXTEND TO FIND UNKNOWN BINDING SITES
 TOP
 Abstract
 INTRODUCTION
 PROPOSED METHOD
 IMPLEMENTATION
 EXTEND TO FIND UNKNOWN...
 DISCUSSION
 REFERENCES
 
A more complicated CSI problem is to search for the common site in an uncertain pattern format where the number of ignored letters between the two half sites is unknown. An example is to find a common site of length 2 * 5+k with the pattern

where k, the number of {square}s, is an unknown integer between 0 and 10.

Program 2 can be modified slightly to treat this type of extended CSI problems. First we expand D in (1) as D' below:

in which

where k {0,1,...,10}.

The cases with k > 10 are not considered since they are relatively rare. A linear mixed 0–1 program for solving this example is formulated below:

Program 3

(19)
subject to

  1. , zl,s,k ≥ 0 for all l,s,k
  2. for k {0,1,...,10}
  3. the same conservative and logical constraints in Program 2
  4. the same constraints for linearizing product terms in Program 2 but replace zl,s by zl,s,k.

Constraints (i) and (ii) are used to ensure that when a specific k is chosen {sum}s zl,s,k = 1 and {sum}s zl,s,k' = 0 for k' != k.

Using Program 3 to search CRP binding sites we obtain the globally optimal solution as TGTGA{square}{square}{square}{square}{square}{square}TCACA with score 147, which is exactly the solution found in Program 2. And the second best solution is GTGAA{square}{square}{square}{square}TTCAC with score 134. The relationship between the computational time and the number of possible ks (i.e. |k|) is linear, as shown in the experiment result listed in Figure 4. The number of ignored letter k is between 0 and , the upper bound of k, and thus we have |k| = +1 in this experiment.



View larger version (19K):
[in this window]
[in a new window]
 
Fig. 4 Computational time of Program 3 with various numbers of possible k's. The number enclosed in the common site is the solution of k.

 
Finding FNR-binding sites
Program 3 is also applied to solve an example of searching for binding sites of fumarate and nitrate reduction regulatory protein (FNR) in E.coli. Both CRP and FNR belong to the CRP/FNR helix-turn-helix transcription factor superfamily (Tan et al., 2001). The sequence data, which is taken from GenBank, contains 12 DNA sequences with lengths varying from 96 to 781. Owing to the dimer structure of the binding protein, the common site in this example also has a constraint of inverse symmetry. The RegulonDB database (Huerta et al., 1998) lists the regulatory binding sites found for 8 of these 12 sequences while the exact positions of the other 4 sequences are not listed yet. Solving this example by Program 3 we obtained the global optimal common site as TTGAT{square}{square}{square}{square}ATCAA with score 107, which is the same common site as indicated by Tan et al. (2001). Table 2 illustrates the result including the common site and the predicted binding sites for all of the 12 sequences. Some sites downstream of the transcription start (i.e. with positive indices) are also listed because there are a few known cases in which regulatory sites appear within transcription units (Tan et al., 2001). The proposed method has found some sites not listed in RegulonDB but having scores higher than those listed in RegulonDB (e.g. the third solution in the Operon ansB row of Table 2). The best predicted sites in the four undetermined sequences are also listed in Table 2.


View this table:
[in this window]
[in a new window]
 
Table 2 FNR binding sites found by Program 3

 

    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 PROPOSED METHOD
 IMPLEMENTATION
 EXTEND TO FIND UNKNOWN...
 DISCUSSION
 REFERENCES
 
This study proposes a linear mixed 0-1 programming approach for solving CSI problems. Compared to the widely used maximum-likelihood methods, the proposed method can reach a global optimum rather than finding a local optimum or a feasible solution. Additionally, by utilizing binary variables some logical constraints can be embedded into the models. It is also convenient to find the complete set of the second, third, etc. best common sites. Since the number of binary variables is fully independent of the number of sequences and the length of a sequence, the proposed method can treat a large CSI problem with many long sequences. For treating a CSI problem with many independent positions in an acceptable time, this study also proposes a method for distributed computing.

The proposed method can also be conveniently extended to treat more complicated CSI problems. In this study an extension of the linear program is designed to solve CSI problems with an unknown number of ignored letters between the two half sites. The results of searching for FNR-binding sites show that the extended model can not only find the locations of known binding sites listed in the RegulonDB database but also those not yet delimitated.

There are two issues remaining for further study. The first is to extend this method to treat various practical CSI problems. The second is to develop a more refined distributed algorithm to solve some CSI problems by numerous computers via the internet.


    Acknowledgments
 
The authors express deep appreciation for the referees’ valuable comments on this paper. On the basis of these comments, this paper has been substantially improved.

Received on August 6, 2003; revised on January 11, 2005; accepted on January 18, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 PROPOSED METHOD
 IMPLEMENTATION
 EXTEND TO FIND UNKNOWN...
 DISCUSSION
 REFERENCES
 

    Balas, E., et al. (1996) Mixed 0-1 programming by lift-and-project in a branch-and-cut framework. Manage Sci., 42, 1229–1246.

    Brazma, A., et al. (1998) Approaches to the automatic discovery of patterns in biosequences. J. Comput. Biol., 5, 279–305[ISI][Medline].

    Ecker, J.G., et al. (2002) An application of nonlinear optimization in molecular biology. Eur. J. Oper. Res., 138, 452–458[CrossRef].

    Hertz, G.Z., et al. (1990) Identification of consensus patterns in unaligned DNA sequences known to be functionally related. Comput. Appl. Biosci., 6, 81–92[Abstract/Free Full Text].

    Huerta, A.M., et al. (1998) RegulonDB: a database on transcriptional regulation in Escherichia coli. Nucl. Acids Res., 26, 55–59[Abstract/Free Full Text].

    Lawrence, C.E. and Reilly, A.A. (1990) An expectation maximization (EM) algorithm for the identification and characterization of common sites in unaligned biopolymer sequences. Proteins, 7, 41–51[CrossRef][ISI][Medline].

    Schrage, L. Optimization Modeling with Lingo, (1999) , Chicago, IL LINDO Systems Inc.

    Stormo, G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16, 16–23[Abstract/Free Full Text].

    Stormo, G.D. and Hartzell, G.W. (1989) Identifying protein-binding sites from unaligned DNA fragments. Proc. Natl Acad. Sci. USA, 86, 1183–1187[Abstract/Free Full Text].

    Tan, K., et al. (2001) A comparative genomics approach to prediction of new members of regulons. Genome Res., 11, 566–584[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
P. Larranaga, B. Calvo, R. Santana, C. Bielza, J. Galdiano, I. Inza, J. A. Lozano, R. Armananzas, G. Santafe, A. Perez, et al.
Machine learning in bioinformatics
Brief Bioinform, March 1, 2006; 7(1): 86 - 112.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/9/1838    most recent
bti286v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, H.-L.
Right arrow Articles by Fu, C.-J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, H.-L.
Right arrow Articles by Fu, C.-J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?