Skip Navigation


Bioinformatics Advance Access originally published online on October 28, 2004
Bioinformatics 2005 21(7):1062-1068; doi:10.1093/bioinformatics/bti094
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/1062    most recent
bti094v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kiryu, H.
Right arrow Articles by Asai, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kiryu, H.
Right arrow Articles by Asai, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Extracting relations between promoter sequences and their strengths from microarray data

Hisanori Kiryu 1,2,*, Taku Oshima 1 and Kiyoshi Asai 1,2,3

1Graduate School of Information Sciences, Nara Institute of Science and Technology 8916-5 Takayama-cho, Ikoma, Nara 630-0192, Japan
2Computational Biology Research Center, The National Institute of Advanced Industrial Science and Technology Aomi Frontier Building 2-43 Aomi, 17F, Koto-ku, Tokyo 135-0064, Japan
3Department of Computational Biology, Faculty of Frontier Science, The University of Tokyo 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8561, Japan

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 

Motivation: The relations between the promoter sequences and their strengths were extensively studied in the 1980s. Although these studies uncovered strong sequence-strength correlations, the cost of their elaborate experimental methods have been too high to be applied to a large number of promoters. On the contrary, a recent increase in the microarray data allows us to compare thousands of gene expressions with their DNA sequences.

Results: We studied the relations between the promoter sequences and their strengths using the Escherichia coli microarray data. We modeled those relations using a simple weight matrix, which was optimized with a novel support vector regression method. It was observed that several non-consensus bases in the ‘–35’ and ‘–10’ regions of promoter sequences act positively on the promoter strength and that certain consensus bases have a minor effect on the strength. We analyzed outliers for which the observed gene expressions deviate from the promoter strength predictions, and identified several genes with enhanced expressions due to multiple promoters and genes under strong regulation by transcription factors. Our method is applicable to other procaryotes for which both the promoter sequences and the microarray data are available.

Contact: hisano-k{at}is.aist-nara.ac.jp


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
The relations between the promoter sequences and their strengths were extensively studied in the 1980s (Mulligan et al., 1984; Mulligan et al., 1985; Mulligan and McClure, 1986; Kobayashi et al., 1990; Szoke et al., 1987; Ayers et al., 1989; O'Neill, 1989; Stefano and Gralla, 1982; Youderian et al., 1982; Gardella et al., 1989; Burr et al., 2000; Strohl, 1992; Kumar et al., 1993; Straney et al., 1994). Several studies used Escherichia coli promoters corresponding to the {sigma}70 subunit of RNA polymerase. In this case, a promoter sequence comprises two separately conserved regions, called ‘–35 region’ and ‘–10 region’. Their consensus sequences are ‘TTGACA’ and ‘TATAAT’, respectively, and they are separated by approximately 17 base pairs (bp) (Siebenlist et al., 1980; Strohl, 1992; Hawley and McClure, 1983; Harley and Reynolds, 1987; Lisser and Margalit, 2000; Mulligan et al., 1985). These studies uncovered strong sequence-strength correlations: promoters closer to the consensus sequence have stronger activities. Promoters with a spacer of length 17 are more active than the promoters with other spacer lengths. Matching the consensus sequence in the –10 region rather than in the –35 regions is much more important for the promoter activity.

However, since most of their methods were based on base-wise mutations to a specific promoter, and experimental conditions varied between different experiments, it was difficult to analyze their results statistically. Moreover, the cost of their elaborate methods was too high to be applied to various organisms.

On the contrary, a recent increase in the microarray data provides us with opportunities to compare the strengths of hundreds of promoters under the same experimental conditions for various organisms. In this paper, we propose a statistical learning method to extract the promoter sequence-strength relations from the microarray data and apply our method to E.coli {sigma}70 promoters.


    2 SYSTEMS AND METHODS
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
2.1 The model
Weight matrix models of protein–DNA binding are useful in recognizing the protein-binding sites in a given DNA fragment (Stormo, 2000). These models assign partial energies to each base at each position in a putative binding site and define the protein–DNA binding energy as the sum of them. Subsequently, the sites with high binding energies are considered as candidates for the protein-binding sites.

We apply the weight matrix method to recognize the promoter sequences, which are known as the binding sites of the {sigma}70 subunit of RNA polymerase. The {sigma}70-promoter binding energy E is defined by,

(1)
where {varepsilon}base(k,i) represents the partial energies between base i and the DNA binding region of {sigma}70 associated with the k-th promoter site. The values of k = 1 to 6 and k = 7 to 12 correspond to the –35 and –10 regions, respectively. We also consider the spacer length contribution {varepsilon}spacer(l) to the {sigma}70-promoter binding energy in order to compare the relative importance of the spacer length variations to the base binding energies. {varepsilon}spacer(l) are interpreted as the stress energies of {sigma}70-promoter complex for different spacer lengths. For each promoter site with sequences ‘s1s2 ... s6 and ‘s7 ... s12’ for the –35 and –10 regions, respectively, and spacer length n, we associate nbase(k,i) and nspacer(l) defined by

where {delta}i,j is the Kronecker's delta symbol. We limit the range of spacer lengths between 15 and 19 bases, as most of the spacers fall within this range. In the following equations, we denote the right-hand side of Equation (1) as a dot product of the 53-dimensional vectors:

Here, k ranges from 1 to 13, and j takes the values j = A, C, G, T for k = 1 to 12 and j = 15 to 19 for k = 13. w(k, j) and n(k, j) are defined by


We call W a weight vector, N a feature vector, and the entire set of N vectors the feature space.

We impose the following constraints on the weight vector W in order to fix the base energy for each position k and the overall energy scale of W, which are not related to the sequence specificity of W,

(2)
These constraints reduce the number of parameters of W from 53 to 39.

The method employed to obtain the best estimate of W from the given protein-binding sites is described in Heumann et al. (1994). According to their method, the weight vector W = W0 for promoter recognition is, in our normalization convention, given by

(3)
where,

Here, for k = 1 to 12, fk(j) are the frequencies to observe base j at the k-th position, and for k = 13, fk(j) are the frequencies to observe a spacer of length j. pk(j) represent the background base frequencies of E.coli genome. Since our results are not sensitive to the detailed differences of pk(j), we simply choose pk(b) = 0.25 for k = 1 to 12 and pk(b) = 0.2 for k = 13. This formula is derived by approximating the E.coli genome as a random DNA sequence with base frequencies pk(j). The weight vector W0 is shown in Figure 4 with the base frequency data adapted from reference (Hawley and McClure, 1983).



View larger version (33K):
[in this window]
[in a new window]
 
Fig. 4 Top: components of the weight vector W = W0 defined by Equation (3). The first 4 x 6 = 24 components correspond to the –35 region, the next 24 components correspond to the –10 region, and the last five components are related to the spacer length. Bottom: the trained weight vector W that represents the promoter sequence-strength relations.

 
The solid line in Figure 1 represents the energy distribution in the feature space with respect to W0, which is obtained by generating random feature vectors N and plotting the histogram of energy E = W0 · N. Due to our normalization convention, most of the feature vectors are located around zero energy (Sengupta et al., 2002). As the energy E increases from zero, the number of feature vectors with energy E rapidly decreases. We also generated 10 000 random sequences of 60 bases and found the maximal binding energy available among all possible binding sites for each sequence. The corresponding energy distribution is shown in the broken line with crosses in Figure 1. The energy distribution has moved toward the higher energy direction as compared to the energy distribution in the feature space.



View larger version (14K):
[in this window]
[in a new window]
 
Fig. 1 Energy distributions with respect to W0. Solid line: energy distribution of random feature vectors. Broken line with crosses: distribution of maximal binding energy among all possible binding sites for a random sequence of 60 bp. Dotted line with white boxes: energy distribution of real promoter sequences in the E.coli genome.

 
Experimentally identified promoters were obtained from the RegulonDB database (Salgado et al., 2004). Among the 656 promoters associated with the {sigma}70 transcription factor of RNA polymerase, we obtained 114 promoters by excluding the promoters that were annotated in the EcoCyc database (Karp et al., 2004) to be a member of multiple promoters or regulated by any transcription factors. We identified the exact positions of the –35 and –10 regions as the site with highest energies with respect to W0. The corresponding energy histogram is shown in the dotted line with white boxes in Figure 1. The energies of true promoters are higher than those of random sequences. The mean energy is E = 1.79. The number of sequences with an energy higher than 1.79 is 3.8% of the random sequences.

We now describe the model for the promoter sequence-strength relations. We formulate it as the linear relation of promoter strength z to the binding energy E = W · N,

(4)
Here, the weight vector W is optimized so as to satisfy the above equation. The promoter strength z is essentially the logarithm of fluorescent intensity of gene expression downstream of the promoter. Its precise definition is described in the following section.

The rough ideas that lead to Equation (4), are as follows: it is reasonable that the promoter sequences can affect their strengths via the binding energies of {sigma}70-promoter interactions. This binding energy is a part of the activation energy of transcription reaction. We assume that the other part of the activation energy is irrelevant to the variety of promoter strengths. According to simple chemical kinetics, the logarithm of chemical reaction rate is linearly dependent on the activation energy. If the primary mRNA degradation processes are insensitive to the types of mRNA sequence, then the abundance of mRNA is proportional to the production rate. Therefore, the logarithm of fluorescent intensities is linearly related to the {sigma}70-promoter binding energies.

Of course, transcription is a notoriously complicated cellular process, including multiple steps of conformational transitions of large proteins. However, complicated models display worse performance than simple models when the data quality is low. The above arguments are made only to derive a model that is sufficiently simple to be applied to the statistical method, yet capable of expressing the promoter sequence-strength relations.

2.2 Microarray data
We obtained the E.coli microarray data from the KEGG database (Kanehisa et al., 2004; Mori et al., 2000) containing results of 48 gene depression experiments. For each experiment, the expression levels of all open reading frames of the E.coli genome are measured.

We treated ‘control’ and ‘target’ data of each gene depression experiment separately, as we were concerned with the absolute values of the gene expressions. Therefore, the number of gene expression profiles amounted to 96. For each profile, we subtracted the background intensities from the signal intensities. We shifted these values so that their median vanished. Next, we took the logarithm of the data, neglecting the negative-valued data. The histograms of the resulting data acquired similar bell-shaped forms, although their center positions varied irregularly among profiles. We standardized each profile in order to set the mean value and standard deviation to zero and unity, respectively. After these normalizations on each profile, we collected the values associated with each gene from the 96 profiles. We calculated their median as the representative value of intensity.

Figure 2 shows the scatter plot of median and absolute deviation of the normalized fluorescent intensities. The figure shows that genes with higher intensities tend to be expressed stably. The absolute deviation of the entire gene intensities is 0.87; therefore, variations of most genes with positive intensities are smaller than the width of the gene intensities.



View larger version (32K):
[in this window]
[in a new window]
 
Fig. 2 Scatter plot of the median and absolute deviation of fluorescent intensity for each gene (gray points). The white boxes are the dataset used as promoter strengths.

 
It may be suspected that these normalized fluorescent intensities do not faithfully represent mRNA concentrations in the cell, due to the sequence dependence of mRNA–cDNA hybridization efficiencies. To assess this issue, one of the authors (T. O.) examined a microarray experiment, in which mRNA transcripts are competitively hybridized with DNA fragments, obtained by cutting the E.coli genomes using restriction enzymes. Since the concentrations of these genomic DNA fragments will be constant through all genes in the genome, the relative intensities of mRNA transcripts to those of DNA fragments should faithfully represent the gene expression strengths.

Figure 3 shows the scatter plot of the relative intensities of mRNA to those of the genomic DNA fragments, and the normalized absolute intensities described previously. The figure shows a good correlation between the data with the correlation coefficient 0.7. This indicates that the absolute values of microarray expression have strong correlation with the mRNA concentrations in the cell. Unfortunately, the method of competitive hybridization with genomic DNA fragments has certain limitations; the estimates of the strengths of the weakly expressed genes are inaccurate, because of the hybridization of reverse cDNA strands with mRNA transcripts. The figure shows only the 697 strongly expressed genes that have non-vanishing relative intensities. Hereafter, we identify the normalized intensities with the absolute values of gene expression strengths.



View larger version (25K):
[in this window]
[in a new window]
 
Fig. 3 Scatter plot of the relative mRNA intensities to the reference DNA intensities and the normalized absolute intensities. The number of data is 697 and the correlation coefficient is 0.70. Both scales are standardized.

 
We now relate these expression strengths to the promoter strengths. For each promoter sequence, the median of expression strengths of genes transcribed only by the promoter is considered as the promoter strength. The white boxes in Figure 2 show the 114 datasets used as promoter strengths. We used these data to obtain the optimal weight vector W.

There are a number of factors that may alter the mRNA levels from the promoter strengths. Apart from unidentified transcription factors and overlapping transcription units, the mRNA attenuation, the sequence-specific mRNA instability and the operon length may modify the mRNA concentrations. Unfortunately, the annotations on these factors are still very limited in the database and we simply assume that the majority of mRNA levels in our dataset are not affected by these factors.

2.3 Support vector regression
In this section, we describe our regression method to train the weight vector W which represents the promoter sequence-strength correlations. We use a novel kind of support vector regression which yields better correlation between the promoter strengths and W · N than more popular regression methods such as {varepsilon}SVR (Schölkopf and Smola, 2002). It is noted that similar regression problems were considered to search for novel sequence motifs in eukaryotic genomes in Bussemaker et al. (2001) and Conlon et al. (2003). It is also noted that the relatively large number of optimization parameters forbids us to use the simplest least-squares regression, which has no mechanism to avoid the overfitting to the training data.

We express W as a sum of W0 and the residual W1. It is W1 that is actually optimized by our support vector machine algorithm, which is defined by

(5)
subject to

where zi represent the promoter strengths obtained in the previous section, zc represents a given threshold parameter, which we set to 0.39, and is the median of promoter strengths zi. Ci and yi are defined by


where, C is a positive real parameter, which determines the trade-off between the fit of the model to training data and the generalization error. np and nn represent the number of training data that satisfy zi ≥ zc, zi ≤ zc, respectively. z1i is defined by

where {lambda} is a positive real parameter that determines the relative scale of the promoter strength to W0 and b0 is defined by the median of the set {(z W0 · N)i}.

The reason for the separation of W into W0 and W1 is as follows: since our regression method uses only the promoter sequences existing in the E.coli genome, it occurs that certain promoter positions do not have sufficient base variations to determine the corresponding components of the weight vector W. In such a case, ordinary support vector methods that do not include W0 tend to eliminate these components and to emphasize the components of less-conserved positions. The addition of W0 reduces the risk of neglecting the most-conserved positions such that the weight vector components that are not well-determined are set to the corresponding components of W0. Since W0 is essentially the logarithm of base frequencies, this can be considered as the inclusion of prior knowledge obtained in the 1980s that the more conserved bases act positively on the promoter strengths.

We next describe our regression method. Our regression method is based on the support vector classification (SVC) algorithm (Schölkopf and Smola, 2002). In ordinary support vector classification problems, one divides the training data into two classes and seeks the optimal linear function f(N) = W · N + b such that f(Ni) ≥ 1 for any Ni in one class and f(Ni) ≤ –1 for the other. Similarly, we divide the promoters into strong and weak promoters according to the threshold value zc, and we seek the linear function that satisfies f(Ni) ≥ zi for strong promoters and f(Ni) ≤ zi for weak ones. This problem can be solved with the same algorithms as used to solve ordinary SVC problems.

We solved Equation (5) using the pr_loqo routine (http://www.kernel-machines.org/), which implements the dual-primal interior point method of quadratic programming and provides extremely accurate optimization results. For each value of parameter {lambda}, we performed a 10-fold cross-validation and determined the parameter C. The value of {lambda} is determined such that the result yielded the best correlation between the strength z and the energy W · N. It was found that our regression method showed higher correlations (correlation coefficient 0.63) than the {varepsilon}SVR regression method (Schölkopf and Smola, 2002) (correlation coefficient 0.58).

Because of the rather large number of optimization parameters relative to the available dataset, we were unable to have a large test set separate from the training dataset. However, for the optimal {lambda} and C, the standard deviation of trained weight vectors W at each validation was 10 times smaller than the scale of components of W. Thus, we are convinced of avoiding overfitting of W to any particular training set.


    3 RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
Figure 4 shows the weight vector W0 (top), which is defined in Equation (3), and is essentially the logarithm of base and spacer length frequencies. This figure shows that the conventional consensus sequence (TTGACA, TATAAT, 17 bp) is the collection of the most likely bases at each position and spacer length. Figure 4 also shows the weight vector W (bottom), which is obtained using our support vector regression method, and represents the relations between the promoter sequences and their strengths.

At most positions, the bases cytosine and guanine have inhibitory effects on the promoter activity, except the positive guanine contributions at the positions k = 3 and 4. These positions are also the only positions wherein thymine acts negatively on the strength. It may be noted that the non-consensus bases ‘A’ at k = 1, 2, 9 and ‘T’ at k = 6 have positive contributions to the strength comparable to most of the consensus bases, although the consensus promoter is among the strongest promoters. It may also be noted that no significant contributions from the consensus bases ‘C’ at k = 5 and ‘A’ at k = 6 are found. The figure also shows the large differences in effect among base kinds that are inhibitory on the promoter strength. For example, cytosine at the position k=9 has a highly adverse effect than guanine at the same position, despite the similar observed frequencies of these bases. The mean contributions of –35 and –10 regions, and the spacer length to the binding energy with respect to the obtained W are given by

(6)
On the contrary, the contributions of these three elements to the strength variation of promoters in E.coli genome is given by

(7)
where f>(f<) represents the mean of feature vectors with strengths z greater(less) than the mean strength. Equations (6) and (7) imply that while the –10 region and the spacer length are more important to discriminate {sigma}70-DNA binding sites from random sequences, the base sequences in the –35 region affect the variety of genomic promoter strengths in magnitude comparable to the –10 region.

Figure 5 shows the scatter plot of the promoter strength z and the binding energies W0 · N (top) and W · N (bottom). One can observe a better correlation of the zW · N plot than the zW0 · N plot. The corresponding correlation coefficients are 0.63 and 0.40, respectively. We also numbered three outliers in zW · N plot, and listed the corresponding promoter–gene pairs and promoter sequences in Table 1.



View larger version (9K):
[in this window]
[in a new window]
 
Fig. 5 Top: scatter plot of promoter strength z and energy W0 · N. The correlation coefficient is 0.40. Bottom: scatter plot of promoter strength z and energy W · N. The correlation coefficient is 0.63. Three outliers are numbered and their corresponding promoter–gene names and sequences are shown in Table 1. Scales of z, W0 · N, and W · N are all standardized.

 

View this table:
[in this window]
[in a new window]
 
Table 1 Promoter–gene pairs and promoter sequences

 
The metYp2 promoter (numbered 1 in Fig. 5) is associated with the transcription unit including the gene rbfA. It is known (Nakamura and Mizusawa, 1985) that there is a relatively efficient {rho}-independent terminator upstream of rbfA, which is consistent with the low expression of rbfA despite the almost complete coincidence of metYp2 with the consensus promoter. For the promoter–gene pair (fimBp1, fimB) (numbered 2), it is known (Schwan et al., 1994) that there is an uncharacterized protein-binding upstream of fimBp1, which may explain the low expression level of fimB. For the pair (hscB, hscA) (numbered 3), another transcription unit including hscA is known (Seaton and Vickery, 1994). Although the corresponding {sigma} factor is not annotated in the EcoCyc database, its activity may have a dominant effect on the hscA expression; higher than the predicted strength of the hscB promoter.

We now investigate the multiple promoter and transcription factor effects on the gene expressions, which are among the various factors used to alter the gene expression strengths, from the predicted promoter strengths. In Figure 6 we plotted the strengths of the genes transcribed by multiple transcription units, the genes under transcriptional regulation, as well as the data at the bottom of Figure 5. The crosses in Figure 6 show the expression strengths of multiply transcribed genes paired with the maximal predicted strength W· N among the promoters. We plotted only the stably expressed genes (sample size over 80 among the 96 microarray profiles and absolute deviation under 1.0) with no known regulatory sites. One can observe that the expressions of genes pepD, nlpD and ompA (numbered 4, 5 and 6, respectively) are clearly enhanced by the multiple promoter effect.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 6 Scatter plot of the strength z and the predicted promoter strength W · N in the case of multiple promoters (crosses) and single promoters with only activators (white boxes) and inhibitors (black boxes), as well as single, unregulated promoters (dots) which are the re-plot of data in Figure 5. Scales are the same as those of Figure 5. Several outliers are numbered for which the promoter and gene names are listed in Table 1.

 
The white and black boxes in Figure 6 show the regulated genes with a single promoter for which all the regulatory sites consist of activators and inhibitors, respectively. As in the case of multiple promoters, we plotted only the genes with stable expressions. We numbered several outliers for which the expression strengths are significantly different from the promoter strengths.

The cpdB promoter is known to be regulated by the cyclic AMP–cyclic AMP receptor protein (cAMP–CRP) (Liu and Beacham, 1990). Although only positive regulation of cAMP–CRP is described in Liu and Beacham (1990) the low expression level of the cpdB may indicate the inhibitory effect of cAMP–CRP on this site, since cAMP–CRP is known as a dual regulator (Kolb et al., 1999). There are constitutive activators MarA, Rob and SoxS for the promoter inaAp (Martin et al., 1999). Currently, no facts are available to explain the low basal expression level of inaA. For the promoter yihEp, there exists an activator CpxR (Danese and Silhavy, 1997; Pogliano et al., 1997). The high expression levels of rdoA rather than the predicted strength are consistent with the existence of the CpxR binding site. Only activators are known for the promoter sodBp, which contradicts the high expression level of sodB than the predicted strength. However, in Dubrac and Touati (2000) it is described that the mRNA transcripts of sodB undergo the post-transcriptional regulation by the regulator protein Fur that enhances the expression of sodB 7-fold by preventing sodB mRNA from degradation. In the reference, it is also described that the effect of the activators is much smaller than that of Fur. Our result is consistent with these facts.

As can be seen from the figure, even when only activators (inhibitors) are annotated to the promoters, their expressions do not show higher (lower) mRNA levels than the predicted promoter strengths. This may occur if the activities of the annotated regulons are so weak that their effects are buried in the noise of microarray data, or if there are still unidentified factors which have the activities opposite to the annotated ones. These discrepancies from the expectations might indicate that experiments frequently fail to find out all the components participating in the complicated regulatory activities.


    4 CONCLUSION
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 
In this paper, we analyzed the relations of promoter sequences to their strengths. We presented a method to extract those relations from microarray data, using a novel kind of regression method. This analysis has been possible due to the availability of abundant microarray data.

It was observed that several non-consensus bases act positively on the promoter strength and that certain consensus bases have a minor effect on the strength. It was also found that certain bases with similar observed frequencies have large differences in the strength of inhibitory activity.

We calculated the individual contributions of the –35, the –10 regions, and the spacer length to the promoter strength, and showed that the base sequences in the –35 region affect the variety of genomic promoter strengths in magnitude comparable to the –10 region, although the –10 region and the spacer length are more important to discriminate {sigma}70-DNA binding sites from random sequences.

Our model describes only the simplest promoters whose associated mRNA levels are not modified from the basal promoter strengths. However, once we have optimized the weight vector W using the non-regulated promoters, we can use it to detect promoters under strong regulations by analyzing outliers in the zW · N plot (Fig. 6), for which the observed mRNA levels are significantly different from the predicted promoter strengths. We identified several genes with enhanced expressions by multiple promoters, and genes under strong regulation by transcription factors. This analysis of outliers will be a promising approach for the discovery of genes with strongly modified expressions.

Our method uses only the promoter sequences existing in the genome. Since these promoters have highly biased base frequencies, certain promoter positions do not have sufficient base variations to determine certain components of the weight vector W accurately. Although we reduced the influence of this problem by the normalization convention (Equation 2) to reduce the number of parameters and by the introduction of the prior weight vector W0 (Equation 3), it will be useful to perform microarray experiments for the E.coli strains mutated to have several promoters rarely observed in wild-type strains.

Our method is applicable to other organisms if a collection of promoter sequences and microarray data are available. The present study also implies the rich information contained in the absolute fluorescent intensities of microarray experiments.


    Acknowledgments
 
We are grateful to Prof. Ogasawara for providing helpful comments on the absolute values of fluorescent intensities.

Received on August 5, 2004; revised on October 10, 2004; accepted on October 10, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 SYSTEMS AND METHODS
 3 RESULTS
 4 CONCLUSION
 REFERENCES
 

    Ayers, D.G., Auble, D.T., deHaseth, P.L. (1989) Promoter recognition by Escherichia coli RNA polymerase. Role of the spacer DNA in functional complex formation. J. Mol. Biol., 207, 749–756[CrossRef][ISI][Medline].

    Burr, T., Mitchell, J., Kolb, A., Minchin, S., Busby, S. (2000) DNA sequence elements located immediately upstream of the –10 hexamer in Escherichia coli promoters: a systematic study. Nucleic Acids Res., 28, 1864–1870[Abstract/Free Full Text].

    Bussemaker, H.J., Li, H., Siggia, E.D. (2001) Regulatory element detection using correlation with expression. Nat. Genet., 27, 167–171[CrossRef][ISI][Medline].

    Conlon, E.M., Liu, X.S., Lieb, J.D., Liu, J.S. (2003) Integrating regulatory motif discovery and genome-wide expression analysis. Proc. Natl Acad. Sci. USA, 18, 3339–3344.

    Danese, P.N. and Silhavy, T.J. (1997) The sigma(E) and the Cpx signal transduction systems control the synthesis of periplasmic protein-folding enzymes in Escherichia coli. Genes Dev., 11, 1183–1193[Abstract/Free Full Text].

    Dubrac, S. and Touati, D. (2000) Fur positive regulation of iron superoxide dismutase in Escherichia coli: functional analysis of the sodB promoter. J. Bacteriol., 182, 3802–3808[Abstract/Free Full Text].

    Gardella, T., Moyle, H., Susskind, M.M. (1989) A mutant Escherichia coli sigma 70 subunit of RNA polymerase with altered promoter specificity. J. Mol. Biol., 206, 579–590[CrossRef][ISI][Medline].

    Hawley, D.K. and McClure, W.R. (1983) Compilation and analysis of Escherichia coli promoter DNA sequences. Nucleic Acids Res., 11, 2237–2255[Abstract/Free Full Text].

    Harley, C.B. and Reynolds, R.P. (1987) Analysis of E.coli promoter sequences. Nucleic Acids Res., 15, 2343–2361[Abstract/Free Full Text].

    Heumann, J.M., Lapedes, A.S., Stormo, G.D. (1994) Neural networks for determining protein specificity and multiple alignment of binding sites. Proc. Int. Conf. Intell. Syst. Mol. Biol., 2, 188–194[Medline].

    Kanehisa, M., Goto, S., Kawashima, S., Okuno, Y., Hattori, M. (2004) The KEGG resources for deciphering the genome. Nucleic Acids Res., 32, D277–D280[Abstract/Free Full Text].

    Karp, P.D., Arnaud, M., Collado-Vides, J., Ingraham, J., Paulsen, I.T., Saier, M.H., Jr. (2004) The E.coli EcoCyc database: no longer just a metabolic pathway database. ASM News, 70, 25–30.

    Kobayashi, M., Nagata, K., Ishihama, A. (1990) Promoter selectivity of Escherichia coli RNA polymerase: effect of base substitutions in the promoter –35 region on promoter strength. Nucleic Acids Res., 18, 7367–7372[Abstract/Free Full Text].

    Kolb, A., Busby, S., Buc, H., Garges, S., Adhya, S. (1999) Transcriptional regulation by cAMP and its receptor protein. Annu. Rev. Biochem., 62, 749–795[CrossRef].

    Kumar, A., Malloch, R.A., Fujita, N., Smillie, D.A., Ishihama, A., Hayward, R.S. (1993) The minus 35-recognition region of Escherichia coli sigma 70 is inessential for initiation of transcription at an "extended minus 10" promoter. J. Mol. Biol., 232, 406–418[CrossRef][ISI][Medline].

    Lisser, S. and Margalit, H. (2000) Compilation of E.coli mRNA promoter sequences. Nucleic Acids Res., 21, 1507–1516.

    Liu, J. and Beacham, I.R. (1990) Transcription and regulation of the cpdB gene in Escherichia coli K12 and Salmonella typhimurium LT2: evidence for modulation of constitutive promoters by cyclic AMP–CRP complex. Mol. Gen. Genet., 222, 161–165[CrossRef][ISI][Medline].

    Martin, R.G., Gillette, W.K., Rhee, S., Rosner, J.L. (1999) Structural requirements for marbox function in transcriptional activation of mar/sox/rob regulon promoters in Escherichia coli: sequence, orientation and spatial relationship to the core promoter. Mol. Microbiol., 34, 431–441[CrossRef][ISI][Medline].

    Mori, H., Isono, K., Horiuchi, T., Miki, T. (2000) Functional genomics of Escherichia coli in Japan. Res. Microbiol., 151, 121–128[Medline].

    Mulligan, M.E. and McClure, W.R. (1986) Analysis of the occurrence of promoter-sites in DNA. Nucleic Acids Res., 14, 109–126[Abstract/Free Full Text].

    Mulligan, M.E., Hawley, D.K., Entriken, R., McClure, W.R. (1984) Escherichia coli promoter sequences predict in vitro RNA polymerase selectivity. Nucleic Acids Res., 12, 789–800[ISI][Medline].

    Mulligan, M.E., Brosius, J., McClure, W.R. (1985) Characterization in vitro of the effect of spacer length on the activity of Escherichia coli RNA polymerase at the TAC promoter. J. Biol. Chem., 260, 3529–3538[Abstract/Free Full Text].

    Nakamura, Y. and Mizusawa, S. (1985) In vivo evidence that the nusA and infB genes of E.coli are part of the same multi-gene operon which encodes at least four proteins. EMBO J., 4, 527–532[ISI][Medline].

    O'Neill, M.C. (1989) Consensus methods for finding and ranking DNA binding sites. Application to Escherichia coli promoters. J. Mol. Biol., 207, 301–310[CrossRef][ISI][Medline].

    Pogliano, J., Lynch, A.S., Belin, D., Lin, E.C., Beckwith, J. (1997) Regulation of Escherichia coli cell envelope proteins involved in protein folding and degradation by the Cpx two-component system. Genes Dev., 11, 1169–1182[Abstract/Free Full Text].

    Salgado, H., Gama-Castro, S., Martinez-Antonio, A., Diaz-Peredo, E., Sanchez-Solano, F., Peralta-Gil, M., Garcia-Alonso, D., Jimenez-Jacinto, V., Santos-Zavaleta, A., Bonavides-Martinez, C., Collado-Vides, J. (2004) RegulonDB (version 4.0): transcriptional regulation, operon organization and growth conditions in Escherichia coli K-12. Nucleic Acids Res., 32, 303–306.

    Schölkopf, B. and Smola, A.J. Learning with Kernels, (2002) , Cambridge, MA MIT Press.

    Schwan, W.R., Seifert, H.S., Duncan, J.L. (1994) Analysis of the fimB promoter region involved in type 1 pilus phase variation in Escherichia coli. Mol. Gen. Genet., 242, , pp. 623–630[CrossRef][ISI][Medline].

    Seaton, B.L. and Vickery, L.E. (1994) A gene encoding a DnaK/hsp70 homolog in Escherichia coli. Proc. Natl Acad. Sci., USA, 91, 2066–2070[Abstract/Free Full Text].

    Sengupta, A.M., Djordjevic, M., Shraiman, B.I. (2002) Specificity and robustness in transcription control networks. Proc. Natl Acad. Sci. USA, 99, 2072–2077[Abstract/Free Full Text].

    Siebenlist, U., Simpson, R.B., Gilbert, W. (1980) E.coli RNA polymerase interacts homologously with two different promoters. Cell, 20, 269–281[CrossRef][ISI][Medline].

    Stefano, J.E. and Gralla, J.D. (1982) Mutation-induced changes in RNA polymerase-lac ps promoter interactions. J. Biol. Chem., 257, 13924–13929[Abstract/Free Full Text].

    Stormo, G.D. (2000) DNA binding sites: representation and discovery. Bioinformatics, 16, 16–23[Abstract/Free Full Text].

    Straney, R., Krah, R., Menzel, R. (1994) Mutations in the –10 TATAAT sequence of the gyrA promoter affect both promoter strength and sensitivity to DNA supercoiling. J. Bacteriol., 176, 5999–6006[Abstract/Free Full Text].

    Strohl, W.R. (1992) Compilation and analysis of DNA sequences associated with apparent streptomycete promoters. Nucleic Acids Res., 20, 961–974[Abstract/Free Full Text].

    Szoke, P.A., Allen, T.L., deHaseth, P.L. (1987) Promoter recognition by Escherichia coli RNA polymerase: effects of base substitutions in the –10 and –35 regions. Biochemistry, 26, 6188–6194[CrossRef][Medline].

    Youderian, P., Bouvier, S., Susskind, M.M. (1982) Sequence determinants of promoter activity. Cell, 30, 843–853[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
J. Weindl, P. Hanus, Z. Dawy, J. Zech, J. Hagenauer, and J. C. Mueller
Modeling DNA-binding of Escherichia coli {sigma}70 exhibits a characteristic energy landscape around strong promoters
Nucleic Acids Res., November 29, 2007; 35(20): 7003 - 7010.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
T. S. Rani, S. D. Bhavani, and R. S. Bapi
Analysis of E.coli promoter recognition problem in dinucleotide feature space
Bioinformatics, March 1, 2007; 23(5): 582 - 588.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/1062    most recent
bti094v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kiryu, H.
Right arrow Articles by Asai, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kiryu, H.
Right arrow Articles by Asai, K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?