Bioinformatics Advance Access originally published online on September 28, 2004
Bioinformatics 2005 21(5):601-607; doi:10.1093/bioinformatics/bti047
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Improving promoter prediction Improving promoter prediction for the NNPP2.2 algorithm: a case study using Escherichia coli DNA sequences
1 Department of Mathematics and Applied Statistics, University of Wollongong Wollongong, NSW 2522, Australia
2 Department of Biological Sciences, University of Wollongong Wollongong, NSW 2522, Australia
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: Although a great deal of research has been undertaken in the area of promoter prediction, prediction techniques are still not fully developed. Many algorithms tend to exhibit poor specificity, generating many false positives, or poor sensitivity. The neural network prediction program NNPP2.2 is one such example.
Results: To improve the NNPP2.2 prediction technique, the distance between the transcription start site (TSS) associated with the promoter and the translation start site (TLS) of the subsequent gene coding region has been studied for Escherichia coli K12 bacteria. An empirical probability distribution that is consistent for all E.coli promoters has been established. This information is combined with the results from NNPP2.2 to create a new technique called TLSNNPP, which improves the specificity of promoter prediction. The technique is shown to be effective using E.coli DNA sequences, however, it is applicable to any organism for which a set of promoters has been experimentally defined.
Availability: The data used in this project and the prediction results for the tested sequences can be obtained from http://www.uow.edu.au/~yanxia/E_Coli_paper/SBurden_Results.xls
Contact: alh98{at}uow.edu.au
| 1 INTRODUCTION |
|---|
|
|
|---|
Owing to the availability of vast amounts of genomic data, there is a need for prediction techniques that can rapidly and accurately evaluate sequences for the presence of promoters. Although a great deal of research has been undertaken in the area of promoter prediction, the problem has not yet been resolved. Algorithms tend to exhibit either poor specificity, generating many false positives, or poor sensitivity.
The problems inherent in promoter prediction occur because the transcription process is initiated by specific interactions between proteins and the DNA sequence in the promoter region. The interactions are complex and the various elements involved are highly degenerate, so recognition of characteristic sequences within promoter elements is difficult. Consequently, the most distinguishing feature of many promoters is often the presence of one or more characteristic motifs upstream of the +1 position and the spacing and distance between the conserved elements. Various other signals, motifs and/or clusters of elements in the surrounding region have also been reported (Chasov et al., 2002 Ozoline et al., 1997 Pedersen et al., 2000 Ioshikhes et al., 1999.)
Additional complication arises because the detectable motifs within the promoters also occur randomly throughout the genome. As such, the promoter elements do not possess specific identifying characteristics, and prediction algorithms must incorporate multiple features that may or may not be present in any given promoter.
Even though the promoter elements are degenerate in nature, they contain the nucleotide sequences which indicate the starting point for RNA synthesis. Hence, they are always located immediately upstream of the first nucleotide (or bp) that is transcribed during gene expression. This nucleotide is often called the +1 position or transcription start site (TSS).
By proxy, the promoters are also associated with at least one gene coding region and after transcription, the relevant coding regions in the RNA are translated into a protein. Hence immediately downstream of the promoter region and TSS there exists at least one gene coding region. In this paper, the first nucleotide downstream of the promoter region to be translated (i.e. the first nucleotide in the immediately subsequent gene coding region) will be denoted by the relevant translation start site (TLS) for the promoter.
Currently, features that are commonly utilized in promoter prediction algorithms include homology with known promoters, the presence of particular motifs within the sequence, DNA structural characteristics and the relative signatures of different regions in the sequence.
Algorithms that use the presence of particular motifs within the sequence include those based on position weight matrices (PWMs). The usefulness of simple examples such as TATA (Bucher, 1990) is limited because they assume independence between adjacent bases and do not allow for the presence of multiple promoter elements, insertions, deletions or variable spacing between elements. However, more complex algorithms such as Eponine (Down and Hubbard, 2002) incorporate multiple promoter elements using probability distributions relative to the TSS.
Markov models in several forms have also been used for promoter prediction (Ohler et al., 1999 Ohler et al., 2000 Audic and Claverie, 1997). They take advantage of the different characteristics of particular signatures in the sequence. Unlike PWM, they do not rely on the presence of a motif and allow for a departure from independence between adjacent nucleotides.
Word frequency techniques are also based on over-represented patterns of nucleotides of different lengths. Some algorithms such as PromoterInspector (Scherf et al., 2000) utilize classifiers to match sequences against a database of known promoter elements. However, as the probability of a word decreases exponentially with its length, balancing sensitivity and specificity using single words is problematic. Others, such as PromFD (Chen et al., 1997) find over-represented patterns of 510 bp length and create information matrices to predict promoter location. Multiple promoter elements are incorporated into the Mitra Algorithm (Eskin et al., 2003) using disjunct groups of words with a given separation to find frequent signals in several bacterial genomes. Finally, the evolutionary algorithm has also been applied to promoter prediction (Corne et al., 2001).
Neural networks (NNs), another pattern recognition technique, have also been applied to promoter prediction. Despite the simple architecture of earlier algorithms (Demeler and Zhou, 1991) prediction accuracy was high but there was a correspondingly high incidence of false positives. More complex architectures including algorithms incorporating several NNs in series (Knudsen, 1999); multiple hidden layers and time delay neural networks (TDNNs) (Reese and Eeckman, 1995) have also been applied to the problem. Interactive optimization of the number of nodes during training has been used (Yang et al., 1999) as has preprocessing of promoter sequences to extract features based on their information content (Ma et al., 2001).
Recent algorithms often incorporate several analysis techniques. PromH (Solovyev and Shahmuradov, 2003) uses linear discriminant functions to evaluate the structural characteristics of DNA in the promoter regions and also detects conservation features and nucleotide sequences of promoters from pairs of orthologous genes. Similarly, CONPRO (Liu and States, 2001) compares predictions from five algorithms, including the NNPP program, for sequences upstream from the TLS. Finally, Dragon gene finder (Bajic and Seah, 2003) compares sequences to the oligonucleotide positional distributions for a particular functional region of a gene. The output from several modules is then passed to an NN which integrates the information and provides a prediction.
Although all of these techniques attempt to maximize the number of promoters detected while minimizing the number of false predictions (incorrect predictions), there is always a trade-off between sensitivity (the percentage of true positives) and specificity (the ratio of true to false predictions).
To improve currently available algorithms, it is therefore necessary to introduce other measures to help in reducing the number of false predictions. The use of structural information has a lot of potential for assisting with promoter prediction and work is continuing in this area. For example, methods to predict areas of helix destabilization are being developed to help in reducing false positives (Benham, 1996) and distance correlations between large sets of elements that have been used to identify over-represented correlations without the need for training (Quandt et al., 1996).
The distance between the TSS and TLS has not been explicitly utilized in promoter prediction algorithms. Some algorithms informally incorporate this information by restricting their search areas using the TLS as a reference (Liu and States, 2001) while others incorporate probability distributions. However, none incorporate distance information into their algorithms. Consequently, another measure to assist with promoter prediction can be defined: the distance between a promoter (defined by its associated TSS) and the TLS of the subsequent coding region. Using the calculated distance from known promoters to the start of the subsequent gene coding region, it is possible to estimate the probability of a promoter occurring at a given distance from the TLS and use this information to improve prediction specificity for existing promoter prediction tools.
The structure of this paper is as follows. In the next section the application of NNs to promoter prediction is described. The subsequent two sections first describe the distribution of distances between the TSS and TLS of known promoter sequences and then assess the sensitivity of this measure to the number of known promoters. The usefulness of the distance distribution for improving promoter prediction from NN techniques is tested in the following section. Finally the methods utilized are discussed.
| 2 APPROACH |
|---|
|
|
|---|
2.1 Neural network prediction
NNs attempt to model the learning process in the brain. NNs have been used extensively for promoter prediction and a comprehensive summary of their application can be obtained from Wu, (1997). In particular, they have been applied successfully to the Escherichia coli genome (Demeler and Zhou, 1991 Ma and Wang, 1999) and other bacteria (Kalate et al., 2003). NNs have the advantage that they can learn to recognize the degenerate patterns that characterize promoter motifs although they are often unable to distinguish novel promoters.
Although the results were not published, the program NNPP2.2 has also been applied to the E.coli genome (M. Reese, personal communication, http://www.fruitfly.org/seq_tools/~promoter.html). The program uses a series of TDNNs to incorporate multiple promoter elements with variable spacing. To train the TDNN for prokaryotes, a carefully cross-validated test set of 272 E.coli promoters was used.
To use the program, a DNA sequence of length
51 bp is inputed into the program and the output of NNPP2.2 is a list of predictions with scores greater than the user-defined threshold. Each prediction includes the prediction score and the associated 46 bp sequence corresponding to the positions 41 to +5, where +1 is the TSS. All scores lie between 0 and 1 and represent the probability of the instance being a promoter. Because promoter elements may appear at different relative positions to one another and the TSS, the positional accuracy of promoter prediction using NNPP2.2 is ±3 bp. In this paper, a threshold score of 0.1 is used, to provide an initial limit to the number of predictions obtained.
The results obtained for eukaryotic organisms from this technique have been cited by several references. Most recently, Liu and States, (2001) compared several available techniques during the development of their own technique, showing that NNPP2.2 is competitive with several other freely available techniques. The results show that reasonable results are available, although the technique suffers from a high level of false positives.
2.2 TSSTLS distance
The TSSTLS distance can be used to improve the promoter prediction for several reasons. First, the TSS and TLS can be defined for every gene and both locations can be experimentally defined (or verified) and second the location of the TSS is closely related to the promoter region.
The TLS in a gene is readily identifiable due to the structure of the coding regions in DNA. In many genomes, the first annotation of a sequence defines open reading frames that are essentially the predicted coding regions in the genome. As the TLS corresponds to the first nucleotide in the gene coding region, its position is easily defined.
Promoter regions are not well defined. They generally consist of several characteristic elements, but the exact number and the location of individual promoter elements varies. Since they are always located immediately upstream of the TSS, it is expedient to define a promoter region in terms of its TSS. By definition, the TSS is also located upstream of the associated TLS. Hence, all promoters are related to at least one gene coding region downstream of the promoter sequence. In E.coli, the distance between the TSS and TLS is relatively small, ranging from 0 to 1000 bp at the known extremes. As large samples of experimentally defined promoters exist for the E.coli genome, the distribution of the distance between the TSS and TLS can be estimated if all the TLS are given.
Formally, the distance between the TSS and the TLS is defined in terms of the number of nucleotides between these positions. As shown in Figure 1 the +1 position in the promoter (the TSS) is the first nucleotide counted and the nucleotide before the TLS is the last to be counted. This measure will be referred to as the TSSTLS distance.
|
For the E.coli K12 bacteria, we found 771 experimentally defined promoters from the EcoCyc database V7.1 (Karp et al., 2002, http://biocyc.org/ecocyc/) associated with 587 genes. For each promoter we calculated the TSSTLS distance and evaluated the distribution of the distances. The resulting distribution has a range 0920 bp, with a mean of 104.5 bp, a median of 61 bp and a SD of 119.0. As shown by the histogram in Figure 2 the distribution is highly skewed towards the lower values but contains several large outliers that cause the mean to be significantly higher than the median. The peak probability is obtained at a distance of 27 bp and for 95% of promoters the distance is <325 bp.
|
We found that the TSSTLS distance distribution does not follow any standard discrete probability distribution functions. Consequently, we estimated the distribution function empirically using the data available. The resulting distribution function is shown in Figure 3. For the purposes of promoter prediction, we use the distribution to estimate the probability of a prediction being a true promoter given its distance from the TLS of the subsequent gene.
|
To check whether all promoters share the same distribution, we evaluated several known parameters that may affect the distribution of TSSTLS distance. These parameters included the strand on which the promoter is located; the associated sigma factor; the functional class of the associated gene (Serres and Riley, 2000); the number of promoters grouped together; the presence and conservation of characteristic motifs; and the nucleotide sequence in the promoter region. Promoters were separated into groups by each of these factors and the resulting sub-distributions were tested using contingency table analysis. In all cases, there was no significant difference in the distribution of distances within the different groups. As such, we assume that the distribution of TSSTLS distances is independent of the nucleotide sequence, the conservation of promoter motifs and the other tested parameters. The following research is based on this assumption.
2.3 Sensitivity of TSSTLS distance
In the next section, a new measure for predicting promoter locations based on the empirical probability distribution of TSSTLS distance is developed. The empirical probability distribution used in this paper is based on the information from the 771 E.coli promoters. Theoretically, to accurately predict promoters by using the measure developed in this paper, this empirical distribution should be updated frequently whenever any new promoter is identified. However, it is not practical to update the empirical distribution so often. Thus, it is interesting to know whether the empirical probability distribution is very sensitive to the number of promoters used when the number is already reasonably high.
To carry out this study, a subset of promoters was randomly selected from the set of 771 promoters and the smoothed density function for the TSSTLS distance was recreated. Subsets of 400, 200 and 100 promoter sequences were tested.
As shown in Figure 4 the resulting distributions closely approximated the shape of the final distribution. However when less promoters were included, the peak density was reduced and more weight was observed at the tail of the distribution.
|
The distribution appears to be robust when the number of known promoters exceeds 200. As such the technique is still applicable when significantly fewer promoters have been experimentally defined.
| 3 ALGORITHM |
|---|
|
|
|---|
The NN technique NNPP2.2 is based on the nucleotide sequence alone. It recognizes only the presence and relative location of patterns and motifs within a promoter, rather than the location of promoter motifs relative to the TLS. It predicts the probability that a tested sequence ±3 bp (denoted by s) belongs to the class of true promoters (
). Given a potential promoter sequence s, the relevant TSS and TLS can be determined. The TSSTLS distance is denoted by d s and measures the distance between the TSS and TLS for sequence s. Since the position of sequence s is usually randomly located, d s is a random variable.
Here, we propose a new approach for improving the predictive ability of NNPP2.2 using additional information provided by d s . Instead of just predicting the probability that sequence s belongs to the class of true promoters P(s
), we predict the probability that sequence s belongs to the class of true promoters and that d s
(x a,x + a). This probability can be written as P[s
d s
(xa,x+a)], where x is an integer and a is a predefined non-zero integer.
As it is relatively easy to identify the locations of TLS in the sequence, incorporating this information into the prediction algorithm improves specificity. That is, instead of predicting whether a sequence is a promoter by just using the nucleotide content of the sequence, we are also finding out whether the sequence is located in proximity to a gene coding region. As such we can theoretically guarantee that P[s
d s
(x a,x + a)] predicts promoter sequences more accurately. This probability can be calculated as follows:
![]() |
Therefore, P[s
d s
(x a,x + a)] is estimated by
![]() |
where
is provided by NNPP2.2 and
is given by the empirical distribution produced in this paper, we chose a=3 as this corresponds to the positional accuracy of NNPP2.2. Hereafter, this prediction technique is referred to as the TLSNNPP technique.
| 4 IMPLEMENTATION |
|---|
|
|
|---|
In general, implementing the TLSNNPP technique for an organism involves several steps. First, the empirical distribution of TSSTLS distances for known promoters must be generated. Then, DNA sequences must be passed through the NNPP2.2 prediction algorithm to obtain the locations of predicted promoters. These predictions must be associated with the closest subsequent TLS in the sequence and the TSSTLS distance for each prediction determined. Finally, for each prediction, Equation (2) is calculated and the final set of predictions created using an appropriate cut-off probability.
For the E.coli genome, implementation involved each of these steps. First, as described, the empirical TSSTLS distance distribution was generated. Then, using the NNPP2.2 module for prokaryotes, 510 sequences from the E.coli genome containing a total of 671 known promoters were tested. Tested sequences were randomly chosen from the set of known promoter sequences. Each sequence was 500 bp long, starting 501 bp upstream from the TLS and finishing at the 1 position before the TLS. Sequences contained between 1 and 7 known promoters, with an average of 1.32 promoters per sequence. Promoters were considered to be correctly predicted when the actual TSS of the promoter fell within ±3 bp of a predicted TSS. Only one true positive prediction was allowed for each tested promoter. The results are shown in Table 1.
|
In Table 1 TP denotes the number of true predictions with an NNPP2.2 score greater than the chosen threshold of 0.1. No. Pred. denotes the total number of tested sequences with a score above the threshold.
denotes the estimated probability from the data that a tested sequence is a true promoter and % Recog denotes the percentage of correct predictions obtained in total (% Recog=TP/671), which is a measure of the coverage of the NNPP2.2 program. Using a threshold of 0.1, a total of 7584 predictions resulted and of these, only 470 corresponded to true (or known) promoters. The number of false predictions generally exceeded 85% of the total number of predictions, even at very high thresholds.
To incorporate the new prediction measure proposed in this paper, for each of the 7584 sequences obtained from NNPP2.2 with a cut-off >0.1, the TSSTLS distance, denoted by (x), was calculated. Using this distance, P(d s
x±3|s
) was calculated from the TSSTLS distance empirical distribution.
was then calculated by multiplying the score from NNPP2.2 by P(d s
x±3|s
). Hence the probability of each prediction obtained from NNPP2.2 was modified to give an adjusted probability of the sequence being a promoter. It was found that after adjustment, the maximum prediction probability was 0.0672. New cut-off values were created as a percentage of this value and a modified set of predictions was created.
Table 2 shows the results of the TLSNNPP technique. In general, compared with the results from NNPP2.2 alone, the total number of true predictions has declined, but the number of false predictions has also declined to a great extent. Hence, the percentage of false predictions is now always <85%, and reaches as low as 52%. The benefit of the TLSNNPP technique can also be seen by comparing the total number of predictions for a given number of true predictions. From Table 1 at a threshold of 0.9, NNPP2.2 provides 135 true predictions from a total of 1055 predictions
. Using the TLSNNPP technique to obtain 135 true predictions, a threshold of 0.0383 is required (not included in Table 2). At this threshold, a total of only 423 predictions are obtained
, which is a 60% improvement on the unmodified algorithm.
|
Figure 5 shows the reduction in false positives obtained when using the TLSNNPP technique compared with NNPP2.2 alone. Both techniques suffer from overall low recognition rates for the promoters because the prediction program NNPP2.2 was trained only on 293 E.coli promoters, and therefore many of the newer promoters are not recognized by the program.
|
| 5 DISCUSSION |
|---|
|
|
|---|
This study has shown that the TSSTLS distance distribution can be used to reduce the incidence of false predictions from an existing prediction technique. By combining the distance information with the pre-existing NNPP2.2 prediction algorithm, we have demonstrated that the prediction of promoters can be significantly improved. In addition, the technique ensures that each promoter that is predicted is associated with a gene coding region.
For the E.coli genome, there is one limitation for using the empirical TSSTLS distribution to improve promoter prediction by the NNPP2.2 program. Promoters that are located further upstream than 500 bp will, by the very nature of the empirical distance distribution for E.coli, score very low and will not be located using this technique. Hence, while it is useful for most E.coli promoters, prediction for those few located further upstream will not be improved.
Several sources of errors have been identified in this study. The size of the dataset used in this study does not necessarily make it representative of the E.coli genome. The fact that these promoters have been experimentally defined may mean that they are all the more homogeneous than the set of unknown promoters within the genome. As such, the presence of bias in these results is possible.
Conversely, within each sequence there could exist other promoter sequences that have not yet been identified. As such some of the false positives may in fact be unknown or as yet unreported promoters, suggesting that the results obtained are conservative.
The additional benefit of this technique is that there is a potential to use it more generally. That is, the distance information could be used to post-process results from other traditional prediction algorithms. Potentially, it could also be incorporated into a new promoter prediction technique for E.coli bacteria. Hence, while this paper has focused on the NNPP2.2 prediction algorithm, the technique could also be applied to other prediction algorithms and there is a scope for further research in this area.
Finally, although the results cannot be directly generalized to other organisms, there is a potential for this technique to be applied more widely. Even higher organisms, whose promoter structure and TSSTLS distance distribution may be significantly different from those found in E.coli promoters, may have a similarly stable TSSTLS distance distribution that could be used to improve the prediction specificity. The only requirement is access to a sufficiently large number of experimentally defined promoter sequences from similar organisms.
Received on July 22, 2004; revised on September 23, 2004; accepted on September 24, 2004
| REFERENCES |
|---|
|
|
|---|
Audic, S. and Claverie, J. (1997) Detection of eukaryotic promoters using Markov transition matrices. Comput. Chem., 21, 223227[CrossRef][Web of Science][Medline].
Bajic, V.B. and Seah, S.H. (2003) Dragon Gene Start Finder identifies approximate locations of the 5' ends of genes. Nucleic Acids Res., 31, 35603563
Benham, C.J. (1996) Computation of DNA structural variabilitya new predictor of DNA regulatory regions. Comput. Appl. Biosci., 12, 375381
Bucher, P. (1990) Weight matrix descriptions of four eukaryotic RNA polymerase II promoter elements derived from 502 unrelated promoter sequences. J. Mol. Biol., 212, 563578[CrossRef][Web of Science][Medline].
Chasov, V.V., Deev, A.A., Masulis, I.S., Ozoline, O.N. (2002) Distribution and functional significance of A/T tracts in promoter sequences of Escherichia coli . Mol. Biol., 36, 682688 (in Russian).
Chen, Q.K., Hertz, G.Z., Stormo, G.D. (1997) PromFD 1.0: a computer program that predicts eukaryotic pol II promoters using strings and IMD matrices. Comput. Appl. Biosci., 13, 2935
Corne, D., Meade, A., Sibly, R. (2001) Evolving core promoter motifs. In Proceedings of Congress on Evolutionary Computation, , Seoul, Korea May 2730 IEEEvol. 2, pp. 11621169.
Demeler, B. and Zhou, G.W. (1991) Neural network optimization for E.coli promoter prediction. Nucleic Acids Res., 19, 15931599
Down, T.A. and Hubbard, T.J.P. (2002) Computational detection and location of transcription start sites in mammalian genomic DNA. Genome Res., 12, 458461
Eskin, E., Keich, U., Gelfand, M.S., Pevzner, P.A. (2003) Genome-wide analysis of bacterial promoter regions. Pac. Symp. Biocomput., 2003, 2940.
Ioshikhes, I., Trifonov, E.N., Zhang, M.Q. (1999) Periodical distribution of transcription factor sites in promoter regions and connection with chromatin structure. Proc. Natl Acad. Sci. USA, 96, 28912895
Kalate, R.N., Tambe, S.S., Kulkarni, B.D. (2003) Artificial neural networks for prediction of mycobacterial promoter sequences. Comput. Biol. Chem., 27, 555564[CrossRef][Web of Science][Medline].
Karp, P.D., Riley, M., Saier, M., Paulsen, I.T., Collado-Vides, J., Paley, S.M., Pellegrini-Toole, A., Bonavides, C., Gama-Castro, S. (2002) The EcoCyc Database. Nucleic Acids Res., 30, 5658
Knudsen, S. (1999) Promoter2.0: for the recognition of polII promoter sequences. Bioinformatics, 15, 356361
Liu, R. and States, D. (2001) Consensus promoter identification in the human genome utilizing expressed gene markers and gene modeling. Genome Res., 12, 462469.
Ma, Q. and Wang, J. (1999) Recognizing promoters in DNA using Bayesian neural networks. IASTED International Conference on Artificial Intelligence and Soft Computing, , Honolulu, HI , pp. 301305 August 912.
Ma, Q., Wang, J., Shasha, D., Wu, C. (2001) DNA sequence classification via an expectation maximisation algorithm and neural networks: a case study. IEEE Trans. Syst. Man Cybernet., 31, 468475[CrossRef].
Ohler, U., Harbeck, S., Niemann, H., Nöth, E., Reese, M. (1999) Interpolated markov chains for eukaryotic promoter recognition. Bioinformatics, 15, 362369
Ohler, U., Stemmer, G., Harbeck, S., Niemann, H. (2000) Stochastic segment models of eukaryotic promoter regions. Pac. Symp. Biocomput., 2000, 380381.
Ozoline, O.N., Deev, A.A., Arkhipova, M.V. (1997) Non-canonical sequence elements in the promoter structure. Cluster analysis of promoters recognised by Escherichia coli RNA polymerase. Nucleic Acids Res., 25, 47034709
Pedersen, A.G., Jensen, L.J., Brunak, S., StÒrfeldt, H., Ussery, D. (2000) A DNA structural atlas for Escherichia coli . J. Mol. Biol., 299, 907930[CrossRef][Web of Science][Medline].
Quandt, K., Grote, K., Werner, T. (1996) GenomeInspector: a new approach to detect correlation patterns of elements on genomic sequences. Comput. Appl. Biosci., 12, 405413
Reese, M. and Eeckman, F. (1995) Novel neural network prediction systems for human promoters and splice sites. In Searls, G.S.D., Fickett, J., Noordewier, M. (Eds.). Proceedings of the Workshop on Gene-Finding and Gene Structure Prediction, , Philadelphia, PA .
Scherf, M., Klingenhoff, A., Werner, T. (2000) Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. J. Mol. Biol., 297, , pp. 599606[CrossRef][Web of Science][Medline].
Serres, M. and Riley, M. (2000) MultiFun, a multifunctional classification scheme for Escherichia coli K-12 gene products. Microb. Comp. Genomics, 5, 205222[Medline].
Solovyev, V. and Shahmuradov, I.A. (2003) PromH: promoters identification using orthologous genomic sequences. Nucleic Acids Res., 31, 35403545
Wu, C. (1997) Artificial neural networks for molecular sequence analysis. Comput. Chem., 21, 237256[CrossRef][Web of Science][Medline].
Yang, J., Parekh, R., Honavar, V., Dobbs, D. (1999) Data-driven theory refinement algorithms for bioinformatics. Proceedings of the International Joint Conference on Neural Networks (IJCNN 99), , Washington, DC IEEEvol. 6, pp. 40644068.
This article has been cited by other articles:
![]() |
J. Zeng, S. Zhu, and H. Yan Towards accurate human promoter recognition: a review of currently used sequence features and classification methods Brief Bioinform, September 1, 2009; 10(5): 498 - 508. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Zhang, E. Li, and G. J. Olsen Protein-coding gene promoters in Methanocaldococcus (Methanococcus) jannaschii Nucleic Acids Res., June 1, 2009; 37(11): 3588 - 3601. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Collado-Vides, H. Salgado, E. Morett, S. Gama-Castro, V. Jimenez-Jacinto, I. Martinez-Flores, A. Medina-Rivera, L. Muniz-Rascado, M. Peralta-Gil, and A. Santos-Zavaleta Bioinformatics Resources for the Study of Gene Regulation in Bacteria J. Bacteriol., January 1, 2009; 191(1): 23 - 31. [Full Text] [PDF] |
||||
![]() |
P. S. Hefty and R. S. Stephens Chlamydial Type III Secretion System Is Encoded on Ten Operons Preceded by Sigma 70-Like Promoter Elements J. Bacteriol., January 1, 2007; 189(1): 198 - 206. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Nonaka, M. Blankschien, C. Herman, C. A. Gross, and V. A. Rhodius Regulon and promoter analysis of the E. coli heat-shock factor, {sigma}32, reveals a multifaceted cellular response to heat stress. Genes & Dev., July 1, 2006; 20(13): 1776 - 1789. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. J. Gordon, M. W. Towsey, J. M. Hogan, S. A. Mathews, and P. Timms Improved prediction of bacterial transcription start sites Bioinformatics, January 15, 2006; 22(2): 142 - 148. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. de la Grange, M. Dutertre, N. Martin, and D. Auboeuf FAST DB: a website resource for the study of the expression regulation of human gene products Nucleic Acids Res., July 28, 2005; 33(13): 4276 - 4284. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||











