Skip Navigation


Bioinformatics Advance Access originally published online on September 25, 2006
Bioinformatics 2006 22(22):2722-2728; doi:10.1093/bioinformatics/btl482
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/22/2722    most recent
btl482v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xie, X.
Right arrow Articles by Yan, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xie, X.
Right arrow Articles by Yan, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

PromoterExplorer: an effective promoter identification method based on the AdaBoost algorithm

Xudong Xie 1,2, Shuanhu Wu 1,3, Kin-Man Lam 4 and Hong Yan 1,5,*

1 Department of Electronic Engineering, City University of Hong Kong Hong Kong
2 Department of Electronic Engineering, Tsinghua University Beijing, China
3 School of Computer Science and Technology, Yantai University China
4 Department of Electronic and Information Engineering, The Hong Kong Polytechnic University Hong Kong
5 School of Electronic and Information Engineering, University of Sydney NSW 2006, Australia

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURE EXTRACTION FROM...
 3 FEATURE SELECTION AND...
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 

Motivation: Promoter prediction is important for the analysis of gene regulations. Although a number of promoter prediction algorithms have been reported in literature, significant improvement in prediction accuracy remains a challenge. In this paper, an effective promoter identification algorithm, which is called PromoterExplorer, is proposed. In our approach, we analyze the different roles of various features, that is, local distribution of pentamers, positional CpG island features and digitized DNA sequence, and then combine them to build a high-dimensional input vector. A cascade AdaBoost-based learning procedure is adopted to select the most ‘informative’ or ‘discriminating’ features to build a sequence of weak classifiers, which are combined to form a strong classifier so as to achieve a better performance. The cascade structure used for identification can also reduce the false positive.

Results: PromoterExplorer is tested based on large-scale DNA sequences from different databases, including the EPD, DBTSS, GenBank and human chromosome 22. Experimental results show that consistent and promising performance can be achieved.

Contact: h.yan{at}cityu.edu.hk


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURE EXTRACTION FROM...
 3 FEATURE SELECTION AND...
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 
With the publication and preliminary analysis of the human genome sequence (Lander et al., 2001; Venter et al., 2001), which marks a significant milestone in the field of biology, it has become important to characterize and functionally annotate genes based on the collected large-scale sequences. In the past decade, a number of reliable methods have been developed for protein-coding regions prediction (Claverie, 1997). However, for regulatory regions of genes, although a number of algorithms have been proposed (Bajic and Seah, 2003; Davuluri et al., 2001; Ohler et al., 2002), accurate promoter identification remains a challenge. A promoter is the region of a genomic sequence, which is close to a gene's transcription start site (TSS), and it largely controls the biological activation of the gene (Pedersen et al., 1999). Based on the characteristics of promoters, the intrinsic relations between these promoters and the corresponding genes their control can be discovered, and this is useful to our understanding of gene transcription. Therefore, promoter identification can be considered a fundamental and important step in gene annotation.

In order to discriminate a promoter region from non-promoters regions, with the quantity of the latter being overwhelmingly larger than the former, many different features are considered, such as CpG islands (Bajic and Seah, 2003; Davuluri et al., 2001), TATA boxes (Knudsen, 1999; Ohler et al., 2002), CAAT boxes (Knudsen, 1999; Ohler et al., 2002), some specific transcription factor binding sites (TFBSs) (Knudsen, 1999; Ohler et al., 2002; Solovyev and Shahmuradov, 2003), pentamer matrix (Bajic et al., 2002) and oligonucleotides (Scherf et al., 2000). Also, various pattern recognition technologies have been adopted for classification, e.g. neural networks (Bajic and Seah 2003; Knudsen, 1999; Ohler et al., 2002; Bajic et al., 2002), linear and quadratic discriminant analyses (Davuluri et al., 2001; Solovyev and Shahmuradov 2003), interpolated Markov model (Ohler et al., 2002), independent component analysis (ICA) (Matsuyama and Kawamura, 2004; Hiisila and Bingham, 2004) and non-negative matrix factorization (NMF) (Hiisila and Bingham, 2004). Experimental results and analyses in (Bajic et al., 2004) show that selection of the right biological signals to be implemented in promoter prediction programs remains an open issue. In fact, none of these signals can cover all promoter representations, and each feature abstracted from promoter sequences has its own limitation.

In this paper, a new algorithm called PromoterExplorer is proposed to locate the positions of promoters. In our method, not only the local distribution of pentamers, but also the positional CpG island features are considered. Furthermore, we adopt the original DNA sequence as the input features. All of these features are combined to form a high-dimensional vector, and then a cascade AbaBoost algorithm is used both to select a small subset of possible features and to train the classifiers. We evaluate the performance of the proposed algorithm for promoter detection using different DNA sequences, including the human chromosome 22 sequence. Consistent and promising results have been obtained, which show that our method can greatly improve the promoter identification performance, and that it outperforms other methods, such as PromoterInspector (Scherf et al., 2000), Dragon Promoter Finder (DPF) (Bajic et al., 2002) and First Exon Finder (FirstEF) (Davuluri et al., 2001).

This paper is organized as follows. In Section 2, the features used in our approach are introduced. Section 3 describes our method for selecting features and training classifiers based on the AdaBoost algorithm. Experimental results are given in Section 4, and the performance of our proposed algorithm is compared with existing methods. Finally, conclusions are drawn in Section 5.


    2 FEATURE EXTRACTION FROM DNA SEQUENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURE EXTRACTION FROM...
 3 FEATURE SELECTION AND...
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 
In our method, we consider three different kinds of features: the local distribution of pentamers, positional CpG island features and digitized DNA sequence. These are described in the following sections, respectively.

2.1 Local distribution of pentamers
The main idea of PromoterInspector is based on the context technique. For two sets of oligonucleotides, which are promoter-related and non-promoter-related, respectively, several wildcards at multiple positions are introduced to reduce the effect of mismatching. The number of wildcards and the length of the elements in the oligonucleotide sets are optimized for classification, and are then used for testing. For DPF, pentamers (all sequences of five consecutive nucleotides) are considered input features. The pentamers, which most significantly contribute to the separation between the promoter and non-promoter regions, are selected, and a positional weight matrix is used to represent the positional distribution of pentamers. From the two methods discussed above, we can argue that: (1) the discrimination information between promoters and non-promoters may be concealed in a certain set of combinations of nucleotides; in other words, the local structure plays an important role and (2) the frequencies of occurrences of these combinations are related to their positions considered in a promoter sequence.

In our method, we also select pentamers as input features. For an input DNA sequence, a set of pentamers ai, i = 1, 2, ..., W, can be obtained, where the maximal value of W is 45 = 1024. In order to select the most informative pentamers for discriminating promoters and non-promoters, we consider the posterior probability of I given ai, P(I|ai), where I is an indicator which equals 1 when the input sequence is a promoter, otherwise I = 0. If P(I = 1 | ai) > P(I = 0 | ai), the input sequence should be a promoter with a higher probability, and vice versa. Define

Formula 1(1)
and compute the value of {eta} for each pentamer. According to the Bayes' theorem, we have

Formula 2(2)
and

Formula 3(3)
From Equations (1) to (3), we can obtain that


Formula 4

(4)

Assuming that P(I = 1) and P(I = 0) are constant, we define {eta} as follows:

Formula 5(5)

The pentamers are then ranked according to their {eta} values. The 250 pentamers with the highest values are selected to form a pentamer set Pset. Here 250 pentamers are considered because their {eta} values are <1.5. In Table 1, the first 10 ranked pentamers and their corresponding {eta} values are shown. The pentamers that include a TATA box or a CAAT box are also tabulated.


View this table:
[in this window]
[in a new window]

 
Table 1 Some selected pentamers in the Pset

 
For each pentamer included in Pset, DPF uses positional weight matrices to represent their positional distributions. This method can describe the distribution of each pentamer exactly at each position in a DNA sequence. However, because there are a limited number of promoters for training, i.e. from a hundred to several thousands (here we have 1024 pentamer patterns), the positional weight matrices based on statistics may not be reliable. In order to solve this small sample-size problem, two modifications are made in our algorithm. First, all pentamers in Pset are considered as one class, and the others as another class. In other words, 1024 pentamer patterns are converted into two kinds of patterns: pentamers in Pset and pentamers out of Pset. Second, the pentamer at each position in a DNA sequence as well as those within its neighborhood are considered. A window of 51 bp moves across the sequence at an interval of 1 bp, and the number of pentamers in Pset within this window is taken as a feature at the center of the window. Therefore, for a DNA sequence with a length l, the number of features, which represent local distributions of pentamers in Pset, is l – 4.

2.2 Positional CpG island features
CpG islands are regions of a DNA sequence near and in the promoter of a mammalian gene where a large concentration of phosphodiester-linked cytosine (C) and guanine (G) pairs exists. The usual formal definition of a CpG island is a region with at least 200 bp, with a GC percentage <50%, and with an observed/expected CpG ratio <0.6 (Gardiner-Garden and Frommer, 1987). Because about 56% of human genes and 47% of mouse genes are associated with CpG islands (Antequera and Bird, 1993), CpG islands can be used to locate promoters across genomes (Pedersen et al., 1999; Bajic and Seah, 2003; Davuluri et al., 2001). In fact, from Table 1, we can find that most pentamers in Pset with high {eta} values are also G + C rich. The most widely used CpG island features are the GCp and the observed/expected CpG ratio (o/e), which are defined as follows:

Formula 6(6)
and

Formula 7(7)
where P(CG), P(C) and P(G) are percentages of CG, C and G in a DNA sequence, respectively.

GCp and o/e are two global features for G + C rich or G + C related promoters. However, for the promoters that are G + C poor, CpG island features cannot be used to predict the position of a promoter. It is a reasonable assumption that there are some short regions, which are G + C rich, even in a G + C poor promoter sequence. These regions can then be used for promoter identification. In other words, if we consider GCp and o/e as a sequence of local features instead of global features, more promoters can be found based on CpG islands. Similar to pentamer feature extraction, a sliding window 51 bp in length is used, and GCp and o/e are calculated for each window. Then, for an l-length DNA sequence, the number of extracted positional CpG island features is 2l.

2.3 Digitized DNA sequence
Besides the local distribution of pentamers and the positional CpG island features, we also adopt the digitized DNA sequence as an input feature. This is because there may be some intrinsic features in the original sequence, which are useful for discriminating promoters and non-promoters and are still unknown. In fact, some of the early research on promoter identification is based on digitized DNA sequence (Mahadevan and Ghosh, 1994) and various digitization algorithms have been proposed (Demeler and Zhou, 1991; Parbhane et al., 2000). In our method, each nucleotide is represented using a single integer, as given by: A = 0; T = 1; G = 2 and C = 3.

From the discussion above, we can see that for an l-length input DNA sequence, the number of extracted features, including the local distribution of pentamers, positional CpG island features and digitized DNA sequence, is l – 4 + 2l + l = 4l 4. These features are concatenated to form a high-dimensional vector, and then a cascade AdaBoost (Adaptive Boosting) learning algorithm is used for feature selection and classifier training for promoter identification.


    3 FEATURE SELECTION AND CLASSIFIER TRAINING WITH ADABOOST
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURE EXTRACTION FROM...
 3 FEATURE SELECTION AND...
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 
Boosting is a supervised learning algorithm used to improve the accuracy of any given learning algorithm. In this algorithm, a classifier with accuracy based on the training set greater than an average performance is created, and then new component classifiers are added to form an ensemble whose joint decision rule has an arbitrarily high level of accuracy in the training set (Duda et al., 2001). In such a case, we say that the classification performance has been ‘boosted'. In general, the algorithm trains successive component classifiers with a subset of the entire training data that is ‘most informative’ given the current set of component classifiers (Duda et al., 2001).

AdaBoost is a boosting algorithm, which runs a given weak learner several times on slightly altered training data, and combines the hypotheses to one final hypothesis in order to achieve greater accuracy than the weak learner's hypothesis would have (Freund and Schapire, 1997). The main idea of AdaBoost is that each example of the training set should act in a different role for discrimination at different training stages. Those examples that can be easily recognized should be considered less in the training that follows; while those examples that are incorrectly classified in the previous rounds should be paid more attention to. In this way, the weak learner is forced to focus on the more informative or the ‘difficult’ examples of the training set. The importance of each example is represented by a weight. At the beginning, all the weights are equal, and they are then adaptively adjusted according to the classification results based on a hypothesis in every round. The final hypothesis is a combination of the hypotheses of all rounds, namely a weighted majority vote, where hypotheses with lower classification errors have higher weights (Freund and Schapire, 1997).

As discussed in Section 2, for an input DNA sequence with a length l, the number of features extracted is N = 4l 4, e.g. if l = 250, then N = 996. We assume that only a small number of these features are necessary to form an effective, strong classifier. We therefore define our weak classifier as follows:

Formula 8(8)
where X is an input feature vector, xj is the j-th feature of X, and {theta}j is a threshold. Suppose we have a set of training samples: (X1, y1), ..., (Xm, ym), where Xi isin X and yi isin Y = {1, – 1} (‘1’denotes positive examples and ‘–1’ is used for negative examples). In order to create a strong classifier, the following procedure is used:

  1. Initialize the weights for each training example:

    Formula 9(9)
    where N+ and N are the number of positives and negatives, respectively.

  2. For t = 1, ... , T:
  1. For each feature xj, train a classifier hj(X), which implies selecting the optimal {theta}j to produce the lowest error. The error for the classifier hj considers all the input samples with the condition of {theta}j, which is defined as Formula 9
  2. Find the classifier Formula 9 that minimizes the error with respect to the distribution wt: Formula 9. Here Formula should be <0.5.
  3. Update the weights of the samples for the next round Formula 9, where ei = 0 if sample Xi is correctly classified, otherwise ei = 1, and Formula 9.
  4. Normalize the weights to make wt+1 a probability distribution: Formula 9.

After T iterations, the resulting strong classifier is:

Formula 10(10)
where Formula 9. The procedure described above not only selects features that will produce the lowest error Formula 9 when used as weak classifiers, but also trains the weak classifiers and the combined strong classifiers, that is, the optimal values of {theta}j, wt and {alpha}t are determined based on the training set.

For an input DNA sequence, the number of non-promoter segments is much larger than the number of promoters. Therefore, it is desirable to remove as many non-promoter segments from consideration as possible early on. Then we can cascade our classifiers to filter out most of the non-promoters, as shown in Figure 1.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1 Promoter identification based on a cascade AdaBoost algorithm.

 
From Figure 1, we can see that a number of strong classifiers are used. In the early stage, few features, or weak classifiers, are considered, which can rapidly filter out most of the non-promoters while retaining most of the promoters. In the later stages, increasingly more complex features are adopted. For each stage, only the positive samples (promoters) and the negative samples (non-promoters) that are incorrectly classified in the previous stage are used for training. This structure can speed up the detection, and can also reduce the effect of a heavy imbalance between the number of promoters and non-promoters training samples. In our method, a five-layer cascade is used, and for each strong classifier, the numbers of weak classifiers used are 10, 20, 50, 100 and 200, respectively.


    4 EXPERIMENTAL RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURE EXTRACTION FROM...
 3 FEATURE SELECTION AND...
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 
In this section, we will evaluate the performance of our proposed algorithm, namely PromoterExplorer, for promoter identification based on different databases. The training set is from the Eukaryotic Promoter Database (EPD), Release 86 (Schmid et al., 2006), and the testing databases include EPD, the database of transcription start sites (DBTSS), Version 5.2.0 (Suzuki et al., 2002), six GenBank genomic sequences and human chromosome 22 (http://www.sanger.ac.uk/HGP/Chr22).

For training, the positive samples are 2426 promoter sequences in EPD, which are from 200 bp upstream to 50 bp downstream of the TSS. The negative samples are randomly extracted sequences of 250 bp, which are out of the range (–1000–1000) relative to the TSS locations. Here we do not divide the negative samples into different classes according to their categories, i.e. intron, exon or 3'-untranslated region (3'-UTR), but combine them for training. Furthermore, we extract these non-promoter sequences from the EPD instead of using annotated intron, exon or 3'-UTR sets. This can ensure that the training sequences and the testing set are close. In fact, for a DNA sequence to be analyzed, the positions of intron, exon, promoter and 3'-UTR are unknown; also, the lengths of these patterns vary. Therefore, for a segment to be classified, the characters may be from different kinds of patterns. Certainly, this mechanism cannot ensure that there are no partial promoter segments in the negative samples. However, considering the very small percentage of promoters in a DNA sequence and the large number of negative samples for training, i.e. 11 515, the useful and effective discriminating features can still be extracted based on the training set for classification. In our experiments, all training sequences are constructed only by A, T, G and C; in other words, the sequencethat includes the letter ‘N’ is excluded.

When testing is performed, an input DNA sequence is divided into a set of segments of 250 bp, which overlap each other with a 10 bp shift. As described in Section 2, features including the local distribution of pentamers, the positional CpG island features, and the digitized DNA sequence are obtained, followed by a cascade AdaBoost for classification. From Equation (10), if the final output is larger than zero, a TSS candidate is marked. Those TSS candidates, which have no more than 1000 nt apart from their closest neighboring prediction should be merged into a cluster. Then a new TSS prediction is used to represent this cluster, which is obtained by averaging all the TSS candidates within the cluster. In order to fairly compare the performance of PromoterExplorer with other methods, for DPF, the minimum length of gaps in the output predictions is set as 1000, and a similar merging mechanism is also adopted for other methods, such as PromoterInspector, FirstEF and DragonGSF (Bajic et al., 2003). Similar to the criteria proposed in (Bajic et al., 2004), when one or more predictions fall in the region (–2000, +2000) relative to the reference TSS location, a true positive is counted; otherwise the predictions are denoted as false positives. When the known gene is missed by this count, it represents a false negative. The effect of different distance criteria for promoter prediction will be discussed in the following section. Sensitivity (Sn) and specificity (Sp) are two criteria widely used to evaluate the performance of promoter prediction program, which are defined as follows:

Formula 11(11)

Formula 12(12)
where TP, FP and FN denote the numbers of true positives, false positives and false negatives, respectively. Generally, the larger the value of Sn, the more false positives are reported, and the smaller the value of Sp. It is a trade-off to balance Sn and Sp. For DPF, the values of Sn can be preset, which is used to control the predictions. In our algorithm, the sensitivity can be modulated by the number of TSS candidates within a cluster. For each cluster to be merged, if the number of TSS candidates within this cluster is larger than a threshold, the merged TSS prediction is considered a true prediction; otherwise, the cluster is removed from the output. Various thresholds will result in different outputs. The larger the value of the threshold, the fewer the false positives, and the fewer the true positives predicted.

4.1 Experimental results based on EPD
In this section, we will test the PromoterExplorer based on the whole 2541 vertebrate promoters in EPD, which include a total of 40 656 000 bp in the genome sequences. The sensitivity-specificity curve is shown in Figure 2.


Figure 2
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2 The sensitivity-specificity curve based on EPD using PromoterExplorer.

 
From Figure 2, we can see that when the sensitivity is 68.5%, the specificity is 68.6%. In this case, the average of the distance between the predicted TSS and the real TSS location is ~320. Because the 2541 vertebrate promoters in EPD have been used for training, we use DBTSS, a different database, to verify the performance of our classifier.

4.2 Experimental results based on DBTSS
DBTSS includes 30 964 promoter sequences. Each sequence includes 1201 bp, which are from –1000 to 201 relative to the TSS locations. As the length of a sequence is short, we consider only the sensitivity, and also investigate the effect of different distance criteria on prediction.

For DBTSS, 17 434 promoters are detected by PromoterExplorer. In other words, the sensitivity is 56.3%, which is lower than the result in Section 4.1. This is because the size of DBTSS is much larger than EPD, and most of the testing samples are unused for training.

We calculate the distance between each prediction and the corresponding correct position of a promoter. The histogram is shown in Figure 3. We can see that different distance criteria will result in different predictions. Generally, a larger distance criterion will increase the sensitivity; however, more false positives will also be reported. Because most distances (83.3%) are >300 bp, PromoterExplorer has the ability to exactly predict the position of TSS. This is a very useful characteristic for gene annotation. Sections 4.3 and 4.4 contain further discussions of this property.


Figure 3
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3 The distribution of the distance between the prediction and the real position of TSS.

 
4.3 Experimental results based on GenBank
We also evaluate the performance of PromoterExplorer using another test set, which contains six GenBank genomic sequences with a total length of 1.38 Mb and 35 known TSSs in these sequences. In (Bajic et al., 2002; Scherf et al., 2000), PromoterInspector and DPF outperform other methods based on this set. In another study (Bajic et al., 2004), DragonGSF and FirstEF achieved the best performance in the test. Thus, in this paper, we compare our method with these four methods.

Figure 4 shows the sensitivity-specificity curves of these five algorithms. In Figure 4a, the distance criterion is set as (–2000, 2000), and in Figure 4b, the criterion (–500, 500) is adopted. We can see that when a large distance criterion is used, DragonGSF achieves the best performance, and PromoterExplorer performs second best. The average distances between the predicted TSS and the real TSS for PromoterExplorer, PromoterInspector, DPF, DragonGSF and FirstEF are 440 (Sn = 45.7%), 586 (Sn = 40.0%), 509 (Sn = 57.1%), 571 (Sn = 45.7%) and 865 (Sn = 74.3%), respectively. This shows that PromoterExplorer has the highest locational precision. When a smaller distance criterion is adopted, PromoterExplorer outperforms the others. This result coincides with the observation in Section 4.2, i.e. most of the predictions are close to the real TSS locations.


Figure 4
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4 The sensitivity-specificity curves based on Genbank. (a) Distance criterion is (–2000, 2000). (b) Distance criterion is (–500, 500).

 
4.4 Experimental results based on human chromosome 22
Finally, we evaluate PromoterExplorer on Release 3 of the human chromosome 22, which includes 34 748 585 bp and 393 known genes. The annotation data were produced by the Chromosome 22 Gene Annotation Group at the Sanger Center. The comparative experimental results are shown in Figure 5. Similar to the observation in Section 4.3, PromoterExplorer performs the second best in case of a large distance criterion, and outperforms the others with a small distance criterion. When the distance criterion is (–2000, 2000), the average distances between the predicted TSS and the real TSS for PromoterExplorer, PromoterInspector, DPF, DragonGSF and FirstEF are 306 (Sn = 63.9), 351 (Sn = 63.6), 401 (Sn = 67.9), 376 (Sn = 66.2) and 371 (Sn = 41.2), respectively.


Figure 5
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5 The sensitivity-specificity curves based on Human Chromosome 22. (a) Distance criterion is (–2000, 2000). (b) Distance criterion is (– 500, 500).

 

    5 CONCLUSIONS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURE EXTRACTION FROM...
 3 FEATURE SELECTION AND...
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 
In this paper, we have proposed an effective promoter identification algorithm, which is called PromoterExplorer. In our approach, different kinds of features, i.e. the local distribution of pentamers, positional CpG island features and digitized DNA sequence, are extracted and combined. Then a cascade AdaBoost algorithm is adopted to perform feature selection and classifier training. An advantage of our algorithm is that the most ‘informative’ features in the different classifying stages can be selected for classification. For each feature, a weak classifier is built, and the importance of the classifier is represented by a weight. A number of weak classifiers combine to form a strong classifier, which can achieve a better performance. In order to reduce the number of false positives, we train a sequence of strong classifiers to build a cascade structure. PromoterExplorer is tested based on large-scale DNA sequences, which are from different databases. Our algorithm has a favorable performance compared with several best-known existing techniques. Our method cannot only achieve a balance between the sensitivity and specificity of the predictions, but also has the capability of more exact TSS localization. Therefore, it can be used to detect unknown prompter locations in a new DNA sequence.

In our future research, more different kinds of features should be considered, and their intrinsic relations should be considered in feature selection and classifier design. Because our method is based on the statistical characteristics of the training set, more representative training data should be helpful for extracting features, and the algorithm should be tested based on the whole human genome sequence to evaluate the performance. Also, our method can be used for detecting other functional segments, e.g. exon, intron or 3'-UTR, within a DNA sequence.


    Acknowledgments
 
This work is supported by research grants from City University of Hong Kong (Projects 9010003 and 9610034).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alex Bateman

Received on July 12, 2006; revised on August 28, 2006; accepted on September 11, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURE EXTRACTION FROM...
 3 FEATURE SELECTION AND...
 4 EXPERIMENTAL RESULTS
 5 CONCLUSIONS
 REFERENCES
 

    Antequera, F. and Bird, A. (1993) Number of CpG islands and genes in human and mouse. Proc. Natl Acad. Sci. USA, . 90, 11995–11999[Abstract/Free Full Text].

    Bajic, V.B. and Seah, S.H. (2003) Dragon gene start finder: an advanced system for finding approximate locations of the start of gene transcriptional units. Genome Res, . 13, 1923–1929[Abstract/Free Full Text].

    Bajic, V.B., et al. (2002) An intelligent system for vertebrate promoter recognition. IEEE Intell. Syst. Mag, . 17, 64–70.

    Bajic, V.B., et al. (2004) Promoter prediction analysis on the whole human genome. Nat. Biotechnol, . 22, 1467–1473[CrossRef][ISI][Medline].

    Claverie, J.M. (1997) Computational methods for the identification of genes in vertebrate genomic sequences. Hum. Mol. Genet, . 6, 1735–1744[Abstract/Free Full Text].

    Davuluri, R.V., et al. (2001) Computational identification of promoters and first exons in the human genome. Nature Genet, . 29, 412–417[CrossRef][ISI][Medline].

    Demeler, B. and Zhou, G.W. (1991) Neural network optimization for E. coli promoter prediction. Nucleic Acids Res, . 19, 1593–1599[Abstract/Free Full Text].

    Duda, R.O., Hart, P.E., Stork, D.G. Pattern Classification, (2001) 2nd edn John Wiley & Sons Inc.

    Freund, Y. and Schapire, R.E. (1997) A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci, . 55, 119–139.

    Gardiner-Garden, M. and Frommer, M. (1987) CpG islands in vertebrate genomes. J. Mol. Biol, . 196, 261–282[CrossRef][ISI][Medline].

    Hiisila, H. and Bingham, E. (2004) Dependencies between transcription factor binding sites: comparison between ICA, NMF, PLSA and frequent sets. Proc. IEEE Int. Conf. Data Mining, 4, 114–121.

    Knudsen, S. (1999) Promoter2.0: for the recognition of PoIII promoter sequences. Bioinformatics, 15, 356–361[Abstract/Free Full Text].

    Lander, E.S., et al. (2001) Initial sequencing and analysis of the human genome. Nature, 409, 860–921[CrossRef][Medline].

    Mahadevan, I. and Ghosh, I. (1994) Analysis of E. coli promoter structures using neural networks. Nucleic Acids Res, . 22, 2158–2165[Abstract/Free Full Text].

    Matsuyama, Y. and Kawamura, R. (2004) Promoter recognition for E. coli DNA segments by independent component analysis. Proc. Comput. Syst. Bioinformatics Conf, . 686–691.

    Ohler, U., et al. (2002) Computational analysis of core promoters in the Drosophila genome. Genome Biol, . 3, research0087.1-0087.12.

    Parbhane, R.V., et al. (2000) ANN modeling of DNA sequences: new strategies using DNA shape code. Comp. Chem, . 24, 699–711[CrossRef].

    Pedersen, A.G., et al. (1999) The biology of Eukaryotic promoter prediction: a review. Comp. Chem, . 23, 191–207.

    Scherf, M., et al. (2000) Highly specific localization of promoter regions in large genomic sequences by PromoterInspector: a novel context analysis approach. J. Mol. Biol, . 297, 599–606[CrossRef][ISI][Medline].

    Schmid, C.D., et al. (2006) EPD in its twentieth year: towards complete promoter coverage of selected model organisms. Nucleic Acids Res, . 34, 82–85.

    Solovyev, V.V. and Shahmuradov, I.A. (2003) PromH: promoters identification using orthologous genomic sequences. Nucleic Acids Res, . 31, 3540–3545[Abstract/Free Full Text].

    Suzuki, Y., et al. (2002) DBTSS: DataBase of human transcriptional start sites and full-length cDNAs. Nucleic Acids Res, . 30, 328–331[Abstract/Free Full Text].

    Venter, J.C., et al. (2001) The sequence of the human genome. Science, 291, 1304–1351[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
T. Abeel, Y. Saeys, P. Rouze, and Y. Van de Peer
ProSOM: core promoter prediction based on unsupervised clustering of DNA physical profiles
Bioinformatics, July 1, 2008; 24(13): i24 - i31.
[Abstract] [PDF]


Home page
Genome Res.Home page
T. Abeel, Y. Saeys, E. Bonnet, P. Rouze, and Y. Van de Peer
Generic eukaryotic core promoter prediction using structural features of DNA
Genome Res., February 1, 2008; 18(2): 310 - 323.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/22/2722    most recent
btl482v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (2)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xie, X.
Right arrow Articles by Yan, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xie, X.
Right arrow Articles by Yan, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?