Skip Navigation


Bioinformatics Advance Access originally published online on July 29, 2004
Bioinformatics 2005 21(5):671-673; doi:10.1093/bioinformatics/bth437
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/5/671    most recent
bth437v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, H.
Right arrow Articles by Wong, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, H.
Right arrow Articles by Wong, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

DNAFSMiner: a web-based software toolbox to recognize two types of functional sites in DNA sequences

Huiqing Liu *, Hao Han , Jinyan Li and Limsoon Wong

Institute for Infocomm Research 21 Heng Mui Keng Terrace, Singapore, 119613

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 TOOLBOX OVERVIEW
 REFERENCES
 

Summary: DNAFSMiner (DNA Functional Sites Miner) is a web-based software toolbox to recognize functional sites in nucleic acid sequences. Currently in this toolbox, we provide two software: TIS Miner and Poly(A) Signal Miner. The TIS Miner can be used to predict translation initiation sites in vertebrate DNA/mRNA/cDNA sequences, and the Poly(A) Signal Miner can be used to predict polyadenylation [poly(A)] signals in human DNA sequences. The prediction results are better than those by literature methods on two benchmark applications. This good performance is mainly attributable to our unique learning method. DNAFSMiner is available free of charge for academic and non-profit organizations.

Availability: http://research.i2r.a-star.edu.sg/DNAFSMiner/

Contact: huiqing{at}i2r.a-star.edu.sg


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 TOOLBOX OVERVIEW
 REFERENCES
 
DNA sequences are an important type of biomedical data that contains many biologically meaningful functional sites such as transcription start site, coding region, translation initiation site (TIS), splice site, polyadenylation signal (PAS) and so on. These functional sites are important components in the primary structure of genes and play crucial roles in DNA transcription and translation. Accurately identifying these biological functional sites is an important application of computational biology and bioinformatics.

Recently, software programs have been developed to detect TISs or PASs from DNA sequences. For example, ATGpr (Salamov et al., 1998) is a web application to predict TIS in cDNA sequences using a linear discriminant function that combines some statistical features derived from the sequence. It can be accessed via interface http://www.hri.co.jp/atgpr/. Polyadq (Tabaska and Zhang, 1999) and Erpin (Legendre and Gautheret, 2003) are two programs to detect PASs in human DNA and mRNA sequences by analysing upstream and downstream sequence elements around PASs. Polyadq finds PASs using a pair of quadratic discriminant functions. It is available at http://rulai.cshl.org/tools/polyadq/polyadq_form.html. Erpin was built on bioinformatics analysis of expressed sequence tag (EST) and genomic sequences to characterize biases in the regions encompassing 600 nt around the cleavage site. The program can be found at http://tagc.univ-mrs.fr/erpin/

Our DNAFSMiner is also a web-based toolbox to recognize TIS in vertebrate DNA/mRNA/cDNA sequences (via TIS Miner), and PASs in human DNA sequences [via Poly(A) Signal Miner]. The software implements our unique statistical and data mining algorithms. Specifically, our method for constructing the prediction models consists of three steps (http://www.bioinfo.de/isb/2004/04/0022/) (1) generating a large number of candidate features from the sequences, (2) selecting relevant features and (3) integrating the selected features with a learning algorithm to build a classification and prediction system. The prediction models are trained and evaluated on several datasets, including public ones and our own established ones. For example, TIS Miner was trained by a set of well annotated and verified sequences and evaluated by some recently published data (Liu et al., 2004). The prediction results achieved by our method are comparable or superior to previously reported ones. Please see (Liu et al., 2004) for the performance comparisons of our model with ATGpr on TIS predictions and see (Liu et al., 2003) for the comparisons of our model with Erpin and Polyadq on PAS predictions.


    TOOLBOX OVERVIEW
 TOP
 Abstract
 INTRODUCTION
 TOOLBOX OVERVIEW
 REFERENCES
 
Technologies. In the construction of the prediction models, TIS Miner and Poly(A) Signal Miner, in the first step, generate candidate features using k-gram nucleotide acid or amino acid patterns, which are patterns defined as k consecutive letters of nucleotide symbols or amino acid symbols. So, candidate features are these patterns. The occurrence rate of a pattern within certain base pairs upstream and downstream of a candidate functional site is used as the value of the feature. Then, in the framework of the new feature space, the original nucleotide sequences are transformed into data in the form of integer values. In the second step, an entropy-based feature selection algorithm is applied to the transformed training data to select important features that can discriminate between true functional sites and false ones sharply. In the third step, support vector machines (SVMs) is used to build the prediction model. As well known, an SVM can select a small number of critical boundary samples from each class of training data and then build a discriminant function that separates them as widely as possible. The decision function for a test sample T is usually defined as:

where x i is the training data point, y i is the class label (true functional site is mapped to 1 while non-functional site is mapped to –1) of these data points, b and {alpha} i 0 are parameters to be determined through training. K(·) is the kernel function which defines an inner product. The kernel function is used by the SVM to map the training data into a higher dimensional space when the linear separation is impossible in the original one. So, in our applications, f(T) > 0 if the sample T is more likely to be a functional site, and f(T) < 0 if T is more likely to be a non-functional site. As f(T) is an unbounded function, we propose a function s(T) to normalize f(T):

It is easy to see that f(T) is normalized by s(T) into the range (0,1). For each candidate of the functional site, we use score s(T) to make the prediction. Note that if f(T) > 0 then s(T) > 0.5, and if f(T) < 0 then s(T) < 0.5. For more information about the background technologies of the above three-step method and the datasets used for training and validating the models, please refer to our recent publications (Liu and Wong, 2003; Liu et al., 2003; Liu et al., 2004).

Input. For prediction, both the TIS Miner and the Poly(A) Signal Miner require a nucleic acid sequence which can be submitted either in raw or in FASTA format through our website. The maximum number of base pairs per sequence per submission is limited to 50 000 to avoid a long waiting time for users. The ‘Number of predictions’ is the number of top scored candidates of the predicted functional site that will be displayed in the result page (default setting is 5). When predicting PAS, users can also select the hexamer poly(A) signal consensus other than the default ‘AATAAA’. The options include ‘ATTAAA’ or any variant of ‘NNTANA’-type.

Output. The output of the TIS Miner is displayed as a table with six columns described below, while the output of the Poly(A) Signal Miner is also a table but with only three columns, i.e. the columns (1), (2) and (3) of the following description. Figure 1 shows the output page of the TIS Miner.

  1. No. of ATG(s)/AATAAA(s) from the 5' end. The number i in this column indicates that the corresponding candidate is the i-th candidate functional site from the 5' end. In general, a sequence may contain multiple candidates of the functional site [e.g. ATG for TIS and AATAAA for poly(A) signal].
  2. Score. This column shows the likelihood score, ranging in (0,1), of the prediction that ‘the corresponding candidate is a true functional site’. It is given by the prediction model built by SVM on the training sequences. The higher the score is, the more likely the corresponding candidate is a true functional site. We also provide the information of accuracy, sensitivity, specificity and precision under different thresholds of the score based on our validation results, for both the TIS Miner and the Poly(A) Signal Miner. Table 1 is a summary of this information for the TIS Miner. For example, if the threshold is set as 0.6 (i.e. if the prediction score of a candidate is greater than 0.6, then it will be predicted as a true TIS; otherwise, it will be predicted as a non-TIS), the accuracy, sensitivity, specificity and precision are 72.2, 54.6, 89.7 and 84.1%, respectively.
  3. Position (bp). This column is the position of the corresponding candidate in the submitted nucleic acid sequence.
  4. Identity to Kozak consensus [AG]XXATGC. According to Kozak's weight matrix (Kozak, 1987) developed for TIS prediction, a G residue tends to follow a true TIS while an A or G residue tends to be found 3 nt upstream of a true TIS. This column shows how the candidate ATG fits this consensus.
  5. Is any ATG in 100 base pairs upstream? This column indicates that whether an ATG exists within the 100 bp of the upstream of the candidate.
  6. Is any in-frame stop codon in 100 base pairs downstream? This column answers that whether an in-frame stop codon is found within the 100 bp of the downstream of the candidate.



View larger version (83K):
[in this window]
[in a new window]
 
Fig. 1 The output page of TIS Miner.

 

View this table:
[in this window]
[in a new window]
 
Table 1 TIS Miner—overall accuracy, sensitivity, specificity and precision under different thresholds of the score based on the validation results on Human Chromosome data (Liu et al. 2004)

 
Future development. As a toolbox to recoginize functional sites in DNA sequences, DNAFSMiner is expected to provide functions to identify other functional sites, such as splice site and etc. We are working on this. On the other hand, we are planning to incrementally expand the system with newer sequences, in particular, to make it work on additional organisms not covered by the current datasets.

Received on April 30, 2004; revised on July 4, 2004; accepted on July 23, 2004

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 TOOLBOX OVERVIEW
 REFERENCES
 

    Kozak, M. (1987) An analysis of 5'-noncoding sequences from 699 vertebrate messenger RNAs. Nucleic Acids Res., 15, 8125–8148[Abstract/Free Full Text].

    Legendre, M. and Gautheret, D. (2003) Sequence determinants in human polyadenylation site selection. BMC Genomics, 4, 7[CrossRef][Medline].

    Liu, H., Han, H., Li, J., Wong, L. (2003) An in-silico method for prediction of polyadenylation signals in human sequences. Proceedings of 14th International Conference on Genome Informatics (GIW 2003), , pp. 84–93.

    Liu, H., Han, H., Li, J., Wong, L. (2004) Using amino acid patterns to accurately predict translation initiation sites. In-Silico Biol., 4, 0022 http://www.bioinfo.de/isb/2004/04/0022/ .

    Liu, H. and Wong, L. (2003) Data mining tools for biological sequences. J. Bioinform. Comput. Biol., 1, 139–168[CrossRef][Medline].

    Salamov, A.A., Nishikawa, T., Swindells, M.A. (1998) Assessing protein coding region integrity in cDNA sequencing projects. Bioinformatics, 14, 384–390[Abstract/Free Full Text].

    Tabaska, J.E. and Zhang, M.Q. (1999) Detection of polyadenylation signals in human DNA sequences. Gene, 231, 77–86[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
S. Griffiths-Jones, H. K. Saini, S. van Dongen, and A. J. Enright
miRBase: tools for microRNA genomics
Nucleic Acids Res., January 11, 2008; 36(suppl_1): D154 - D158.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
H. K. Saini, S. Griffiths-Jones, and A. J. Enright
Genomic analysis of human microRNA transcripts
PNAS, November 6, 2007; 104(45): 17719 - 17724.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/5/671    most recent
bth437v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Liu, H.
Right arrow Articles by Wong, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Liu, H.
Right arrow Articles by Wong, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?