Skip Navigation


Bioinformatics Advance Access originally published online on May 12, 2008
Bioinformatics 2008 24(13):1532-1533; doi:10.1093/bioinformatics/btn225
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/13/1532    most recent
btn225v2
btn225v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Van Bel, M.
Right arrow Articles by Van de Peer, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Van Bel, M.
Right arrow Articles by Van de Peer, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

FunSiP: a modular and extensible classifier for the prediction of functional sites in DNA

Michiel Van Bel 1,2, Yvan Saeys 1,2 and Yves Van de Peer 1,2,*

1Department of Plant Systems Biology, VIB and 2Department of Molecular Genetics, Ghent University, Technologiepark 927, 9052 Gent, Belgium

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 methods
 3 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 

Motivation: Many problems in genome annotation are tackled by using a classification model to predict functional sites such as splice sites, translation start sites or stop codons. Locating the correct position of these sites remains one of the most important but also one of the most difficult issues in the structural annotation of genomes. Most of the software currently in use is written for a very specific problem, thereby limiting the possibilities for reuse.

Summary: We developed a software platform that uses a very general approach towards the classification of functional sites in DNA sequences. The program uses an ab initio approach towards the identification of these sites, and extends SpliceMachine, a previously developed splice site predictor that shows state-of-the-art performance for both donor and acceptor splice site recognition in the human and Arabidopsis thaliana genome.

Availability: The program is developed as a stand-alone Java application, and is available as GPLv3 open-source software. The program, source and documentation can be obtained from the ‘Software’ section at http://bioinformatics.psb.ugent.be/

Contact: Yves.VandePeer{at}psb.ugent.be

Supplementary information: Supplementary data is available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 methods
 3 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
Identifying functional sites such as splice sites, translation start sites or stop codons in a DNA sequence is the first step in the genome annotation process. Each of these sites can be identified by a short conserved sequence, namely GT/GC for the donor sites, AG/AC for the acceptor sites, ATG for the start sites and TAG/TGA/TAA for stop codons. Evidently, context information is of vital importance when distinguishing between true functional sites and pseudo sites. By extracting various features around each putative site, we gain the ability to classify all sites with a high level of confidence, as was shown in the state-of-the-art performance of SpliceMachine (Degroeve et al., 2005). We now generalized the approach of SpliceMachine and built a modular platform which can predict any type of functional site in a sequence, provided there exists a short conserved sequence (e.g. AG for acceptors) characterizing the site, and provided some training data for this site is available.

The results obtained from the classifier can then be used by a gene prediction tool, such as EuGène (Schiex et al., 2001) to obtain the full gene structure.


    2 methods
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 methods
 3 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
2.1 Technology
The program is implemented as a stand-alone Java application, making use of the WEKA machine learning library (Witten and Frank, 2005) to perform feature selection and classification, and of the Log4j library (Gupta, 2005) for formatted and directed output. The program requires at least Java version 1.5.0 to work and runs on a whole range of platforms including MacOS, Windows and different flavors of Linux and Unix. A modular design was chosen for the program, leading to an extensible platform that can be used for a broad range of classifications types.

2.2 General outline
The program has several modes of operation, but the ones most frequently used will be either building a classification model or using a previously built model to classify sites of interest in a sequence. Both modes of operation rely on several variables, which are defined in an editable configuration file. The main steps when building a classification model are: (1) collect training data, (2) extract information from the data and (3) build a classification model. The main steps when evaluating a genomic sequence are: (1) extract information from the sequence and (2) classify using the classification model.

2.3 Features
The program provides numerous features, including:

  1. Built-in support for feature selection. By applying univariate feature selection techniques, the predictive performance of the classifier increases significantly (Saeys et al., 2007). Support for saving and loading the results of feature selection itself is also included.
  2. Built-in support for feature extraction from the DNA sequence: the set of possible (but not limiting) features include k-mer patterns, amino acids and RNA secondary structure features.
  3. A general framework and workflow, which is the same for all genomes: (a) collect training data, (b) optimize the extraction methods, (c) build a classifier and (d) use the classifier to distinguish between true or pseudo sites. Detailing the extraction methods and the type of site of interest is implemented in a configurable and extensible way.
  4. The optimization of the extraction methods and the classification of large numbers of genome files can take quite some time on a single machine. These problems are largely parallelizable, and are thus very well suited for distributed computing. Therefore, support for Grid Engine (Sun Microsystems, 2008), a well-known software package for distributed computing and is supported by FunSiP.
  5. A graphical user interface to launch the program, edit the configuration options and evaluate the results is available (Fig. 1). Because the number of configuration options is quite large, extensive documentation and a number of example configuration files is supplied with the program.


Figure 1
View larger version (35K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Display of the graphical user interface of Fun Si P, showing a wide variety of configurable options.

 
2.4 Availability
The program is available at http://bioinformatics.psb.ugent.be/ under ‘Software’. The program is open source and is licensed under the GPLv3 license or later. The program, source code and full documentation are also available at the website. The documentation includes installation notes, a guide on editing configuration files, a description of the internal working of the platform and a guide for extending the platform with user-defined methods for data extraction.


    3 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 methods
 3 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
The program presented here is an extensible platform for recognizing and classifying functional sites in a genome sequence. Each functional site is assigned a score, which reflects how well it compares with both positive and negative training data. These results can then be used as input for a gene-prediction tool. Table 1 compares the results of FunSip with and without feature selection to SpliceMachine (Degroeve et al., 2005), a state-of-the-art splice site predictor. Values in the table denote the F-measure (harmonic mean of precision and recall; Van Rijsbergen, 1979) on a 10-fold cross-validation experiment using 1000 positive and 10 000 negative examples.


View this table:
[in this window]
[in a new window]

 
Table 1. Performace comparison of FunSiP

 

    ACKNOWLEDGEMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 methods
 3 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 
Funding: This work was supported by a grant (G031805) from The Research Foundation-Flanders. Y.S. is a Postdoctoral Researcher of the Research Foundation-Flanders.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Dmitrij Frishman

Received on January 25, 2008; revised on January 25, 2008; accepted on May 6, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 methods
 3 DISCUSSION
 ACKNOWLEDGEMENT
 REFERENCES
 

    Degroeve S, et al. Predicting splice sites from high-dimensional local context representations. Bioinformatics (2005) 21:1332–1340.[Abstract/Free Full Text]

    Gupta S. Pro Apache log4j (2005) 2nd edn. Berkeley, CA: APress.

    Schiex T, et al. EuGène, an eukaryotic gene finder that combines several types of evidence. Lect. Notes Comput. Sci. (2001) Vol. 2066:111–125.[CrossRef]

    Saeys Y, et al. A review of feature selection techniques in bioinformatics. Bioinformatics (2007) 23:2507–2517.[Abstract/Free Full Text]

    Sun Microsystems. (2008) (Last accessed date May 29, 2008). Grid Engine. Available at http://gridengine.sunsource.net/.

    Witten I-H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques (2005) 2nd edn. San Francisco: Morgan Kaufmann.

    Van Rijsbergen C. Information Retrieval. (1979) 2nd edn. London: Butterworths.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/13/1532    most recent
btn225v2
btn225v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Van Bel, M.
Right arrow Articles by Van de Peer, Y.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Van Bel, M.
Right arrow Articles by Van de Peer, Y.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?