Bioinformatics Advance Access originally published online on September 16, 2004
Bioinformatics 2005 21(5):669-670; doi:10.1093/bioinformatics/bti030
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
ESTminer: a Web interface for mining EST contig and cluster databases
1 Center for Applied Genetic Technologies, University of Georgia 111 Riverbend Road, Athens, GA 30602, USA
2 Janie Pumphrey 2028 Spruce St Apt 3R, Philadelphia, PA 19103, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: ESTminer is a Web application and database schema for interactive mining of expressed sequence tag (EST) contig and cluster datasets. The Web interface contains a query frame that allows the selection of contigs/clusters with specific cDNA library makeup or a threshold number of members. The results are displayed as color-coded tree nodes, where the color indicates the fractional size of each cDNA library component. The nodes are expandable, revealing library statistics as well as EST or contig members, with links to sequence data, GenBank records or user configurable links. Also, the interface allows queries within queries where the result set of a query is further filtered by the subsequent query.
Availability: ESTminer is implemented in Java/JSP and the package, including MySQL and Oracle schema creation scripts, is available from http://cggc.agtec.uga.edu/Data/download.asp
Contact: agingle{at}uga.edu
| INTRODUCTION |
|---|
|
|
|---|
ESTminer is a Web application for interactive mining of expressed sequence tag (EST) contig and cluster datasets. The importance of EST assembly and clustering has been well established as evidenced by the number of data processing pipelines, such as STACK (Christoffels et al., 2001) and XGI (http://www.ncgr.org/xgi), and database resources such as Unigene (Wheeler et al., 2003) and the TIGR gene indices (Quackenbush et al., 2001) that have been developed for these data types. The mining of these datasets is an important component of gene discovery and expression profiling. However, their typically large size is challenging to the development of compact displays that provide an overview and facilitate focused queries to identify expressed genes associated with particular tissues or experimental conditions.
| DESCRIPTION |
|---|
|
|
|---|
ESTminer was originally developed as a component (http://cggc.agtec.uga.edu/estMiner/estMiner.jsp) of the CGGC (http://cggc.agtec.uga.edu/) resource for sorghum to provide user-friendly data querying and visualization for the large volume of EST data in the website. Similarly, the downloadable interface allows users to access their own EST contig, cluster and unigene datasets stored in their MySql or Oracle relational database management system (RDBMS). The downloadable installation package includes schema creation scripts and sample data.
Views of the associated interface components are shown in Figure 1. A query interface (Fig. 1A) allows the selection of contigs/clusters with a specific library makeup or a threshold number of members. The interface also allows nested queries, in which the result of one query is further filtered by a subsequent query; thus, further enhancing the drill-down capabilities of the interface. In addition, contigs and clusters can be selected from an alphanumerically ordered list (Fig. 1C and D) or based on name and GenBank accession id (Fig. 1E). Query results are displayed in a color-coded expandable tree structure (Fig. 1B) in which contigs and clusters are represented by a dynamic color-coded bar graph indicating the relative number of members from each of their cDNA library components. The nodes are expandable, revealing library statistics, sequence data and GenBank records as well as expandable subnodes that correspond to EST members for contigs or singleton ESTs and contig members for similarity-based clusters.
|
In the CGGC environment, the interface allows users to search for candidate sorghum genes, associated with environmental conditions (e.g. biotic and abiotic stresses), species, tissues and developmental stages. The query interface (Fig. 1A) allows users to filter the results in the presence or absence of any combination of cDNA libraries as well as by setting ranges on contig or cluster size. In addition, a range of clustering parameters such as alignment length or percentage identities threshold, for BLAST-based clustering, can be selected to meet the specific needs of the individual study. The downloadable version provides these flexibilities and is compatible with datasets that involve multiple clustering algorithms/methodologies.
ESTminer application has been developed for a multi-tier Internet architecture and can be deployed on platforms that are compatible with the Apache/Tomcat Web/Application server and either MySQL or Oracle RDBMS. So far, we have successfully tested it on Windows and Linux operating systems. The project was developed with Jbuilder7 (Borland) and is structured as Object-Oriented CVM with Java JSPs and servlets generating the front-end interface components, such as the color-coded bar graph tree nodes and java classes, which handles all non-database computing functions. All SQL queries are encapsulated in two Java classes to facilitate easy modification for adapting to changes in database schema and RDBMS. The database schema contains tables to accommodate cDNA library, EST sequence, contig and cluster data with table partitioning and materialized views being employed in the Oracle RDBMS schema to enhance the overall performance of large datasets.
| FUTURE PLANS |
|---|
|
|
|---|
At the time of this writing we added fuzzy search capabilities to the name based lookup form (Fig. 1E), a Perl script loader for populating the MySQL schema from a combination of file formats and popup help tips to supplement the already available documentation. These will be incorporated in the upcoming versions of the installation package. We are planning to develop a GMOD CHADO schema (http://www.gmod.org/) compatible version that will be made available as a separate installation package. We plan to leverage their developing schema standards to facilitate more seamless data exchange with other databases and integration with related GMOD tools. We are also considering the development of an alignment viewer for EST contigs, a feature that is not currently available as part of the interface package.
| Acknowledgments |
|---|
The authors wish to thank the collaborating laboratories for providing data, access to Web resources and advice. We are grateful to the National Science Foundation, the Georgia Research Alliance, the National Grain Sorghum Producers and the University of Georgia Research Foundation for financial support.
Received on June 25, 2004; revised on July 30, 2004; accepted on September 9, 2004
| REFERENCES |
|---|
|
|
|---|
Christoffels, A., van Gelder, A., Greyling, G., Miller, R., Hide, T., Hide, W. (2001) STACK: Sequence Tag Alignment and Consensus Knowledgebase. Nucleic Acids Res., 29, 234238
Quackenbush, J., Cho, J., Lee, D., Liang, F., Holt, I., Karamycheva, S., Parvizi, B., Pertea, G., Sultana, R., White, J. (2001) The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res., 29, 159164
Wheeler, D.L., Church, D.M., Federhen, S., Lash, A.E., Madden, T.L., Pontius, J.U., Schuler, G.D., Schrimi, L.M., Sequeira, E., Tatusova, T.A., Wagner, L. (2003) Database resources of the National Center for Biotechnology. Nucleic Acids Res., 31, 2833
This article has been cited by other articles:
![]() |
S. H. Nagaraj, R. B. Gasser, and S. Ranganathan A hitchhiker's guide to expressed sequence tag (EST) analysis Brief Bioinform, January 1, 2007; 8(1): 6 - 21. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

