Bioinformatics Advance Access originally published online on November 5, 2004
Bioinformatics 2005 21(5):691-693; doi:10.1093/bioinformatics/bti075
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Published by Oxford University Press 2004.
ESTminer: a suite of programs for gene and allele identification
USDA-ARS, CICGR Ames, IA, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: ESTminer is a collection of programs that use expressed sequence tag (EST) data from inbred genomes to identify unique genes within gene families. The algorithm utilizes Cap3 to perform an initial clustering of related EST sequences to produce a consensus sequence of a gene family. These consensus sequences are then used to collect all ESTs in the original EST library that are related using BLAST. A redundancy based criterion is applied to each EST to identify reliable unique gene-sequences. Using a highly inbred genome as a source of ESTs eliminates the necessity of computing covariance on each polymorphism to identify alleles of the same gene, thus making this algorithm more streamlined than other alternatives which must computationally attempt to distinguish genes from alleles.
Availability: The programs were written in PERL and are freely available at http://www.soybase.org/publication_data/Nelson/ESTminer/ESTminer.html
Contact: nelsonrt{at}iastate.edu
Supplementary information: Figures and dataset can be obtained from: http://www.soybase.org/publication_data/Nelson/ESTminer/ESTminer.html
| INTRODUCTION |
|---|
|
|
|---|
Many crop species are polyploid which means that any gene in the genome may have one or more paralogous copies. These copies vary in number and similarity depending on how many duplication events have occurred, the time elapsed between the events and the genomic mutation rate. For example, the genome of soybean (Glycine max, L. Merr.) has duplicated at least twice during its history (Schlueter et al., 2004). Hybridization-based experiments have indicated that, on average, each soybean gene is present 2.5 times in the genome (Shoemaker et al., 1996). Because of duplication events the genomes of many species tend to be large and internally redundant. Thus, few of their genomes have been fully sequenced and annotated. However, large publicly available EST libraries exist for many species. Because of this, it is important to have a method to analyze ESTs, where they exist, from species with unfinished or un-sequenced genomes in order to study basic questions of organismal biology and evolutionary genetics until genomic sequence data become available.
There are a number of disadvantages in clustering ESTs to identify genes, the two main being (1) for most organisms, EST libraries were made from different genomes/cultivars and (2) a single genome/cultivar could be heterozygous at many loci. Thus most EST libraries really represent an unknown mix of genes and alleles of those genes. This complicates the identification of sequence variations which distinguish closely related genes in family groups from polymorphisms which define alleles of genes within those families. Using only EST sequence data from a duplicated heterozygous genome, it would be difficult at best to reliably distinguish alleles of a single gene in mixtures of EST sequences from very similar paralogous genes. Limiting the gene identification step to ESTs from a highly inbred and essentially homozygous genome should eliminate the need to distinguish alleles of a gene from alleles of highly similar paralogous genes. To circumvent these problems as well as others related to the study of EST sequences, we developed the ESTminer suite of PERL programs to study EST sequences from inbred species. This suite of programs along with an existing EST sequence data from a highly inbred genome will allow us to distinguish between unique gene sequences and alleles of those genes. This makes our method more straightforward, easy to customize and more applicable to laboratories with modest bioinformatic infrastructure than some previously described methods such as CLOBB (Parkinson et al., 2002) which required 6 days to fully analyze a 20K EST and autoSNP which required almost 7 days to analyze a 103K EST dataset (Barker et al., 2003). Furthermore, this method is applicable to any species in which EST libraries are available from a highly homozygous diploid variety, or haploid species, and does not rely on quality scores or trace files for the identification of single nucleotide polymorphisms which may not be available for all datasets.
Algorithm
The ESTminer suite of PERL programs is designed to be run from the command line. Each program produces multiple output files reporting different aspects of the input files as well as commentary on the analysis progress to the terminal. A flowchart of the overall algorithm is presented in Figure 1.
Cap assembly of contigs
The EST data is collected for any inbred or nearly homozygous line. These sequences are clustered using Cap3 (Huang and Madan, 1999) with default options except for minimum overlap size (-o) of 21 bp and clipping range (-y) of 10 to remove from consideration the first and last 10 bases of each sequence. These parameters have profound effects on the behavior of Cap3 and were chosen to provide a lenient assembly of each consensus sequence. This clustering step is designed to cluster ESTs from all closely related genes representing paralogs and other homologous genes into a consensus sequence representing a family of genes related by high sequence similarity. These consensus sequences will be used in BLAST analysis to recover all ESTs which share any sequence similarity above a threshold expectation value. This step allows us to provide a framework to identify as many members of a gene family as possible by BLAST analysis. In genomes which are highly repetitive as the result of multiple rounds of gene, chromosomal and/or genome duplication, this will collect most or all duplicated and highly similar gene transcripts into a single set. These gene family-specific collections of ESTs can then easily be examined for sequence differences that distinguish paralogs.
BLAST analysis
All EST sequences used to construct the Cap3 contigs as well as the contigs themselves were collected into a BLAST (Altschul et al., 1997) database. If ESTs derived from non-inbred genomes are available, these can be added to the database after coding their GenBank accession numbers with a leading XX to distinguish them from ESTs from the original genome. BLAST analysis is used to collect all ESTs which share significant sequence similarity and hence are closely related. The BLAST database is interrogated with each Cap3 contig using an expectation cut-off of 1 x 109 and with filtering of low complexity sequences off. Turning the filtering off is necessary so that all matching bases of a sequence are represented in the BLAST alignment even in areas of low complexity of which there are many in the soybean ESTs. The output option (-m 4) was chosen as it produced the most easily parsed report with all sequence alignments.
Gene identification using SNPfind.pl
The identification of putative gene sequences is accomplished by using the program SNPfind.pl. This program takes each BLAST report and parses the alignment. It then calculates the length of each hit. Any EST's hit to a Cap3 contig sequence which is less than 90% of the EST full length is removed from the alignment. This quality control measure is necessary to remove BLAST hits which result from short regions of high similarity (motifs) that are shared by non-homologous genes. The remaining sequences from the original genome are then examined for the presence of locus (gene) defining sequence differences or locus defining polymorphisms (LDPs). An LDP is defined as a base substitution or insertion/deletion which appears in two or more EST sequences. If an EST represents a unique sequence, non-redundant within the original genome, it is then compared to EST sequences from the other genomes if they were included in the dataset. If this sequence matches any other EST from a different genome at the positions in question, then the EST sequence is considered to be validated and it is returned to the alignment. Any sequence which fails validation is removed from the alignment. If ESTs of other genomes are not included in the dataset, all sequences with a unique sequence will be removed from the alignment. Snpfind.pl produces a number of output files including a tab-delimited file containing accession numbers of each validated EST in a contig group and its parsed sequence. This file is used as input into the program HaplotypeSorter.pl, designed to reduce the data to a sequence representative of each gene in a family group.
Data reduction and haplotype construction using HaplotypeSorter.pl
Since we are dealing with EST sequences derived from a single, highly inbred genome, we assume that each gene in that genome is represented by a single allele of each gene. Thus, each gene sequence represents a single haplotype of each gene. The sequences collected in each gene family group are reduced to a single representative sequence of each gene present in the group in a series of steps. First, all sequences were compared to each other based on sequence identity. Any ESTs with an identical sequence were grouped together. The sequence of the longest member of each group was then chosen to represent the sequence (gene) and the other shorter sequences were removed from consideration. The remaining sequences were compared to each other in a pair-wise fashion. Any potential-gene sequence which was a complete subset of a longer potential-gene subset was also removed from consideration. The remaining sequences represented the longest version of each potential unique gene sequence or haplotype in the group. A consensus sequence was produced for any family group which was represented by a single potential-gene sequence. In families that contained multiple potential-gene sequences, construction of a single consensus sequence for each sequence was not attempted due to ambiguities present in the assembly of closely related sequences which differ by LDPs not present in the overlap region (Fig. 2). Any sequences that remain at this point are considered to be potential-gene haplotypes or pHaps and are considered to represent partial or whole gene sequences. HaplotypeSorter produces a number of output files that keep track of the ESTs coalesced into the individual pHaps as well as a primary output file containing the final pHaps. This file is a tab-delimited flat file which contains the pHap sequence and could be used to create a FASTA formatted database suitable for BLAST similarity searching. A sample of the output from HaplotypeSorter is presented in Figure 3. If ESTs from other genomes are included in the BLAST EST database, this output file can also be used as input into the program AlleleFinder.pl that idenitifies the pHap(s) for which each EST from the other genomes is consistent. This program uses a method similar to that employed by Snpfind.pl to identify allele-defining polymorphisms with the exception that any match to a haplotype is accepted; thus there need not be any redundancy in the observations. Once a sequence has been associated with a potential haplotype, AlleleFinder identifies sequence differences between the potental haplotype and the EST sequence which are potential single nucleotide polymorphisms (SNPs) because they represent sequence differences between homologous loci from different genomes. A sample of the output file is presented in Figure 4.
| RESULTS AND DISCUSSION |
|---|
|
|
|---|
ESTminer has been used to identify 39,443 potential haplotypes present in soybean EST libraries from the cultivar Williams and its near isogenic line Williams82 as well as alleles of genes in other soybean genotypes (unpublished observations).
In contrast to other algorithms, this algorithm is easily modified to use any clustering software since only the file containing the FASTA formatted contigs are used. BLAST is used to gather all related EST sequences as well as to provide a common alignment although the Snpfind program could easily be modified to accept the output of any alignment program. Therefore the ESTminer suite of programs provide a powerful and customizable platform to identify and analyze EST sequences in order to identify putative unique gene sequences and ultimately identify SNPs between alleles of these gene sequences in other non-inbred genomes.
Performance
The performance of the individual steps in the algorithm are hardware-dependent. The clustering step of the algorithm using Cap3 is dependent on the original EST database size and is the most computationally demanding part of the algorithm. Clustering 190K ESTs took approximately 48 h using a Macintosh G3 400 MHz computer with 1.5 GB of RAM. As mentioned above, the use of Cap3 for the initial clustering could be substituted with the output of any clustering program which returns a FASTA formatted consensus sequence run on any available local or internet accessible computer. BLAST analysis took approximately 16 h using a Macintosh dual processor G4 800 MHz. SNPfind required approximately 4 h of running time using a Macintosh dual processor 800 MHz computer with 1.5 GB RAM while HaplotypeSorter only required 2 h on this computer.
| Acknowledgments |
|---|
We wish to thank two anonymous reviewers for their thoughtful review of our manuscript and helpful suggestions. Mention of trade names or commercial products in this article is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture.
Received on June 16, 2004; revised on August 25, 2004; accepted on September 20, 2004
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., Lipman, D.J. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Barker, G., Batley, J., O'Sullivan, H.O., Edwards, K.J., Edwards, D. (2003) Redundancy based detection of sequence polymorphisms in expressed sequence tag data using autoSNP. Bioinformatics, 19, 421422
Huang, X. and Madan, A. (1999) CAP3: a DNA sequence assembly program. Genome Res., 9, 868877
Parkinson, J., Guiliano, D.B., Blaxter, M. (2002) Making sense of EST sequences by CLOBBing them. BMC Bioinformatics, 3, 31[CrossRef][Medline].
Schlueter, J., Dixon, P., Granger, S., Grant, D., Clark, L., Doyle, J.J., Shoemaker, R.C. (2004) Mining EST databases to resolve evolutionary events in major crop species. Genome, 47, 868876[Medline].
Shoemaker, R.C., Polzin, K., Labate, J., Specht, J., Brummer, E.C., Olsen, T., Young, N., Concibido, V., Wilcox, J., Tamulonis, J.P., Kochert, G., Boerma, H.R. (1996) Genome duplication in soybean (Glycine subgenus soja). Genetics, 144, 329338[Abstract].
This article has been cited by other articles:
![]() |
I.-Y. Choi, D. L. Hyten, L. K. Matukumalli, Q. Song, J. M. Chaky, C. V. Quigley, K. Chase, K. G. Lark, R. S. Reiter, M.-S. Yoon, et al. A Soybean Transcript Map: Gene Distribution, Haplotype and Single-Nucleotide Polymorphism Analysis Genetics, May 1, 2007; 176(1): 685 - 696. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. H. Nagaraj, R. B. Gasser, and S. Ranganathan A hitchhiker's guide to expressed sequence tag (EST) analysis Brief Bioinform, January 1, 2007; 8(1): 6 - 21. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

