Bioinformatics Advance Access originally published online on June 9, 2004
Bioinformatics 2004 20(17):2973-2984; doi:10.1093/bioinformatics/bth342
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Bioinformatics vol. 20 issue 17 © Oxford University Press 2004; all rights reserved.
EST clustering error evaluation and correction
1 Department of Statistics, Northwestern University, Evanston, IL 60208, USA, 2 Department of Statistics and 3 Department of Biology, Pennsylvania State University, University Park, PA 16802, USA
Received on February 23, 2004; revised on May 13, 2004; accepted on May 18, 2004
Advance Access Publication June 9, 2004
Motivation: The gene expression intensity information conveyed by (EST) Expressed Sequence Tag data can be used to infer important cDNA library properties, such as gene number and expression patterns. However, EST clustering errors, which often lead to greatly inflated estimates of obtained unique genes, have become a major obstacle in the analyses. The EST clustering error structure, the relationship between clustering error and clustering criteria, and possible error correction methods need to be systematically investigated.
Results: We identify and quantify two types of EST clustering error, namely, Type I and II in EST clustering using CAP3 assembling program. A Type I error occurs when ESTs from the same gene do not form a cluster whereas a Type II error occurs when ESTs from distinct genes are falsely clustered together. While the Type II error rate is <1.5% for both 5' and 3' EST clustering, the Type I error in the 5' EST case is
10 times higher than the 3' EST case (30% versus 3%). An over-stringent identity rule, e.g., P
95%, may even inflate the Type I error in both cases. We demonstrate that
80% of the Type I error is due to insufficient overlap among sibling ESTs (ISO error) in 5' EST clustering. A novel statistical approach is proposed to correct ISO error to provide more accurate estimates of the true gene cluster profile.
Availability: We have automated the methods developed in this paper in a web-based software ESTstat at http://cwdg5.bio.psu.edu/eststat.
Supplementary information: http://cwdg5.bio.psu.edu/eststat
Contact: jzwang{at}northwestern.edu
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
L. M. Bragg and G. Stone k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage Bioinformatics, September 15, 2009; 25(18): 2302 - 2308. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. C. Almeida and R. DeSalle Orthology, Function and Evolution of Accessory Gland Proteins in the Drosophila repleta Group Genetics, January 1, 2009; 181(1): 235 - 245. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. M. Freeman JR., M. Wu, M-M. Cordonnier-Pratt, L. H. Pratt, C. E. Gruber, M. Smith, E. S. Lander, N. Stange-Thomann, C. J. Lowe, J. Gerhart, et al. cDNA Sequences for Transcription Factors and Signaling Proteins of the Hemichordate Saccoglossus kowalevskii: Efficacy of the Expressed Sequence Tag (EST) Approach for Evolutionary and Developmental Studies of a New Organism Biol. Bull., June 1, 2008; 214(3): 284 - 302. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. H. Nagaraj, R. B. Gasser, and S. Ranganathan A hitchhiker's guide to expressed sequence tag (EST) analysis Brief Bioinform, January 1, 2007; 8(1): 6 - 21. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Malde, K. Schneeberger, E. Coward, and I. Jonassen RBR: library-less repeat detection for ESTs Bioinformatics, September 15, 2006; 22(18): 2232 - 2236. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. Cui, P. K. Wall, J. H. Leebens-Mack, B. G. Lindsay, D. E. Soltis, J. J. Doyle, P. S. Soltis, J. E. Carlson, K. Arumuganathan, A. Barakat, et al. Widespread genome duplications throughout the history of flowering plants Genome Res., June 1, 2006; 16(6): 738 - 749. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. H. Pratt, C. Liang, M. Shah, F. Sun, H. Wang, St. P. Reid, A. R. Gingle, A. H. Paterson, R. Wing, R. Dean, et al. Sorghum Expressed Sequence Tags Identify Signature Genes for Drought, Pathogenesis, and Skotomorphogenesis from a Milestone Set of 16,801 Unique Transcripts Plant Physiology, October 1, 2005; 139(2): 869 - 884. [Abstract] [Full Text] [PDF] |
||||





