Bioinformatics Advance Access originally published online on July 1, 2009
Bioinformatics 2009 25(18):2302-2308; doi:10.1093/bioinformatics/btp410
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
k-link EST clustering: evaluating error introduced by chimeric sequences under different degrees of linkage
1CSIRO Mathematical and Information Sciences, North Ryde, NSW 2113 and 2Preventative Health National Research Flagship, Locked Bag 17, North Ryde, NSW 1670, Australia
*To whom correspondence should be addressed.
| Abstract |
|---|
Motivation: The clustering of expressed sequence tags (ESTs) is a crucial step in many sequence analysis studies that require a high level of redundancy. Chimeric sequences, while uncommon, can make achieving the optimal EST clustering a challenge. Single-linkage algorithms are particularly vulnerable to the effects of chimeras. To avoid chimera-facilitated erroneous merges, researchers using single-linkage algorithms are forced to use stringent sequence–similarity thresholds. Such thresholds reduce the sensitivity of the clustering algorithm.
Results: We introduce the concept of k-link clustering for EST data. We evaluate how clustering error rates vary over a range of linkage thresholds. Using k-link, we show that Type II error decreases in response to increasing the number of shared ESTs (ie. links) required. We observe a base level of Type II error likely caused by the presence of unmasked low-complexity or repetitive sequence. We find that Type I error increases gradually with increased linkage. To minimize the Type I error introduced by increased linkage requirements, we propose an extension to k-link which modifies the required number of links with respect to the size of clusters being compared.
Availability: The implementation of k-link is available under the terms of the GPL from http://www.bioinformatics.csiro.au/products.shtml. k-link is licensed under the GNU General Public License, and can be downloaded from http://www.bioinformatics.csiro.au/products.shtml. k-link is written in C++.
Contact: lauren.bragg{at}csiro.au
Supplementary information: Supplementary data are available at Bioinformatics online.
Associate Editor: Dmitrij Frishman
Received on April 6, 2009; revised on June 4, 2009; accepted on June 26, 2009