Bioinformatics Advance Access originally published online on October 18, 2005
Bioinformatics 2005 21(24):4414-4415; doi:10.1093/bioinformatics/bti709
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Evaluating and improving cDNA sequence quality with cQC
1Department of Plant Sciences, University of Arizona Tucson, AZ 85721-0036, USA
2Department of Computer Science, University of Arizona Tucson, AZ 85721-0077, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: Errors are prevalent in cDNA sequences but the extent to which sequence collections differ in frequencies and types of errors has not been investigated systematically. cDNA quality control, or cQC, was developed to evaluate the quality of cDNA sequence collections and to revise those sequences that differ from a higher quality genomic sequence. After removing rRNA, vector, bacterial insertion sequence and chimeric cDNA contaminants, small-scale nucleotide discrepancies were found in 51% of cDNA sequences from one Arabidopsis cDNA collection, 89% from a second Arabidopsis collection and 75% from a rice collection. These errors created premature termination codons in 4 and 42% of cDNA sequences in the respective Arabidopsis collections and in 7% of the rice cDNA sequences.
Availability: A web-based version of cQC, source code and revised cDNA collections are available at http://genomics.arizona.edu/software/cQC/
Contact: raj{at}ag.arizona.edu
Supplementary information: Further text, tables and figures are available at the above website or on Bioinformatics online.
| INTRODUCTION |
|---|
|
|
|---|
Expressed sequence tags (ESTs) and full-length cDNA sequences have been used to confirm or revise computational annotations of genomic sequences, however, the quality of the original cDNA sequence datasets has not been investigated. Because ESTs and full-length cDNAs are generated by single-pass sequencing, errors are frequent. Substitutions, deletions and insertions can alter reading frames or introduce premature termination codons (PTCs). In addition, bacterial insertion sequences (ISs) (Hill et al., 2000), ribosomal RNA (Gonzalez and Sylvester, 1997) and chimeric cDNA sequences (Burke et al., 1998) can contaminate cDNA libraries. Thus, we developed cDNA quality control (cQC) to evaluate the quality of cDNA sequences and to provide corrected cDNA sequences.
| PROGRAM OVERVIEW |
|---|
|
|
|---|
cQC is a program written in Perl which
- removes chimeric cDNAs and identifies cDNAs with similarity to rRNA sequences,
- identifies cDNA sequences with similarity to bacterial IS elements for manual removal,
- identifies cDNAs lacking sequence similarity to the genomic sequence,
- identifies small-scale discrepancies (substitutions, insertions and deletions) in remaining cDNA sequences, calculates the number of occurrences in the 5'-untranslated region (5'-UTR), major open reading frame (ORF) and 3'-untranslated region (3'-UTR) and corrects the cDNA sequence to match the genomic sequence,
- assesses whether these discrepancies create a frame shift or introduce PTCs in the ORF, and
- generates a set of corrected cDNA sequences based on genomic sequence.
Arabidopsis thaliana (Arabidopsis) and Oryza sativa (rice) are good targets for analysis by cQC because high-quality genomic sequences and large libraries of full-length cDNA sequences are available. Genomic and cDNA sequences have been derived from the same highly inbred lines, minimizing the occurrence of allelic variants and therefore the number of incorrectly labeled discrepancies. Two Arabidopsis (Seki et al., 2002; Castelli et al., 2004) and one rice (Rice Full-length cDNA Consortium, 2003) full-length cDNA collections were analyzed by cQC.
Prevalence of cDNAs misaligning or not aligning to genomic sequence
After identifying cDNAs containing IS elements and rDNA sequences (Supplementary text and table S4), cQC compares cDNA sequences with genomic sequence using MegaBLAST (Zhang et al., 2000) and identifies clusters of exons by grouping high-scoring pairs that are proximal to each other within genomic sequence, resulting in (1) normal cDNAs, corresponding to a single cluster, (2) chimeric cDNAs, corresponding to two or more unrelated clusters and (3) cDNAs with no genomic counterpart (supplementary figure S1). The rice collection contained the highest proportion of sequences lacking a genomic counterpart (Table 1), probably due to the 78 gaps in its genome sequence (Yuan et al., 2005). Chimeric cDNAs occurred at a much higher frequency (
1% of cDNAs) in the rice collection than in the Arabidopsis collections (Table 1), similar to what was found for rDNA-containing cDNAs, many of which are chimeric cDNA clones.
|
Next, sim4 (Florea et al., 1998) alignments of cDNA to genomic sequence distinguish cDNAs (1) lacking genomic sequence similarity at one or both ends, (2) lacking sequence similarity internally and (3) possessing continuous similarity. cQC removes the second category, trims the first category of misaligning terminal sequences and appends these to a cleaned sequences file. cDNAs with aberrant termini or internal irregularities represented 0.74.4% of the tested collections (Table 1). In total, cDNA sequences misaligning or not aligning to genomic sequence comprised 18% sequences in these cDNA collections (Supplementary table S5).
Small-scale discrepancies: substitutions, deletions and insertions
Finally, cQC identifies small-scale discrepancies between each cDNA and the corresponding genomic sequence, reports discrepancies in an altered sequences file, changes the cDNA sequence to match its genomic counterpart and adds them to the cleaned sequences file along with cDNAs showing perfect alignments.
Discrepancies were found in 5189% of intact cDNAs (Table 1). Locations of discrepancies varied within cDNAs, and collections differed with respect to the frequency of discrepancies in the ORF as compared with the 5'- and 3'-UTRs (Supplementary table S6). In total, Genoscope sequences have
10-fold greater discrepancy rate than RIKEN sequences (Supplementary table S7). The effect of these changes on protein prediction was most marked in the genoscope collection's frequency of frameshift mutations and/or PTCs (46%) whereas the other collections' frequencies were more moderate, ranging from 5 to 9% (Table 1).
Clearly, the quality of full-length cDNA sequence collections can be quite variable. If this important resource is to be used effectively as a primary source of data, information about the quality of these collections will be valuable and corrected sequence collections must be available. cQC can provide this for any species for which there is a high-quality genome sequence, including highly redundant (e.g. 10x) draft genome sequences.
| Acknowledgments |
|---|
We thank the Biotechnology Computing Facility (BCF) for hosting the cQC website. This research was supported by a University of Arizona NSF IGERT Genomics Initiative fellowship to C.A.H. and T.J.W. Funding to pay the Open Access publication charges for this article was provided by University of Arizona.
Conflict of Interest: none declared.
Received on May 27, 2005; revised on September 1, 2005; accepted on October 6, 2005
| REFERENCES |
|---|
|
|
|---|
Burke, J., et al. (1998) Alternative gene form discovery and candidate gene selection from gene indexing projects. Genome Res, . 8, 276290
Castelli, V., et al. (2004) Whole genome sequence comparisons and full-length cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation. Genome Res, . 14, 406413
Florea, L., et al. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res, . 8, 967974
Gonzalez, I.L. and Sylvester, J.E. (1997) Incognito rRNA and rDNA in databases and libraries. Genome Res, . 7, 6570
Hill, F., et al. (2000) An estimate of large-scale sequencing accuracy. EMBO Reports, 1, 2931[CrossRef][Web of Science][Medline].
Rice Full-length cDNA Consortium. (2003) Collection, mapping and annotation of over 28 000 cDNA clones from japonica rice. Science, 301, 376379
Seki, M., et al. (2002) Functional annotation of a full-length Arabidopsis cDNA collection. Science, 296, 141145
Yuan, Q., et al. (2005) The Institute for Genomic Research Osa1 rice genome annotation database. Plant Physiol, . 138, 1826
Zhang, Z., et al. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol, . 7, 203214[CrossRef][Web of Science][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||