Skip Navigation


Bioinformatics Advance Access originally published online on October 18, 2005
Bioinformatics 2005 21(24):4414-4415; doi:10.1093/bioinformatics/bti709
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
21/24/4414    most recent
bti709v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Google Scholar
Right arrow Articles by Hayden, C. A.
Right arrow Articles by Jorgensen, R. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hayden, C. A.
Right arrow Articles by Jorgensen, R. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oxfordjournals.org
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions{at}oxfordjournals.org

Evaluating and improving cDNA sequence quality with cQC

Celine A. Hayden 1, Travis J. Wheeler 2 and Richard A. Jorgensen 1,*

1Department of Plant Sciences, University of Arizona Tucson, AZ 85721-0036, USA
2Department of Computer Science, University of Arizona Tucson, AZ 85721-0077, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM OVERVIEW
 REFERENCES
 

Summary: Errors are prevalent in cDNA sequences but the extent to which sequence collections differ in frequencies and types of errors has not been investigated systematically. cDNA quality control, or cQC, was developed to evaluate the quality of cDNA sequence collections and to revise those sequences that differ from a higher quality genomic sequence. After removing rRNA, vector, bacterial insertion sequence and chimeric cDNA contaminants, small-scale nucleotide discrepancies were found in 51% of cDNA sequences from one Arabidopsis cDNA collection, 89% from a second Arabidopsis collection and 75% from a rice collection. These errors created premature termination codons in 4 and 42% of cDNA sequences in the respective Arabidopsis collections and in 7% of the rice cDNA sequences.

Availability: A web-based version of cQC, source code and revised cDNA collections are available at http://genomics.arizona.edu/software/cQC/

Contact: raj{at}ag.arizona.edu

Supplementary information: Further text, tables and figures are available at the above website or on Bioinformatics online.


    INTRODUCTION
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM OVERVIEW
 REFERENCES
 
Expressed sequence tags (ESTs) and full-length cDNA sequences have been used to confirm or revise computational annotations of genomic sequences, however, the quality of the original cDNA sequence datasets has not been investigated. Because ESTs and full-length cDNAs are generated by single-pass sequencing, errors are frequent. Substitutions, deletions and insertions can alter reading frames or introduce premature termination codons (PTCs). In addition, bacterial insertion sequences (ISs) (Hill et al., 2000), ribosomal RNA (Gonzalez and Sylvester, 1997) and chimeric cDNA sequences (Burke et al., 1998) can contaminate cDNA libraries. Thus, we developed cDNA quality control (cQC) to evaluate the quality of cDNA sequences and to provide corrected cDNA sequences.


    PROGRAM OVERVIEW
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM OVERVIEW
 REFERENCES
 
cQC is a program written in Perl which

  • removes chimeric cDNAs and identifies cDNAs with similarity to rRNA sequences,
  • identifies cDNA sequences with similarity to bacterial IS elements for manual removal,
  • identifies cDNAs lacking sequence similarity to the genomic sequence,
  • identifies small-scale discrepancies (substitutions, insertions and deletions) in remaining cDNA sequences, calculates the number of occurrences in the 5'-untranslated region (5'-UTR), major open reading frame (ORF) and 3'-untranslated region (3'-UTR) and corrects the cDNA sequence to match the genomic sequence,
  • assesses whether these discrepancies create a frame shift or introduce PTCs in the ORF, and
  • generates a set of corrected cDNA sequences based on genomic sequence.

Arabidopsis thaliana (Arabidopsis) and Oryza sativa (rice) are good targets for analysis by cQC because high-quality genomic sequences and large libraries of full-length cDNA sequences are available. Genomic and cDNA sequences have been derived from the same highly inbred lines, minimizing the occurrence of allelic variants and therefore the number of incorrectly labeled discrepancies. Two Arabidopsis (Seki et al., 2002; Castelli et al., 2004) and one rice (Rice Full-length cDNA Consortium, 2003) full-length cDNA collections were analyzed by cQC.

Prevalence of cDNAs misaligning or not aligning to genomic sequence
After identifying cDNAs containing IS elements and rDNA sequences (Supplementary text and table S4), cQC compares cDNA sequences with genomic sequence using MegaBLAST (Zhang et al., 2000) and identifies clusters of exons by grouping high-scoring pairs that are proximal to each other within genomic sequence, resulting in (1) normal cDNAs, corresponding to a single cluster, (2) chimeric cDNAs, corresponding to two or more unrelated clusters and (3) cDNAs with no genomic counterpart (supplementary figure S1). The rice collection contained the highest proportion of sequences lacking a genomic counterpart (Table 1), probably due to the 78 gaps in its genome sequence (Yuan et al., 2005). Chimeric cDNAs occurred at a much higher frequency (~1% of cDNAs) in the rice collection than in the Arabidopsis collections (Table 1), similar to what was found for rDNA-containing cDNAs, many of which are chimeric cDNA clones.


View this table:
[in this window]
[in a new window]
 
Table 1 Frequency of cDNAs misaligning to genomic sequence and consequences for protein prediction

 
Next, sim4 (Florea et al., 1998) alignments of cDNA to genomic sequence distinguish cDNAs (1) lacking genomic sequence similarity at one or both ends, (2) lacking sequence similarity internally and (3) possessing continuous similarity. cQC removes the second category, trims the first category of misaligning terminal sequences and appends these to a cleaned sequences file. cDNAs with aberrant termini or internal irregularities represented 0.7–4.4% of the tested collections (Table 1). In total, cDNA sequences misaligning or not aligning to genomic sequence comprised 1–8% sequences in these cDNA collections (Supplementary table S5).

Small-scale discrepancies: substitutions, deletions and insertions
Finally, cQC identifies small-scale discrepancies between each cDNA and the corresponding genomic sequence, reports discrepancies in an altered sequences file, changes the cDNA sequence to match its genomic counterpart and adds them to the cleaned sequences file along with cDNAs showing perfect alignments.

Discrepancies were found in 51–89% of intact cDNAs (Table 1). Locations of discrepancies varied within cDNAs, and collections differed with respect to the frequency of discrepancies in the ORF as compared with the 5'- and 3'-UTRs (Supplementary table S6). In total, Genoscope sequences have ~10-fold greater discrepancy rate than RIKEN sequences (Supplementary table S7). The effect of these changes on protein prediction was most marked in the genoscope collection's frequency of frameshift mutations and/or PTCs (46%) whereas the other collections' frequencies were more moderate, ranging from 5 to 9% (Table 1).

Clearly, the quality of full-length cDNA sequence collections can be quite variable. If this important resource is to be used effectively as a primary source of data, information about the quality of these collections will be valuable and corrected sequence collections must be available. cQC can provide this for any species for which there is a high-quality genome sequence, including highly redundant (e.g. 10x) draft genome sequences.


    Acknowledgments
 
We thank the Biotechnology Computing Facility (BCF) for hosting the cQC website. This research was supported by a University of Arizona NSF IGERT Genomics Initiative fellowship to C.A.H. and T.J.W. Funding to pay the Open Access publication charges for this article was provided by University of Arizona.

Conflict of Interest: none declared.

Received on May 27, 2005; revised on September 1, 2005; accepted on October 6, 2005

    REFERENCES
 TOP
 ABSTRACT
 INTRODUCTION
 PROGRAM OVERVIEW
 REFERENCES
 

    Burke, J., et al. (1998) Alternative gene form discovery and candidate gene selection from gene indexing projects. Genome Res, . 8, 276–290[Abstract/Free Full Text].

    Castelli, V., et al. (2004) Whole genome sequence comparisons and ‘full-length’ cDNA sequences: a combined approach to evaluate and improve Arabidopsis genome annotation. Genome Res, . 14, 406–413[Abstract/Free Full Text].

    Florea, L., et al. (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res, . 8, 967–974[Abstract/Free Full Text].

    Gonzalez, I.L. and Sylvester, J.E. (1997) Incognito rRNA and rDNA in databases and libraries. Genome Res, . 7, 65–70[Abstract/Free Full Text].

    Hill, F., et al. (2000) An estimate of large-scale sequencing accuracy. EMBO Reports, 1, 29–31[CrossRef][ISI][Medline].

    Rice Full-length cDNA Consortium. (2003) Collection, mapping and annotation of over 28 000 cDNA clones from japonica rice. Science, 301, 376–379[Abstract/Free Full Text].

    Seki, M., et al. (2002) Functional annotation of a full-length Arabidopsis cDNA collection. Science, 296, 141–145[Abstract/Free Full Text].

    Yuan, Q., et al. (2005) The Institute for Genomic Research Osa1 rice genome annotation database. Plant Physiol, . 138, 18–26[Abstract/Free Full Text].

    Zhang, Z., et al. (2000) A greedy algorithm for aligning DNA sequences. J. Comput. Biol, . 7, 203–214[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
21/24/4414    most recent
bti709v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Google Scholar
Right arrow Articles by Hayden, C. A.
Right arrow Articles by Jorgensen, R. A.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Hayden, C. A.
Right arrow Articles by Jorgensen, R. A.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?