Bioinformatics Advance Access originally published online on September 24, 2007
Bioinformatics 2007 23(21):2942-2944; doi:10.1093/bioinformatics/btm451
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Extending assembly of short DNA sequences to handle error
1Department of Biology, University of Carolina—Chapel Hill, Chapel Hill, NC 27599, 2Department of Genetics, Washington University School of Medicine, St. Louis, MO 63108 and 3Carolina Center for Genome Sciences, University of Carolina—Chapel Hill, Chapel Hill, NC 27599, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Inexpensive de novo genome sequencing, particularly in organisms with small genomes, is now possible using several new sequencing technologies. Some of these technologies such as that from Illumina's Solexa Sequencing, produce high genomic coverage by generating a very large number of small reads (
30 bp). While prior work shows that partial assembly can be performed by k-mer extension in error-free reads, this algorithm is unsuccessful with the sequencing error rates found in practice. We present VCAKE (Verified Consensus Assembly by K-mer Extension), a modification of simple k-mer extension that overcomes error by using high depth coverage. Though it is a simple modification of a previous approach, we show significant improvements in assembly results on simulated and experimental datasets that include error.
Availability: http://152.2.15.114/~labweb/VCAKE
Contact: william.jeck{at}gmail.com
| 1 INTRODUCTION |
|---|
|
|
|---|
Rapid and inexpensive sequencing of genomes is now possible via new high-throughput sequencing technologies that produce large numbers of small sequencing reads (Sundquist et al., 2007). Theoretical and computational work suggests that de novo assembly of small genomes is possible using these technologies (Warren et al., 2007; Whiteford et al., 2005). For example, the SSAKE assembler can assemble a viral genome using error-free short read sequences. However, the two leading short read sequencing technologies, Illumina and 454 Life Sciences, show appreciable error rates in their reads (Bentley, 2006). A viable de novo assembler using these technologies must account for this error. With a purely greedy approach to k-mer extension, erroneous bases will be incorporated into the assembly with a frequency equal to the error rate. We exhibit that even low error rates result in significantly poorer assembly for SSAKE.
Our algorithm, VCAKE (Verified Consensus Assembly by K-mer Extension), makes significant improvements in handling error. VCAKE extends seed sequences using a k-mer extension method much like SSAKE, where sequences are efficiently found from a hash table. Rather than using greedy extension or simply using Q values, which are too poorly assigned to result in adequate assembly, VCAKE considers all reads overlapping with the seed sequence. VCAKE extends the seed sequence one base at a time using the most commonly represented base from these matching reads, provided that the set of reads passes certain conditions.
Using actual and simulated data, we show that VCAKE behaves well with short reads with and without error. Our results imply that complete assembly of viral genomes and partial assembly of bacterial genomes may be possible using presently available short read technology even with prevalent error and in taxa far diverged from any reference.
| 2 METHODS |
|---|
|
|
|---|
2.1 Material
Whole genome sequences for SARS-TOR2 and Pseudomonas syringae pv. tomato str. DC3000 (DC3000) came from GenBank (AY274119 [GenBank] and AE016853 [GenBank] , respectively). Random 30mer simulated reads were generated at different coverage levels for each whole genome sequence. For those simulations including errors, each base of the sequences was randomly and independently changed to another base with a fixed probability. Three lanes of Solexa sequencing of DC3000 whole DNA were also provided (Baltrus et al., unpublished data) and used for test assembly. All lanes combined were 202 884 352 bases in 6 340 136 reads, making a coverage depth of 31.7x.
2.2 VCAKE algorithm
Initially, the VCAKE assembly process is nearly identical to SSAKE, except that two multi-FASTA files separately populate bin and set hash tables from a pool of reads. Divergence from the SSAKE method occurs during extension of the seed sequences from set. VCAKE finds all exact k-mer matches of the 3' end of the sequence up to a user-defined minimum n. The first 11 bases of the k-mer are used to efficiently search bin and all returned keys are checked against the remainder of the k-mer. Those matching perfectly are pushed into an array a number of times equal to the values keyed by that sequence in bin. As in SSAKE, the bin hash table consists of a treed hash table keyed by the first 11 bases of the read (or its reverse complement) followed by the sequence itself, with the value containing the number of appearances of that read or its reverse complement, allowing efficient searching. Having reached the minimum k-mer length n, if the total matching sequence occurrences (repetition of reads included) is less than a user-defined value, t, then the algorithm will proceed further. In this case, VCAKE extracts all k-mer matches up to a lower user-defined length m. If t reads are still not found, then k-mer matches up to user-defined length, e, are considered. This last group may have one mismatch in overlaps with the k-mer after the first 11 bases. This last procedure halts when t total sequence occurrences have been found or the minimum overlap, e, is reached. If a given sequence matches two different k-mer frames, then contig extension on that side is terminated.
The array of matching sequences is then considered. Each sequence offers a vote for the first overhanging base called by that read. The votes are totaled and the base exceeding a threshold of representation, c, is added to the contig. However, the contig will be terminated if the number of reads calling the second most common base exceeds the user-defined threshold, v. This value, v, represents the number of occurrences at which a base call can be considered an indication of duplication of the sequence elsewhere in the genome, rather than due to sequencing error. The user may also define a number of found reads, x, at which the algorithm terminates the contig. This is used when high representation of reads from a given sequence suggests repetitive sequence in the genome. Regardless of contig termination, all sequences retrieved that occur completely within the contig, are deleted both from bin and set.
Extension proceeds, one base at a time, until no matches are found or the contig is terminated for one of the other reasons above. The contig is then reverse complemented and extension of the opposite end is performed by the same method. The program finally outputs the contig to a file in multi-FASTA format and begins again with an unused seed from set.
| 3 RESULTS |
|---|
|
|
|---|
Both SSAKE and VCAKE assemblies of reads without error at 50x coverage reproduced 100% of the original SARS-TOR2 genome perfectly (Table 1). In assembly of reads with error, SSAKE showed no coverage by contigs perfectly matching the original genome. VCAKE, in contrast, showed 100% coverage by perfectly matching contigs, of which the largest was 28 332 bp.
|
For assembly of error-free simulated DC3000 reads at 50x coverage, SSAKE produced perfect contigs covering 46.5% of the genome with a longest contig of 103 158 bp. VCAKE showed superior coverage of 98.9%, but a shorter longest contig (31 446 bp). Assembly of reads with an error rate of 1% at higher coverage showed unambiguously superior assembly by VCAKE for those parameters analyzed. VCAKE had a longest contig of 34 229 bp as opposed to SSAKE's 669 bp, and VCAKE covered 98.9% of the genome where SSAKE covered 67.7%.
Using actual Solexa data, VCAKE also showed unambiguously improved assembly, covering 97.5% of the genome as opposed to 31.0% for SSAKE. Of the contigs perfectly matching the reference, VCAKE produced the longest contig (2749 bp versus 219 bp for SSAKE assembly) and a higher N50 of 249 bp. SSAKE assembly of the Solexa data showed no N50, as contigs perfectly matching the genome covered <50% of the reference.
| 4 CONCLUSION |
|---|
|
|
|---|
VCAKE can partially assemble a de novo genome from short reads in a range of genomic contexts, error rates and sequencing coverage. VCAKE shows significant improvements compared to SSAKE in situations with error. Our analysis suggests that viral genomes are tractable for de novo assembly using current short read sequencing technology with VCAKE. We also believe that VCAKE may be a step towards the assembly of larger bacterial genomes from short reads, particularly with the development of superior base calling or paired end technology.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Jarret Glasscock for discussion and Rene Warren for providing the foundation for the method. This publication was supported by Carolina Center for Genome Sciences to C.D.J. and National Institutes of Health Grant RO1GM066025 to J.L.D. The University of North Carolina at Chapel Hill's Office of the Vice Chancellor for Research and Economic Development provided support for open access publication.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alex Bateman
Received on July 23, 2007; revised on August 21, 2007; accepted on August 24, 2007
| REFERENCES |
|---|
|
|
|---|
Bentley DR. Whole-genome re-sequencing. Curr. Opin. Genet. Dev. (2006) 16:545–552.[CrossRef][Web of Science][Medline]
Sundquist A, et al. Whole-genome sequencing and assembly with high-throughput, short-read technologies. PLoS ONE (2007) 2:e484.[CrossRef]
Warren RL, et al. Assembling millions of short DNA sequences using SSAKE. Bioinformatics (2007) 23:500–501.
Whiteford N, et al. An analysis of the feasibility of short read sequencing. Nucleic Acids Res. (2005) 33:e171.
This article has been cited by other articles:
![]() |
M. Imelfort and D. Edwards De novo sequencing of plant genomes using second-generation technologies Brief Bioinform, November 1, 2009; 10(6): 609 - 618. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Suchland, K. M. Sandoz, B. M. Jeffrey, W. E. Stamm, and D. D. Rockey Horizontal Transfer of Tetracycline Resistance among Chlamydia spp. In Vitro Antimicrob. Agents Chemother., November 1, 2009; 53(11): 4604 - 4611. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. S. Horner, G. Pavesi, T. Castrignano, P. D. De Meo, S. Liuni, M. Sammeth, E. Picardi, and G. Pesole Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing Brief Bioinform, October 27, 2009; (2009) bbp046v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. Schmidt, R. Sinha, B. Beresford-Smith, and S. J. Puglisi A fast hybrid short read fragment assembly algorithm Bioinformatics, September 1, 2009; 25(17): 2279 - 2280. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. Pop Genome assembly reborn: recent computational challenges Brief Bioinform, July 1, 2009; 10(4): 354 - 366. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. T. Simpson, K. Wong, S. D. Jackman, J. E. Schein, S. J.M. Jones, and I. Birol ABySS: A parallel assembler for short read sequence data Genome Res., June 1, 2009; 19(6): 1117 - 1123. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. L. Warren, B. H. Nelson, and R. A. Holt Profiling model T-cell metagenomes with short reads Bioinformatics, February 15, 2009; 25(4): 458 - 464. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Reinhardt, D. A. Baltrus, M. T. Nishimura, W. R. Jeck, C. D. Jones, and J. L. Dangl De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae Genome Res., February 1, 2009; 19(2): 294 - 305. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Chaisson, D. Brinza, and P. A. Pevzner De novo fragment assembly with short mate-paired reads: Does the read length matter? Genome Res., February 1, 2009; 19(2): 336 - 346. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. J. Suchland, B. M. Jeffrey, M. Xia, A. Bhatia, H. G. Chu, D. D. Rockey, and W. E. Stamm Identification of Concomitant Infection with Chlamydia trachomatis IncA-Negative Mutant and Wild-Type Strains by Genomic, Transcriptional, and Biological Characterizations Infect. Immun., December 1, 2008; 76(12): 5438 - 5446. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. A. Holt and S. J.M. Jones The new paradigm of flow cell sequencing Genome Res., June 1, 2008; 18(6): 839 - 846. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Butler, I. MacCallum, M. Kleber, I. A. Shlyakhter, M. K. Belmonte, E. S. Lander, C. Nusbaum, and D. B. Jaffe ALLPATHS: De novo assembly of whole-genome shotgun microreads Genome Res., May 1, 2008; 18(5): 810 - 820. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. R. Zerbino and E. Birney Velvet: Algorithms for de novo short read assembly using de Bruijn graphs Genome Res., May 1, 2008; 18(5): 821 - 829. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




