Bioinformatics Advance Access originally published online on January 28, 2008
Bioinformatics 2008 24(5):713-714; doi:10.1093/bioinformatics/btn025
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SOAP: short oligonucleotide alignment program
1Beijing Genomics Institute at Shenzhen, Shenzhen 518083, China and 2Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M, DK-5230, Denmark
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: We have developed a program SOAP for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The program is designed to handle the huge amounts of short reads generated by parallel sequencing using the new generation Illumina-Solexa sequencing technology. SOAP is compatible with numerous applications, including single-read or pair-end resequencing, small RNA discovery and mRNA tag sequence mapping. SOAP is a command-driven program, which supports multi-threaded parallel computing, and has a batch module for multiple query sets.
Availability: http://soap.genomics.org.cn
Contact: soap{at}genomics.org.cn
The new DNA sequencing technologies, which have been developed and implemented recently, have significantly improved throughput and dramatically reduced the cost compared with the capillary-based electrophoresis systems (Shendure et al., 2004). In a single experiment using one instrument, the Illumina-Solexa system using sequencing-by-synthesis (SBS) can determine up to 40 million sequences of up to 50 bases in length, whereas the ABI-SOLiD system using ligation technology allows determination of 3 Gb mappable data. The ultra high throughput and short read length make these technologies particularly suitable for large-scale resequencing of large cohorts of individuals with known reference for studies of genetic variations (Bentley, 2006). Traditional sequence alignment software like blast (Altschul et al., 1997) and blat (Kent, 2002) are unable to cope efficiently with the huge amount of reads generated in such applications, while SSAHA (Ning et al., 2001) is optimized to find long alignments and fails in practice on most short queries. To our knowledge, there have been several programs developed or under developing to match the new sequencing technologies. ELAND, an alignment tool integrated in Illumina-Solexa data processing package, can do ungapped alignment for reads with size up to 32 bp (Cox, unpublished). Maq is another program for ungapped alignment, which implemented sophisticated probability models to measure alignment quality of each read using sequence quality information (Li, unpublished). Here, we present a new program SOAP, which can do both ungapped and gapped alignment, and has special modules for alignment of pair-end, small RNA and mRNA tag sequences.
SOAP will allow either a certain number of mismatches or one continuous gap for aligning a read onto the reference sequence. The best hit of each read which has minimal number of mismatches or smaller gap will be reported. For multiple equal-best hits, the user can instruct the program to report all, or randomly report one, or disregard all of them. Since the typical read length is 25–50 bp, hits with too many mismatches are unreliable which are hard to distinguish with random matches. By default, the program will allow at most two mismatches. Between two haplotype genome sequences, occurrence of single nucleotide polymorphism is much higher than that of small insertions or deletions, so ungapped hits have precedence over gapped hits. For gapped alignment only one continuous gap with a size ranging from 1 to 3 bp is accepted, while no mismatches are permitted in the flanking regions to avoid ambiguous gaps. The gap could be either insertion or deletion in the query or the reference sequence. As the intrinsic character of the sequencing technology, errors will accumulate during the sequencing process. Reads always exhibit a much higher number of sequencing errors at the 3'-end, which sometimes make them unalignable to the reference sequences. To deal with the problem, SOAP can iteratively trim several basepairs at the 3'-end and redo the alignment, until hits are detected or the remaining sequence is too short for specific alignment.
Pair-end sequencing means to sequence both ends of a DNA fragment. So the two reads belonging to a pair will always have the settled relative orientation and approximate distance between each other on the genome. The technology can significantly improve the accuracy of resequencing mapping, and is a powerful method for detection of structural variants including copy number variations (CNVs), rearrangements, inversions and etc. SOAP is able to align a pair of reads simultaneously. A pair will be aligned when two reads are mapped with the right orientation relationship and proper distance. Similar filter as single-read alignment, a certain number of mismatches are allowed in one or both reads of the pair. For gapped alignment, gap is only permitted on one read, and the other end should match exactly.
Apart from genome resequencing, The high throughput sequencing technology lends itself to numerous applications. For some applications (ex. ChIP-Seq), the data analysis process is essentially identical to that of resequencing. Additionally, SOAP provides special modules for small RNA discovery and mRNA tag profiling analysis. Small RNAs have a size between 18 to 26 bp. According to the experimental protocol, the 3'-end of RNA sequence will be flanked by adapter sequences. SOAP will filter adapter sequence, and then align the remaining candidate small RNA to the reference sequence. A small RNA will be annotated if an adapter sequence is detected and the insert sequence match well with the reference sequence. Considering sequencing errors, one or two mismatches can be allowed insider either the adapter or the candidate RNA region according to user settings. On mRNA tag sequencing, there are two types of restriction enzyme digestion: (i) DpnII, which will specially recognize the site GATC and cuts a 16 bp tag after the site; (ii) NlaIII, which exclusively recognizes the site CATG and cuts 17 bp downstream. SOAP checks and trims off the 3'-end adapter sequence according to the enzyme type. Aligned hits should contain the enzyme site, and have at most one mismatch in the tag region.
SOAP uses seed and hash look-up table algorithm to accelerate alignment. Both reads and the reference sequences are converted to numeric data type using 2-bits-per-base encoding. A read will do exclusive-OR comparison with the reference sequence. Then the value is used as suffix to check the look-up table to know how many bases are different. In order to have a tradeoff between memory usage and efficiency, SOAP uses unsigned 3-bytes data type as the table element. To admit two mismatches, a read is splitted into four fragments, the two mismatches can exist in at most two of the fragments on the same time, then if we try all six combinations of the two fragments as seed, we can however catch all hits with two mismatches (the algorithm is the same as Eland and Maq). Since mismatches are not allowed in gapped hits, SOAP used the enumeration algorithm which tries to insert a continuous gap or delete a fragment at each possible position in a read. The algorithm outputs the identical alignments as that of dynamic programming while runs much faster. Not alike Eland and Maq which load read sequences into memory and build seed index tables for reads, SOAP loads reference sequences into memory as an unsigned 3-bytes array and builds the seed index tables for all the reference sequences. Then for each read, create seeds and search the corresponding index table for candidate hits, perform alignment and report the results. The RAM required for storing the reference sequences and seed index tables can be calculated as:
|
|
Evaluated on a real dataset containing 9 914 527 Illumina-Solexa single-end resequencing reads (length 32 bp), which were generated from a 5 Mb human genome region, SOAP was almost 300 (gapped) to 1200 (ungapped) times faster than blastn, while having better sensitivity (Table 1). The iterative feature of SOAP improved sensitivity. And gapped alignment can further identify hits accommodating small indels which compose only a small fraction of all hits but are a very important class of mutation. Since SOAP loads reference sequences into memory, while Eland and Maq load reads, the memory usage varies in different datasets.
|
SOAP accepts FASTA format for reference, and either FASTA or FASTQ format for query reads. It's a command-driven program, which employs single command line model and batch computing model. On batch computing model, the reference sequences and hash index tables will reside in the memory and alignment procedure can be performed for multiple query datasets in a order. This model avoids I/O and time wasted on loading reference and creating hash tables multiple times, and is also suitable for real-time web service. SOAP is written in standard C++ language and runs well on Macintosh or any 64-bit Linux/Unix systems. It supports multithreaded parallel computing.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We are indebted to Wei Fan, Qibin Li, Xiaodong Fang, Zhike Lu, Guoqing Li, Junjie Qin and the other users who tested the beta version of the program for identifying bugs and proposing all kinds of improvements. We thank Shengting Li for setting up the website. The project is supported by the National Natural Science Foundation of China (30725008), and grants from the Danish Natural Science Research Council (272-05-0344 and 272-07-0196).
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith Crandall
Received on November 10, 2007; revised on December 20, 2007; accepted on January 14, 2008
| REFERENCES |
|---|
|
|
|---|
Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.
Bentley DR. Whole-genome re-sequencing. Curr. Opin. Genet. Dev (2006) 16:545–552.[CrossRef][Web of Science][Medline]
Cox A. ELAND: Efficient Local Alignment of Nucleotide Data. (unpublished).
Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res (2002) 12:656–664.
Li H. Mapping and assembly with quality. (unpublished) http://maq.sourceforge.net/.
Ning Z, et al. SSAHA: a fast search method for large DNA databases. Genome Res (2001) 11:1725–1729.
Shendure J, et al. Advanced sequencing technologies: methods and goals. Nat. Rev. Genet (2004) 5:335–344.[Web of Science][Medline]
This article has been cited by other articles:
![]() |
H. Li and R. Durbin Fast and accurate short read alignment with Burrows-Wheeler transform Bioinformatics, July 15, 2009; 25(14): 1754 - 1760. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Kawaoka, N. Hayashi, Y. Suzuki, H. Abe, S. Sugano, Y. Tomari, T. Shimada, and S. Katsuma The Bombyx ovary-derived cell line endogenously expresses PIWI/PIWI-interacting RNA complexes RNA, July 1, 2009; 15(7): 1258 - 1264. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Gaidatzis, K. Jacobeit, E. J. Oakeley, and M. B. Stadler Overestimation of alternative splicing caused by variable probe characteristics in exon arrays Nucleic Acids Res., June 15, 2009; (2009) gkp508v1. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Bao, H. Guo, J. Wang, R. Zhou, X. Lu, and S. Shi MapView: visualization of short reads alignment on a desktop computer Bioinformatics, June 15, 2009; 25(12): 1554 - 1555. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang SNP detection for massively parallel whole-genome resequencing Genome Res., June 1, 2009; 19(6): 1124 - 1132. [Abstract] [Full Text] [PDF] |
||||
![]() |
Y. Jung Kim, N. Teletia, V. Ruotti, C. A. Maher, A. M. Chinnaiyan, R. Stewart, J. A. Thomson, and J. M. Patel ProbeMatch: rapid alignment of oligonucleotides to genome allowing both gaps and mismatches Bioinformatics, June 1, 2009; 25(11): 1424 - 1425. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. C. Schatz CloudBurst: highly sensitive read mapping with MapReduce Bioinformatics, June 1, 2009; 25(11): 1363 - 1369. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. D. Passalacqua, A. Varadarajan, B. D. Ondov, D. T. Okou, M. E. Zwick, and N. H. Bergman Structure and Complexity of a Bacterial Transcriptome J. Bacteriol., May 15, 2009; 191(10): 3203 - 3211. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Fahlgren, C. M. Sullivan, K. D. Kasschau, E. J. Chapman, J. S. Cumbie, T. A. Montgomery, S. D. Gilbert, M. Dasenko, T. W.H. Backman, S. A. Givan, et al. Computational and analytical framework for small RNA profiling by high-throughput sequencing RNA, May 1, 2009; 15(5): 992 - 1002. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. A. Ebhardt, H. H. Tsang, D. C. Dai, Y. Liu, B. Bostan, and R. P. Fahlman Meta-analysis of small RNA-sequencing errors reveals ubiquitous post-transcriptional RNA modifications Nucleic Acids Res., May 1, 2009; 37(8): 2461 - 2470. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. J. Fullwood, C.-L. Wei, E. T. Liu, and Y. Ruan Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses Genome Res., April 1, 2009; 19(4): 521 - 532. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Campagna, A. Albiero, A. Bilardi, E. Caniato, C. Forcato, S. Manavski, N. Vitulo, and G. Valle PASS: a program to align short sequences Bioinformatics, April 1, 2009; 25(7): 967 - 968. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. L. Eaves and Y. Gao MOM: maximum oligonucleotide mapping Bioinformatics, April 1, 2009; 25(7): 969 - 970. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. V. Voelkerding, S. A. Dames, and J. D. Durtschi Next-Generation Sequencing: From Basic Research to Diagnostics Clin. Chem., April 1, 2009; 55(4): 641 - 658. [Abstract] [Full Text] [PDF] |
||||
![]() |
X. Wang, A. A. Elling, X. Li, N. Li, Z. Peng, G. He, H. Sun, Y. Qi, X. S. Liu, and X. W. Deng Genome-Wide and Organ-Specific Landscapes of Epigenetic Modifications and Their Relationships to mRNA and Small RNA Transcriptomes in Maize PLANT CELL, April 1, 2009; 21(4): 1053 - 1069. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Reinhardt, D. A. Baltrus, M. T. Nishimura, W. R. Jeck, C. D. Jones, and J. L. Dangl De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae Genome Res., February 1, 2009; 19(2): 294 - 305. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Li, L. Ma, C. Song, Z. Yang, X. Wang, H. Huang, Y. Li, R. Li, X. Zhang, H. Yang, et al. The YH database: the first Asian diploid genome database Nucleic Acids Res., January 1, 2009; 37(suppl_1): D1025 - D1028. [Abstract] [Full Text] [PDF] |
||||
![]() |
K. Kapur, H. Jiang, Y. Xing, and W. H. Wong Cross-hybridization modeling on Affymetrix exon arrays Bioinformatics, December 15, 2008; 24(24): 2887 - 2893. [Abstract] [Full Text] [PDF] |
||||
![]() |
B. D. Ondov, A. Varadarajan, K. D. Passalacqua, and N. H. Bergman Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications Bioinformatics, December 1, 2008; 24(23): 2776 - 2777. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. R. ten Bosch and W. W. Grody Keeping Up With the Next Generation: Massively Parallel Sequencing in Clinical Diagnostics J. Mol. Diagn., November 1, 2008; 10(6): 484 - 492. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Jiang and W. H. Wong SeqMap: mapping massive amount of oligonucleotides to the genome Bioinformatics, October 15, 2008; 24(20): 2395 - 2396. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







