Skip Navigation


Bioinformatics Advance Access originally published online on January 28, 2008
Bioinformatics 2008 24(5):713-714; doi:10.1093/bioinformatics/btn025
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/5/713    most recent
btn025v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (56)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, R.
Right arrow Articles by Wang, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, R.
Right arrow Articles by Wang, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

SOAP: short oligonucleotide alignment program

Ruiqiang Li 1,2, Yingrui Li 1, Karsten Kristiansen 2 and Jun Wang 1,2,*

1Beijing Genomics Institute at Shenzhen, Shenzhen 518083, China and 2Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense M, DK-5230, Denmark

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: We have developed a program SOAP for efficient gapped and ungapped alignment of short oligonucleotides onto reference sequences. The program is designed to handle the huge amounts of short reads generated by parallel sequencing using the new generation Illumina-Solexa sequencing technology. SOAP is compatible with numerous applications, including single-read or pair-end resequencing, small RNA discovery and mRNA tag sequence mapping. SOAP is a command-driven program, which supports multi-threaded parallel computing, and has a batch module for multiple query sets.

Availability: http://soap.genomics.org.cn

Contact: soap{at}genomics.org.cn

The new DNA sequencing technologies, which have been developed and implemented recently, have significantly improved throughput and dramatically reduced the cost compared with the capillary-based electrophoresis systems (Shendure et al., 2004). In a single experiment using one instrument, the Illumina-Solexa system using sequencing-by-synthesis (SBS) can determine up to 40 million sequences of up to 50 bases in length, whereas the ABI-SOLiD system using ligation technology allows determination of 3 Gb mappable data. The ultra high throughput and short read length make these technologies particularly suitable for large-scale resequencing of large cohorts of individuals with known reference for studies of genetic variations (Bentley, 2006). Traditional sequence alignment software like blast (Altschul et al., 1997) and blat (Kent, 2002) are unable to cope efficiently with the huge amount of reads generated in such applications, while SSAHA (Ning et al., 2001) is optimized to find long alignments and fails in practice on most short queries. To our knowledge, there have been several programs developed or under developing to match the new sequencing technologies. ELAND, an alignment tool integrated in Illumina-Solexa data processing package, can do ungapped alignment for reads with size up to 32 bp (Cox, unpublished). Maq is another program for ungapped alignment, which implemented sophisticated probability models to measure alignment quality of each read using sequence quality information (Li, unpublished). Here, we present a new program SOAP, which can do both ungapped and gapped alignment, and has special modules for alignment of pair-end, small RNA and mRNA tag sequences.

SOAP will allow either a certain number of mismatches or one continuous gap for aligning a read onto the reference sequence. The best hit of each read which has minimal number of mismatches or smaller gap will be reported. For multiple equal-best hits, the user can instruct the program to report all, or randomly report one, or disregard all of them. Since the typical read length is 25–50 bp, hits with too many mismatches are unreliable which are hard to distinguish with random matches. By default, the program will allow at most two mismatches. Between two haplotype genome sequences, occurrence of single nucleotide polymorphism is much higher than that of small insertions or deletions, so ungapped hits have precedence over gapped hits. For gapped alignment only one continuous gap with a size ranging from 1 to 3 bp is accepted, while no mismatches are permitted in the flanking regions to avoid ambiguous gaps. The gap could be either insertion or deletion in the query or the reference sequence. As the intrinsic character of the sequencing technology, errors will accumulate during the sequencing process. Reads always exhibit a much higher number of sequencing errors at the 3'-end, which sometimes make them unalignable to the reference sequences. To deal with the problem, SOAP can iteratively trim several basepairs at the 3'-end and redo the alignment, until hits are detected or the remaining sequence is too short for specific alignment.

Pair-end sequencing means to sequence both ends of a DNA fragment. So the two reads belonging to a pair will always have the settled relative orientation and approximate distance between each other on the genome. The technology can significantly improve the accuracy of resequencing mapping, and is a powerful method for detection of structural variants including copy number variations (CNVs), rearrangements, inversions and etc. SOAP is able to align a pair of reads simultaneously. A pair will be aligned when two reads are mapped with the right orientation relationship and proper distance. Similar filter as single-read alignment, a certain number of mismatches are allowed in one or both reads of the pair. For gapped alignment, gap is only permitted on one read, and the other end should match exactly.

Apart from genome resequencing, The high throughput sequencing technology lends itself to numerous applications. For some applications (ex. ChIP-Seq), the data analysis process is essentially identical to that of resequencing. Additionally, SOAP provides special modules for small RNA discovery and mRNA tag profiling analysis. Small RNAs have a size between 18 to 26 bp. According to the experimental protocol, the 3'-end of RNA sequence will be flanked by adapter sequences. SOAP will filter adapter sequence, and then align the remaining candidate small RNA to the reference sequence. A small RNA will be annotated if an adapter sequence is detected and the insert sequence match well with the reference sequence. Considering sequencing errors, one or two mismatches can be allowed insider either the adapter or the candidate RNA region according to user settings. On mRNA tag sequencing, there are two types of restriction enzyme digestion: (i) DpnII, which will specially recognize the site ‘GATC’ and cuts a 16 bp tag after the site; (ii) NlaIII, which exclusively recognizes the site ‘CATG’ and cuts 17 bp downstream. SOAP checks and trims off the 3'-end adapter sequence according to the enzyme type. Aligned hits should contain the enzyme site, and have at most one mismatch in the tag region.

SOAP uses seed and hash look-up table algorithm to accelerate alignment. Both reads and the reference sequences are converted to numeric data type using 2-bits-per-base encoding. A read will do exclusive-OR comparison with the reference sequence. Then the value is used as suffix to check the look-up table to know how many bases are different. In order to have a tradeoff between memory usage and efficiency, SOAP uses unsigned 3-bytes data type as the table element. To admit two mismatches, a read is splitted into four fragments, the two mismatches can exist in at most two of the fragments on the same time, then if we try all six combinations of the two fragments as seed, we can however catch all hits with two mismatches (the algorithm is the same as Eland and Maq). Since mismatches are not allowed in gapped hits, SOAP used the enumeration algorithm which tries to insert a continuous gap or delete a fragment at each possible position in a read. The algorithm outputs the identical alignments as that of dynamic programming while runs much faster. Not alike Eland and Maq which load read sequences into memory and build seed index tables for reads, SOAP loads reference sequences into memory as an unsigned 3-bytes array and builds the seed index tables for all the reference sequences. Then for each read, create seeds and search the corresponding index table for candidate hits, perform alignment and report the results. The RAM required for storing the reference sequences and seed index tables can be calculated as:


Formula

where L is the total length of the reference sequences; S is seed size. For small reference like yeast, L = 12 Mb and selected seed size S = 10 bp, about 200 Mb RAM is needed; but for the whole human genome, L = 3 Gb and a selected seed size S = 12 bp, about 14 Gb RAM in total will be needed.

Evaluated on a real dataset containing 9 914 527 Illumina-Solexa single-end resequencing reads (length 32 bp), which were generated from a 5 Mb human genome region, SOAP was almost 300 (gapped) to 1200 (ungapped) times faster than blastn, while having better sensitivity (Table 1). The iterative feature of SOAP improved sensitivity. And gapped alignment can further identify hits accommodating small indels which compose only a small fraction of all hits but are a very important class of mutation. Since SOAP loads reference sequences into memory, while Eland and Maq load reads, the memory usage varies in different datasets.


View this table:
[in this window]
[in a new window]

 
Table 1. Comparison of performance and sensitivity among short oligonucleotide alignment programs

 
SOAP accepts FASTA format for reference, and either FASTA or FASTQ format for query reads. It's a command-driven program, which employs single command line model and batch computing model. On batch computing model, the reference sequences and hash index tables will reside in the memory and alignment procedure can be performed for multiple query datasets in a order. This model avoids I/O and time wasted on loading reference and creating hash tables multiple times, and is also suitable for real-time web service. SOAP is written in standard C++ language and runs well on Macintosh or any 64-bit Linux/Unix systems. It supports multithreaded parallel computing.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 ACKNOWLEDGEMENTS
 REFERENCES
 
We are indebted to Wei Fan, Qibin Li, Xiaodong Fang, Zhike Lu, Guoqing Li, Junjie Qin and the other users who tested the beta version of the program for identifying bugs and proposing all kinds of improvements. We thank Shengting Li for setting up the website. The project is supported by the National Natural Science Foundation of China (30725008), and grants from the Danish Natural Science Research Council (272-05-0344 and 272-07-0196).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Keith Crandall

Received on November 10, 2007; revised on December 20, 2007; accepted on January 14, 2008

    REFERENCES
 TOP
 ABSTRACT
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3402.[Abstract/Free Full Text]

    Bentley DR. Whole-genome re-sequencing. Curr. Opin. Genet. Dev (2006) 16:545–552.[CrossRef][Web of Science][Medline]

    Cox A. ELAND: Efficient Local Alignment of Nucleotide Data. (unpublished).

    Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res (2002) 12:656–664.[Abstract/Free Full Text]

    Li H. Mapping and assembly with quality. (unpublished) http://maq.sourceforge.net/.

    Ning Z, et al. SSAHA: a fast search method for large DNA databases. Genome Res (2001) 11:1725–1729.[Abstract/Free Full Text]

    Shendure J, et al. Advanced sequencing technologies: methods and goals. Nat. Rev. Genet (2004) 5:335–344.[Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Genome ResHome page
H. M. Manske and D. P. Kwiatkowski
LookSeq: A browser-based viewer for deep sequencing data
Genome Res., November 1, 2009; 19(11): 2125 - 2132.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
A. D. Smith, W.-Y. Chung, E. Hodges, J. Kendall, G. Hannon, J. Hicks, Z. Xuan, and M. Q. Zhang
Updates to the RMAP short-read mapping software
Bioinformatics, November 1, 2009; 25(21): 2841 - 2842.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
D. S. Horner, G. Pavesi, T. Castrignano, P. D. De Meo, S. Liuni, M. Sammeth, E. Picardi, and G. Pesole
Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing
Brief Bioinform, October 27, 2009; (2009) bbp046v1.
[Abstract] [Full Text] [PDF]


Home page
ScienceHome page
Q. Xia, Y. Guo, Z. Zhang, D. Li, Z. Xuan, Z. Li, F. Dai, Y. Li, D. Cheng, R. Li, et al.
Complete Resequencing of 40 Genomes Reveals Domestication Events and Genes in Silkworm (Bombyx)
Science, October 16, 2009; 326(5951): 433 - 436.
[Abstract] [Full Text] [PDF]


Home page
J Exp BotHome page
D. Moldovan, A. Spriggs, J. Yang, B. J. Pogson, E. S. Dennis, and I. W. Wilson
Hypoxia-responsive microRNAs and trans-acting small interfering RNAs in Arabidopsis
J. Exp. Bot., October 8, 2009; (2009) erp296v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Chen, T. Souaiaia, and T. Chen
PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds
Bioinformatics, October 1, 2009; 25(19): 2514 - 2521.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
J. D. Gawronski, S. M. S. Wong, G. Giannoukos, D. V. Ward, and B. J. Akerley
Tracking insertion mutants within libraries by deep sequencing and a genome-wide screen for Haemophilus genes required in the lung
PNAS, September 22, 2009; 106(38): 16422 - 16427.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. M. Manske and D. P. Kwiatkowski
SNP-o-matic
Bioinformatics, September 15, 2009; 25(18): 2434 - 2435.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
D. Gaidatzis, K. Jacobeit, E. J. Oakeley, and M. B. Stadler
Overestimation of alternative splicing caused by variable probe characteristics in exon arrays
Nucleic Acids Res., September 1, 2009; 37(16): e107 - e107.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
D. Weese, A.-K. Emde, T. Rausch, A. Doring, and K. Reinert
RazerS--fast read mapping with sensitivity control
Genome Res., September 1, 2009; 19(9): 1646 - 1654.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
P. F. Stadler, J. J.-L. Chen, J. Hackermuller, S. Hoffmann, F. Horn, P. Khaitovich, A. K. Kretzschmar, A. Mosig, S. J. Prohaska, X. Qi, et al.
Evolution of Vault RNAs
Mol. Biol. Evol., September 1, 2009; 26(9): 1975 - 1991.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. Li, C. Yu, Y. Li, T.-W. Lam, S.-M. Yiu, K. Kristiansen, and J. Wang
SOAP2: an improved ultrafast tool for short read alignment
Bioinformatics, August 1, 2009; 25(15): 1966 - 1967.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Li and R. Durbin
Fast and accurate short read alignment with Burrows-Wheeler transform
Bioinformatics, July 15, 2009; 25(14): 1754 - 1760.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
S. Kawaoka, N. Hayashi, Y. Suzuki, H. Abe, S. Sugano, Y. Tomari, T. Shimada, and S. Katsuma
The Bombyx ovary-derived cell line endogenously expresses PIWI/PIWI-interacting RNA complexes
RNA, July 1, 2009; 15(7): 1258 - 1264.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Bao, H. Guo, J. Wang, R. Zhou, X. Lu, and S. Shi
MapView: visualization of short reads alignment on a desktop computer
Bioinformatics, June 15, 2009; 25(12): 1554 - 1555.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
R. Li, Y. Li, X. Fang, H. Yang, J. Wang, K. Kristiansen, and J. Wang
SNP detection for massively parallel whole-genome resequencing
Genome Res., June 1, 2009; 19(6): 1124 - 1132.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Jung Kim, N. Teletia, V. Ruotti, C. A. Maher, A. M. Chinnaiyan, R. Stewart, J. A. Thomson, and J. M. Patel
ProbeMatch: rapid alignment of oligonucleotides to genome allowing both gaps and mismatches
Bioinformatics, June 1, 2009; 25(11): 1424 - 1425.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. C. Schatz
CloudBurst: highly sensitive read mapping with MapReduce
Bioinformatics, June 1, 2009; 25(11): 1363 - 1369.
[Abstract] [Full Text] [PDF]


Home page
J. Bacteriol.Home page
K. D. Passalacqua, A. Varadarajan, B. D. Ondov, D. T. Okou, M. E. Zwick, and N. H. Bergman
Structure and Complexity of a Bacterial Transcriptome
J. Bacteriol., May 15, 2009; 191(10): 3203 - 3211.
[Abstract] [Full Text] [PDF]


Home page
RNAHome page
N. Fahlgren, C. M. Sullivan, K. D. Kasschau, E. J. Chapman, J. S. Cumbie, T. A. Montgomery, S. D. Gilbert, M. Dasenko, T. W.H. Backman, S. A. Givan, et al.
Computational and analytical framework for small RNA profiling by high-throughput sequencing
RNA, May 1, 2009; 15(5): 992 - 1002.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
H. A. Ebhardt, H. H. Tsang, D. C. Dai, Y. Liu, B. Bostan, and R. P. Fahlman
Meta-analysis of small RNA-sequencing errors reveals ubiquitous post-transcriptional RNA modifications
Nucleic Acids Res., May 1, 2009; 37(8): 2461 - 2470.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
M. J. Fullwood, C.-L. Wei, E. T. Liu, and Y. Ruan
Next-generation DNA sequencing of paired-end tags (PET) for transcriptome and genome analyses
Genome Res., April 1, 2009; 19(4): 521 - 532.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Campagna, A. Albiero, A. Bilardi, E. Caniato, C. Forcato, S. Manavski, N. Vitulo, and G. Valle
PASS: a program to align short sequences
Bioinformatics, April 1, 2009; 25(7): 967 - 968.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. L. Eaves and Y. Gao
MOM: maximum oligonucleotide mapping
Bioinformatics, April 1, 2009; 25(7): 969 - 970.
[Abstract] [Full Text] [PDF]


Home page
Clin. Chem.Home page
K. V. Voelkerding, S. A. Dames, and J. D. Durtschi
Next-Generation Sequencing: From Basic Research to Diagnostics
Clin. Chem., April 1, 2009; 55(4): 641 - 658.
[Abstract] [Full Text] [PDF]


Home page
Plant CellHome page
X. Wang, A. A. Elling, X. Li, N. Li, Z. Peng, G. He, H. Sun, Y. Qi, X. S. Liu, and X. W. Deng
Genome-Wide and Organ-Specific Landscapes of Epigenetic Modifications and Their Relationships to mRNA and Small RNA Transcriptomes in Maize
PLANT CELL, April 1, 2009; 21(4): 1053 - 1069.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
J. A. Reinhardt, D. A. Baltrus, M. T. Nishimura, W. R. Jeck, C. D. Jones, and J. L. Dangl
De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae
Genome Res., February 1, 2009; 19(2): 294 - 305.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
G. Li, L. Ma, C. Song, Z. Yang, X. Wang, H. Huang, Y. Li, R. Li, X. Zhang, H. Yang, et al.
The YH database: the first Asian diploid genome database
Nucleic Acids Res., January 1, 2009; 37(suppl_1): D1025 - D1028.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
K. Kapur, H. Jiang, Y. Xing, and W. H. Wong
Cross-hybridization modeling on Affymetrix exon arrays
Bioinformatics, December 15, 2008; 24(24): 2887 - 2893.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. D. Ondov, A. Varadarajan, K. D. Passalacqua, and N. H. Bergman
Efficient mapping of Applied Biosystems SOLiD sequence data to a reference genome for functional genomic applications
Bioinformatics, December 1, 2008; 24(23): 2776 - 2777.
[Abstract] [Full Text] [PDF]


Home page
J. Mol. Diagn.Home page
J. R. ten Bosch and W. W. Grody
Keeping Up With the Next Generation: Massively Parallel Sequencing in Clinical Diagnostics
J. Mol. Diagn., November 1, 2008; 10(6): 484 - 492.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Jiang and W. H. Wong
SeqMap: mapping massive amount of oligonucleotides to the genome
Bioinformatics, October 15, 2008; 24(20): 2395 - 2396.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/5/713    most recent
btn025v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (56)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Li, R.
Right arrow Articles by Wang, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Li, R.
Right arrow Articles by Wang, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?