Skip Navigation


Bioinformatics Advance Access originally published online on August 12, 2008
Bioinformatics 2008 24(20):2395-2396; doi:10.1093/bioinformatics/btn429
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/20/2395    most recent
btn429v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jiang, H.
Right arrow Articles by Wong, W. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jiang, H.
Right arrow Articles by Wong, W. H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

SeqMap: mapping massive amount of oligonucleotides to the genome

Hui Jiang 1 and Wing Hung Wong 2,*

1Institute for Computational and Mathematical Engineering and 2Department of Statistics, Stanford University, Stanford, California 94305, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: SeqMap is a tool for mapping large amount of short sequences to the genome. It is designed for finding all the places in a reference genome where each sequence may come from. This task is essential to the analysis of data from ultra high-throughput sequencing machines. With a carefully designed index-filtering algorithm and an efficient implementation, SeqMap can map tens of millions of short sequences to a genome of several billions of nucleotides. Multiple substitutions and insertions/deletions of the nucleotide bases in the sequences can be tolerated and therefore detected. SeqMap supports FASTA input format and various output formats, and provides command line options for tuning almost every aspect of the mapping process. A typical mapping can be done in a few hours on a desktop PC. Parallel use of SeqMap on a cluster is also very straightforward.

Contact: whwong{at}stanford.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 ACKNOWLEDGEMENTS
 REFERENCES
 
With the rapid development of sequencing technology, there are now several next generation sequencing platforms available. Enormous amount of short sequences can be generated by these sequencing machines in a short time. For example, the Illumina-Solexa system can generate over 50 million sequences of length 30–50 nt in a single run taking <3 days. Using a different technology, the ABI-SOLiD system can generate data at a similar rate. The Roche-454 system generates fewer, but longer sequences. To date, the sequencing technology is still in the developing phase with a very fast pace of increase in throughput.

Due to the large amount of data generated by the above systems, efficient algorithms for mapping short oligonucleotides to a reference genome are in great demand. Popular alignment programs that are designed to align smaller amount of longer sequences, such as BLAST (Altschul et al., 1997) and its successor BLAT (Kent, 2002), are not the best options to accomplish this task. Therefore, several short sequence mapping programs have been developed recently for this particular objective. They use effective techniques such as hashing and short key indexing, but differ in algorithm, implementation details and capability. ELAND (Cox, unpublished software), developed by Solexa, can map reads of length 20–32 nt to the genome, allowing up to two substitutions in the mapping. SOAP (Li et al., 2008) can also handle up to two substitutions, or a gap of 1–3 nt without any other substitution. It can also handle longer reads or pair-end reads. RMAP (Smith et al., 2008) is another program for ungapped mapping, which takes read qualities into account.

Compared to these existing programs, SeqMap offers more flexibility in the mapping. It allows up to five mixed substitutions and inserted/deleted nucleotides in the mapping, which is considered sufficient for most mapping applications. FASTA input format and various output formats (e.g. the ELAND format) are supported by SeqMap for the convenience of users. It also provides many command line options for tuning almost every aspect of the mapping process. For instance, SeqMap allows sequences to contain N's, and to have unequal lengths. Such flexibility is beneficial for the analysis since both sequencing errors and SNPs may cause substitutions and inserted/deleted nucleotides in the reads. It is especially useful for the analysis of cancer genomes where substitutions and insertions/deletions happen more often. For reads that are longer than 30 nt, the 3 bp mismatch mapping gives mostly true signal rather than noise. We map 11M RNA-Seq reads of length 30 nt from (Mortazavi et al., 2008) to mouse chr19 using 2 bp and 3 bp mismatch mapping, respectively. The 3 bp mismatch mapping gives 18.5% more uniquely mapped reads. This is achieved without large drop in specificity—60.3% of the uniquely mapped reads in 3 bp mismatch mapping are mapped to RefSeq genes versus 63.6% in 2 bp mismatch mapping. Moreover, there are other applications in which this flexibility is helpful, such as mapping of exon tiling array probes in (Xing et al., 2008) for probe level cross-hybridization analysis.

A website (http://biogibbs.stanford.edu/~jiangh/SeqMap/) has been setup for maintaining the SeqMap program, its source code and documentations.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 ACKNOWLEDGEMENTS
 REFERENCES
 
Many of the short sequence mapping programs, including ELAND, SOAP and RMAP, are based on the pigeonhole principle. This principle was also used in detecting near-duplicated web pages (Manku et al., 2007). The idea, also used by SeqMap, is to split each read into several parts. By requiring some of the parts instead of all of them to be perfectly matched in the mapping, the noncandidates can be filtered out very quickly. For example, for a mapping up to two substitutions, we can split each read into four parts. Since the substitutions can only occur in at most two of the four parts, at least two of the four parts will be perfectly matched to the target. Therefore, we can use the sequence combined from the two perfectly matched parts as the key to index all the candidates. A hash table is an effective and efficient data structure to implement this task. By enumerating all combinations of two parts chosen from all of the four parts, we need only six scans to find all the candidates. After that, a second stage of filtering is done between the sequence to be mapped and all the candidates from the first stage to determine all the targets in the genome. Insertions and deletions can also be incorporated into this algorithm in a similar fashion if carefully implemented. The only change will be that more rounds of scans are needed in the first stage of filtering. To allow one substitution and one insertion/deletion, each read will still be split into four parts. However, we need to scan not only all of the combinations of the two of the four parts, but also the combinations of the two parts with one of them shifted one nucleotide to its left or to its right.

SeqMap is written in ANSI C++. It uses bit operations to accelerate the mapping. Each nucleotide is encoded into 2 bits in the memory. In common with ELAND or RMAP, SeqMap indexes and hashes the reads before scanning the reference genome. This is different from some other alignment programs such as BLAT and SOAP, which indexes and hashes the reference genome instead. Hashing the genome usually needs large memory (e.g. SOAP needs 14 GB memory when mapping to the human genome) and therefore prohibits the program from running on most desktop PCs. As a comparison, SeqMap runs smoothly and quickly on a PC with 2 GB memory and a single 32-bit CPU when working with the human genome. It can also be compiled and run on any other platform, including 64-bit workstations with >16 GB memory, where it can take advantage of the large memory and the high performance CPU. Furthermore, by simply splitting the reads or the genome or both into several parts, SeqMap can be used in parallel on large scale data sets to speed up the mapping process.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 ACKNOWLEDGEMENTS
 REFERENCES
 
To evaluate the mapping efficiency and effectiveness of SeqMap, we take more than 11 million Solexa reads from RNA-Seq experiments (Mortazavi et al., 2008), and use several short sequence mapping programs including ELAND, SOAP, RMAP and SeqMap to map these reads to mouse chrX. ELAND, SOAP and RMAP are known to be among the best available short sequence mapping programs. Other alignment programs such as BLAST and BLAT are not evaluated because they have been shown to be much slower than the programs designed for mapping short reads (Li et al., 2008). We take the first 25 nt of the reads and do a mapping with up to two substitutions, since this is supported by all these programs. As we can see from Table 1, ELAND is the fastest, and SeqMap is faster than SOAP and RMAP. Given the fact that ELAND is optimized for <32 nt reads and up to two substitutions mapping only, it is reasonable that SeqMap is slower than ELAND since SeqMap can map longer reads, with more substitutions and even several additional insertions and deletions. In terms of the mapping accuracy, all four programs give similar results. SeqMap and ELAND gives the most number of mapped reads. The memory usage of the programs is also given in Table 1, where we can see that SeqMap and RMAP use more memory than ELAND and SOAP. We need to point out that the comparison with SOAP is very data and parameter dependent, since SOAP indexes the reference genome rather than the reads as in the other three programs.


View this table:
[in this window]
[in a new window]

 
Table 1. Benchmark results of SeqMap, ELAND, SOAP and RMAP

 


View this table:
[in this window]
[in a new window]

 
Table 2. Mapping 100 000 randomly perturbed reads with SeqMap, ELAND, SOAP and RMAP

 
To show that SeqMap can deal with more substitutions and also insertion/deletions, we randomly generate a DNA sequence of a length of 1 Mb, add to it 100 Kb random substitutions, N's and insertion/deletions, and then randomly sample 100k short reads of 30 nt from it. Finally, we map these short reads back to the original DNA sequence using all four programs. The results we get are shown in Table 2. We can see that SeqMap is able to detect a much larger number of matches.


    Funding
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 ACKNOWLEDGEMENTS
 REFERENCES
 
National Institute of Health (R01HG003903 and U54GM62119 to W.H.W.)

Conflict of Interest: none declared.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Karen Kapur, Wenxiu Ma and Yi Xing for helpful advice and Ying Xu for the (Manku et al., 2007) reference.


    FOOTNOTES
 
Associate Editor: Limsoon Wong

Received on July 21, 2008; revised on August 8, 2008; accepted on August 8, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]

    Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. (2002) 12:656–664.[Abstract/Free Full Text]

    Li R, et al. SOAP: short oligonucleotide alignment program. Bioinformatics (2008) 24:713–714.[Abstract/Free Full Text]

    Manku GS, et al. Detecting near-duplicates for web crawling. In: Proceedings of the 16th international conference on World Wide Web (2007) ACM, New York, NY, USA. 141–150.

    Mortazavi A, et al. Mapping and quantifying mammalian transcriptomes by RNA-seq. Nat. Methods (2008) 5:621–628.[CrossRef][Web of Science][Medline]

    Smith AD, et al. Using quality scores and longer reads improves accuracy of Solexa read mapping. BMC Bioinformatics (2008) 9:128.[CrossRef][Medline]

    Xing Y, et al. MADS: a new and improved method for analysis of differential alternative splicing by exon-tiling microarrays. RNA (2008) 14:1470–1479.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
D. S. Horner, G. Pavesi, T. Castrignano, P. D. De Meo, S. Liuni, M. Sammeth, E. Picardi, and G. Pesole
Bioinformatics approaches for genomics and post genomics applications of next-generation sequencing
Brief Bioinform, October 27, 2009; (2009) bbp046v1.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Chen, T. Souaiaia, and T. Chen
PerM: efficient mapping of short sequencing reads with periodic full sensitive spaced seeds
Bioinformatics, October 1, 2009; 25(19): 2514 - 2521.
[Abstract] [Full Text] [PDF]


Home page
Genome ResHome page
D. Weese, A.-K. Emde, T. Rausch, A. Doring, and K. Reinert
RazerS--fast read mapping with sensitivity control
Genome Res., September 1, 2009; 19(9): 1646 - 1654.
[Abstract] [Full Text] [PDF]


Home page
Gen Biol EvolHome page
X. Chen, L. J. Collins, P. J. Biggs, and D. Penny
High Throughput Genome-Wide Survey of Small RNAs from the Parasitic Protists Giardia intestinalis and Trichomonas vaginalis
Gen Biol Evol, July 24, 2009; 2009(0): 165 - 175.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Li and R. Durbin
Fast and accurate short read alignment with Burrows-Wheeler transform
Bioinformatics, July 15, 2009; 25(14): 1754 - 1760.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
L. Lin, S. Liu, H. Brockway, J. Seok, P. Jiang, W. H. Wong, and Y. Xing
Using high-density exon arrays to profile gene expression in closely related species
Nucleic Acids Res., July 1, 2009; 37(12): e90 - e90.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Bao, H. Guo, J. Wang, R. Zhou, X. Lu, and S. Shi
MapView: visualization of short reads alignment on a desktop computer
Bioinformatics, June 15, 2009; 25(12): 1554 - 1555.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
Y. Jung Kim, N. Teletia, V. Ruotti, C. A. Maher, A. M. Chinnaiyan, R. Stewart, J. A. Thomson, and J. M. Patel
ProbeMatch: rapid alignment of oligonucleotides to genome allowing both gaps and mismatches
Bioinformatics, June 1, 2009; 25(11): 1424 - 1425.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
H. Jiang and W. H. Wong
Statistical inferences for isoform expression in RNA-Seq
Bioinformatics, April 15, 2009; 25(8): 1026 - 1032.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
D. Campagna, A. Albiero, A. Bilardi, E. Caniato, C. Forcato, S. Manavski, N. Vitulo, and G. Valle
PASS: a program to align short sequences
Bioinformatics, April 1, 2009; 25(7): 967 - 968.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/20/2395    most recent
btn429v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (12)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Jiang, H.
Right arrow Articles by Wong, W. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Jiang, H.
Right arrow Articles by Wong, W. H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?