Skip Navigation


Bioinformatics Advance Access originally published online on November 30, 2004
Bioinformatics 2005 21(7):1265-1266; doi:10.1093/bioinformatics/bti122
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/1265    most recent
bti122v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xu, J.
Right arrow Articles by Gordon, J. I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xu, J.
Right arrow Articles by Gordon, J. I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

MapLinker: a software tool that aids physical map-linked whole genome shotgun assembly

Jian Xu * and Jeffrey I. Gordon *

Center for Genome Sciences and Department of Molecular Biology and Pharmacology, Washington University School of Medicine St. Louis, MO 63108, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 REFERENCES
 

Summary: MapLinker is an analysis tool, as well as a browsing interface, that facilitates integration of whole genome sequence assembly with a clone-based physical map. Using the locations of sequence markers on the physical map, MapLinker generates a tentative sequence map of the genome that serves to verify the map and to guide genome-wide finishing.

Availability: MapLinker is freely available at http://gordonlab.wustl.edu/MapLinker

Contact: jgordon{at}molecool.wustl.edu, jxu{at}watson.wustl.edu

Identification of gaps among sequence contigs is typically the most difficult part of the finishing phase of whole-genome shotgun (WGS) sequencing projects. To facilitate this process, many projects use a fingerprinted BAC- or fosmid-based physical map generated from sets of overlapping clones, in addition to WGS assembly (e.g., Waterston et al., 2002). In WGS, reads are collected and assembled into a number of sequence contigs (draft sequencing). At this stage, even after taking into account read pair information, finishers usually encounter many unordered, disoriented sequence contigs or supercontigs. Repetitive elements may cause global misassemblies. A physical map, generated from BACs or fosmids that have been end-sequenced, provides a useful set of sequence tags that can be used to guide finishing and to confirm the correctness of the sequence assembly. However, placing sequence contigs onto a physical map is a tedious process: the amount of data that has to be analyzed is large, and the positions and orientations of sequence contigs have to be inferred from a wide variety of evidence. We have developed MapLinker to markedly reduce the human effort required to integrate information from physical maps and WGS assembly.

The principle behind this tool is that the relative positions and orientations of sequence contigs can be predicted based on the location of sequence markers on the physical map. If a sequence contig includes a BAC or fosmid end read, the contig can be tentatively anchored to the position of that BAC/fosmid on the physical map. However, since the orientations of BACs or fosmids are initially all unknown, there are two possibilities for the location and orientation of each contig. Figure 1 shows how this problem can be overcome by using the multiple ‘links’ that exist on the same sequence contig.



View larger version (11K):
[in this window]
[in a new window]
 
Fig. 1 Algorithm used to predict the positions and orientations of WGS contigs on a BAC or fosmid-based physical map: Black fragments represent BACs or fosmids. Blue fragments are sequence contigs (SeqContigs). Red arrowheads indicate the positions and relative orientation of BAC/fosmid-end reads on BAC/fosmids and SeqContigs. Each red dot represents a set of linking clone that spans two SeqContigs. Each green line is a ‘MapLinker’ that helps anchor a sequence contig onto the physical map. The following steps are executed by the algorithm: (1) Display Level I SeqContigs: i.e., those that can be unambiguously anchored with respect to position and orientation (as is shown in the figure). (2) Display Level II SeqContigs, which are anchored because of their connection to Level I SeqContigs through linking BACs or fosmids. (3) Display Level III SeqContigs, which are anchored due to their connection to Level I SeqContigs through linking clones. The relative positions and orientation of Levels I, II, and III SeqContigs are unambiguous. (4) Display Level IV SeqContigs. These are anchored because the corresponding BAC or fosmid was assembled in that position by FPC. The positions and orientation of a Level IV SeqContig is ambiguous: it could be anchored to either end of the BAC/fosmid (as shown in the figure) since the orientation of the BAC/fosmid is unknown. The resulting positions and orientations of SeqContigs can be adjusted manually.

 
The input of MapLinker consists of two parts. The first is physical mapping data from fingerprint software such as FPC (Soderlund et al., 1997), where the relative physical coordinates and estimated size of BACs or fosmids are stored. The second part is sequence assemblies from software such as Phrap (Green, P., http://www.phrap.org/) and PCAP (Huang et al., 2003), where the location and orientation of all BAC/fosmid end-reads relative to sequence contigs are stored. All data are parsed and stored in a relational database. MapLinker, implemented in Perl/tk, acts as a client to query the database, analyzes the ‘conformation’ for each FPC contig, and presents the output in a graphic interface for user manipulation and evaluation (for a screenshot, see http://gordonlab.wustl.edu/MapLinker). Results of the analysis can be saved in the database and printed.

The positions of BACs or fosmids on the physical map may be inaccurate, mistakes may have been made during shotgun assembly, or there may be insufficient physical markers (or constraints). In these cases, MapLinker allows users to manipulate (e.g., move, invert or delete a sequence contig) and save the data output. It also provides a number of features that help the user to improve the analysis. For example, when a sequence contig has links from two distant parts of the physical map, the location supported by the most links will be used: links that are inconsistent are marked and relevant information is provided.

The utility of MapLinker was tested as we sequenced the complete 6.3 Mb circular chromosome of Bacteroides thetaiotaomicron, a prominent bacterial symbiont in the microbial community that resides in the distal adult human intestine (Xu et al., 2003). Our sequencing strategy involved the following steps: (1) creation of a physical map composed of overlapping BAC clones, each of which was fingerprinted by digestion with a restriction endonuclease (Marra et al., 1997) and end-sequenced; (2) end-sequencing of two WGS libraries, and assembly of the data, together with the BAC end reads, into sequence contigs using Phrap; (3) MapLinker-based anchoring of sequence contigs onto the physical map with coincident (i) predictions of misassemblies based on discrepancies between sequence contigs and the physical map, and (ii) identification of physical gaps in addition to sequence gaps; (4) verifying predicted gaps by PCR, and closing them by sequencing linking clones or PCR fragments. This approach made full use of the time and cost advantage of WGS, and with the help of the physical map, allowed the finishing process to proceed with simultaneous targeting of sequence and physical gaps. We also coupled MapLinker with annotation tools such as Artemis (Rutherford et al., 2000) so that the gene content of individual sequence contigs could be examined as finishing progressed.

MapLinker can also be useful for physical mapping when draft sequences are available. Even low-coverage draft WGS assembly can help merge small FPC contigs to larger ones when a sequence contig or supercontig spans two FPC contigs. Furthermore, conflicts in the order or location of sequence contigs suggested by WGS assembly versus the physical map serve as warning signs of misassembly of sequence contigs or FPC contigs. For example, MapLinker was used to integrate draft WGS assembly into the process of producing a high-quality physical map of Histoplasma capsulatumstrain G217B (Magrini et al., 2004 estimated genome size, 39 Mb): i.e., MapLinker automatically generated a list of potential connections between FPC contigs that were later verified by manual examination of the restriction patterns of overlapping clones; local misassemblies of BACs/fosmids on the physical map were identified in the MapLinker interface.

MapLinker was tested on two much larger genomes that are currently being finished: Galus Galus (1.1 Gb; Washington University Genome Sequencing Center) and Mus musculus (2.8 Gb; Mouse Genome Sequencing Consortium). Given the fingerprinted physical maps and the whole genome assemblies, it took less than 8 h for MapLinker to automatically parse the text-based input files (.fpc and .ace) for either genome and load them into a relational database. MapLinker was then able to automatically construct a genome map in less than 4 h [tests performed on a MySQL (Version 4.0.18) server with two Pentium 4 Xeon 3.2 GHz processors and 2 GB of RAM].

In summary, MapLinker can be used as an analysis tool and as a browsing interface to fulfill a number of important functions in genome sequencing projects of varying size. It predicts the relative order and orientations of sequence contigs, allows prediction of physical or sequence gaps among these contigs, detects misassemblies, and permits ready comparisons of assemblies produced by different assembly programs. In addition to its utility for evaluating and monitoring progress during finishing, it provides a framework for annotation of draft (6X coverage), improved draft (8X coverage), or fully finished genomes. MapLinker can also be used for improving physical maps using a sequence assembly.


    Acknowledgments
 
This work was supported by grants from the NIH (DK30292) and NSF (0333284).

Received on August 3, 2004; revised on October 19, 2004; accepted on October 25, 2004

    REFERENCES
 TOP
 Abstract
 REFERENCES
 

    Huang, X., Wang, J., Aluru, S., Yang, S.P., Hillier, L. (2003) PCAP: a whole-genome assembly program. Genome Res., 13, 2164–2170[Abstract/Free Full Text].

    Magrini, V., Warren, W.C., Wallis, J., Goldman, W.E., Xu, J., Mardis, E.R., Mcpherson, J.D. (2004) Fosmid-based physical mapping of the Histoplasma capsulatum genome. Genome Res., 14, 1603–1609[Abstract/Free Full Text].

    Marra, M.A., Kucaba, T.A., Dietrich, N.L., Green, E.D., Brownstein, B., Wilson, R.K., McDonald, K.M., Hillier, L.W., McPherson, J.D., Waterston, R.H. (1997) High throughput fingerprint analysis of large-insert clones. Genome Res., 7, 1072–1084[Abstract/Free Full Text].

    Rutherford, K., Parkhill, J., Crook, J., Horsnell, T., Rice, P., Rajandream, M.A., Barrell, B. (2000) Artemis: sequence visualization and annotation. Bioinformatics, 16, 944–945[Abstract/Free Full Text].

    Soderlund, C., Longden, I., Mott, R. (1997) FPC: a system for building contigs from restriction fingerprinted clones. Comput. Appl. Biosci., 13, 523–535[Abstract/Free Full Text].

    Waterston, R.H., Lindblad-Toh, K., Birney, E., Rogers, J., Abril, J.F., Agarwal, P., Agarwala, R., Ainscough, R., Alexanersson, M., An, P., et al. (2002) Initial sequencing and comparative analysis of the mouse genome. Nature, 420, 520–562[CrossRef][Medline].

    Xu, J., Bjursell, M.K., Himrod, J., Deng, S., Carmichael, L.K., Chiang, H.C., Hooper, L.V., Gordon, J.I. (2003) A genomic view of the human-Bacteroides thetaiotaomicron symbiosis. Science, 299, 2074–2076[Abstract/Free Full Text].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
D. C. Richter, S. C. Schuster, and D. H. Huson
OSLay: optimal syntenic layout of unfinished assemblies
Bioinformatics, July 1, 2007; 23(13): 1573 - 1579.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
B. S. Samuel and J. I. Gordon
A humanized gnotobiotic mouse model of host-archaeal-bacterial mutualism
PNAS, June 27, 2006; 103(26): 10011 - 10016.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/7/1265    most recent
bti122v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xu, J.
Right arrow Articles by Gordon, J. I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xu, J.
Right arrow Articles by Gordon, J. I.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?