Skip Navigation


Bioinformatics Advance Access originally published online on January 18, 2008
Bioinformatics 2008 24(6):744-750; doi:10.1093/bioinformatics/btm608
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
24/6/744    most recent
btm608v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Right arrow Articles by Choi, J.-H.
Right arrow Articles by Colbourne, J. K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Choi, J.-H.
Right arrow Articles by Colbourne, J. K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

A machine-learning approach to combined evidence validation of genome assemblies

Jeong-Hyeon Choi 1,*, Sun Kim 1,2, Haixu Tang 1,2, Justen Andrews 1,3, Don G. Gilbert 1,3 and John K. Colbourne 1

1The Center for Genomics and Bioinformatics, 2School of Informatics and 3Department of Biology, Indiana University, IN 47405, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: While it is common to refer to ‘the genome sequence’ as if it were a single, complete and contiguous DNA string, it is in fact an assembly of millions of small, partially overlapping DNA fragments. Sophisticated computer algorithms (assemblers and scaffolders) merge these DNA fragments into contigs, and place these contigs into sequence scaffolds using the paired-end sequences derived from large-insert DNA libraries. Each step in this automated process is susceptible to producing errors; hence, the resulting draft assembly represents (in practice) only a likely assembly that requires further validation. Knowing which parts of the draft assembly are likely free of errors is critical if researchers are to draw reliable conclusions from the assembled sequence data.

Results: We develop a machine-learning method to detect assembly errors in sequence assemblies. Several in silico measures for assembly validation have been proposed by various researchers. Using three benchmarking Drosophila draft genomes, we evaluate these techniques along with some new measures that we propose, including the good-minus-bad coverage (GMB), the good-to-bad-ratio (RGB), the average Z-score (AZ) and the average absolute Z-score (ASZ). Our results show that the GMB measure performs better than the others in both its sensitivity and its specificity for assembly error detection. Nevertheless, no single method performs sufficiently well to reliably detect genomic regions requiring attention for further experimental verification. To utilize the advantages of all these measures, we develop a novel machine learning approach that combines these individual measures to achieve a higher prediction accuracy (i.e. greater than 90%). Our combined evidence approach avoids the difficult and often ad hoc selection of many parameters the individual measures require, and significantly improves the overall precisions on the benchmarking data sets.

Availability: http://people.cgb.indiana.edu/jeochoi/gav/

Contact: jeochoi{at}indiana.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
Since the shotgun strategy was first introduced by Sanger and colleagues in sequencing the genome of bacteriophage {lambda} (Sanger et al., 1982), significant progress has been made in applying this method to progressively larger genomes. These improvements are mainly because of the rapid advancement in DNA sequencing technologies and in the algorithmic development for DNA fragment assembly. Today, the whole genome shotgun (WGS) sequencing and fragment assembly are applied to entire eukaryotic genomes in a near-fully automatic fashion; compared to phage genomes of ~2–200 kilobases, some sequenced eukaryotic genomes contain several billion base pairs, like those of human (Venter et. al., 2001), mouse (Mouse Genome Sequencing Consortium, 2002), dog (Lindblad-Toh et al., 2005), chicken (International Chicken Genome Sequencing Consortium, 2004), and opossum (Mikkelsen et al., 2007). Yet despite its high throughput and low cost at reaching a draft genome sequence assembly, WGS sequencing is still ineffective at ultimately achieving an accurate and complete genome sequence.1 A finished genome project requires extensive (and expensive!) efforts to validate the draft assembly and to fill in sequence gaps (Green, 1997). As a result, most WGS assembly are released in ‘draft’ form, whose quality is seldom reported and largely unknown.

Although draft genome sequences are incomplete, they are extremely valuable for biologists, especially for the functional and evolutionary study of protein coding genes. However, recent analyses show disturbingly large numbers (from hundreds to thousands for each genome) of potential mistakes in draft genome sequences (Salzberg and Yorke, 2005). Note that these mistakes are not individual base-calling errors that are often corrected with additional sequencing. These errors relate to assembled sequence fragments (mis-assemblies) that incorrectly delete or wrongly arrange the location and/or orientation of long stretches of DNA. Most of these mis-assemblies result from the repetitive DNA, which is a common feature of large eukaryotic genomes and some microbial genomes. Fragment assemblers mainly follow the ‘overlap-layout-consensus’ paradigm (Bonfield et al., 1995; Kececioglu and Myers, 1995; Kim et al., 2007; Myers, 1995). Since many repeated DNA segments have nearly identical sequences, assemblers mistakenly overlap sequence reads belonging to copied regions of the genome (Pop et al., 2002). With pseudo-overlaps incorrectly promoted to the layout step, the assemblers can create large-scale rearrangements of DNA segments in the final consensus genome sequence (Tang, 2007).

Unlike the base-calling errors, mis-assemblies are rarely fixed, since the finishing projects mainly focus on generating additional reads targeting gaps in the genome sequence and attempt not to validate the existing assemblies. Furthermore, even if the finishing projects were to experimentally verify assemblies, until recently, there is no automatic method to help guide these experiments toward areas deserving attention (Nelson et al., 2005; Schmutz et al., 2003, 2004; West et al., 2006). However, there are several types of measures that are known to be helpful to identify assembly errors. For example, assembly errors often appear within the regions with low read coverage (RC), containing chimeric or recombined reads, having wrongly oriented paired end reads, or paired end reads with compressed distances. Based on these commonly observed characteristics, several simple measures are suggested for assembly validation, such as measuring the RC and the clone coverage (CC) (Table 1). More sophisticated methods include proposals made by Kim and Liao (Kim et al., 2001), which are based on the null distributions built from randomly sampled reads, and an entropy measure within contigs derived from the probabilistic models. Sutton and colleagues detected the breakpoints between mis-assembled DNA segments by scanning the number of unsatisfied mate-pairs, i.e. the paired reads from clone ends that show distances (measured in kilobases) within the assembly that deviate from the known distribution of insert sizes in sequenced genomic libraries (Dew et al., 2005). Yorke and colleagues propose the compression/expansion (CE) statistics for unsatisfied mate-pairs, and identify the regions containing potentially collapsed repeats (Zimin et al., 2005). Finally, visualization tools for assembly validation are also developed. For example, BACCardI is a graphical tool for the construction of virtual clone maps by using paired-end reads (Bartels et al., 2005). Hawkeye is a visual analytic tool for fragment assembly, which can be used to aid in manually finding and correcting mis-assemblies (Schatz et al., 2007).


View this table:
[in this window]
[in a new window]

 
Table 1. Summary of individual measures for detecting potential errors in draft genome assemblies, including several new measures proposed here (see text for detailed descriptions)

 
In this article, we propose several new measures for assembly validation. We compare their performance with those of existing measures, by evaluating them using three benchmarking Drosophila draft genomes. Our results show that, although the new measures are more accurate than the existing ones, the performance of any individual measure is not satisfactory for all assembled genomes. To improve on the potential strengths of each formulation in general applications, we develop a machine learning approach that combines multiple measures, and we demonstrate that this combined evidence approach is far better than any individual measure, and generally achieves acceptable results in genome assembly validation. The implementation of this algorithm provides a useful software tool for the genome sequencing communities and for biologists to gain confidence in the draft genome sequences.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Individual measures for assembly validation
We explore the performance of five existing and four new measures on assembly validation. Table 1 summarizes the definition of all nine individual measures.

RC and CC are two basic measures that count the number of local reads and clones (i.e. paired-end reads), respectively, which span a specific segment in the genome assembly. The deviation of the RC/CC from the average coverage along the entire genome (e.g. RC = 8, for a typical shotgun genome sequencing project) indicates a putative mis-assembly within this segment, such as the collapse of repeats or insertion of a DNA segment. As an example that utilizes this information, CE statistic (Zimin et al., 2005) computes the length distribution of the clones spanning specific genomic segment, and compares it with the length distribution from all assembled clones. To define CE precisely, we begin by classifying clones into four groups (Fig. 1):

  • intra-contig clones: paired-end reads placed within the same contig;
  • intra-scaffold clones: paired-end reads placed within the same scaffold, yet anchored in different contigs;
  • inter-scaffold clones: paired-end reads placed among different scaffolds; and
  • half-placed clones: one of the paired-end reads placed in the assembly, but not the other.
For each size-specific DNA library, the clone length distribution is calculated based on the respective intra-contig clones by counting the number of nucleotides spanning the length of the assembly between the paired-end reads. Not surprising, clone length distributions best fit a Gaussian model with narrow skews as shown in Figure 2. After we obtain the overall length distributions, we compute the mean and standard deviation of the clone lengths. Then, the CE statistics are defined as the magnitude in standard deviations that the average length of local clones differs from the average length of all clones for each clone library (Zimin et al., 2005). Similar to the CE statistics, the Z-score is defined as the number of standard deviations that the length of one clone differs from the average length of all clones from the same clone library. An intra-contig or intra-scaffold clone is called good if the absolute Z-score of its length is smaller than a threshold (e.g. 2), or otherwise the clone is called bad as shown in Figure 1. All half-placed clones and clones with paired-end reads that are placed in the same or outer orientation (Fig. 1A) are also called bad. After good and bad clones are classified, we compute the measure of good-minus-bad (GMB) by subtracting the number of bad clones from the number of good clones that span a specific genomic segment in the assembly. Similarly, the measure of good-to-bad ratio (RGB) is computed by the logarithmic ratio between the numbers of good and bad clones. Finally, the measure of average Z-score (AZ) is computed by averaging the Z-scores of local clones, and the measures of positive and negative Z-scores (ASZ) are computed by averaging the positive and negative Z-scores, respectively. GMB was tested previously on the assembly of a small bacterial genome, Mycoplasma genetalium, using Phrap. In this article, we extend and improve the method for assemblies of larger genomes.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Clone classification. (A) intra-contig clones; (B) intra-scaffold clones; (C) inter-scaffold clones; (D) half-placed clones. The green and red lines represent good and bad clones, respectively.

 

Figure 2
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Distribution of clone lengths deduced from locations of paired-end reads placed in the 2004 draft assembly of Drosophila virilis. The Gaussian distribution (µ = 3699, µ = 355) fitting the library of 3704 bp is shown in the blue dotted line.

 
2.2 Machine-learning approach to combined evidence assembly validation
We test five different machine-learning algorithms that are implemented in the Weka package (Witten and Eibe, 2001), including the decision tree (J48) (Quinlan, 1993), the random forest (RF) (Breiman, 2001), the random tree (RT) (Dietterich, 2000), the naive in french bayes classifier (NB) (Duda and Hart, 1973) and the Bayesian network (BN) (Heckerman et al., 1995). For each method, we supply three kinds of features: coverage statistics, length statistics and repeat measurements. The coverage statistics include RC, CC, GMB and RGB; the length statistics include average Z-score, average positive and negative Z-scores and CE statistics; the repeat measurements include the number of repeats identified by RepeatMasker, and by a self-comparison of the genomic segment.

2.3 Evaluation of predicted mis-assembled regions
To evaluate the assembly validation methods, we use both simulated datasets and draft assemblies from eukaryotic genome projects.

For each simulated dataset, we first generate a random DNA sequence of length 10 Mbp with 70% GC content, then insert multiple copies of nearly identical (~99%) repeats with total length of 3.5 Mbp, consisting of 200, 1000, 2000 and 5000 copies of repeats with length 5000, 1000, 500 and 100, respectively. The average difference between repeat copies was set as 1% substitutions and 1% indels. Afterwards, we sampled randomly 107 994 (i.e. coverage ~8) paired-end reads with expected distances of 3, 10 and 150 Kbp which were allowed 2% variance. The clone length which is the distance between the pairs of reads are drawn from an even distribution. The reads are sampled with expected length of 1000 which were allowed 2% variance. Their quality scores are assigned to the same value. The sequencing error rates are assumed uniform across the reads, and were set at various levels (i.e. 0.001, 0.003 and 0.005) to test the performance of validation methods under different conditions. Finally, the assemblies of these simulated reads were automatically generated using the Arachne assembler (Batzoglou et al., 2002) which are used for evaluating the assembly validation methods.

In addition to the simulated data, we also use the draft assemblies of three Drosophila species, i.e. D. mojavensis, D. erecta and D. virilis, for the evaluation purpose. For each genome, we choose two versions of draft assemblies that are generated using Arachne in 2004 and 2005, respectively. The mis-assemblies in the test draft assemblies are determined by aligning the assembled contigs with the corresponding final genome sequences, downloaded from Drosophila Assembly/Alignment/Annotation Website (http://rana.lbl.gov/drosophila/). After alignment, each contig in the draft assembly may contain matched and unmatched regions (Fig. 3). A matched region may be further classified as (1) a unique match, e.g. the segment [a, b] (Fig. 3A), (2) two or more overlapping matches, e.g. the segment [c, e] or (3) a match along with the other alternative matches, e.g. the segment [f, g]. We then define two classes of breakpoints: (1) the breakpoints corresponding to the overlapping matches, e.g. around d with overlapping matches [c, d] and [d, e]; (2) the breakpoints defined by unmatched regions, e.g. [b, c] and [e, f]. These breakpoints are considered true mis-assemblies, and are used as reference to evaluate assembly validation methods. We note that these breakpoints typically represent very short genome regions, hence, it is difficult to predict their accurate positions within mis-assembled regions (Fig. 3B). Therefore, we consider a prediction (of mis-assembly) to be true if it is located within 500 bp from an actual breakpoint.


Figure 3
View larger version (11K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Determining true mis-assembly regions by comparing the draft assembly with a finished genome. (A) Classification of mis-assembly breakpoints. The matches between a contig in the draft assembly and the finished genome are represented by the blue and red lines. The first match is unique. Therefore, there is no evidence of mis-assembly. The second and third matches along the contig have overlapping matches, which represent mis-assembly regions. In total, seven breakpoints (ag) are considered true mis-assemblies from this evaluation. (B) Uncertainties up to 500 bp are allowed in the predicted mis-assembly breakpoints for a fair evaluation of the validation methods. To achieve this, a contig is split into blocks of 500 bp, and the number of blocks for true positives (TP), false positives (FP), false negatives (FN) and true negatives (TN) are counted at the block level. For this example, if the blue lines are predicted mis-assembly regions by a validation method and the red blocks are true mis-assembled blocks, we count 5 TPs, 7 FPs, 1 FNs and 11 TNs, respectively.

 
The draft assemblies of all three Drosophila genomes are downloaded from the AAA (Assembly, Alignment and Annotation of the now 12 sequenced Drosophila genomes) website at http://rana.lbl.gov/drosophila/. The statistics of these published genome sequence assemblies and the versions of draft assemblies are listed in Table 2 and Supplementary Table 1. The final assemblies in CAF1 (Comparative Analysis Freeze 1) which are used as the finished genome sequences in our experiments are reconciliations of independent assemblies performed using Arachne and the Celera Assembler (Zimin et al., 2005). The statistics of these genomes are listed in Table 3, which are published on FlyBase (http://flybase.bio.indiana.edu/).


View this table:
[in this window]
[in a new window]

 
Table 2. Statistics of draft assemblies for D. erecta, D. mojavensis and D. virilis

 

View this table:
[in this window]
[in a new window]

 
Table 3. Statistics describing the finished genome sequences (i.e. CAF1) of three Drosophila species that are used for benchmarking purposes in this article (FlyBase http://flybase.bio.indiana.edu/)

 
2.4 Implementation details
We implemented the methods described above in C ++ and PERL. The whole program consists of three steps. First, we compute all single measures for given training and testing draft assemblies from the input of read layout (in ACE or Washington University format) and mate-pair information (in table-delimited or XML format used by NCBI Trace Archive). Next, we analyze the repetitive structure of the draft assembly by RepeatMasker and a self-comparison using BLAST and MUMmer. Finally, based on these pre-computed measures, we predict putative mis-assembly regions using weka package with prepared data in the first and second steps.


    3 Results
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Performance of nine individual measures
We first tested the performance of these measures on the draft assemblies of three Drosophila genomes. The results are presented as Receiver Operating Characteristic (ROC) curves (Fig. 4). Among all individual measures, GMB reached the best performance (accuracy between 0.7 and 0.8). Yet the simple minimum clone coverage measure (CCN) achieved surprisingly high accuracy, by performing better than, or equally well to, more sophisticated measures, such as CE or Z-value-based measures in the 2005 assembly of D. erecta. Although the accuracy of some validation measures appears high, the precisions are extremely low, typically below 10%. Because the number of mis-assembled genomic regions is small compared to the number of correctly assembled regions, the predicted mis-assembled regions may contain mostly (> 90%) false positives. It is impractical to rely on these individual measures to guide experiments for correcting errors in the assembly. We therefore attempt to improve the precision of assembly validation measurements to a reasonably high level (i.e. > 30%) so to provide useful information for finishing efforts in genome projects.


Figure 4
View larger version (23K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. ROC curves of individual measures and the combined evidence approaches based on different machine learning methods on validating the draft assemblies of (A) D. mojavensis, (B) D. erecta and (C) D. virilis genomes, respectively.

 
3.2 Performance of combined evidence approach
We adopt a combined evidence approach that uses a machine-learning method to integrate the individual validation measures. We evaluate five different machine-learning classifiers that are implemented in the Weka: the decision tree (J48), the RF, the RT, the NB and the BN. Their relative performance are tested using both simulated data and draft assemblies of Drosophila genomes.

Table 4 summarizes the results of different machine-learning classifiers for the combined evidence assembly validation on simulated DNA sequences. The five machine-learning classifiers exhibit similar trends in their prediction accuracy across the varying sequence error rates used for our simulations; in most cases, the machine-learning classifiers perform better than individual measures (Supplementary Fig. 1). In the experiments employing higher sequencing errors, a great number of assembly errors are made by the Arachne assembler, while the validation methods achieve higher precisions, but slightly lower prediction accuracies. Overall, decision tree (J48) and random forest (RF) classifiers outperform (with higher accuracy and precision) the other classifiers across different experiments. Nevertheless, although the precision of the combined evidence approach is improved over the individual measures, it is still not great (e.g. only ~0.2–0.3), owing to the small total number of mis-assembled regions in the simulated data. The performance of combined evidence approaches compared to individual measures is shown in Figure 4, in relation to various machine learning applications used to validate the Drosophila genome draft assemblies. Most machine-learning classifiers outperform individual measures on these real data sets. Again, random forest (RF) performs the best among the five machine-learning classifiers (with the highest precision) in all experiments (Table 5 and Supplementary Table 2, which are represented by dots in Fig. 4). The decision tree classifier reaches a slightly higher accuracy than RF, although its precision is lower. In general, the performance of RF is satisfactory (accuracy > 0.9 and precision ~0.3) except for one case in which a limited number of true mis-assembled regions are identified (D. erecta, 2004 draft assembly).


View this table:
[in this window]
[in a new window]

 
Table 4. Results of different machine learning methods for the combined evidence assembly validation on simulated DNA sequences of about 13.5 Mbp that contain multiple copies of nearly identical repeats

 

View this table:
[in this window]
[in a new window]

 
Table 5. Results of a cross-evaluation of the machine learning approaches for validating the draft assemblies of Drosophila genomes

 
3.3 Cross-species evaluation
To employ the machine-learning method, models must be trained using a learning data set of known true mis-assemblies. In the previous experiments, we trained the models by randomly sampling n blocks of the mis-assembled blocks and 5n blocks of the correctly assembled blocks as learning set to train the model where n is 50% of the mis-assembled regions. We then used the remaining blocks to test the model. This procedure is impractical since in reality we expect to de novo validate the draft assembly for a whole genome, and we are unable to identify true breakpoints used for training. Therefore, a last experiment is performed, where we use one of the Drosophila genomes as the training set and test the model on the draft assembly of another genome. Table 6 and Supplementary Table 3 show the result from this cross-species evaluation. Similar to the previous experiments, decision tree (J48) is the most accurate classifier, whereas random forest (RF) reaches the highest precision.


View this table:
[in this window]
[in a new window]

 
Table 6. Results of a cross-species evaluation of the machine-learning approaches for assembly validating

 

    4 Discussion
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we tested several individual measures for assembly validations. The results show that no single measure achieves satisfactory performance. Hence, we proposed a combined evidence approach to validating draft assemblies of large eukaryotic genomes. We showed that the machine-learning-based methods consistently outperformed the individual measures on both simulated data and draft assemblies from real sequencing projects. Although the requirement of training data appears to be a limitation for our approach, in practice we argue there are two sources that can provide sufficient training data: experimentally validated mis-assembly and cross-species learning. Many genome projects initiate cDNA sequencing, BAC fingerprinting and genetic mapping projects, in addition to genome sequencing. Recently developed high-throughput technologies, such as the genome-scale DNA tiling array (Samanta et al., 2007) are also able to experimentally validate a significant fraction of a draft assembly, which can then be used as training data in our machine-learning approach. Furthermore, the performance of our approach remains satisfactory (as shown in Table 6) when the model is trained on one genome assembly and tested on another assembly from a closely related genome. We note that the performance improvement of the ML methods in cross-species evaluations are not uniformly advantageous. One classifier may achieve better results when trained using data from one genome over another. Nevertheless, among five machine-learning algorithms we tested, the decision tree and random forest algorithms in general showed better performance than the other non-linear learning algorithms, and thus can be practically applied.

We emphasize that among the many automatic methods we tested, the best precision of mis-assembly detection is only around ~60%, which means that finishing efforts are still un-avoidable to verify these predicted mis-assemblies. However, since false-negative rates are very low and the majority of the assembly is not found to be ‘suspicious’, finishing efforts can be focused on the regions that are flagged as potential errors. Wrongly predicted mis-assemblies indeed reflect the complication for sequencing validations. They are predicted as mis-assemblies because they fall into the ‘difficult-to-assemble’ regions, but the assemblers may still assemble them correctly. Therefore, the true performance of our method may be better than it appears in the evaluation.

Our combined evidence approach described here significantly improves the existing methods based on individual measures, and this is a useful tool for verifying confidence in genome assembly. Currently, we are working on applying our approach to the draft assemblies of Daphnia and Drosophila genomes. We will report our findings in the future publications.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 
We are grateful to anonymous reviewers for their valuable comments. This research was supported in part by the Indiana METACyt Initiative of Indiana University, funded in part through a major grant from the Lilly Endowment, Inc. and by NSF Career DBI-0237901. Computer support was provided by an allocation TG-MCB060059N through the TeraGrid Advanced Support, by the University Information Technology Services (UITS) and by The Center for Genomics and Bioinformatics computing group. We thank Richard Repasky (UITS) who helped conceive this project.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alex Bateman

1 According to the Bermuda standard, a complete genome means a single DNA sequence (with no gap) for each chromosome, containing no more than 1 in 10 000 (0.01%) erroneous or ambiguous bases. Back

Received on October 9, 2007; revised on November 29, 2007; accepted on December 5, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 Results
 4 Discussion
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bartels D, et al. BACCardI – a tool for the validation of genomic assemblies, assisting genome finishing and intergenome comparison. Bioinformatics (2005) 21:853–859.[Abstract/Free Full Text]

    Batzoglou S, et al. ARACHNE: A whole-genome shotgun assembler. Genome Res (2002) 12:177–189.[Abstract/Free Full Text]

    Bonfield JK, et al. A new DNA sequence assembly program. Nucl. Acids Res (1995) 23(24):4992–4999.[Abstract/Free Full Text]

    Breiman L. Random forests. Machine Learning (2001) 45:5–32.[CrossRef][Web of Science]

    Dew IM, et al. A tool for analyzing mate pairs in assemblies (TAMPA). J. Comput. Biol (2005) 12:497–513.[CrossRef][Web of Science][Medline]

    Dietterich T. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Machine Learning (2000) 40:139–157.[CrossRef][Web of Science]

    Duda RO, Hart PE. Bayes decision theory. In: Pattern Classification and Scene Analysis.—Richard OD, Peter EH, David GS, eds. (1973) John Wiley & Sons Inc. 10–43.

    Gilbert DG. DroSpeGe: rapid access database for new drosophila species genomes. Nucl. Acids Res (2007) 35(Suppl_1):D480–485.[Abstract/Free Full Text]

    Green P. Against a whole-genome shotgun. Genome Res (1997) 7(5):410–417.[Free Full Text]

    Heckerman D, et al. Learning bayesian networks: the combination of knowledge and statistical data. Machine Learning (1995) 20:197–243.[CrossRef][Web of Science]

    International Chicken Genome Sequencing Consortium. Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature (2004) 432:695–716.[CrossRef][Medline]

    Kececioglu JD, Myers EW. Combinatorial algorithms for DNA sequence assembly. Algorithmica (1995) 13:7–51.[CrossRef][Web of Science]

    Kim S, et al. A probabilistic approach to sequence assembly validation. (2001) Proceedings of the ACM SIGKDD Workshop on Data Mining in Bioinformatics (BIOKDD'01). ACM, San Francisco, CA, pp. 38–43.

    Kim S, et al. Genome Sequencing Technology and Algorithms. (2007) ArtechHouse: ArtechHouse.

    Lindblad-Toh, et al. Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature (2005) 438:803–819.[CrossRef][Medline]

    Mikkelsen, et al. Genome of the marsupial monodelphis domestica reveals innovation in non-coding sequences. Nature (2007) 447:167–177.[CrossRef][Medline]

    Mouse Genome Sequencing Consortium. Initial sequencing and comparative analysis of the mouse genome. Nature (2002) 420:520–562.[CrossRef][Medline]

    Myers EW. Towards simplifying and accurately formulating fragment assembly. J. Comp. Biol (1995) 2:275–290.

    Nelson WM, et al. Whole-genome validation of high-information-content fingerprinting. Plant Physiol (2005) 139(1):27–38.[Abstract/Free Full Text]

    Pop M, et al. Genome sequence assembly: Algorithms and issues. IEEE Computer (2002) 35:47–54.

    Quinlan JR. C4.5: Programs for Machine Learning. (1993) San Mateo, CA: Morgan Kaufmann.

    Salzberg SL, Yorke JA. Beware of mis-assembled genomes. Bioinformatics (2005) 21(24):4320–4321.[Free Full Text]

    Samanta MP, et al. In-depth query of large genomes using tiling arrays. Methods Mol Biol (2007) 377:163–174.[Medline]

    Sanger F, et al. Nucleotide sequence of bacteriophage [lambda] DNA. J. Mol. Biol (1982) 162:729–773.[CrossRef][Web of Science][Medline]

    Schatz M, et al. Hawkeye: an interactive visual analytics tool for genome assemblies. Genome Biology (2007) 8:R34.[CrossRef][Medline]

    Schmutz J, et al. Assessing the quality of finished genomic sequence. Cold Spring Harb Symp. Quant. Biol (2003) 68:31–37.[CrossRef][Web of Science][Medline]

    Schmutz J, et al. Quality assessment of the human genome sequence. Nature (2004) 429:365–368.[CrossRef][Medline]

    Tang H. Genome assembly, rearrangement and repeats. Chem. Rev (2007) 107:3391–3406.[CrossRef][Web of Science][Medline]

    Venter JC, et al. The sequence of the human genome. Science (2001) 291:1304–1351.[Abstract/Free Full Text]

    West J, et al. Validation of S. pombe sequence assembly by microarray hybridization. J. Comput. Biol (2006) 13:1–20.[CrossRef][Web of Science][Medline]

    Witten IH, Eibe F. Data Mining. (2001) Hanser Fachbuch: Hanser Fachbuch.

    Zimin AV, et al. Assembly reconciliation method (2005) unpublished. Available at http://www.genome.umd.edu/reconciliation.htm.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
M. Pop
Genome assembly reborn: recent computational challenges
Brief Bioinform, July 1, 2009; 10(4): 354 - 366.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrowOA All Versions of this Article:
24/6/744    most recent
btm608v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Google Scholar
Right arrow Articles by Choi, J.-H.
Right arrow Articles by Colbourne, J. K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Choi, J.-H.
Right arrow Articles by Colbourne, J. K.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?