Bioinformatics Advance Access originally published online on November 11, 2004
Bioinformatics 2005 21(7):889-896; doi:10.1093/bioinformatics/bti129
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A parallel graph decomposition algorithm for DNA sequencing with nanopores
1Department of Electrical Engineering, UET Lahore-54890, Pakistan
2Eagle Research & Development 1898 S. Flatiron Court, Suite 128, Boulder, CO 80301, USA
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Motivation: With the potential availability of nanopore devices that can sense the bases of translocating single-stranded DNA (ssDNA), it is likely that reads of length
105 will be available in large numbers and at high speed. We address the problem of complete DNA sequencing using such reads.
We assume that
102 copies of a DNA sequence are split into single strands that break into randomly sized pieces as they translocate the nanopore in arbitrary orientations. The nanopore senses and reports each individual base that passes through, but all information about orientation and complementarity of the ssDNA subsequences is lost. Random errors (both biological and transduction) in the reads create further complications.
Results: We have developed an algorithm that addresses these issues. It can be considered an extreme variation of the well-known Eulerian path approach. It searches over a space of de Bruijn graphs until it finds one in which (a) the impact of errors is eliminated and (b) both possible orientations of the two ssDNA sequences can be identified separately and unambiguously.
Our algorithm is able to correctly reconstruct real DNA sequences of the order of 106 bases (e.g. the bacterium Mycoplasma pneumoniae) from simulated erroneous reads on a modest workstation in about 1 h. We describe, and give measured timings of, a parallel implementation of this algorithm on the Cray Multithreaded Architecture (MTA-2) supercomputer, whose architecture is ideally suited to this unstructured problem. Our parallel implementation is crucial to the problem of rapidly sequencing long DNA sequences and also to the situation where multiple nanopores are used to obtain a high-bandwidth stream of reads.
Contact: shb{at}acm.org
| INTRODUCTION |
|---|
|
|
|---|
The development of nanopore devices to sequence DNA is an area of vigorous research that promises exciting results. The essential idea is to make single stranded DNA (ssDNA) pass through (or translocate) a nanopore whose diameter is of the order of a few atoms. The passage of individual bases in the ssDNA is sensed using electrical or optical means. The resulting signal is processed to reveal the identity (i.e. A, C, G or T) of the individual bases as they pass through the nanopore. Several research groups are active in this area. A partial, and by no means complete, sampling includes Branton and Meller (2002), Howorka et al. (2001); Kasianowicz et al. (1996); Li et al. (2001) and Storm et al. (2003). Our own efforts are aimed at developing a silicon-based nanosequencer that has field-effect transistors fabricated around the nanopore. As different bases translocate the nanopore, changes in charge distribution are sensed directly by these transistors, thereby potentially providing an accurate and reliable means of differentiating between different bases. The structure of this nanosequencer is described in a US patent (Sauer and Van Zeghbroeck, 2002). Figure 1 shows how this sequencer would be used in practice.
|
The key issues associated with the use of this nanosequencer are as follows:
Read length. The translocation of an entire ssDNA string would be a matter of incredible luck. It is expected that, in practice, the physical stress of translocation would cause the ssDNA to break in several places. We expect that substrings of length around
105 bases would translocate at one time (Li et al., 2003; Sauer-Budge et al., 2003; Storm et al., 2004.
Complementarity. Each DNA helix would be split up into the original and its WatsonCrick complement. Since these would be further broken up into smaller pieces, there is no way for the nanosequencer's signals to reveal which of these two is passing through at any time.
Orientation. There is no control on the orientation in which the strings pass through the nanopore. DNA has well-defined orientations (the 3'5' direction and vice versa). Since the ssDNA is broken into subsequences, all information on orientation is lost.
Errors. An individual DNA sequence might have errors in the form of insertions or deletions (randomly added or deleted bases, collectively called indels). It may also have mutations (some bases might be randomly modified). Furthermore, the nanosequencer itself might make errors and sense bases incorrectly. The reconstruction of the original DNA sequence must be able to tolerate these errors.
We have developed an algorithm that takes these constraints into account and is able to reconstruct DNA sequences given realistic inputs of ssDNA subsequences, as might be expected from a nanopore device. Our algorithm is able to sequence a small organism such as Mycoplasma pneumoniae, which has 816 394 bases, on a modest workstation in under 1 h. We have parallelized this algorithm for the Cray MTA-2 supercomputer and our preliminary results show good performance. A parallel implementation for large supercomputers is important because mammalian DNA has
109 bases and cannot be sequenced on small conventional machines. Furthermore, in the future, practical devices may employ multiple nanopores in parallel to speed up the process of sequencing. High performance computing will thus be necessary to keep up with the high bandwidth data supplied by such devices.
While this research is motivated primarily by the silicon-based nanosequencer proposed by Sauer and Van Zeghbroeck (2002), we expect that all devices capable of providing long reads of DNA sequences will be able to utilize our algorithm.
In the following section we present the essentials of our algorithm, which is based on the well-known Eulerian path algorithm, modified to take advantage of the very long reads promised by nanosequencers. We next discuss the impact of indels and mutations. These are handled by breaking up each read into smaller subsequences or k-mers, combining these using a hashing scheme and then using frequency thresholding to reject those that contain errors. Following this we present the results of an extensive series of experiments that explore how the behavior of the algorithm varies with mer size, error probability and number of input sequences. We then describe parallelization and present our preliminary timings on the Cray MTA-2 when processing simulated reads from the full M.pneumoniae sequence. Our conclusions and plans for future research, which include proposals for extending this approach to mammalian DNA with multiple chromosomes, are given in the final section.
| THE PATH DECOMPOSITION ALGORITHM |
|---|
|
|
|---|
The Eulerian path approach
Our algorithm is based on the well-known Eulerian path approach for reconstructing DNA sequences. The Eulerian path algorithm (Idury and Waterman, 1995; Pevzner et al., 2001) is best known for its applications to sequencing by hybridization (SBH). (Waterman (1995), Gusfield (1997) and Pevzner (2000) provide authoritative introductions to this field. SBH uses the output of VLSI-based DNA arrays to reconstruct portions of DNA sequences. DNA arrays provide all length k subsequences of a given DNA string. These subsequences, or k-mers, are used to develop a de Bruijn graph in which edges represent k-mers and nodes represent the two (k 1)-mers at either end of a single k-mer. Any Eulerian path in such a de Bruijn graph corresponds to a reconstructed DNA sequence. If the Eulerian path is unique the reconstruction is assumed to be correct (assuming no erroneous k-mers).
Technology constrains the length of available k-mers from universal arrays to
101 bases. This limits the length of the sequence that can be reconstructed. However, many impressive results have been obtained even with these limited sizes.
With the potential availability of silicon nanopore devices, such as the SauerVan Zeghbroeck sequencer discussed above, it is likely that subsequences or reads of length
105 bases will be available in large numbers and at high speed. In this case the complexion of the problem changes significantly and it becomes feasible to aim at the target of reconstructing entire DNA sequences, rather than exploring selected areas of interest.
Reconstruction of a DNA sequence from a set of variable length reads can be done using the well-known technique of breaking each read up into a set of overlapping k-mers, creating a de Bruijn graph, and then finding an Eulerian path in that graph. There are a number of complicating factors, however. As discussed in the Introduction, the orientation of each read inside its parent string cannot be determined, nor can we know if a read comes from the original ssDNA or its WatsonCrick complement. Furthermore, errors in the reads, whether biological (i.e. indels or mutations) or electrical (i.e. caused by electrical noise or imperfect sensing), disturb the structure of the graph and destroy the possibility of obtaining the correct de Bruijn graph and hence the sought-for DNA sequence. Finally, there is the problem of repeats in the DNA sequence.
Let us first address the problem of orientation and complementarity of the reads. In essence the problem, in terms of sequence recognition, is that there are four different strings to be recognized. These are the original 3'5' string, the reversed 5'3' string and the two WatsonCrick complements of these. Our approach is to identify all four simultaneously. This is made possible by the long reads provided to us by the nanosequencer. These long reads (in combination with multiple DNA strings) make it possible for the mer size to be made very large. As the mer size gets large, the de Bruijn graph transforms from a convoluted graph that contains all four sequences in overlapped form, to a graph that is composed precisely of four disjoint paths, with nothing left over. This is described in Figure 2.
|
Formally, the graph G must contain four paths P1,...,P4, such that P1
P2
P3
P4 = G and Pi
Pj =
for all i,j, i
j. For brevity, we call this the E4 property and assume we have a function E4(G) that returns true if graph G satisfies it. This property is easy to establish as it merely requires checking if the graph has four components and (a) exactly four source nodes with indegree = 0, outdegree = 1, (b) exactly four sink nodes with indegree = 1, outdegree = 0 and (c) all remaining nodes with indegree = outdegree = 1.
Once the paths P1,...,P4 have been identified we need to establish if they are indeed complements and reversals of each other. Let R(p) [C(p)] represent the reversal (WatsonCrick complement) of a sequence p. We need to find a permutation
(Pi) such that
(P1) = R(
(P2)) = C(
(P3)) = R(C(
(P4))). To verify that this permutation exists it is enough to perform pairwise comparisons between any one path, say P1, and the remaining paths P2, P3 and P4. If the comparisons yield exactly one each of the transformations R(), C() and R(C()), we can be sure that the required permutation
exists. This process takes 3d steps, where d is the length of the target string, and confirms that the sequence found is indeed a bona fide DNA sequence.
The impact of errors
The problem of errors in the reads is tackled by sampling multiple copies of the DNA sequence. The k-mers generated by these multiple reads are collated and only those that occur more than a predetermined number of times are selected, the rest being discarded. This is detailed in the section entitled Experimental Investigation.
The error rates achieved by nanopore technologies remain to be seen, as no fully functioning device has yet been created. We expect that silicon-based nanosequencers will have much better error rates than nanosequencers based on other technologies because (1) the geometries of such devices can be precisely controlled using well-understood VLSI technology and (2) the passage of a base is sensed by measuring its charge at atomic distances, a process that is expected to be very accurate. Other proposed nanosequencers may have different error rates. In all cases, the errors can be compensated for by using larger numbers of reads.
The problem of repeats
Repeats in DNA sequences dictate a lower bound on the mer size k. The chosen k must be greater than the longest maximal repeat in the sequence, otherwise the algorithm will be unable to generate a de Bruijn graph with a unique path in it. Experimental results presented later in this paper demonstrate this limitation.
Clearly, the length of the reads provided by the nanosequencer must also be greater than the longest repeat in the input DNA, otherwise an E4 graph cannot be created. Three points are noteworthy in this context. Firstly, even if most reads are shorter than the expected repeat length, we could still use the remaining reads for our graph generation. This would require a larger number of input DNA strings, which may or may not be a problem, depending on the specific method of extracting DNA. Secondly, given the graph theoretic nature of our algorithm, repeats are only a problem if they are exact. Two long subsequences that are 99% similar do not constitute a repeat as far as the Eulerian approach is concerned. Finally, from a technological standpoint, we should expect that the average length of reads provided by silicon nanosequencers would increase gradually as experience is gained in fabrication and usage. This is typically the way in which silicon devices evolve and improve over time.
| CREATING AND ANALYZING THE DE BRUIJN GRAPH |
|---|
|
|
|---|
We assume that a large number of reads of broken pieces of ssDNA are input to the algorithm and stored in one contiguous array. The k-mers are then generated. We use hashing to store the k-mers, with pointers to the ssDNA array to differentiate between strings that have the same hash value. A k-mers can be encountered multiple times; the multiplicity is stored in the hash table itself. Once all k-mers have been generated, we can carry out a thresholding process to eliminate erroneous k-mers.
The de Bruijn graph is generated using the k-mers above the threshold frequency ft. Since we will only employ the de Bruijn graph further if it satisfies the E4 property, there is no need to use linked lists to store adjacent edges. Each node in an E4 graph can have at most one edge whose identity can be stored in a static array. This leads to a simple data structure that is easy to manipulate in parallel. The (k 1)-mers (k-mers) corresponding to these nodes (edges) are stored in hash tables that contain pointers to the ssDNA array so that the exact composition of each mer is known.
The essential steps of our algorithm, for a given mer size k and threshold ft, are as follows. We assume that n strings of length d are input, so that the total bases input are nd.
- generate and store all k-mers
- threshold k-mers
ft
- generate de Bruijn graph G
- if E4(G) is true then
- extract the paths P1,...,P4
- if

(Pi) such that
(P1) = R(
(P2))
=C(
(P3)) = R(C(
(P4))) then
- output P1; halt
else sequencing impossible with given k,ft
The time required for step [1] is O(nkd) on average when hashing is used to store k-mers (as is the case in our implementation). Step [2] needs O(nd) time while the remaining steps are O(d). The algorithm, as stated above, only serves to check if a reconstruction can be found for a given mer size k and threshold frequency ft. The following section describes how the (k,ft) space would be explored in practice.
| EXPERIMENTAL INVESTIGATION |
|---|
|
|
|---|
We now describe the results of an extensive series of experiments in which we evaluated our algorithm for various frequency thresholds, mer sizes, error probabilities and number of sequences. We used the first 20,000 bases of M.pneumoniae as a test case. This small, but realistic, subsequence was chosen so as to permit the experiments to be performed in an acceptable period of time. Later in this section we shall describe the behavior of the algorithm over a smaller space but using the full 816 394 base M.pneumoniae sequence.
The given DNA string (of length 20 kb) was replicated 30, 40 and 50 times. Indels and mutations were introduced randomly with probability 0.0005. Each string was then split into two ssDNA, which were randomly broken into five pieces each.
3D histograms of mer frequency f and mer size k are shown in Figure 3(ac). A slice at a fixed value of k results in a 2D histogram that shows the distribution of mer frequencies.1 In Figure 3(a) the slice at k = 200 shows that there are a large number of mers that have frequency 1 and 2. The number of mers with frequency 39 is zero while those with frequency 10 or more is again very large. Since the total number of sequences in this case is 50, any mers that have frequency 1 or 2 must be erroneous and should be rejected. In general, mers that have low frequency are erroneous and can be ignored while those that have high frequency are genuine and should not be.
|
The valleys in Figure 3(ac) indicate values of (f,k) pairs for which the erroneous and valid mers are clearly differentiated. The valleys shrink in area as the number of subsequences decreases; the one in Figure 3(c) is only a few points.
The behavior of the algorithm over the range of frequencies and mer sizes of Figure 3(ac) is shown in part (d). Each point on this plot indicates an (ft,k) pair at which the algorithm is able to reconstruct the DNA string correctly. A threshold frequency of ft implies that we include all mers that occur ft or more times.
To avoid a congested figure we have not plotted all points in the k (mer size) direction. The smallest value of k for which the algorithm succeeds is
20.
As is to be expected, the region over which the algorithm succeeds shrinks as the number of input DNA strings is reduced. For a fixed error probability (0.0005 in the present case) the larger the number of strings, the better the chance of creating a graph with the E4 property.
It is clear that the region of success increases in size with increasing number of sequences. This is because more sequences tend to nullify the impact of errors. There is a sharp cutoff in mer size below which the algorithm does not succeed (about 20 for the experiment of Fig. 3). This is because the cutoff limit is determined by the longest maximal repeat in the specific DNA sequence. The length of the longest maximal repeat in the sequence used in Figure 3 is 18, which is consistent with our observation of a lower limit of around 20.
The threshold frequency needs to be chosen so that the algorithm can successfully reject erroneous mers. The algorithm always succeeds when it is run in a clearly defined floor, such as the rectangle defined by the diagonal points (3,100), (7,300) in Figure 3(a). This suggests a practical technique for automating the search for a solution: obtain histograms for a few small values of k and select a frequency threshold that lies within a flat valley. If a flat valley (indicating clear distinction between erroneous and valid mers) is not found, the algorithm can signal the nanopore sequencer to provide more reads. Note, however, that is it is not necessary to find a flat valley: any value of frequency threshold that can differentiate between erroneous and valid mers is sufficient.
The performance of the algorithm for a fixed number of sequences but varying probability of error is given in Figure 4. When there are no errors, the algorithm succeeds with threshold ft = 1 and at almost all values of k (the smallest value of k for which the algorithm succeeds is again
20 in this case). The flat valleys decrease with increasing error probability. [Figs 3(c) and 4(c) show identical surfaces though truncated by different amounts.]
|
The performance of the algorithm on the complete M.pneumoniae sequence appears in Figure 5. In this case we used 60, 70 and 80 sequences with error probability 0.0005. Each sequence was randomly broken in 20 places. We would expect that the smallest mer size for which the algorithm succeeds on this longer DNA sequence would be higher than in the test problem of Figures 3 and 4 because of the longer maximal repeat size. This is indeed the case: the longest maximal repeat in M.pneumoniae is 470, which is consistent with the lower limit of k in Figure 5.
|
For any given threshold, there is also a well-defined upper limit on mer size beyond which the algorithm does not succeed. This arises because longer mers have a higher probability of being corrupted by errors. Even when there are no errors [Fig. 4(a and d)] longer mers have a greater chance of being intercepted by random breaks and thus being rejected by the thresholding process. In Figure 4(d) we see that when the threshold is set to 1 (i.e. accept all mers), the algorithm succeeds for all mer sizes shown. In this case it would fail when the size approached the average size of the broken ssDNA string.
There is a tension between the frequency threshold and mer size that leads to a hyperbolic boundary of the region of success. This is especially evident in Figure 3(d). A formal study of this phenomenon is an interesting area for future research.
| PARALLELIZATION |
|---|
|
|
|---|
The target of our parallelization efforts is the innovative Cray MTA-2 (Multithreaded Architecture) supercomputer (Alverson et al., 1990, 1992). This machine has hardware supported multithreading and a large flat shared memory without locality. Special full-empty tag bits are associated with each memory location to permit efficient synchronization (http://www.cray.com/products/programs/mta_2). When developing code for this machine, the programmer does not have to be concerned with issues of mapping, partitioning, load balancing and interprocessor communication. This machine is thus ideally suited for unstructured graph-based problems of the type that are encountered in the bioinformatics field. Our prior experiences with the MTA-2 in a bioinformatics context may be found in Bokhari et al. (2002) and Bokhari and Sauer (2004).
The bulk of the time taken by our algorithm lies in the mer generation, thresholding and graph generation steps. These are easily parallelized on the MTA-2. Establishing if the resulting graph has the E4 property involves checking the in- and outdegrees of each node, a process that is also efficiently parallelizable. Actually finding the four component Eulerian paths is an inherently serial process, but its contribution to the overall run time is negligible compared to the remaining steps.
Our algorithm was implemented in C on Linux/Solaris workstations. Once it was running correctly, we were able to port it to the MTA-2 at the Electronic Navigation Research Institute (ENRI, http://www.enri.go.jp) in only a few days. This was made possible by the powerful compiler of the MTA, which parallelizes only those portions of code that it can provably transform. Thus an existing C program will run correctly (although with unremarkable efficiency) when moved to the MTA. Once results have been verified, the programmer may choose to modify those selected portions of the code where parallelization is of greatest benefit.
The following fragment shows a part of a hash table insertion function. The string with start index pos is to be inserted. The function READFE(x) (wait until full then read and set empty) uses the hardware tag bits of the MTA to read and lock the address x. Conversely WRITEEF(&x, y) (wait until empty then write and set full), writes y into location x and then unlocks x. Each of these operations corresponds to one hardware instruction. A thread waiting for a lock is suspended and later resumed using a special hardware mechanism, without overhead.
insertHashtable(hash *table, int pos){...
// h is hash value of string
// starting at pos
this = READFE(&table[h]);
if (this == 1){// empty location
WRITEEF(&table[h], pos); return;
}
if (this == pos){
WRITEEF(&table[h], this); return;
// pos already there, do nothing
}
if (compareStrings(this, pos) == 1){
...
WRITEEF(&table[h], this); return;
// matches some other position
}
WRITEEF(&table[h], this); return;
...
}
The following fragment shows how we count the total number of edges inserted in the edge hash table as well as the number of these that are above a specified threshold. The pragma is needed to reassure the compiler that it is safe to parallelize this loop. This is because the compiler is unable to determine if there are any dependencies within dynamically allocated C arrays. The int_fetch_add implements the well known fetch and add primitive which ensures that the respective locations are incremented atomically.
#pragma mta assert parallelfor(i = 0; i < max; i++)}
if(table[i] ! = 1)}
int_fetch_add(&edges,1);
if (occur[i] > threshold)}
int_fetch_add(&above,1);
}
}
}
Judicious use of the few features described above is enough to obtain a good parallel implementation on the MTA.
| DISCUSSION |
|---|
|
|
|---|
Our parallel algorithm has only recently reached the implementation stage. Nevertheless we have some encouraging results. Table 1 shows the results of an experiment in sequencing M.pneumoniae. Our algorithm was able to reconstruct its sequence in about 40 min using a 4-processor MTA-2.
|
It is obvious that the overall run times are dominated by the mer generation and thresholding processes. It is this aspect of the code that deserves further attention. The times required for graph generation are also significant. Both thresholding and graph generation scale well with increasing processors. Path finding, which is inherently a serial operation, does not scale, but its contribution to the total run time is negligible.
| CONCLUSION |
|---|
|
|
|---|
The algorithm
We have developed a parallel graph decomposition algorithm that can be used to completely sequence DNA using the output from a nanopore sequencing device. The actual determination of the sequence is done using a variant of the Eulerian path approach. While this path approach has been known for many years, our variant introduces several innovations that permit it to be used efficiently for long erroneous reads.
We have shown that for long reads and correspondingly long k-mers, the target is to find four disjoint paths in the graph that are all equal in length and are WatsonCrick complements or reversals of each other. We have demonstrated that this technique can be used to obtain DNA sequences of realistic sizes. Our other innovations include using a frequency thresholding technique to eliminate erroneous mers and searching over a range of de Bruijn graphs to select one that unambiguously determines the sequence.
The bulk of the time taken by our algorithm lies in eliminating erroneous regions of DNA. These errors may be biological (such as indels or mutations) and may also be caused by noise in the sensing device itself. Our computational simulations have shown that these errors can be compensated for by reading multiple DNA strings. We have used a hashing approach to efficiently carry out the thresholding process to eliminate errors.
Parallelization on Cray MTA-2
The architecture of the Cray MTA-2 allows us to parallelize this algorithm with a minimum of modification to the serial code. Whatever changes are necessary do not preclude running the identical code on Linux PCs, Sun Workstation and the MTA-2 itself. This is an indication of the power and versatility of the MTA-2 architectural concept. We expect that the MTA-2, its successor machines and related architectures will provide high-productivity environments for the development of parallel algorithms for the unstructured problems that typically arise in bioinformatics. Such machines will provide alternatives to the more traditional cluster architectures that are in vogue today.
Future research
The results described in this paper provide the proof of concept for our approach. Our ongoing work involves the following.
Data structure
Exploring if a more elaborate data structure could be used to further improve the efficiency of the thresholding process. This data structure will necessarily have to permit efficient parallelization.
Thresholding
Investigating a more sophisticated thresholding technique. The technique used by us at present is simple but effective. A more sophisticated technique might allow us to reduce the number of input strings at the cost of greater execution time. It may also permit the algorithm to successfully sequence the DNA even if most, but not all, reads are shorter than the longest maximal repeat length.
Detailed evaluation of the solution space
We have described the sensitivity of the algorithm to variations in mer length k, frequency threshold ft, number of DNA strings and to the error probabilities. These sensitivities need to be formalized so that a completely autonomous sequencing procedure can be developed.
Using the Cray MTA-2 to sequence longer DNAs
We have demonstrated our algorithm with M.pneumoniae, a bacterium with a comparatively short DNA sequence (0.8 x 106 bases). This organism was chosen because it could comfortably be experimented with on a modest workstation (for comparison purposes). We now plan to move on to larger organisms such as Hemophilus influenzae (1.8 x 106) and Neisseria meningitidis (2.1 x 106 bases). We already have preliminary results on these organisms on serial machines, wherein we have reconstructed their DNA assuming error-free reads. The huge shared memory (16160 GB) and multiple processors (440) of the Cray MTA-2 will permit us to carry out realistic experiments on these and larger organisms in convenient time frames. We plan to pursue a program of careful evaluation of our algorithm so as to obtain detailed timings for a set of real organisms. We also plan to investigate the parallel scaling aspects of the algorithm in greater detail.
Circular DNA
We have carried out our initial experiments with linearized DNA so as to prepare the way for our long-term objective, the sequencing of linear mammalian DNA (see below). Nevertheless, it is of interest to explore the extension of this algorithm to circular DNA. We expect that the techniques presented in this paper can easily be modified for circular DNA. Prior to translocation through the nanopores of the sequencer, the DNA would, of course, be split into linear segments using chemical or physical means. Instead of searching for four disjoint paths, the modified algorithm would search for four disjoint cycles.
Mammalian DNA
Our long-term objective is to simulate the reconstruction of mammalian (and ultimately human) DNA which is
109 bases. In addition to the obvious difficulties created by large size, there is also the problem of multiple chromosomes. In this case we expect that a graph will be generated as before, but the objective will be to search for a decomposition into 4 x 2 x the number of chromosomes. The start points of chromosomes will be found, as before, by searching for nodes with indegree = 0 and outdegree = 1. The different orientations and complements of each chromosome will then be matched to each other. Although this is a computationally intensive process, it is made manageable by the low number of chromosomes (
50) of most mammals and is, in any case, easily parallelizable on the Cray MTA-2.
Long repeats, as described by Bailey et al. (2002), could be a problem when sequencing human DNA. In addition to the longest maximal repeats within chromosomes, we will have to contend with longest common substrings across chromosomes. The techniques for handling these are the same as those described for long repeats elsewhere in this paper.
| Acknowledgments |
|---|
This research was supported by the National Institutes of Health, grant R21HG02167-01. Access to the Cray MTA-2 was provided by Cray Inc., Cray Japan Inc., and the Electronic Navigation Research Institute (ENRI), Japan. Additional support was provided by the Optoelectronic Computing Systems Center, University of Colorado, Boulder.
We are grateful to Cray Inc., Cray Japan Inc. and ENRI for their generous assistance. This work would not have been possible without the support and encouragement of Dick Russel, Bracy Elton and Simon Kahan. We thank Kazuki Watanabe, Toshiomi Muto, Susumu Igawa, Toshiaki Ishimoto and Satoru Yoshioka for their help. The comments of Jeff Schloss on an earlier version of this paper are gratefully acknowledged.
We especially wish to thank the two anonymous reviewers for their detailed comments. These have helped improve the paper very significantly.
| Footnotes |
|---|
1 This could properly be called a mer frequency spectrum, as it denotes the strength of each frequency. However, this would lead to unnecessary confusion as the term spectrum is widely used in the SBH literature to mean the set of all length
substrings of a DNA fragment (Pevzner, 2000, p. 68.
Received on August 5, 2004; revised on October 2, 2004; accepted on October 25, 2004
| REFERENCES |
|---|
|
|
|---|
Alverson, R., Callahan, D., Cummings, D., Koblenz, B., Porterfield, A., Smith, B. (1990) The Tera computer system. Proceedings of the Fourth International Conference on Supercomputing ACM Press, pp. 16.
Alverson, G., Alverson, R., Callahan, D., Koblenz, B., Porterfield, A., Smith, B. (1992) Exploiting heterogeneous parallelism on a multithreaded multiprocessor. Proceedings of the Sixth International Conference on Supercomputing ACM Press, pp. 188187.
Bailey, J.A., Gu, Z., Clark, R.A., Reinert, K., Samonte, R.V., Schwartz, S., Adams, M.D., Myers, E.W., Li, P.W., Eichler, E. (2002) Recent segmental duplications in the human genome. Science, 297, 945947
Bokhari, S.H., Glaser, M.A., Jordan, H.F., Lansac, Y., Sauer, J.R., Van Zeghbroeck, B. (2002) Parallelizing a DNA simulation code for the Cray MTA-2. Proceedings of the IEEE Computer Society Bioinformatics Conference IEEE, pp. 291302.
Bokhari, S.H. and Sauer, J.R. (2004) Sequence alignment on the Cray MTA-2. Concurrency Comput., 16, 823839[CrossRef].
Branton, D. and Meller, A. (2002) Using nanopores to discriminate between single molecules of DNA. Structure and Dynamics of Confined Polymers, , Dordrecht Kluwer Academic Publishers, pp. 17185.
Gusfield, D. Algorithms on Strings, Trees, and Sequences, (1997) , Combridge Cambridge University Press.
Howorka, S., Cheley, S., Bayley, H. (2001) Sequence-specific detection of individual DNA strands using engineered nanopores. Nature Biotechnol., 19, , pp. 636639[CrossRef][Web of Science][Medline].
Idury, R. and Waterman, M. (1995) A new algorithm for DNA sequence assembly. J. Comput. Biol., 2, 291306[Medline].
Kasianowicz, J.J., Brandin, E., Branton, D., Deamer, D.W. (1996) Characterization of individual polynucleotide molecules using a membrane channel. Proc. Natl Acad. Sci. USA, 93, 1377013773
Li, J., Stein, D., McMullan, C., Branton, D., Aziz, M., Golovchenko, J. (2001) Ion-beam sculpting at nanometre length scales. Nature, 412, .
Li, J., Gershow, M., Stein, D., Brandin, E., Golovchenko, J. (2003) DNA molecules and configurations in a solid-state nanopore microscope. Nature Mater., 2, 611615[CrossRef][Web of Science][Medline].
Pevzner, P.A. Computational Molecular BiologyAn Algorithmic Approach, (2000) , Cambridge, MA The MIT Press.
Pevzner, P.A., Tang, H., Waterman, M.S. (2001) An Eulerian path approach to DNA fragment assembly. Proc. Natl Acad. Sci. USA, 98, , pp. 87489753.
Sauer, J. and Van Zeghbroeck, B. (2002) Ultra-fast nucleic acid sequencing device and a method for making and using the same. US Patent No. 6,413,792 .
Sauer-Budge, A.F., Nyarnwanda, J.A., Lubensky, D.K., Branton, D. (2003) Unzipping kinetics of double-stranded DNA in a nanopore. Phys. Rev. Lett., 90, 23801-123801-4.
Storm, A.J., Chen, J., Ling, X., Zandbergen, H., Dekker, C. (2003) Fabrication of solid-state nanopores with single-nanometer precision. Nature Mater., 2, 537540[CrossRef][Web of Science][Medline].
Storm, A.J., Storm, C., Chen, J., Zandbergen, H., Joanny, J.-F., Dekker, C. (2004) Fast DNA translocation through a solid-state nanopore. Preprint: arXiv:q-bio. BM/0404041.
Waterman, M.S. Introduction to Computational Biology, (1995) , London Chapman and Hall.
This article has been cited by other articles:
![]() |
D. R. Zerbino and E. Birney Velvet: Algorithms for de novo short read assembly using de Bruijn graphs Genome Res., May 1, 2008; 18(5): 821 - 829. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||





