Bioinformatics Advance Access originally published online on November 25, 2004
Bioinformatics 2005 21(8):1705-1706; doi:10.1093/bioinformatics/bti163
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
DSEARCH: sensitive database searching using distributed computing
Department of Computer Science, National University of Ireland Maynooth, Ireland
*To whom correspondence should be addressed.
| Abstract |
|---|
|
|
|---|
Summary: We present a distributed and fully cross-platform database search program that allows the user to utilize the idle clock cycles of machines to perform large searches using the most sensitive algorithms. For those in an academic or corporate environment with hundreds of idle desktop machines, DSEARCH can deliver a free database search supercomputer.
Availability: The software is publicly available under the GNU general public licence from http://www.cs.may.ie/distributed
Contact: tom.naughton{at}may.ie
Supplementary information: Full documentation and a user manual is available from http://www.cs.may.ie/distributed
| 1 INTRODUCTION |
|---|
|
|
|---|
Database searching for similar sequences is one of the fundamental tasks in bioinformatics. The two most rigorous search algorithms are the NeedlemanWunsch (Needleman and Wunsch, 1970) and SmithWaterman (Smith and Waterman, 1981) algorithms. However for large databases it is not feasible to perform searches using these algorithms. Many heuristic search algorithms have been developed but these reduce the sensitivity of a search and can fail to detect certain matches. When given the choice, most biologists would prefer to use the more rigorous algorithms for their searches.
One way to significantly reduce the runtime of long searches is to parallelize the search process across multiple processors. Many approaches to parallelizing database searching have been investigated because database searching is both computationally intensive and easily parallelized. However the overriding problem with many of these programs is that specialized parallel hardware and software is often required, making these programs either prohibitively expensive or simply too complicated to set up. DSEARCH is a fully cross-platform parallel database search program that does not require any specialized parallel hardware or software.
| 2 DESCRIPTION |
|---|
|
|
|---|
DSEARCH operates in a masterslave environment and the search is parallelized by splitting the database into fixed sized units that are subsequently searched on the donor machines. DSEARCH is one of a number of distributed applications that runs on our general purpose distributed computing platform which provides the user with a remote interface to monitor the progress of the application as their search executes (Keane, 2004). The parallel granularity is dynamically controlled during each search to match the processing abilities of the current set of donor machines (Keane, 2004). The user edits a straightforward configuration file to tailor their computation and chooses one of the built-in search algorithms (Smith and Waterman, 1981; Needleman and Wunsch, 1970; Crochemore et al., 2003). The inputs to the program are a FASTA database file, a FASTA query sequences file, a scoring scheme and a configuration file.
The generality of DSEARCH is demonstrated by the fact that DSEARCH, written in Java, can run on virtually any architecture and operating system simultaneously while only using the spare clock cycles of donor machines. No specialized computer hardware or software is required, and no expense is incurred if idle computing resources are harnessed. This would not be as straightforward for a parallel application written in a native language because the application would have to be compiled for each particular architecture and operating system. We have demonstrated the ease of use and platform heterogeneity of DSEARCH with experiments that utilize the spare computing resources of several architectures and operating systems simultaneously. DSEARCH's main features are:
- Full cross-platform compatibility.
- Ability to run several database searches simultaneously.
- Remote real-time progress updates via the remote interface.
- Full support for arbitrary donor machine failure.
| 3 DEPLOYMENT AND PERFORMANCE |
|---|
|
|
|---|
We have deployed DSEARCH in our university by running it as a low priority background service in a number of computing laboratories, consisting of approximately 200 desktop PCs of various modest specifications (Pentium IIs up to Pentium IVs running assorted versions of Windows and Linux OSs) and on every node of an IBM Linux cluster (32 Dual PIII 1 GHz nodes) with all machines connecting via a 10 Mbit/s network to a single server (Pentium III 500 MHz). We tested the performance of DSEARCH with SSEARCH (Pearson and Lipman, 1988) by running DSEARCH on a single processor and searching a 13 MB EST database with a 173 KB set of query sequences. The comparative runtimes were 1207 and 493 min, respectively. DSEARCH's performance reduction is overcome by its greater cross-platform compatibility. Figure 1 shows how DSEARCH scales with increasing numbers of processors. In our tests, DSEARCH reached a speed of 511 million cells per second on a network of 83 semi-idle workstations (Pentium III 900 MHz).
|
| Acknowledgments |
|---|
This research has been funded by the Embark Initiative from the Irish Research Council for Science, Engineering and Technology funded by the National Development Plan. We thank Chris Creevey and Andrew Page for their assistance. DSEARCH uses the NeoBio Java library (http://www.neobio.sourceforge.net) to perform all comparisons.
Received on July 9, 2004; revised on October 18, 2004; accepted on November 17, 2004
| REFERENCES |
|---|
|
|
|---|
Crochemore, M., Landau, G., Ziv-Ukelson, M. (2003) A subquadratic sequence alignment algorithm for unrestricted scoring matrices. SIAM J. Comput., 32, 16541673[CrossRef].
Keane, T.M. (2004) A general-purpose heterogeneous distributed computing system. , Maynooth M.Sc. Thesis Department of Computer Science, National University of Ireland.
Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequences of two proteins. J. Mol. Biol., 48, 443453[CrossRef][Web of Science][Medline].
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence analysis. Proc. Natl Acad. Sci., 85, 24442448
Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol., 147, 195197[CrossRef][Web of Science][Medline].
This article has been cited by other articles:
![]() |
T. M. Keane, T. J. Naughton, and J. O. McInerney MultiPhyl: a high-throughput phylogenomics webserver using distributed computing Nucleic Acids Res., July 13, 2007; 35(suppl_2): W33 - W37. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

