Bioinformatics Advance Access originally published online on May 5, 2007
Bioinformatics 2007 23(11):1437-1439; doi:10.1093/bioinformatics/btm120
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
BioDownloader: bioinformatics downloads and updates in a few clicks
Fox Chase Cancer Center, 333 Cottman Ave., Philadelphia PA 19111 USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Summary: There are many ftp or http servers storing data required for biological research. While some download applications are available, there is no user-friendly download application with a graphical interface specifically designed and adapted to meet the requirements of bioinformatics. BioDownloader is a program for downloading and updating files from ftp and http servers. It is optimized to work robustly with large numbers of files. It allows the selective retrieval of only the required files (batch downloads, multiple file masks, ls-lR file parsing, recursive search, recent updates, etc.). BioDownloader has a built-in repository containing the settings for common bioinformatics file-synchronization needs, including the Protein Data Bank (PDB) and National Center for Biotechnology Information (NCBI) databases. It can post-process downloaded files, including archive extraction and file conversions.
Availability: The program can be installed from http://dunbrack.fccc.edu/BioDownloader. The software is freely available for both non-commercial and commercial users under the BSD license.
Contact: Roland.Dunbrack{at}fccc.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Various types of data serve as input in almost any bioinformatics research problem. Vast amounts of data are stored as computer files, primarily located on remote ftp and http servers for public access. There is a common need in bioinformatics to have a local copy for fast access of data sets potentially comprising thousands of files. For some projects, data sets are assembled from multiple servers. Moreover, given the significant progress in high-throughput biological initiatives, a large amount of data becomes available on a weekly or daily basis. For instance, the Worldwide Protein Data Bank (wwPDB or PDB) adds over 100 new structures every week (Berman et al., 2007). Every week, the wwPDB updates an average of over 100 structure files.
Many researchers write their own or borrow shell, Perl or Python scripts to download the required remote files to their local workstations. These scripts are usually highly customized for a specific server and a specific data type. Therefore, they do not provide an integrated solution where all data sets are managed and updated within the same application. Given the importance of keeping up-to-date with the latest data and the current size of biological databases, the reliability of the update process becomes crucial. The available scripts designed for biological databases tend to lack rigorous error-checking mechanisms required for complex tasks. For instance, a well-known problem is that the connection with a remote server can be dropped, thus interrupting the whole download process.
Another common problem encountered with the existing download scripts is the difficulty of installing and running them, especially for scientists without extensive computer programming experience. Most downloading programs specifically for bioinformatics have a primitive graphical interface, if they have one at all.
We tested a number of free or commercial download programs, including WinFtp Client 1.5 (Network Soft, www.wftpserver.com), ReGet Deluxe 4.2 (ReGet Software, www.reget.com), FlashGet 1.73 (www.flashget.com), FTP Synchronizer (Liuxz Software, www.ftpsynchronizer.com), and Star Downloader Pro 1.52 (www.stardownloader.com), for common bioinformatics tasks. While many of these programs are user friendly and are controlled via graphical interfaces, we found that they lack the specific features required to navigate and retrieve the data stored on biological data servers. Some applications are designed to download only a single file at a time, which is impractical for thousands of files. Others can read a batch of URLs, but it would require first the preparation of such a listing. Some allow creating an exact mirror of another ftp or http site or a part of that site. This approach tremendously increases internet traffic, download time and local storage overhead. Some of the download applications are either ftp- or http-server specific only. Most download applications do not provide a convenient updating option, forcing the user to re-download the whole data set. Even the more sophisticated download applications are difficult to configure for bioinformatics ftp and http sites, which often have complex and unusual structures (for instance, the PDB ftp site). Many of the generic downloaders cannot properly process large sets of remote files, and they often crash without even attempting to download them. These programs also do not usually provide for any kind of processing, such as compression or format conversion. For instance, we routinely download XML- and mmCIF-format PDB files and convert them to the legacy-PDB format for some applications.
We have created a program called BioDownloader as an easy-to-use, all-in-one application for downloading, updating and processing data files from remote ftp and http servers commonly used in bioinformatics. Its features are described below.
| 2 FEATURES |
|---|
|
|
|---|
BioDownloader provides a unique combination of features specifically designed to meet the most common bioinformatics update needs:
- developed and optimized specifically for downloading and updating batches of files from biological data ftp or http servers, including very large data sets from multiple sources,
- has a very friendly, easy-to-use interface as shown in Figure 1 and an exhaustive hint and help system,
- has a built-in repository containing the settings for the most common bioinformatics databases and
- provides a built-in wizard for batch post-download processing of files.
|
The repository for downloading common bioinformatics databases includes the RCSB PDB, EBI and NCBI ftp and http servers. Those download tasks and other repository examples can be easily modified to create user-customized download tasks. This bioinformatics download repository will be populated with more tasks as BioDownloader users contribute newly defined tasks that we will add to the repository. User-created download task settings are saved in separate XML-formatted files, and can be easily shared with other researchers in the field.
Since bioinformatics databases sometimes consist of very many files, we developed a robust mechanism to download files even when the remote server drops connection with its client or some files in the download listing are missing or cannot be retrieved from the server. In some cases, the user does not need the whole database—just a subset or a few files. The software allows the retrieval of only the required files. The user can request a file set by providing a list of specific filenames or single or multiple wildcard file masks (e.g. *.mmol, 1c*.cif.Z, pdb????.ent.Z, etc.).
An important aspect is the way in which BioDownloader obtains the directory structure of a server. For ftp servers, when the number of files in a remote ftp directory is small, there is usually no problem, since the listing can be easily retrieved by automatic directory listing ftp requests. When the number of files is large (
10 000), the server may simply fail to respond. In such a situation, the only possible solution might be reading a plain-text listing file or parsing of an ls-lR formatted listing file, if these are available on the server (e.g. the PDB provides an ls-lR on its ftp server and a flat file listing all current PDB codes on its http server). BioDownloader implements both these ways of retrieving the ftp server structure. It can read and process such directory listings taken from a local computer or from a remote server. This way of retrieving a large set of files is also preferable when the files are spread across many different directories on an ftp server. In the latter case, the multiple ftp directory-content requests may take a lot of time, and may simply fail or time out. In the case of http servers, the directories are quite often not browsable. BioDownloader gets the http-server file listing from a file that can be stored locally or on any http or ftp server. The listing file is a plain-text file containing file names with or without the full path.
We built a batch processing wizard into BioDownloader for routine archive extracting (e.g. gzip), file conversions from one format into another (e.g. our xml2pdb program, http://dunbrack.fccc.edu/xml2pdb.php) and batch viewing of PDB structure files using predefined session templates.
It is common for computer users to mistype some input values. BioDownloader parses and verifies values for consistency as they are entered and warns about possible errors. For example, the application checks if a certain server exists and is available as soon as its name is entered. Error-checking and on-the-fly help systems make the whole process very easy and intuitive. BioDownloader is implemented with two user-interface modes: basic and advanced allowing a new user to start using the application quickly.
BioDownloader is based on platform-independent.NET technology. Currently, BioDownloader runs on Microsoft Windows or Mac OS using Windows virtual machine software. BioDownloader does not yet run properly on UNIX-based operating systems using the Mono project (http://mono-project.com) implementation of C#;, but should be available in the near future as Mono improves.
| 3 DISCUSSION |
|---|
|
|
|---|
What makes BioDownloader unique and novel is the combination of features which are tailored for the most common bioinformatics uses, the user friendliness of the graphical interface and the tested robustness of the program during download tasks involving very large numbers of files coming from various data servers. With an all-in-one application, the user can rely on downloaded data sets that are up-to-date, processed and avoid the drawbacks associated with borrowed Perl scripts and other programs. BioDownloader can be easily used without any prior knowledge of computer programming and network protocols and configurations.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
This work was supported by NIH grants R01-HG02302 and R01-GM73784 to R.L.D. and P30-CA06972 to Fox Chase Cancer Center.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Alfonso Valencia
Received on January 6, 2007; accepted on March 17, 2007
| REFERENCE |
|---|
|
|
|---|
Berman H, et al. The worldwide Protein Data Bank (wwPDB): ensuring a single, uniform archive of PDB data. Nucleic Acids Res. (2007) 35:D301–D303.
This article has been cited by other articles:
![]() |
O. Filangi, Y. Beausse, A. Assi, L. Legrand, J.-M. Larre, V. Martin, O. Collin, C. Caron, H. Leroy, and D. Allouche BioMAJ: a flexible framework for databanks synchronization and processing Bioinformatics, August 15, 2008; 24(16): 1823 - 1825. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||

