Skip Navigation


Bioinformatics Advance Access originally published online on November 23, 2007
Bioinformatics 2008 24(2):299-301; doi:10.1093/bioinformatics/btm570
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/2/299    most recent
btm570v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Sangket, U.
Right arrow Articles by Tan, T. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sangket, U.
Right arrow Articles by Tan, T. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Automatic synchronization and distribution of biological databases and software over low-bandwidth networks among developing countries

Unitsa Sangket 1,*, Amornrat Phongdara 1, Wilaiwan Chotigeat 1, Darran Nathan 2, Woo-Yeon Kim 3, Jong Bhak 3, Chumpol Ngamphiw 4, Sissades Tongsima 4, Asif M. Khan 5, Honghuang Lin 5 and Tin Wee Tan 2,5

1Center for Genomics and Bioinformatics Research, Prince of Songkla University, Thailand, 2Asia-Pacific Bioinformatics Network, 3Korean BioInformation Center (KOBIC), KRIBB, Korea, 4Biostatistics and Informatics Laboratory, Genome Institute, National Center for Genetic Engineering and Biotechnology, Thailand and 5Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Summary: Bioinformatics involves the collection, organization and analysis of large amounts of biological data, using networks of computers and databases. Developing countries in the Asia-Pacific region are just moving into this new field of information-based biotechnology. However, the computational infrastructure and network bandwidths available in these countries are still at a basic level compared to that in developed countries. In this study, we assessed the utility of a BitTorrent-based Peer-to-Peer (btP2P) file distribution model for automatic synchronization and distribution of large amounts of biological data among developing countries. The initial country-level nodes in the Asia-Pacific region comprised Thailand, Korea and Singapore. The results showed a significant improvement in download performance using btP2P—three times faster overall download performance than conventional File Transfer Protocol (FTP). This study demonstrated the reliability of btP2P in the dissemination of continuously growing multi-gigabyte biological databases across the three Asia-Pacific countries. The download performance for btP2P can be further improved by including more nodes from other countries into the network. This suggests that the btP2P technology is appropriate for automatic synchronization and distribution of biological databases and software over low-bandwidth networks among developing countries in the Asia-Pacific region.

Availability: http://everest.bic.nus.edu.sg/p2p/

Contact: usangket{at}yahoo.com


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Bioinformatics research often involves processing of large amounts of biological data (Gilbert et al., 2004; Lim et al., 2003), which are regularly updated—ranging from daily to quarterly updates. Consequently, bioinformatics centers around the world have to update their database repositories with the latest releases frequently. These updates are normally carried out over the Internet by using the traditional client/server distribution of files such as File Transfer Protocol (FTP) and Hypertext Transfer Protocol (HTTP). However, this process requires large network bandwidth to ensure that the latest database releases are downloaded reliably and within a reasonable timeframe. In 1997 the Bio-Mirrors project (Gilbert et al., 2004http://bio-mirror.net) linking up a network of database mirror sites in the Asia Pacific, was established to assist in this dissemination of data. Developing countries in the Asia-Pacific region have in recent years started moving into the field of bioinformatics (Ranganathan et al., 2002, 2006), but computational infrastructure and network bandwidths available between and within these countries are still at a primitive level compared to that in more developed countries. Network bandwidth within these countries are very slow, and the low reliability of connections means breaks or aborts in downloads are common. Therefore, in spite of the availability of Bio-Mirrors nodes, many developing countries still face a major problem in regularly updating these databases. With the growing sizes of these Bio-Mirror databases (approximately 10 GB in 1998, 150 GB in 2003 and 707 GB as of 18 August 2007) (Gilbert et al., 2004; http://bio-mirror.net/biomirror/docs/about-databanks.txt), the problem will only deteriorate in the future as the growth of databases surpasses the rate of bandwidth increase. For example, even with APAN—Asia Pacific Advanced Network (http://www.apan.net) and TEIN2—Trans-Eurasia Information Network (http://tein2.net), there is still significant difficulty for universities in developing countries to obtain the bandwidth that can guarantee regular nightly updates of these databases.

In the late 90s, the Internet community witnessed the start of a major revolution in the way people shared files. The Peer-to-Peer (P2P) file sharing model was introduced with the widely popular Napster. Since then, the technology continued to evolve and improve. BitTorrent (btP2P) is a recent P2P communications protocol that has become very popular lately (Hales et al., 2005). Any program that uses the BitTorrent protocol is termed as a btP2P client. The protocol allows the client to prepare, request and transmit any type of computer file over a network. A peer is any computer hosting the client and can be connected by other peers to transfer data. Peers usually do not have the complete file and those which do have the complete file and offer it for upload to other peers are called seeds. In contrast, a leech is a client that has the complete file, but does not share it with other peers in the network (for more definitions visit: http://www.azureuswiki.com/index.php/This_funny_word). A peer starts sharing file(s) by creating a ‘.torrent’, which contains the meta-data about the files to be shared and the tracker that coordinates the file distribution. The client divides the file to be shared into smaller fragments, typically to a quarter of a megabyte. Clients requesting to download the file first obtain the torrent file for it, through which they connect to the specified tracker that responds by providing information on peers that can be connected to download the fragments of the file.

The difference between the traditional client/server file distribution model and the P2P file distribution model using btP2P is illustrated in Figures 1 and 2. As the number of downloading clients in the traditional distribution architecture increases, demand for bandwidth placed on servers will dramatically increase, which eventually leads to network bottlenecks. With the btP2P architectural model, a single site is no longer taking the burden of solely supplying the data for others to download. The more peers there are, the more nodes are available to distribute fragments of the file (Guo et al., 2007). High demand will actually lead to greater throughput as more bandwidth from additional nodes becomes available to the group. Therefore, it can be seen that if btP2P technology is used, it simultaneously addresses two major problems plaguing the distribution of biological data to developing countries that are: (1) low international bandwidth and (2) unreliable connections.


Figure 1
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Traditional client/server distribution of files.

 

Figure 2
View larger version (37K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. P2P distribution of files using BitTorrent protocol (btP2P).

 
With a btP2P architecture, downloads need not be from a central server in one country, i.e. every network-connected peer that synchronizes its databases or software, whether from the same institute, state, country or region, will act as a server and provide additional bandwidth that will speed up the overall download rate for all the peers. In the conventional server/client architecture, all downloads are from a single server and if this connection becomes very slow or unreliable, there is ‘no’ failover to automatically continue downloading from another source.

Given the benefits of btP2P over the traditional file transfer model, in this study, we would like to assess the utility of btP2P for automatic synchronization and distribution of large amounts of biological data among developing countries in the Asia-Pacific region. The initial country-level nodes in the region comprise Thailand, Korea and Singapore.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
After extensive analyses and trials of various btP2P software, such as BitTorrent (http://www.bittorrent.com/download), uTorrent (http://www.utorrent.com/download.php), BitComet (http://www.bitcomet.com/doc/download.htm) and Azureus (http://azureus.sourceforge.net/download.php), the Azureus suite version 2.5.0.4 was selected because of the following reasons: (1) it is open-source and has a large active development community, (2) it runs on Java, allowing it to be deployed on any operating system and (3) it has a well documented plug-in interface that makes it easy to develop additional enhancements that may be necessary for this work.

Four trial nodes were setup for the first phase of testing the btP2P network in the Asia-Pacific region. These sites comprised (1) Prince of Songkla University (PSU, Thailand), (2) Korean Bioinformation Center (KOBIC, Korea), (3) National University of Singapore (NUS, Singapore) and (4) National Center for Genetic Engineering and Biotechnology (BIOTEC, Thailand). We have set up three tracker sites to publish the.torrent files, as shown in Table 1. An RSSFeed Scanner Plugin (http://azureus.sourceforge.net/plugin_list.php) was used to trigger automatic synchronization of data at regular intervals. This allows the Azureus client program to download data automatically from the seed nodes without user intervention.


View this table:
[in this window]
[in a new window]

 
Table 1. Tracker sites to publish torrents in the Asia-Pacific region

 
To compare the download performance between FTP and btP2P, we downloaded biological databases using both methods at the PSU node over 7 days. To achieve uniformity, we performed downloads using both the methods on the same machine, same date and same network; this ensured that the load on the network was the same for both of the methods during the test period. For evaluation of the FTP performance, the PSU node was set to download data from the KOBIC FTP server (ftp://ftp.kobic.re.kr/, Proftpd program version 1.2.10) by using the FileZilla program version 3.0.0-beta7 (http://filezilla.sourceforge.net/). Because the ftp client does not provide a function for automatic downloads like btP2P, we had to manually check for new files everyday and download them. On the other hand, for evaluation of btP2P, PSU node was set to download data from three seeds or peers—KOBIC, NUS and BIOTEC nodes. The btP2P client was set for automatic download of new.torrents files every 2 h from the KOBIC tracker.


    3 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Trace results from only the PSU node are shown in Figure 3. With only three nodes available as peers/seeds to the PSU node, the results demonstrate a significant improvement in download performance using btP2P over FTP. After 7 days, 23.2 GB of data was successfully downloaded using FTP and about 70 GB using btP2P. Although the download throughputs for both FTP and btP2P slightly dropped after the third day, it increased a bit after the fifth day. Variations in the rates of transmission for both protocols are quite similar and are likely to be due to fluctuations in daily Internet traffic and the underlying network quality of service. The use of btP2P appears more effective and is at least three times faster in terms of overall download performance than conventional FTP in this test.


Figure 3
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Download performance comparison between FTP and btP2P in the Asia-Pacific region.

 
Our results were in agreement with the findings of Scully L (http://cs.winona.edu/CSConference/2007proceedings/lincoln.pdf), who compared the download performance between FTP and btP2P on different subnets of the Saint Mary's University network in the USA.

In conclusion, the btP2P protocol appears significantly more effective than traditional FTP for synchronizing large multi-gigabyte public biological databases across the three Asia-Pacific countries, Thailand, Korea and Singapore. The download performance for btP2P can be further improved by including more nodes/seeds from other countries in the network (http://cs.winona.edu/CSConference/2007proceedings/lincoln.pdf). This suggests that the btP2P technology may be appropriate for automatic synchronization and distribution of biological databases and software over low-bandwidth networks among developing countries in the Asia-Pacific region. In the next test phase, more nodes from various Asia-Pacific countries will be included for large-scale real-time tests of the performance of the btP2P network.

A potential drawback of btP2P, purely from network traffic point of view, is that they can result in waste of network resources (http://portal.acm.org/citation.cfm?id=1146882). For example, because there is no assurance that the list of seeds provided by the tracker are ‘good—offering fast downloads, the peer will be searching for the best set of seeds for download by trying the different seeds in the seeds set, resulting in many unnecessary network connections and thus increasing the traffic of the network (http://portal.acm.org/citation.cfm?id=1146882). The Azureus program provides a function to minimize this traffic by allowing the user to set the maximum number of connections allowed. Additionally, recent research suggests that, in future, the waste of network resources by btP2P can be avoided by looking for the best set of seeds using the best neighbors list from Internet Service Providers (ISPs) (Aggarwal et al., 2007) or using a Content Distribution Network (CDN) mechanism (http://portal.acm.org/citation.cfm?id=1146882).


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work was supported by the International Development Research Centre (IDRC) Canada, the Asia-Pacific Bioinformatics Network (APBioNet), and MOST under grant number M10508040002-07N0804-0021 and KADO of MIC in Korea.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on July 4, 2007; revised on October 9, 2007; accepted on November 10, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Aggarwal D, et al. Can ISPs and P2P users cooperate for improved performance? ACM SIGCOMM Comput. Commun. Rev. (2007) 37:29–40.

    Gilbert D, et al. Bio-Mirror project for public bio-data distribution. Bioinformatics (2004) 20:3238–3240.[Abstract/Free Full Text]

    Guo L, et al. A Performance Study of BitTorrent-like Peer-to-Peer Systems. IEEE J. Selected Areas Commun. (2007) 25:155–169.[CrossRef]

    Hales D, Patarin S. Computational Sociology for Systems "In the Wild": The Case of BitTorrent. IEEE Distrib. Syst. (2005) 6:1–6.

    Lim YP, et al. The S-Star trial bioinformatics course – an on-line learning success. Biochem. Mol. Biol. Educ. (2003) 31:20–23.[CrossRef][Web of Science]

    Ranganathan S, et al. APBioNet: the Asia Pacific regional consortium for bioinformatics. Appl. Bioinformatics (2002) 1:101–105.[Medline]

    Ranganathan S, et al. Establishing bioinformatics research in the Asia Pacific. BMC Bioinformatics (2006) 7(Suppl. 5). S1.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
24/2/299    most recent
btm570v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Sangket, U.
Right arrow Articles by Tan, T. W.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sangket, U.
Right arrow Articles by Tan, T. W.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?