Compression of nucleic acid and protein sequence data
Department of Information Studies, University of Sheffield Western Bank, Sheffield S10 2TN, UK
*To whom reprint requests should be sent.
This paper describes the application of text compression methods to machine-readable files of nucleic acid and protein sequence data. Two main methods are used to reduce the storage requirements of such files, these being n-gram coding and run-length coding. A Pascal program combining both of these techniques resulted in a compression figure of 74.6% for the GenBank database and a program that used only n-gram coding gave a compression figure of 42.8% for the Protein Identification Resource database.
Received on November 29, 1985; accepted on February 24, 1986
This article has been cited by other articles:
![]() |
M. F. Lynch and P. Willett Information retrieval research in the Department of Information Studies, University of Sheffield: 1965-1985 Journal of Information Science, January 1, 1987; 13(4): 221 - 234. [Abstract] [PDF] |
||||
