Natural sequence code representations for compression and rapid searching of human-genome style databases
1Proteus Molecular Design Limited, Proteus House 48 Stockport Road, Marple, Cheshire SK6 6AB, UK
2Department of Molecular Biology and Molecular Biology, University of Manchester M13 9PT UK
Department of Structural Properties of Materials and Biophysics, Danmarks Teknikske Hojskole Lyngby DK-2800, Denmark
Numeric descriptions (bio-informatic descriptions) of amino acid residues have been developed which will be of value whenever the quality and quantity of information in very large (i. e. human genome style) gene and protein sequences is to be compared or manipulated. These codes are as natural as possible by our criteria (the same principles could be used in revision of the criteria). In particular, in storing and searching large amounts of sequence data, natural codeswhich relate to the properties of amino acidscan be combined with existing fastsearch algorithms but introduce several advantages. The code can be assigned such that subselection of bits leads to compressed databases with residues defined less specifically, by classes of properties. The most compressed representation leads to the specification of a residue as polar or non-polar, while the most extended representation used at present also allows specification of, for example, glyco-asparagine and phosphoserine. Preliminary studies on both a supercomputer and smaller machines suggest a worstcase speeding of
4.5-fold. For more intelligent searching, coding extensions mixed with the basic sequence data give the sequence data some of the character of a computer program.
Received on November 7, 1991; accepted on February 14, 1992