A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization
Institute of Cytology and Genetics, Russian Academy of Sciences Novosibirsk 630090 Russia
1To whom correspondence should be addressed
A new method for distinguishing among protein families based on the analysis of oligopeptide co of amino acid sequences is presented. It is assumed that any protein family can be characterized by a set of essential oligopeptides (oligopeptide vocabulary). A simple approach to find such a vocabulary is suggested. It is shown that comparison of the vocabularies can distinguish among different families and the latter from random sequences. This comparison can be successsfully made with a small set offrequencies of 25 dipeptides (or tripeptides). No preliminary alignment is necessary. Ir is established that characteristic peptides are located in the regions of functional value, as shown for GWPbinding domains of the translation elongation factors. It is demonstrated that this method is reasonably efficient for localizing functional domains in the amino acid sequences. The average error of prediction does not e.xceed three or four amino acid residues as shown for several functional domains.
Received on January 27, 1992; accepted on July 29, 1992