Automatic evaluation of protein sequence functional patterns
Molecular Biology Computer Research Resource, Department of Biostatistics, Dana-Farber Cancer Institute and Harvard School of Public Healh. LG-127 44 Binney Street, Boston, MA 02115, USA
A procedure that automatically provides an evaluation of the diagnostic ability of a protein sequence functional pattern is described. The procedure relies on the identification of the closest definable set in terms of a (protein sequence) database functional annotation to the set of database instances containing a given pattern. Assuming annotation correctness and completeness in the protein sequence database, the degree of statistical association between these sets provides an appropriate measure of the diagnostic ability of the pattern. An experimental implementation of the procedure, using the NBRF/PIR protein database, has been applied to a diverse collection of published sequence patterns. Results obtained reveal that frequently it is not possible to define (in NBRF/PIR database terminology) the set of database instances containing a given pattern, suggesting either lack of pattern diagnostic ability or protein database annotation incompleteness and/or inconsistencies.
Received on November 30, 1989; accepted on July 20, 1990