Dot-plot comparisons by multivariate analysis (DOCMA): a tool for classifying protein sequences
Centre de Génétique Moléculaire du CNRS, Laboratoire Associé á l'Université P. et M. Curie 91198 Gif-sur-Yvette Cedex, France
A method aimed at classifying protein sequences without resorting to pairwise alignment is presented. Called DOCMA (DOt-plot Comparisons by Multivariate Analysis), it is based on a multivariate analysis of the pairwise dot-plots between all the sequences in the set. The dot-plots are first simplified by considering only the projections of the "diagonal" segments of similarity onto the axes. From these projections a data matrix is built, in which each column is representative of the comparisons of one given sequence with all the other ones. This data matrix is then transformed into a distance matrix by a chi-squared analysis, from which the coordinates of the sequences in an orthonormal Euclidean space are obtained. The sequences are finally classified by a dynamic clustering procedure followed by a search for strong clusters. Application of this method to protein families such as the globins, the cytochromes c and the aminoacyl-tRNA synthetases shows that it is quite effective in delineating subgroups that contain even distantly related sequences.
Received on April 14, 1992; accepted on July 9, 1992