Bioinformatics Vol. 18 no. 11 2002
Pages 1523-1534
© 2002 Oxford University Press
Euclidian space and grouping of biological objects
1 Department of Biochemistry
2 Howard Hughes Medical Institute, University of Texas Southwestern Medical Center,
5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA
Received on February 2, 2002
; revised on April 17, 2002
; accepted on April 23, 2002
Motivation: Biological objects tend to cluster into discrete groups. Objects within a group typically possess similar properties. It is important to have fast and efficient tools for grouping objects that result in biologically meaningful clusters. Protein sequences reflect biological diversity and offer an extraordinary variety of objects for polishing clustering strategies. Grouping of sequences should reflect their evolutionary history and their functional properties. Visualization of relationships between sequences is of no less importance. Tree-building methods are typically used for such visualization. An alternative concept to visualization is a multidimensional sequence space. In this space, proteins are defined as points and distances between the points reflect the relationships between the proteins. Such a space can also be a basis for model-based clustering strategies that typically produce results correlating better with biological properties of proteins.
Results:
We developed an approach to classification of biological objects
that combines evolutionary measures of their similarity with a
model-based clustering procedure. We apply the methodology to
amino acid sequences. On the first step, given a multiple
sequence alignment, we estimate evolutionary distances between
proteins measured in expected numbers of amino acid
substitutions per site. These distances are additive and are
suitable for evolutionary tree reconstruction. On the second
step, we find the best fit approximation of the evolutionary
distances by Euclidian distances and thus represent each protein
by a point in a multidimensional space. The Euclidian space may
be projected in two or three dimensions and the projections can
be used to visualize relationships between proteins. On the
third step, we find a non-parametric estimate of the probability
density of the points and cluster the points that belong to the
same local maximum of this density in a group. The number of
groups is controlled by a
-parameter that determines the shape
of the density estimate and the number of maxima in it. The grouping
procedure outperforms commonly used methods such as UPGMA and single
linkage clustering.
Availability: The code of EESG program for Mathematica4 (Wolfram Research) as well as the details of the analysis are freely available at ftp://iole.swmed.edu/pub/EESG/.
Contact: grishin{at}chop.swmed.edu
* To whom correspondence should be addressed.
![]()
CiteULike
Connotea
Del.icio.us What's this?
This article has been cited by other articles:
![]() |
J. Pei and N. V. Grishin The P5 protein from bacteriophage phi-6 is a distant homolog of lytic transglycosylases Protein Sci., May 1, 2005; 14(5): 1370 - 1374. [Abstract] [Full Text] [PDF] |
||||
![]() |
L. N. Kinch, S. Cheek, and N. V. Grishin EDD, a novel phosphotransferase domain common to mannose transporter EIIA, dihydroxyacetone kinase, and DegV Protein Sci., February 1, 2005; 14(2): 360 - 367. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Cheng, N. Shen, J. Pei, and N. V. Grishin Double-stranded DNA bacteriophage prohead protease is homologous to herpesvirus protease Protein Sci., August 1, 2004; 13(8): 2260 - 2269. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. Pei, N. V. Dokholyan, E. I. Shakhnovich, and N. V. Grishin Using protein design for homology detection and active site searches PNAS, September 30, 2003; 100(20): 11361 - 11366. [Abstract] [Full Text] [PDF] |
||||
![]() |
M. A. Farnum, H. Xu, and D. K. Agrafiotis Exploring the nonlinear geometry of protein homology Protein Sci., August 1, 2003; 12(8): 1604 - 1612. [Abstract] [Full Text] [PDF] |
||||

