Skip Navigation

This Article
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow FREE Full Text (Screen PDF)
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kawaji, H.
Right arrow Articles by Matsuda, H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kawaji, H.
Right arrow Articles by Matsuda, H.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

Vol. 20 no. 2 2004, pages 243-252
Bioinformatics © Oxford University Press 2004; all rights reserved.

Graph-based clustering for finding distant relationships in a large set of protein sequences

Hideya Kawaji 1,2, Yoichi Takenaka 1 and Hideo Matsuda 1,*

1 Department of Bioinformatic Engineering, Graduate School of Information Science and Technology, Osaka University, 1-3 Machikaneyama, Toyonaka, Osaka 560-8531, Japan and 2 NTT Software Corporation, 223-1 Yamashita-cho, Naka-ku, Yokohama, Kanagawa 231-8554, Japan

Received on May 26, 2003 ; revised on July 22, 2003 ; accepted on July 26, 2003

Motivation: Clustering of protein sequences is widely used for the functional characterization of proteins. However, it is still not easy to cluster distantly-related proteins, which have only regional similarity among their sequences. It is therefore necessary to develop an algorithm for clustering such distantly-related proteins.

Results: We have developed a time and space efficient clustering algorithm. It uses a graph representation where its vertices and edges denote proteins and their sequence similarities above a certain cutoff score, respectively. It repeatedly partitions the graph by removing edges that have small weights, which correspond to low sequence similarities. To find the appropriate partitions, we introduce a score combining the normalized cut and a locally minimal cut capacities. Our method is applied to the entire 40 703 human proteins in SWISS-PROT and TrEMBL. The resulting clusters shows a 76% recall (20 529 proteins) of the 26 917 classified by InterPro. It also finds relationships not found by other clustering methods.

Availability: The complete result of our algorithm for all the human proteins in SWISS-PROT and TrEMBL, and other supplementary information are available at http://motif.ics.es.osaka-u.ac.jp/Ncut-KL/

Contact: matsuda{at}ist.osaka-u.ac.jp

* To whom correspondence should be addressed.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Brief BioinformHome page
B. Andreopoulos, A. An, X. Wang, and M. Schroeder
A roadmap of clustering algorithms: finding a match for a biomedical application
Brief Bioinform, May 1, 2009; 10(3): 297 - 314.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
M. Hou, P. Berman, C.-H. Hsu, and R. S. Harris
HomologMiner: looking for homologous genomic groups in whole genomes
Bioinformatics, April 15, 2007; 23(8): 917 - 925.
[Abstract] [Full Text] [PDF]



Disclaimer: Please note that abstracts for content published before 1996 were created through digital scanning and may therefore not exactly replicate the text of the original print issues. All efforts have been made to ensure accuracy, but the Publisher will not be held responsible for any remaining inaccuracies. If you require any further clarification, please contact our Customer Services Department.