Bioinformatics Advance Access originally published online on October 2, 2006
Bioinformatics 2006 22(24):2980-2987; doi:10.1093/bioinformatics/btl495
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A Novel algorithm for identifying low-complexity regions in a protein sequence
Department of Computer and Information Science and Engineering, University of Florida Gainesville, FL 32611, USA
*To whom correspondence should be addressed.
Motivation: We consider the problem of identifying low-complexity regions (LCRs) in a protein sequence. LCRs are regions of biased composition, normally consisting of different kinds of repeats.
Results: We define new complexity measures to compute the complexity of a sequence based on a given scoring matrix, such as BLOSUM 62. Our complexity measures also consider the order of amino acids in the sequence and the sequence length. We develop a novel graph-based algorithm called GBA to identify LCRs in a protein sequence. In the graph constructed for the sequence, each vertex corresponds to a pair of similar amino acids. Each edge connects two pairs of amino acids that can be grouped together to form a longer repeat. GBA finds short subsequences as LCR candidates by traversing this graph. It then extends them to find longer subsequences that may contain full repeats with low complexities. Extended subsequences are then post-processed to refine repeats to LCRs. Our experiments on real data show that GBA has significantly higher recall compared to existing algorithms, including 0j.py, CARD, and SEG.
Availability: The program is available on request.
Contact: xli{at}cise.ufl.edu, tamer{at}cise.ufl.edu
Received on June 20, 2006; revised on September 1, 2006; accepted on September 22, 2006
This article has been cited by other articles:
![]() |
I. B. Kuznetsov ProBias: a web-server for the identification of user-specified types of compositionally biased segments in protein sequences Bioinformatics, July 1, 2008; 24(13): 1534 - 1535. [Abstract] [PDF] |
||||
