Skip Navigation


Bioinformatics Advance Access originally published online on May 14, 2008
Bioinformatics 2008 24(13):1534-1535; doi:10.1093/bioinformatics/btn233
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/13/1534    most recent
btn233v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kuznetsov, I. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kuznetsov, I. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

ProBias: a web-server for the identification of user-specified types of compositionally biased segments in protein sequences

Igor B. Kuznetsov *

Gen*NY*sis Center for Excellence in Cancer Genomics, Department of Epidemiology and Biostatistics, One Discovery Drive, University at Albany, Rensselaer, NY 12144, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 DESCRIPTION OF THE...
 ACKNOWLEDGEMENT
 REFERENCES
 

Summary: Most proteins contain compositionally biased segments (CBS) in which one or more amino acid types are significantly overrepresented. CBS that contain amino acids with similar chemical properties can have functional and structural importance. This article describes ProBias, a web-server that searches a protein sequence for CBS composed of user-specified amino acid types. ProBias utilizes the discrete scan statistics to estimate statistical significance of CBS and is able to detect even subtle local deviations from the random independence model. The web-server also analyzes the global compositional bias of the input sequence. In the case of novel proteins that lack functional annotation, statistically significant CBS reported by ProBias can be used to guide the search for potential functionally important sites or domains.

Availability: Freely available at http://lcg.rit.albany.edu/ProBias

Contact: IKuznetsov{at}albany.edu

Supplemantary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 DESCRIPTION OF THE...
 ACKNOWLEDGEMENT
 REFERENCES
 
Most protein sequences contain segments called compositionally biased segments (CBS), whose amino acid composition is significantly different from the average amino acid usage of the proteome. Statistical analyses indicate that up to 25% of the proteome can be composed of CBS (Wootton and Federhen, 1996). A number of studies demonstrated the existence of distinct types of CBS involved in a variety of molecular functions, such as protein–DNA and protein–protein interactions, transcription regulation, developmental control, membrane transport and essential structural roles (Brendel et al., 1992); Karlin et al., 2003; Koonin et al., 1996; Kreil and Ouzounis, 2003). It was shown that proteins involved in the same biological function tend to contain the same types of CBS (Harrison, 2006; Kuznetsov and Hwang, 2006). Many post-translational modifications are encoded in CBS located at the N- or C-terminus of the protein sequence (Eisenhaber et al., 2003). Recent observations indicate that particular types of CBS in proteins are often structurally disordered (Romero et al., 2004). CBS have also been linked to protein misfolding (DeMarco and Daggett, 2007) and a number of inherited neurological diseases (Gunawardena and Goldstein, 2005; Harrison and Gerstein, 2003; Kreil and Ouzounis, 2003). Global compositional bias, when the entire protein sequence contains a large excess of particular amino acid types, is also known to be related to the general protein function. For instance, the net positive charge of histones that results from a bias in amino acid usage towards positively charged residues facilitates interactions with negatively charged DNA (Karlin et al., 2003). The aforementioned examples of the known roles for CBS are suggestive of future discoveries and highlight the need for readily available computational resources to study various flavors of CBS.

The most popular method used to identify and mask CBS in biological sequences during sequence database searches is SEG (Wootton and Federhen, 1996). Other empirical methods for masking CBS in biological sequences include XNU (Claverie and States, 1993), CAST (Promponas et al., 2000), SIMPLE (Alba et al., 2002), CARD (Shin and Kim, 2005) and GBA (Li and Kahveci, 2006). However, they are all general-purpose methods that do not distinguish between different types of compositional bias. The first statistical method designed for the identification of user-defined CBS by searching for statistically significant clusters of amino acid residues with similar chemical properties was implemented in program SAPS (Brendel et al., 1992). Recently, the second statistical method for finding distinct types of CBS in proteins was proposed (Harrison, 2006; Harrison and Gerstein, 2003). Both methods use a conservative significance threshold to provide a correction for multiple testing and, therefore, will miss low-density clusters. However, weakly significant clusters of amino acid types with certain properties can be the most likely candidates for functionally and/or structurally important sites (Kreil and Ouzounis, 2003; Kuznetsov and Hwang, 2006). Neither of these two methods were made available to the scientific community in the form of a user-friendly web-server designed to search for user-specified types of local and global compositional bias. Recently, we developed BIAS, a highly sensitive method that can identify statistically significant CBS composed of user-specified amino acid types (Kuznetsov and Hwang, 2006). This article describes ProBias, a web-server that implements the BIAS algorithm to search a protein sequence for the instances of user-specified local and global compositional bias.


    2 METHOD
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 DESCRIPTION OF THE...
 ACKNOWLEDGEMENT
 REFERENCES
 
The BIAS algorithm addresses the following problem. Given a protein sequence, S, of length N, generated using the 20 amino acid types according to the random independence model, and a sub-alphabet, B, of m amino acid types (m << 20), find all CBS in which residues from B are significantly over-represented. In order to identify CBS, S is represented as a sequence of successes and failures in N independent Bernoulli trials. Successes correspond to the residues from B, failures correspond to the residues not included in B. Successes separated by less than d positions are merged into a single CBS. BIAS utilizes the discrete scan statistics (Glaz et al., 2001) to estimate the significance of each CBS and is able to detect even subtle local deviations from the random independence model. It also distinguishes whether or not observed CBS are significant due to local or global compositional bias. The significance of global compositional bias of S is estimated using the binomial distribution (see Kuznetsov and Hwang, 2006, for details).


    3 DESCRIPTION OF THE WEB-SERVER
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 DESCRIPTION OF THE...
 ACKNOWLEDGEMENT
 REFERENCES
 
The user interface consists of the sequence input field and four parameters used to run BIAS (Supplementary Fig. 1). The user can upload a protein sequence or retrieve it using accession number. The four parameters used in ProBias are as follows:

  1. The expected probability of each of the 20 amino acid types. The user can choose to use a background model based on the amino acid frequencies observed in (a) the SwissProt database (default), (b) the Protein DataBank (PDB) or (c) the input sequence. PDB frequencies serve as a model derived from mostly globular proteins, whereas SPROT frequencies serve as a model derived from all proteins in the protein universe. The former is better suitable for searching for non-globular domains (such as disordered regions). The option (c) is used to remove the effect of global compositional bias of the input sequence on local CBS.
  2. The linkage distance, d, used to merge positions into CBS (4 by default). The user can either specify the linkage distance or let the program estimate it as d = round(2/(3*xp{B})), where p{B} is the probability of observing a residue from sub-alphabet B. Larger values of d result in longer CBS.
  3. P-value cut-off (0.01 by default). The web-server will only report CBS with P-value less than this cut-off.
  4. Sub-alphabet(s). The user can either specify up to 10 amino acid sub-alphabets or use 10 pre-selected sub-alphabets (Supplementary Fig. 2). Each sub-alphabet is used independently to analyze the input sequence.

The main output page consists of two parts (Supplementary Fig. 3). The first part shows a table with the summary of the analysis of the global and local compositional bias: whether the residues from the input sub-alphabets are over- or under-represented and whether any CBS with the P-value below the cut-off were found in the input sequence. If statistically significant CBS were found, details can be obtained by clicking the corresponding hyperlink (Supplementary Fig. 4). The second part of the main output page shows the input sequence with CBS printed below the sequence. This part also shows low complexity regions (if any) found using the PSEG program (Wootton and Federhen, 1996). To the best of the author's knowledge, ProBias is the only existing web-server that can search a protein sequence for multiple user-specified types of CBS and provide a user-friendly graphical overview of both local and global compositional bias. In combination with other de novo sequence analysis methods, ProBias can be used to guide the search for potential functionally important sites or domains in novel proteins that lack functional annotation. The web-server is freely available at http://lcg.rit.albany.edu/ProBias.


    ACKNOWLEDGEMENT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 DESCRIPTION OF THE...
 ACKNOWLEDGEMENT
 REFERENCES
 
The author thanks Run Li for help in web-server development.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: John Quackenbush

Received on March 17, 2007; revised on May 11, 2008; accepted on May 11, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHOD
 3 DESCRIPTION OF THE...
 ACKNOWLEDGEMENT
 REFERENCES
 

    Alba MM, et al. Detecting cryptically simple protein sequences using the SIMPLE algorithm. Bioinformatics (2002) 8:672–678.

    Brendel V, et al. Methods and algorithms for statistical analysis of protein sequences. Proc. Natl Acad. Sci. USA (1992) 89:2002–2006.[Abstract/Free Full Text]

    Claverie J-M, States DJ. Information enhancement methods for large scale sequence analysis. Comput. Chem. (1993) 17:191–201.[CrossRef][Web of Science]

    DeMarco ML, Daggett V. Molecular mechanism for low pH triggered misfolding of the human prion protein. Biochemistry (2007) 46:3045–3054.[CrossRef][Web of Science][Medline]

    Eisenhaber F, et al. Prediction of lipid posttranslational modifications and localization signals from protein sequences: big-Pi, NMT and PTS1. Nucleic Acids Res. (2003) 31:3631–3634.[Abstract/Free Full Text]

    Glaz J, et al. Scan Statistics (2001) New York: Springer-Verlag. 45–46.

    Gunawardena S, Goldstein LS. Polyglutamine diseases and transport problems: deadly traffic jams on neuronal highways. Arch. Neurol (2005) 62:46–51.[Abstract/Free Full Text]

    Harrison PM. Exhaustive assignment of compositional bias reveals universally prevalent biased regions: analysis of functional associations in human and Drosophila. BMC Bioinformatics (2006) 7:441.[CrossRef][Medline]

    Harrison PM, Gerstein M. A method to assess compositional bias in biological sequences and its application to prion-like glutamine/asparagine-rich domains in eukaryotic proteomes. Genome Biol (2003) 4:R40.[CrossRef][Medline]

    Karlin S, et al. Genome comparisons and analysis. Curr. Opin. Struct. Biol (2003) 13:344–352.[CrossRef][Web of Science][Medline]

    Koonin EV, et al. Protein sequence comparison at genome scale. Methods Enzymol (1996) 266:295–322.[CrossRef][Web of Science][Medline]

    Kreil DP, Ouzounis CA. Comparison of sequence masking algorithms and the detection of biased protein sequence regions. Bioinformatics (2003) 19:1672–1681.[Abstract/Free Full Text]

    Kuznetsov IB, Hwang S. A novel sensitive method for the detection of user-defined compositional bias in biological sequences. Bioinformatics (2006) 22:1055–1063.[Abstract/Free Full Text]

    Li X, Kahveci T. A novel algorithm for identifying low-complexity regions in a protein sequence. Bioinformatics (2006) 22:2980–2987.[Abstract/Free Full Text]

    Promponas VJ, et al. CAST: an iterative algorithm for the complexity analysis of sequence tracts. Bioinformatics (2000) 16:915–922.[Abstract/Free Full Text]

    Romero P, et al. Natively disordered proteins: functions and predictions. Appl. Bioinformatics (2004) 3:105–113.[CrossRef][Medline]

    Shin SW, Kim SM. A new algorithm for detecting low-complexity regions in protein sequences. Bioinformatics (2005) 21:160–170.[Abstract/Free Full Text]

    Wootton JC, Federhen S. Analysis of compositionally biased regions in sequence databases. Methods Enzymol (1996) 266:554–571.[Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
24/13/1534    most recent
btn233v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Kuznetsov, I. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kuznetsov, I. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?