Bioinformatics Advance Access originally published online on February 24, 2006
Bioinformatics 2006 22(9):1055-1063; doi:10.1093/bioinformatics/btl049
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
A novel sensitive method for the detection of user-defined compositional bias in biological sequences
Gen*NY*sis Center for Excellence in Cancer Genomics, Department of Epidemiology and Biostatistics, University at Albany, State University of New York One Discovery Drive, Rensselaer, NY 12144, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Most biological sequences contain compositionally biased segments in which one or more residue types are significantly overrepresented. The function and evolution of these segments are poorly understood. Usually, all types of compositionally biased segments are masked and ignored during sequence analysis. However, it has been shown for a number of proteins that biased segments that contain amino acids with similar chemical properties are involved in a variety of molecular functions and human diseases. A detailed large-scale analysis of the functional implications and evolutionary conservation of different compositionally biased segments requires a sensitive method capable of detecting user-specified types of compositional bias.
Results: We present BIAS, a novel sensitive method for the detection of compositionally biased segments composed of a user-specified set of residue types. BIAS uses the discrete scan statistics that provides a highly accurate correction for multiple tests to compute analytical estimates of the significance of each compositionally biased segment. The method can take into account global compositional bias when computing analytical estimates of the significance of local clusters. BIAS is benchmarked against SEG, SAPS and CAST programs. We also use BIAS to show that groups of proteins with the same biological function are significantly associated with particular types of compositionally biased segments.
Availability: The software is available at http://lcg.rit.albany.edu/bias/
Contact: ikuznetsov{at}albany.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
| INTRODUCTION |
|---|
|
|
|---|
Most protein and DNA sequences contain compositionally biased segments. In these segments one or more residue types are significantly overrepresented. Whether or not a given sequence segment has a compositional bias depends on the choice of background model used to define an unbiased sequence. In most biological applications residue positions in a protein or DNA sequence are modeled as a sequence of discrete events generated according to the random independence model. In this model a residue in sequence position j, aj, is generated according to its probability, p(aj), and this probability is independent of the rest of the positions (0-th order Markov model). Usually, p(aj) is the same as the frequency of residue aj computed using a large collection of protein or DNA sequences (some sequence database). The random independence model is used to assess the significance of database hits in the most popular and efficient sequence similarity search algorithms such as BLAST (Altschul et al., 1997). Therefore, compositionally biased segments that violate the assumptions of the random independence model may lead to incorrect estimates of statistical significance in sequence similarity searches and wrong functional and evolutionary inference.
Identification of compositionally biased segments is important not only in sequence similarity searches. DNA sequences contain a large fraction of diverse biased segments with poorly understood function. More than 50% of all proteins have been shown to contain biased segments (Wootton and Federhen, 1996). Some researchers believe that they are similar to junk DNA and are mostly determined by the compositional bias on the level of coding DNA sequence (Nishizawa and Nishizawa, 1998; Huntley and Golding, 2000; Singer and Hickey, 2000). Others suggest that biased segments in proteins are not just junk sequences and are involved in a variety of molecular functions: protein active sites, proteinDNA and proteinprotein interactions (Karlin et al., 2003), transcription regulation (Brendel and Karlin, 1989), membrane transport (Kreil and Ouzounis, 2003), structural function (Karlin et al., 2003), peptides at protein termini (Berezovsky et al., 1999). In the case of a new protein that does not have detectable homologs, weak but statistically significant compositionally biased segments with certain properties can be one of few clues that the researcher can use to make a biologically meaningful guess about protein's function and structure. Compositionally biased protein segments have also been implicated in a number of human diseases such as prion diseases, Huntington's disease, cancer and others (Harrison and Gerstein, 2003; Karlin et al., 2002). It has been proposed that certain types of these segments are overrepresented in proteins involved in neurological disorders (Karlin and Burge, 1996). Unlike proteins, DNA sequences are composed only of four residue types and can be quite long. As a result, compositional bias in DNA is considerably harder to deal with than that in proteins.
There are two types of compositional bias: global and local. In the case of global bias, the entire protein sequence contains a large excess of particular residue types (an excess of hydrophilic residues, for instance) and becomes one long compositionally biased segment. In the case of local bias, most of the protein sequence conforms to the random independence model with only relatively small local clusters of over- or underrepresented residue types. Such locally biased segments may be the most likely candidates for functionally and/or structurally important sites. A number of methods that identify and mask compositionally biased segments by replacing them with lowercase letters or a string of Xs in protein and Ns in DNA sequences have been developed. The popular SEG algorithm uses a sliding window and entropy-based measure to detect regions with low-information complexity called low-complexity regions (Wootton and Federhen, 1996) that correspond to compositionally biased segments. Masking low-complexity regions with SEG significantly improves the specificity of sequence similarity search and it has become a de facto standard in protein BLAST searches. The SEG algorithm uses equal prior probabilities of residue types and segmentation threshold, chosen based on random sequences, and therefore provides no estimates of statistical significance of the detected low-complexity regions. It has been proposed that SEG reports too many low-complexity segments and introduces an artificial bias into the distribution of their length owing to a pre-set size of the window used to identify seed segments (Kreil and Ouzounis, 2003). CAST is another method for detecting compositionally biased segments (Promponas et al., 2000). It uses pairwise sequence alignment of query sequence with homopolymer sequences (20 homopolymer sequences in the case of proteins) and assigns significance based on the local alignment statistics. SIMPLE is a method for the identification of simple repeats in protein and DNA sequences that estimates statistical significance by using randomly generated sequences of the same length and composition as those of the test sequence (Alba et al., 2002). Neither of these methods is suitable for the identification of short biased segments composed of a user-specified subset of residue types with similar chemical properties.
The first statistical method designed for the identification of compositional bias by searching for statistically significant clusters of amino acid residues with similar chemical properties was proposed by Karlin et al. (1990) and implemented in program SAPS (Brendel et al., 1992). In this method, an amino acid sequence of length N is represented as a sequence of successes and failures in N independent Bernoulli trials. Successes correspond to residues that belong to a pre-selected group of similar amino acids (charged residues, for instance) and failures correspond to the rest of the 20 amino acids. Unusual clusters of amino acids from the group of interest are identified by counting the number of successes in a sliding window of fixed size. The significance of each window is estimated using the normal approximation to the binomial distribution. This approach has two major limitations. First, in order for the normal approximation to be valid window size must be sufficiently large. Second, this simple model does not provide a correction for multiple tests, a problem that arises from the use of all possible windows. In order to deal with multiple tests, SAPS uses three conservative p-value thresholds for proteins shorter than 750 residues, proteins with length 7501500 residues, and proteins longer than 1500 residues. Because of these limitations, SAPS may miss weak but still significant and biologically informative signals in a form of small or low-density clusters. The program was designed neither for masking compositionally biased segments nor for the analysis of DNA sequences. Recently, a statistical method for finding compositionally biased regions in proteins similar to SAPS has been proposed (Harrison and Gerstein, 2003). This method also counts the number of successes in a sliding window and uses the exact binomial test to estimate statistical significance of each window. As in the case of SAPS, there is no explicit analytical correction for multiple testing. None of the above methods separates global compositional bias from the local one.
Here we present BIAS, a novel method and software for the identification of compositionally biased segments in protein and DNA sequences. BIAS automatically searches for unusual clusters of user-specified residue types and computes analytical estimates of the significance of each cluster. These estimates are based on scan statistics that allows one to detect even subtle local deviations from the random independence model. BIAS also distinguishes whether or not observed clusters are significant due to local or global compositional bias. Although the method is primarily intended for the analysis of proteins, it can be applied to DNA sequences as well. We use a number of biological examples to compare the performance of BIAS with that of SEG, SAPS and CAST programs. We also show that groups of proteins with the same biological function contain compositionally biased segments of the same type and that this association between type of function and type of compositional bias is highly non-random.
| SYSTEMS AND METHODS |
|---|
|
|
|---|
Overview
BIAS addresses the following problem. Given a sequence, S, of length N composed of residues generated using an alphabet, A, of size n according to the random independence model and a subalphabet, B, of m residues selected from A (m < n), find subsequences of S in which residues from B are overrepresented. These compositionally biased subsequences (segments) will correspond to clusters that have an unusually high density of residues from B and are unlikely to be observed by pure chance under the random independence model. In order to identify such clusters, S is represented as a sequence of successes and failures in N independent Bernoulli trials. Successes correspond to residues from B, failures correspond to those residues from A that are not included in B. Successes separated by less than d positions are merged into a single cluster. Statistical significance of each cluster is estimated using discrete scan statistic. Discrete scan statistics, denoted Sw, is the maximum number of successes observed within any w contiguous trials in a sequence of N Bernoulli trials. The tests based on scan statistics use a significance level that takes into account the sliding of the window to maximize the density of successes within the cluster. The unconditional discrete scan statistic is used in which the total number of successes in N trials is treated as a random variable. The probability of success is computed as the sum of probabilities of observing residue types from the subalphabet B.
Formal description of BIAS
Let S = s1, ... , sN be a biological sequence, where each residue si belongs to an alphabet A = {a1, ... , an}. Let B = {b1, ... , bm} be a subalphabet of A (m < n). Let P(j) be the probability of occurrence of residue type j. Recode S into a sequence of N independent identically distributed binomial random variables x1, ... , xN, where P(xi = 1) = 1 P(xi = 0) = p, according to the following rule:
![]() | (1) |
- C(i, j) cannot be extendedsubsequences xi d, ... , xi 1 and xj + 1, ... , xj + d (i d
1 and j + d
N) consist only of failures or i = 1 or j = N (positions at sequence termini).
- Any two consecutive successes xf and xg from the cluster are separated by less than d positions: for i
f < g
j, g f
d.
For a given cluster C(i, j) compute the number of successes, k, and cluster size, w = j i + 1. Estimate the significance of the cluster by computing probability of observing the value of unconditional discrete scan statistic, Sw, greater or equal to k in a sequence of N trials with the probability of success p, P(Sw
k|N, p). This probability can be computed analytically using the following highly accurate approximation (Glaz et al., 2001):
![]() | (2) |
![]() | (3) |
![]() | (4) |
![]() | (5) |
![]() | (6) |
![]() | (7) |
![]() | (8) |
![]() | (9) |
![]() | (10) |
N x p) we estimate P(X
h|N, p). For a lack of residues from B (if h < N x p) we estimate P(X < h|N, p) by the following equations:
![]() | (11) |
A flowchart that illustrates the application of the method to find clusters of negatively charged residues in amino acid sequence is shown in Figure 1.
|
| IMPLEMENTATION |
|---|
|
|
|---|
The method is implemented as a standard ANSI C++ program. The program (PBIAS for proteins and NBIAS for DNA) has a command-line interface, reads input sequences in standard FASTA format and is suitable for high-throughput sequence analysis pipelines. Source code and pre-compiled binaries for Windows and Linux are freely available for non-commercial use at http://lcg.rit.albany.edu/bias/. A Perl script, MPBIAS.PL, that implements a CAST-like (Promponas et al., 2000) procedure for masking of low-complexity segments in a protein sequence is also provided. This script searches for subsequences in which one of the 20 amino acid types is significantly overrepresented and then replaces all positions in such subsequences with the X character.
Probability of each of the 20 amino acid types is estimated using its frequency in a non-redundant protein database. Two sets of frequencies are used: SPROT50, derived from the SWISS-PROT database (Bairoch and Apweiler, 2000) and PDB50, derived from the Protein Databank proteins (Berman et al., 2000). Sequences in both datasets were clustered at 50% sequence identity, meaning that all sequence pairs within the same dataset have identity <50%. Clustering was performed using the CD-HIT program (Li et al., 2002). PDB50 frequencies serve as the background model derived from mostly globular proteins, whereas SPROT50 frequencies serve as the background model derived from all proteins in the protein universe. In order to remove the effect of global compositional bias, significance of each cluster is also estimated using sequence-derived residue frequencies (default option for DNA sequences). The program also has options for estimating the significance of local compositional bias by using random sequences.
| DISCUSSION |
|---|
|
|
|---|
In this section we present the application of PBIAS to a number of well-annotated proteins that contain domains with a different type of chemical bias that determines specific properties of these domains and to a set of compositionally biased proteins from the Plasmodium genome. The performance of PBIAS is compared with that of SEG (Wootton and Federhen, 1996), SAPS (Brendel et al., 1992) and CAST (Promponas et al., 2000). For a reasonably fair comparison, all programs were used with default arguments (linkage distance of 4, SPROT50 frequencies and p-value threshold of 0.05 for PBIAS). These arguments were shown to perform well on a wide variety of compositionally biased proteins. Depending on the nature of each protein site, PBIAS was used with a subalphabet that contains amino acid types with similar chemical properties suitable for the identification of this particular site: hydrophobic, aliphatic, charged, etc. SAPS was used via web interface available at http://www.ch.embnet.org/software/SAPS_form.html. CAST was used via web interface available at http://biophysics.biol.uoa.gr/cgi-bin/CAST/cast_cgi. We also show that NBIAS can be successfully used to mask a cDNA sequence coding for a compositionally biased protein domain. Finally, we use BIAS to demonstrate a highly significant non-random association between type of function and type of compositionally biased segments observed in groups of proteins with the same biological function.
Prion protein
The prion protein, PrP, is a rare infectious agent that causes transmissible spongiform encephalopathy. PrP can undergo a dramatic transition from the normal mostly helical conformation to the pathogenic conformation rich in beta-sheet (Prusiner, 1998). Prion protein has been extensively characterized experimentally. It consists of a flexible unstructured N-terminal domain thought to be involved in binding copper and a structured mostly helical C-terminal domain. It has been shown that upon formation of PrP dimer the switch region located in the C-terminus of helix B unwinds (Knaus et al., 2001). PrP also contains N- and C-terminal signal peptides which are removed during post-translational modification (Lehmann et al., 1999). The results of application of PBIAS, SEG, SAPS and CAST to human prion protein are shown in Table 1 and briefly summarized below.
- PBIAS is the only program that accurately identifies the N-terminal signal peptide as an unusually dense cluster of hydrophobic residues. PBIAS, SEG and SAPS identify the C-terminal signal peptide. In this case, SAPS is most accurate with default parameters (residues 232253), but it reports the peptide as a transmembrane region. SAPS also reports the same peptide as a hydrophobic region, 240253 versus 240252 in PBIAS. PBIAS accurately identifies signal peptide residues 232252 when linkage distance is increased from the default value of 4 to 5.
- PBIAS accurately identifies most of the unstructured N-terminal domain as a highly significant long cluster of residues with high propensity for irregular (coil) conformation. CAST also identifies a significant portion of this domain. Both SEG and SAPS identify only about half of this domain. Unlike PBIAS, SEG and SAPS do not provide any insights into the properties of the residues in this domain.
- PBIAS and SEG identify the flexible C-terminal end of helix B. PBIAS identifies it as a highly significant cluster of Threonine residues and is more accurate than SEG. Neither SAPS nor CAST report this segment.
|
It should be noted that all segments identified in PrP by PBIAS are statistically significant when the global compositional bias of the protein is accounted for. In other words, they all represent true cases of local compositional bias. This is especially interesting in the case of the unstructured N-terminal domain. Although PrP has a significant excess of residues with high propensity for coil conformation [Equation (11), p = 3 x 103), the cluster of these residues corresponding to the N-terminal domain is significant (p = 5 x 104) even when sequence-derived frequencies are used.
Ser-tRNA synthetase
The N-terminal domain of Ser-tRNA synthetase is a non-globular long helical bundle that interacts with cognate tRNA (Fig. 2) (Fujinaga et al., 1993). PBIAS accurately identifies this domain as a highly significant (p = 4 x 107) cluster of strong helix formers (Table 2). Despite a large excess of strong helix formers [Equation (11), p = 7 x 106], this cluster remains significant (p = 8 x 103) when global bias is accounted for. SEG identifies two short low-complexity segments that cover only
35% of this domain. CAST identifies about two-third of this domain. SAPS does not report any unusual segments.
|
|
Collagen
Collagen is an extracellular structural protein with three long non-globular triple-helical domains rich in amino acids Glycine and Proline that have unusual backbone conformational preferences. These triple-helical domains are involved in the formation of coiledcoil structure (Beck and Brodsky, 1998). PBIAS is most accurate at identifying these domains as highly significant clusters of Glycine and Proline (Table 2). These clusters remain significant when global bias is accounted for. SEG identifies many short low-complexity regions in these domains. SAPS identifies the entire 275897 segment as a highly repetitive region. CAST identifies 269899 segment as Gly- and Pro-rich.
Cytochrome C biogenesis protein
Cytochrome C biogenesis protein is involved in the heme delivery pathway for cytochrome c maturation (Stevens et al., 2004). This protein contains a transmembrane domain (residues 6788). SAPS is most accurate at identifying this domain with default parameters. PBIAS with default arguments and subalphabet of aliphatic residues (lipid-soluble residues) identifies a slightly longer segment that includes this domain. If linkage distance is changed from the default value of 4 to 3, PBIAS identifies the same domain as SAPS (Table 2). SEG identifies only a part of the transmembrane domain. Both PBIAS and SEG find another compositionally biased segment in the C-terminus. CAST does not report any segments.
Ribosomal protein L32E
Ribosomal protein L32E has two disordered regions, long (residues 194) and short (residues 237340) (Klein et al., 2001). PBIAS is most accurate at identifying a long region as an unusual cluster of residues with high propensity for disordered conformation. Both SEG and SAPS identify only a small fraction of this region, SAPS identifying it as a negatively charged region (Table 2). CAST reports the 2113 segment as Glu-rich.
C-Myc I protein
C-Myc I protein has an acidic domain (residues 226245) and a leucine zipper domain (residues 387415) (Vriz et al., 1989). PBIAS is most accurate at identifying these domains as clusters of charged residues (Table 2), whereas SEG reports too many low-complexity regions. Both CAST and SAPS identify the acidic domain but miss the leucine zipper.
Masking cDNA sequence of prion protein
NBIAS was used with default arguments (linkage distance 5, sequence-derived frequencies, p-value threshold of 0.05) and subalphabet GC to mask the cDNA of human prion protein. This procedure resulted in masking almost entire N-terminal part (codons 2105) which is significantly compositionally biased in the protein sequence. SEG only masked codons 96105. Application of NBIAS to the cDNA sequence of chicken PrP that has very different nucleotide composition and shares only 35% sequence identity with human PrP gives similar results (see Supplementary information).
Compositionally biased segments in PDB and SWISS-PROT proteins
It has been shown before using SEG that low-complexity segments are underrepresented in globular proteins available in the Protein Databank (PDB) (Huntley and Golding, 2002). Application of PBIAS to proteins from the PDB and SWISS-PROT confirms this observation. As one can see from Figure 3, the percentage of proteins with at least one compositionally biased segment that has p-value < 103 is significantly lower in PDB proteins for all tested subalphabets. The largest difference is observed for hydrophobic and hydrophilic segments that are underrepresented in PDB by more than one order of magnitude. This is consistent with the observation that sequences lacking periodicity of hydrophobic/hydrophilic residues cannot form compact globular structure (Dill, 1999; Silverman, 2005) and therefore are underrepresented in mostly globular PDB proteins.
|
Analysis of low-complexity segments in Plasmodium falciparum proteins
In order to compare the performance of PBIAS with that of SEG and CAST on a large sequence set we used 210 proteins from chromosome 2 of P.falciparum (Gardner et al., 1998). We chose this dataset because it represents a set of highly compositionally biased eukaryotic proteins that was extensively used before to compare CAST and SEG methods (Promponas et al., 2000). Each of the three programs was used with the default arguments. The distribution of the number of low-complexity segments detected in this dataset by PBIAS, SEG and CAST is shown in Figure 4. We find that performance of PBIAS is more similar to that of SEG than CAST. For instance, PBIAS identifies at least one low-complexity segment in 93% of the sequences, SEG in 90%, whereas CAST only in 74%. Both PBIAS and SEG also report significantly higher number of sequences that contain five or more low-complexity segments than CAST.
|
We also studied how PBIAS masking of low-complexity segments performs in database searching compared with masking with SEG and CAST. We used each of the 210 P.falciparum sequences masked by PBIAS, SEG and CAST as a query sequence in BLASTP (Altschul et al., 1997) sequence similarity search. BLASTP was run against the NCBI NR database with the default arguments and masking option turned off. For each query sequence, i, we computed four overlap indices normalized between 0 and 100%:
![]() | (12) |
![]() | (13) |
Y(i, E) is the number of hits with E-value below threshold E that appear in both hit list for i masked with method X and in hit list for i masked with method Y (size of the overlap between the two lists). X,Y
{PBIAS, CAST, SEG}. If the value of a given overlap index,
, is 100% it means that all hits in list X are also included in list Y. Small values of the overlap index mean that list X contains many hits that are not included in list Y. We used two E-value thresholds, 106 and 1030. The results of BLASTP benchmarking indicate that when PBIAS is compared with SEG, identical hit lists are returned for 30% of the sequences if E-value is set to 106 and for 41.4% of the sequences if E-value is set to 1030. When PBIAS is compared with CAST, identical hit lists are returned for 31.4 and 39.5% of the sequences, respectively. A more detailed analysis of the overlap indices (Table 3) shows that both SEG- and CAST-masked query sequences tend to return more hits than the PBIAS-masked ones (as indicated by low average OC and OS overlap indices ranging from 67.6 to 76.0%). Almost all hits returned for PBIAS-masked sequences are included in SEG and CAST hit lists (as indicated by high average OP overlap indices ranging from 95.2 to 99.7%).
|
Compositionally biased segments and protein function
An important question regarding compositionally biased segments is whether they exhibit statistically significant associations with protein function. We used functional annotation from the Gene Ontology (GO) database (Ashburner et al., 2000) to study if clusters of residues with particular chemical properties preferentially occur in proteins with specific types of function. The GO database consists of three major categories (ontologies): cellular component, biological process and molecular function. Each ontology consists of a set of unique terms indexed by unique id numbers. A protein is classified using zero or more terms from each of the three ontologies. For instance, SWISS-PROT protein P80385 [GenBank] is linked to GO term 4679 (Molecular function: AMP-activated protein kinase activity) and GO term 6950 (Biological process: response to stress). We used the following procedure to identify a significant association between biased segments from a subalphabet C and a particular GO term i:
- Use PBIAS with default arguments and a given subalphabet, C, to search each protein from SPROT50 dataset for biased segments. For GO term i count the number of proteins, n(i), that contain at least one biased segment with p-value below the threshold of 103 and are linked to this term.
- Count the total number of proteins in SPROT50, M, that contain at least one biased segment from subalphabet C. Randomly sample, without replacement, M proteins from SPROT50. Count the number of proteins in this random sample associated with GO term i, r(i). Repeat the sampling N times.
- Use random samples to compute for each r(i) its average, E(i), and standard deviation, S(i). Use these to compute the Z-score for GO term i, z(i):
where
and
Random sampling was performed 105 times (N = 105, a number for which the Z-score converges up to one decimal digit).
(14)
We used two non-overlapping subalphabets, RKENDSTQ (hydrophilic amino acids) and LVIFM (major hydrophobic amino acids), and a subalphabet that partially overlaps with the hydrophilic one, GPTSDN (amino acids with high propensity for coil conformation). Figure 5 shows all GO terms from Molecular function ontology that have the absolute value of z(i) greater than 6.0 for at least one of the three subalphabets. Since we cannot assume that all distributions are normal, the threshold Z-score of 6.0 was chosen to ensure that the observed associations are significant even for skewed distributions. Inspection of Figure 5 leads to the following conclusions:
- There are groups of GO terms associated with only one type of biased segments (hydrophobic, hydrophilic or coil), despite the fact that hydrophilic and coil subalphabets overlap. Hydrophobic clusters are associated with receptor activity, voltage-gated channel activity, transporter activity. Hydrophilic clusters are associated with nuclease activity, transcription, splicing and translation. Clusters of coil residues are associated with direct binding to nucleic acids, protein binding, structural constituent of cuticle.
- Presence of hydrophobic clusters is negatively correlated with the presence of hydrophilic clusters. There are, however, three exceptions: GO:4872 (receptor activity), GO:5245 (voltage-gated calcium channel activity) and GO:5486 (t-SNARE activity), all being integral membrane proteins.
|
Very similar trends are observed in the other two ontologies, Cellular component and Biological process. However, the number of significant associations in these two ontologies is very large and cannot be presented in a compact form suitable for the limited journal space. Detailed results for a variety of subalphabets and all three ontologies will be presented elsewhere. The primary point from Figure 5 is that we do observe that groups of proteins with the same biological function contain the same type of compositionally biased segments and this association between type of function and type of compositional bias is highly non-random. This suggests that compositionally biased segments are not just junk sequences and can be used to construct potential function-specific de novo protein signatures.
| CONCLUSIONS |
|---|
|
|
|---|
We presented BIAS, a method for the identification of compositionally biased segments composed of a user-specified set of residue types. Main features of the method are as follows.
- BIAS finds statistically significant clusters of user-specified residue types that are unlikely to be observed under the random independence model. Significance is estimated analytically using scan statistics that provides a highly accurate correction for multiple tests.
- BIAS provides an exact estimate of global compositional bias of the test sequence and can take into account global compositional bias when computing analytical estimates of the significance of local clusters.
- Both the main advantage and main shortcoming of the proposed method is that it requires the user to supply a subalphabet of residue types used to search for compositional bias. The advantage is that BIAS can be used to search for compositionally biased segments composed of amino acids with similar chemical properties such as charge, hydrophobicity, size, etc. The disadvantage is that BIAS will ignore biased regions composed of residue types that are not included into the user-supplied subalphabet.
The application of BIAS to a number of test proteins has shown that it is suitable for de novo characterization of functionally important sites. The application of BIAS to study proteins from the SWISS-PROT database has also shown that groups of proteins with the same biological function contain the same type of compositionally biased segments and that this association between type of function and type of compositional bias is non-random.
| Acknowledgments |
|---|
The authors thank Dr Yaroslava Ruzankina for help and two anonymous reviewers for their valuable comments.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Christos Ouzounis
Received on June 26, 2005; revised on December 23, 2005; accepted on February 7, 2006
| REFERENCES |
|---|
|
|
|---|
Alba, M.M., et al. (2002) Detecting cryptically simple protein sequences using the SIMPLE algorithm. Bioinformatics, 8, 672678.
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 33893402
Ashburner, M., et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology consortium. Nat. Genet, . 25, 2529[CrossRef][Web of Science][Medline].
Bairoch, A. and Apweiler, R. (2000) The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res, . 28, 4548
Beck, K. and Brodsky, B. (1998) Supercoiled protein motifs: the collagen triple-helix and the alpha-helical coiled coil. J. Struct. Biol, . 122, 1729[CrossRef][Web of Science][Medline].
Berezovsky, I.N., et al. (1999) Amino acid composition of protein termini are biased in different manners. Protein Eng, . 12, 2330
Berman, H.M., et al. (2000) The protein data bank. Nucleic Acids Res, . 28, 235242
Brendel, V. and Karlin, S. (1989) Association of charge clusters with functional domains of cellular transcription factors. Proc. Natl Acad. Sci. USA, 86, 56985702
Brendel, V., et al. (1992) Methods and algorithms for statistical analysis of protein sequences. Proc. Natl Acad. Sci. USA, 89, 20022006
Dill, K.A. (1999) Polymer principles and protein folding. Protein Sci, . 8, 11661180[Web of Science][Medline].
Fujinaga, M., et al. (1993) Refined crystal structure of the seryl-tRNA synthetase from Thermus thermophilus at 2.5Å resolution. J. Mol. Biol, . 234, 222233[CrossRef][Web of Science][Medline].
Gardner, M.J., et al. (1998) Chromosome 2 sequence of the human malaria parasite Plasmodium falciparum. Science, 282, 11261132
Glaz, J., Naus, J., Wallenstein, S. Scan Statistics, (2001) , NY Springer-Verlag, pp. 4546.
Harrison, P.M. and Gerstein, M. (2003) A method to assess compositional bias in biological sequences and its application to prion-like glutamine/asparagine-rich domains in eukaryotic proteomes. Genome Biol, . 4, R40[CrossRef][Medline].
Huntley, M. and Golding, G.B. (2000) Evolution of simple sequence in proteins. J. Mol. Evol, . 51, 131140[Web of Science][Medline].
Huntley, M.A. and Golding, G.B. (2002) Simple sequences are rare in the Protein Data Bank. Proteins, 48, 134140[CrossRef][Web of Science][Medline].
Karlin, S. and Burge, C. (1996) Trinucleotide repeats and long homopeptides in genes and proteins associated with nervous system disease and development. Proc. Natl Acad. Sci. USA, 93, 15601565
Karlin, S., et al. (1990) Identification of significant sequence patterns in proteins. Methods Enzymol, . 183, 388402[Web of Science][Medline].
Karlin, S., et al. (2002) Amino acid runs in eukaryotic proteomes and disease associations. Proc. Natl Acad. Sci. USA, 99, 333338
Karlin, S., et al. (2003) Genome comparisons and analysis. Curr. Opin. Struct. Biol, . 13, 344352[CrossRef][Web of Science][Medline].
Klein, D.J., et al. (2001) The kink-turn: a new RNA secondary structure motif. EMBO J, . 20, 42144221[CrossRef][Web of Science][Medline].
Knaus, K.J., et al. (2001) Crystal structure of the human prion protein reveals a mechanism for oligomerization. Nat. Struct. Biol, . 8, 770774[CrossRef][Web of Science][Medline].
Kreil, D.P. and Ouzounis, C.A. (2003) Comparison of sequence masking algorithms and the detection of biased protein sequence regions. Bioinformatics, 19, 16721681
Lehmann, S., et al. (1999) Trafficking of the cellular isoform of the prion protein. Biomed. Pharmacother, . 53, 3946[CrossRef][Medline].
Li, W., et al. (2002) Tolerating some redundancy significantly speeds up clustering of large protein databases. Bioinformatics, 8, 7782.
Nishizawa, M. and Nishizawa, K. (1998) Biased usages of arginines and lysines in proteins are correlated with local-scale fluctuations of the G + C content of DNA sequences. J. Mol. Evol, . 47, 385393[CrossRef][Web of Science][Medline].
Promponas, V.J., et al. (2000) CAST: an iterative algorithm for the complexity analysis of sequence tracts. Bioinformatics, 16, 915922
Prusiner, S.B. (1998) Prions. Proc. Natl Acad. Sci. USA, 95, 1336313383
Silverman, B.D. (2005) Underlying hydrophobic sequence periodicity of protein tertiary structure. J. Biomol. Struct. Dyn, . 22, 411423[Web of Science][Medline].
Singer, G.A. and Hickey, D.A. (2000) Nucleotide bias causes a genome wide bias in the amino acid composition of proteins. Mol. Biol. Evol, . 17, 15811588
Stevens, J.M., et al. (2004) C-type cytochrome formation: chemical and biological enigmas. Acc. Chem. Res, . 37, 9991007[CrossRef][Web of Science][Medline].
Vriz, S., et al. (1989) Differential expression of two Xenopus c-myc proto-oncogenes during development. EMBO J, . 8, 40914097[Web of Science][Medline].
Wootton, J.C. and Federhen, S. (1996) Analysis of compositionally biased regions in sequence databases. Methods Enzymol, . 266, 554571[Web of Science][Medline].
This article has been cited by other articles:
![]() |
I. B. Kuznetsov ProBias: a web-server for the identification of user-specified types of compositionally biased segments in protein sequences Bioinformatics, July 1, 2008; 24(13): 1534 - 1535. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


















