Skip Navigation


Bioinformatics Advance Access originally published online on March 3, 2005
Bioinformatics 2005 21(10):2264-2270; doi:10.1093/bioinformatics/bti363
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2264    most recent
bti363v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bae, K.
Right arrow Articles by Elsik, C. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bae, K.
Right arrow Articles by Elsik, C. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

Prediction of protein interdomain linker regions by a hidden Markov model

Kyounghwa Bae 1, Bani K. Mallick 1 and Christine G. Elsik 2,*

1Department of Statistics, Texas A&M University College Station, TX 77843-3143, USA
2Department of Animal Science and Intercollegiate Faculty of Genetics, Texas A&M University College Station, TX 77843-2471, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 INTRODUCTION
 DATA
 MODEL
 COMPUTATION
 RESULTS
 DISCUSSION
 REFERENCES
 

Motivation: Our aim was to predict protein interdomain linker regions using sequence alone, without requiring known homology. Identifying linker regions will delineate domain boundaries, and can be used to computationally dissect proteins into domains prior to clustering them into families. We developed a hidden Markov model of linker/non-linker sequence regions using a linker index derived from amino acid propensity. We employed an efficient Bayesian estimation of the model using Markov Chain Monte Carlo, Gibbs sampling in particular, to simulate parameters from the posteriors. Our model recognizes sequence data to be continuous rather than categorical, and generates a probabilistic output.

Results: We applied our method to a dataset of protein sequences in which domains and interdomain linkers had been delineated using the Pfam-A database. The prediction results are superior to a simpler method that also uses linker index.

Contact: c-elsik{at}tamu.edu

Supplementary information: http://racerx00.tamu.edu/kbae


    INTRODUCTION
 TOP
 Abstract
 INTRODUCTION
 DATA
 MODEL
 COMPUTATION
 RESULTS
 DISCUSSION
 REFERENCES
 
The fundamental unit of protein structure, the domain, is defined as a unit that can independently fold into a stable tertiary structure. Domains often evolve as independent units found in different combinations. Thus, the domain has alternatively been defined as an evolutionary unit. Domain identification within a protein sequence is valuable in numerous applications. It allows structural determination of separate domains, which is often more successful than solving whole proteins. Computational methods for clustering proteins based on sequence similarity perform better when sequences are fragmented into single domain units.

Domain boundary prediction methods that apply the structural definition consider the domain to be a compact, semi-independent unit with a hydrophobic core; these methods use atomic coordinates from experimentally determined three-dimensional (3D) structures (Holm and Sander, 1994; Islam et al., 1995; Siddiqui and Barton, 1995; Wernisch et al., 1999; Taylor, 1999) or predicted structure (George and Heringa, 2002c; Marsden et al., 2002). Other methods apply the evolutionary definition of domain, and use regions of conservation in sequence alignments to identify domain boundaries (Sonnhammer and Kahn, 1994; Gouzy et al., 1997; Gracy and Argos, 1998; George and Heringa, 2002b). The domain families in Pfam-A are created using profile hidden Markov models (HMMs) built on multiple sequence alignments (Bateman et al., 2004). The CHOP method cuts proteins into domain-like fragments using domain boundary information from proteins with known structure and from Pfam-A (homology-based) domains (Liu and Rost, 2004a). CHOPNet is a neural network method that does not rely on known homology or known structure, but uses as input both evolutionary information and predicted structure (Liu and Rost, 2004b). The DGS (domain guess by size) method makes domain boundary estimates based on the statistical distribution of protein and domain lengths in a representative set (Wheelan et al., 2000).

An alternative to delineating domain boundaries is identifying interdomain linkers. The linker is defined as a region between adjacent domains. Studies have shown that linkers can play an essential role in maintaining cooperative interdomain interactions (Gokhale and Khosla, 2000). An understanding of linker properties will aid the engineering of fusion proteins. The composition and length of linkers have been shown to affect protein stability, folding and domain–domain orientation (Robinson and Sauer, 1998). As the alternative to domain prediction, linker identification facilitates splitting multidomain proteins into single domains prior to structural analysis or computational protein clustering.

Studies of linkers in various protein families have shown that linker regions lack regular secondary structure (Argos, 1990), but a recent study has identified helical linkers (George and Heringa, 2002a). Studies agree that some amino acids are more prevalent in the linker regions than in the domain regions (Robinson and Sauer, 1998; George and Heringa, 2002a; Tanaka et al., 2003). Most methods for identifying linkers use predicted secondary structure, amino acid propensity or a combination of the two. Miyazaki et al. (2002) have applied a neural network to predict the linker boundaries based on amino acid propensity, and found that linkers possess characteristics that may distinguish them from intradomain loops. The method of Tanaka et al. (2003) combines predicted secondary structure with amino acid propensity to identify loop regions and distinguish linker and non-linker loops. The Udwary–Merski algorithm (Udwary et al., 2002) combines three properties of linkers: low sequence conservation identified by multiple sequence alignment, low secondary structure conservation and low hydrophobicity. DomCut, which predicts linker regions based on sequence alone, relies solely on amino acid propensity (Suyama and Ohara, 2003). This method simply defines a linker region to be one that has lower linker index values than a specified threshold value. Similar to the approach taken in DomCut, we will use linker index to model linker regions, and will apply our model to a dataset of evolutionarily defined domains. We will employ a HMM to predict not only linker regions, but also their boundaries.

HMMs have been employed in diverse areas of computational biology (Lander and Green, 1987; Churchill, 1989; Cardon and Stormo, 1992; Burge and Karlin, 1997). Krogh et al. (1994) applied HMMs to multiple alignment of protein families and domains. Asai et al. (1993) applied HMMs to protein secondary structure prediction. Later, Schmidler et al. (2000,2001) applied generalized HMMs with Bayesian estimation to protein secondary structure prediction. The observations in the HMMs for protein structure prediction are recognized as strings of amino acids (categorical variables), forming the primary sequences of a protein.

In this paper, sequences are assumed to have a structure composed of regions that are homogeneous within a region but may differ between regions. We assume that protein sequence data are produced by a HMM and compositional variation is likely to reflect functional or structural differences between regions. Each region is classified into one of a finite number of states (linker and non-linker); we wish to estimate the states, given the observed protein sequence. It is important that we recognize the protein sequence data as continuous data instead of categorical data, which is an alternative approach to most HMMs in computational protein sequence analysis. Instead of recognizing the protein sequence as a string of amino acids (categorical variables), we recognize the protein sequence as a string of linker index values (continuous variables). Our objective is to identify linker index values that discriminate linker and non-linker regions.

Parameter estimation in HMMs usually relies on maximum-likelihood or the Bayesian approach. In the Bayesian approach, we consider the HMM as a mixture model with missing data. We can associate observation yi with missing data zi which represents the state (e.g. linker or non-linker) from which yi is generated (Robert and Mengersen, 1999). An important feature of our method is that we overcome the problem of missing data by employing an efficient Bayesian estimation of the model through a Markov Chain Monte Carlo (MCMC) method, particularly Gibbs sampling (Gelfand and Smith, 1990; Gilks et al., 1996). Other methods of handling missing data have been the use of the EM algorithm (Dempster et al., 1977) or a recurrent forward–backward formula. The EM algorithm was originally tailored for missing data structures, but dependency between states causes problems in mixture estimation. While the simulation of the missing data is straightforward for an independent structure, it is quite difficult to simulate from the distribution of missing data that is conditional on the observed data in HMMs. The use of a recurrent forward–backward formula, which is widespread in the literature for estimating HMM parameters, is time consuming and numerically sensitive. Instead, our method uses Gibbs sampling, which effectively reduces the problem of sampling from a high-dimensional distribution to sampling from a series of low-dimensional distributions.


    DATA
 TOP
 Abstract
 INTRODUCTION
 DATA
 MODEL
 COMPUTATION
 RESULTS
 DISCUSSION
 REFERENCES
 
Data preparation
We downloaded protein sequence data from the Pfam database release 14 (Bateman et al., 2004) to construct a representative dataset of multidomain protein sequences. Pfam-A is a collection of domain families created using profile HMMs built on multiple alignments of homologous proteins. Release 14 of the Pfam database contains protein sequences from SWISS-PROT release 43.2 and SP-TrEMBL release 26.2 (Boeckmann et al., 2003). The Pfam database provides protein sequence coordinates for Pfam-A domains identified in these proteins. Protein sequences that were annotated as containing transmembrane regions in the Pfam database were removed from the dataset. We define a linker as a sequence segment of 4–20 residues that connects two adjacent regions identified by Pfam as domains. The reasoning behind this length range is that an interdomain segment >20 residues may contain a domain that has not yet been identified, instead of being one long linker region. We also define non-linker regions as sequence segments excluding linker regions. We denote a whole sequence as Full. We used only protein sequences whose entire length can be classified as linker or domain by our criteria, except we allowed up to 20 non-domain residues at the N-terminus and C-terminus. By this procedure, we obtained 11 968 sequences with at least one linker region (14 339 linkers, 28 726 corresponding domains and 824 unique domain regions).

We removed redundancy in this dataset as follows. First, we grouped the 11 968 proteins into homeomorphic families (identical domain organization). We performed an all-by-all sequence comparison of the 11 968 sequences using FASTA (Pearson and Lipman, 1988). We then applied single-linkage clustering using criteria of E-value ≤ 10–6 and at least 80% alignment coverage. Some of the resulting clusters contained sequences with different domain organizations, due to the transitive nature of single-linkage clustering. Therefore, instead of selecting only one sequence from each cluster, we selected one sequence from each domain organization within each cluster. We also removed seven protein sequences which were significantly longer than the rest (>1000 residues).

We obtained 802 sequences with at least one linker region. These 802 sequences contained 993 linkers and 1988 corresponding domain regions from 376 unique Pfam-A domain families. The average length of linkers and domains was 11.24 and 141.38, respectively. The relative frequency of individual amino acids were compared between linker region and other regions by a z-test. Amino acid whose frequency is significantly different between linker and domain (P-value <10–3) are indicated by (*) in Table 1. The distribution of amino acids in the linker database of George and Heringa (2002a) shows similar patterns even though the definition of linker region is different. We can incorporate the difference in amino acid composition among regions into our model using the linker index.


View this table:
[in this window]
[in a new window]
 
Table 1 Amino acid frequency in the different regions of the protein sequences and the linker index of amino acids

 
Linker index
Many studies have reported observations of some amino acids at higher frequency in the linker regions than in the domain regions. Proline (P), lysine (K), glutamic acid (E), serine (S), aspartic acid (D) and glutamine (Q) are preferred amino acids in linker regions. Studies (George and Heringa, 2002a; Suyama and Ohara, 2003; Tanaka et al., 2003) have shown proline to be the most preferred linker amino acid. However, there is disagreement among studies regarding the other preferred linker amino acids. It is no surprise that proline is favored because it has no amide hydrogen to donate in hydrogen bonding, and therefore structurally isolates the linker from domains (George and Heringa, 2002a). The analysis of our dataset also shows that proline is the most preferred amino acid in the linker regions.

The propensity of amino acids for linkers have been determined by other groups in three ways: (1) by comparing linker regions with domain regions (Suyama and Ohara, 2003), (2) by comparing linker regions with all non-linker regions (domains and terminal sequence, Tanaka et al., 2003) and (3) by comparing linker regions to Full sequences (linkers, domains and terminal sequences, George and Heringa, 2002a). We found amino acid frequencies to be similar among domains, non-linkers and Full sequences (see Supplementary information), so we proceeded using amino acid propensity for linkers compared with domains.

To incorporate the difference in amino acid composition between domain and linker regions, we employ the linker index, yl, which reflects the preference of amino acids in the linkers relative to the domain region, from Suyama and Ohara (2003).

where is the relative frequency of the amino acid l in the linker (domain) region in the dataset. Because yl represents the preference for amino acid l in the linker region, we note that the value of yl will be negative if the relative frequency of amino acid l in the linker region is greater than its relative frequency in the domain region.

To calculate the smoothed linker index, we took an average of the linker index within each window size {omega} and assigned this averaged linker index value y to the center amino acid of the window by sliding from the N-terminus to the C-terminus of a protein sequence. We used a window size, {omega} = 9, which provided the maximum difference between linker and non-linker regions among the window sizes from 3 to 20.

In the following Model section, we describe the Bayesian model that allows us to compute probabilities of linker state for each residue. We then describe the computation of model parameters using MCMC in the Computation section. Additional background and details are provided in the Supplementary information.


    MODEL
 TOP
 Abstract
 INTRODUCTION
 DATA
 MODEL
 COMPUTATION
 RESULTS
 DISCUSSION
 REFERENCES
 
We assume two hidden states corresponding to the linker and non-linker regions. Let Y = (y1, y2, ..., yn)' be the smoothed linker index data of a protein sequence generated by the corresponding hidden state S = (s1, s2, ..., sn)'. The state transition probability matrix P given by a two-state HMM is {plk} = {p(si = k|si–1 = l)}, l, k 0, 1 given

We assume the observed data yis are independent and have normal distribution. Both the mean and the variance of the observed data are parameterized in terms of the unobserved (hidden) state variable si with a Markov process. If si = 0 then yi is from a linker region and if si = 1 then yi is from a non-linker region:

By definition of linker index, it is reasonable to give the restriction that the mean linker index of linker region 0) is smaller than the mean linker index of non-linker region (µ0 + µ1), because linker indexes are negative for amino acids that are more prevalent in linker regions.

where the error terms {varepsilon}i are normally distributed with a mean of zero and variance {sigma}2 [i.e. {varepsilon}i ~ N(0, {sigma}2)]{omega} denotes the proportionate variance increase when si = 1.

Our objective is to infer the hidden state S, the parameters of model {theta} = (µ0, µ1, {sigma}2, {omega}) and the parameters of transition probabilities {eta} = (p00, p11) given the data Y. We use a Bayesian approach to infer the values of parameters (S, {theta}, {eta}) from the conditional joint posterior distributions P(S, {theta}, {eta}|Y).

The likelihood distribution of data Y given hidden state S, the parameters of the model {theta} and the parameters of transition probabilities {eta} is

where the vector of means µ = (µ0, µ1)', 1 =(1, ..., 1) is a n x 1 vector and {Sigma} = diag((1 + ws1), (1 + ws2), ..., (1 + wsn)). We can assume P(s1 = 1) = 1 because the initial state must begin with the non-linker region state in a protein sequence.

The likelihood distribution of the hidden state S conditioned on the initial state being non-linker is given by

where nij is the number of observations from state i to j. Here a random variable X is said to follow a beta distribution if beta(a, b) ~ Xa–1 (1 – X)b–1.

We specify the prior distribution P({theta},{eta}) in The prior distributions subsection below, to complete the conditional joint distribution P(S, {theta}, {eta}|Y).

Finally, we calculate the probability of state k for each residue i in a protein sequence given yi, si–1 = l, {theta} and {eta}. For simplicity, here we show the conditional distribution, suppressing the conditioning on {theta} and {eta}.

(1)
Once the simulated sample values have been obtained from Equation (1), the posterior expectation can be estimated by the sample average, using Equation (2).

(2)
where t denotes the iteration in the MCMC sampler, k {0, 1} and m is the number of MCMC samples taken from the posterior distribution after burn-in (early MCMC iterations that reflect the starting value, prior to convergence). We predict the state of an amino acid using the classification variable CVi.

where x is the selected cutoff.

The prior distributions
We assign mutually independent prior distributions for µ and {sigma}2. The prior of µ is assigned to be the conjugated normal distribution and the prior of {sigma}2 is the inverse gamma distribution. Here a random variable X is said to follow an inverse gamma distribution if IG(a/2, 2/b) ~ (1/X)(a/2)+1 exp(–b/2X).

Given hidden state si, {omega} only depends on the observations for si = 1. We use the expression = ({omega} + 1) in Albert and Chib (1993) to make {omega} represent the proportionate increase in variance when si = 1. Let the prior distribution of be the truncated inverse gamma distribution

For the priors for (p00, p11), we assign the conjugate beta priors for (p00, p11).


    COMPUTATION
 TOP
 Abstract
 INTRODUCTION
 DATA
 MODEL
 COMPUTATION
 RESULTS
 DISCUSSION
 REFERENCES
 
Our challenge in applying the model is to determine the posterior distribution of each parameter. The posterior distribution is not available in explicit form, so we use the MCMC method, Gibbs sampling, to simulate the unknown parameters from the posterior distribution. Details of the computation are provided in the Supplementary information website.

It is convenient to transform data using qi = (1 + {omega}si)1/2 so that the transformed data have constant variances instead of variances that depend on state.

where I = diag{1}. Define and .

The full conditional distributions of µ and {sigma}2 are as follows:

where A = (V–1 + {sigma}–2W*'W*), µ0 = 0a, µ1a)' and V = diag({xi}0a, {xi}1a).

The full conditional distribution of is the truncated inverse gamma distribution.

where J = {i|si = 1}, i = 1, ..., n and n1 is the number of observations whose state are 1.

The full conditional distributions of {eta} = (p00, p11) are as follows:

The full conditional distribution of {si, i = 1, ..., n} depends on the state at position (i – 1) and (i + 1) along a sequence since si has a Markov property.

where Si = (s1, ..., si–1, si+1, ..., sn)' and p(s1 = 1) = 1 and p(sn = 1) = 1.


    RESULTS
 TOP
 Abstract
 INTRODUCTION
 DATA
 MODEL
 COMPUTATION
 RESULTS
 DISCUSSION
 REFERENCES
 
We applied our model to the protein sequence dataset constructed from Pfam-A using linker index y1 as described in the Data section. To evaluate the accuracy of the prediction, a 5-fold cross-validation was applied to the dataset, i.e. we divided the dataset into the training dataset and the test dataset randomly in the ratio of 4:1. We trained the model with the training dataset of 642 sequences and tested the trained model with the test dataset of 160 sequences. This procedure was repeated five times.

We ran Gibbs sampling with 40 000 iterations and 10 000 burn-in to train the model. The choice of hyper-parameters (parameters of the prior distribution), based on the data and the problem at hand, are as follows: (1) We let hyper-parameters for µ be the sample means of the training dataset for each state and gave each a sufficiently large variance of 10; (2) We assumed E() = 1.5, var() = 10 and E({sigma}2) = 0.1, var({sigma}2) = 10 and fixed the hyper-parameters accordingly; (3) We let the hyper-parameters for the p00 and p11 be uij = 1, i, j {0, 1} to have uniform priors; (4) We calculated the probability of linker state, p(si = 0|yi), for each residue i along a protein sequence.

Figure 1 shows cases with good predictions, in which probabilities in the linker region are much higher than in other regions. However, we need to select a cutoff value (here, 0.75) to delineate the boundary. Although our method gives high probabilities to the linker region, it also gives high probabilities to other regions that may have similar structure. Figure 2A shows one of these cases. There are two regions with high probability, but there is only one linker region in the protein. The probability of the actual linker region is slightly higher than that of the false positive linker region. Figure 3B shows that sequence termini can have high probabilities.



View larger version (21K):
[in this window]
[in a new window]
 
Fig. 1 Examples of good predictions: (A) SP-TrEMBL accession Q7P2M5 and (B) SP-TrEMBL accession Q7UD15 (* = 1: non-linker region and * = 0: linker region).

 


View larger version (24K):
[in this window]
[in a new window]
 
Fig. 2 Examples of overpredictions: (A) SP-TrEMBL accession Q89F20 and (B) SP-TrEMBL accession Q7P6J3 (* = 1: non-linker region, * = 0: linker region).

 


View larger version (18K):
[in this window]
[in a new window]
 
Fig. 3 Sensitivity (Sn), Specificity (Sp) and Matthews correlation coefficient (C) for residue-based evaluation. Sn = TP/(TP + FN), Sp = TP/(TP + FP) and C = [(TP)(TN) – (FN)(FP)]/{surd}(TP + FN) (TP + FP) (TN + FN) (TN + FP) where TP is residues correctly labeled as linker; FP represents residues labeled as linker while they are non-linker; FN is residues labeled as non-linker while they are linker; TN is residues correctly labeled as non-linker.

 
To evaluate our method, sensitivity (Sn), specificity (Sp) and correlation coefficient (C) were calculated for each of the five test datasets. We applied the definitions of sensitivity and specificity used by Miyazaki et al. (2002). Sensitivity is the percentage of actual linker residues that were predicted to be linker and specificity is the percentage of predicted linker residues that were truly linker. The correlation coefficient (Matthews, 1975) is an indication how much better a given prediction is than a random one. C = 1 indicates perfect prediction, C = 0 is expected for a prediction no better than random.

Figure 3 shows the effect of CV cutoff on Sn, Sp and C, each averaged over the five tests. Separate curves for each test are provided on the Supplementary website. Using a CV cutoff of 0.75, Sn and Sp were each 67%, indicating that we can identify 67% of the linker residues, and that 67% of the residues predicted to be linker are truly linker, respectively. The average Matthew's correlation coefficient was 65% at the 0.75 CV cutoff.

The test of Sn and Sp described above did not exclude false positives that occured at N-terminal and C-terminal residues. To test the effect of N-terminal and C-terminal residues on the false positive rate, we recalculated Sp, ignoring false positives within the first 20 and last 20 residues of the sequence. The recalculated Sp was 68%, indicating that sequence termini do not contribute significantly to the false positive rate.

There are several other methods for predicting protein linker regions, but it is difficult to compare, because linker definition and type of data required as input and evaluation criteria vary across methods. We compare our method with DomCut (Suyama and Ohara, 2003), because the software is freely available and the authors use a similar linker definition and the same property (linker index) as input; other methods were developed using structural domain/linker definitions and require additional input such as predicted structure or multiple alignments of homologs. DomCut predicts putative linker regions instead of giving specific linker region boundaries, so the DomCut authors evaluated their method in terms of predicted linker regions and not residues (Suyama and Ohara, 2003). This is different from the evaluation of our method described above, which considered predicted linker residues, rather than predicted regions. Therefore, we performed an additional evaluation of our method in order to compare it with DomCut; this time we excluded the unknown regions in the sequences (the N-termini and C-termini), and we evaluated linker region predictions, as done in the evaluation of DomCut (Suyama and Ohara, 2003).

A linker region was predicted by our method if the following were satisfied: the length was >4 residues, each residue had a high probability (>0.5) and the maximum probability was >0.8. In DomCut, a linker region is taken to be correctly predicted if there is a trough in the linker region and the minimum linker index value is lower than the cutoff value. We tested DomCut using our dataset. The smoothed linker index for the DomCut test was calculated with window size 9. We used a linker index cutoff of –0.08, which we had determined to be optimal for the window size. Sensitivity and specificity of our method were 63.3 and 92.9%, respectively, which appeared to improve the DomCut method (Table 2). Our method correctly predicted 131 linker regions out of 141 total predictions, while DomCut correctly predicted 117 linker regions out of 113 total predictions. There were 207 linker regions in the dataset.


View this table:
[in this window]
[in a new window]
 
Table 2 Comparison with DomCut

 
The low false positive rate of our method in testing linker region prediction compared with the false positive rate in testing linker residue prediction can be attributed to two factors. First, the removal of sequence termini in the region-based test eliminated some false positives. Second, some false positives in the residue-based test were caused by linker boundary extension into domain regions, exemplified by the broad peak in Figure 1B. Overpredicting linker length would not be detected in the region-based test.


    DISCUSSION
 TOP
 Abstract
 INTRODUCTION
 DATA
 MODEL
 COMPUTATION
 RESULTS
 DISCUSSION
 REFERENCES
 
We have developed a HMM for evolutionarily defined protein interdomain linker/non-linker regions in a protein sequence using the composition differences of amino acids. Results suggest that our method slightly improves an existing method that uses similar biological evidence.

Our choice of dataset has important implications in surmising our method's ability to predict structurally-defined protein linkers. We used domain definitions provided by Pfam-A, based on evolutionary evidence, rather than structurally defined domain definitions. Our linker/domain dataset is similar to the dataset of Suyama and Ohara (2003) who identified domains using the term DOMAIN in the feature table of the SWISS-PROT database. SWISS-PROT domain annotation is based on the InterPro domain databases, which include Pfam (Apweiler, 2001; Apweiler et al., 2001). We chose to use Pfam-A based domain definitions for SWISS-PROT and SP-TrEMBL proteins, because the resulting non-redundant dataset of multiple domain proteins (802 proteins) was much larger than could be acquired from a structure database (e.g. 101 proteins in Tanaka et al., 2003).

A notorious problem in structural linker prediction has been distinguishing linkers from intradomain loops. Our method appears to perform well in this regard; however, the relatively low number of false positives may be due to bias in the dataset. Since Pfam-A identifies domains as evolutionarily conserved units, non-conserved intradomain loops can cause structural domains to be annotated as multiple Pfam-A domains. Thus, some of our Pfam-A defined linkers may actually be loops in structural domains. Conversely, two structural domains that are always found together may be defined by Pfam-A as a single evolutionary domain; some of our false positives may actually be structural linkers. We must test our approach using a structurally defined dataset to fully understand its ability to distinguish structural linkers from intradomain loops.

We have demonstrated the value of our method in defining linkers for evolutionary domains. Liu and Rost (2003) review methods of computational domain dissection and protein sequence clustering, and suggest that better tools are needed to dissect proteins into domains to cluster them into families. Our method can be used to delineate domains prior to clustering.

We have also demonstrated a HMM approach that considers protein sequence data as continuous variables (linker idex) instead of categorical variables (amino acid), and generates probabilistic output. Existing methods that rely on amino acid propensity (Suyama and Ohara, 2003; Miyazaki et al., 2002) do not give probabilistic output. The approach presented here can be extended to other protein sequence/structure problems.

Not only is the composition of the linker region important but also its length. In general, altering the length of linker regions connecting domains has been shown to affect protein stability, folding rates and domain–domain orientation (van Leeuwen et al., 1997; Robinson and Sauer, 1998). Numerous studies, including that of George and Heringa (2002a) show that the distributions of length of linker and non-linker regions are significantly different. In the future, we can incorporate the informative characteristic of length of linker/non-linker regions by applying a variable duration HMM, which incorporates the specified state duration (i.e. length) distribution.


    Acknowledgments
 
We wish to thank the two anonymous referees who provided valuable critiques of this work.

Received on November 22, 2004; revised on February 9, 2005; accepted on February 26, 2005

    REFERENCES
 TOP
 Abstract
 INTRODUCTION
 DATA
 MODEL
 COMPUTATION
 RESULTS
 DISCUSSION
 REFERENCES
 

    Albert, J.H. and Chib, S. (1993) Bayesian inference via Gibbs sampling of autoregression time series subject to Markov mean and variance shifts. J. Bus. Econ. Stat., 11, 1–15.

    Apweiler, R. (2001) Functional information in SWISS-PROT: the basis for large-scale characterisation of protein sequences. Brief. Bioinformatics, 2, 9–18[Abstract/Free Full Text].

    Apweiler, R., et al. (2001) InterPro—an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Res., 29, 37–40[Abstract/Free Full Text].

    Argos, P. (1990) An investigation of oligopeptides linking domains in protein tertiary structures and possible candidates for general gene fusion. J. Mol. Biol., 211, 943–958[CrossRef][Web of Science][Medline].

    Asai, K., et al. (1993) HMM with protein structure grammar. Proceedings of the 22nd Hawaii International Conference on System Sciences , Los Alamitos, CA IEEE Computer Society Press, pp. 783–791.

    Bateman, A., et al. (2004) The Pfam protein families database. Nucleic Acids Res., 32, D138–D141[Abstract/Free Full Text].

    Boeckmann, B., et al. (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res., 31, 365–370[Abstract/Free Full Text].

    Burge, C. and Karlin, S. (1997) Prediction of complete gene structures in human. J. Mol. Biol., 268, 78–94[CrossRef][Web of Science][Medline].

    Cardon, L.R. and Stormo, G.D. (1992) Expectation maximization algorithm for identifying protein-binding sites with variable lengths from unaligned DNA fragments. J. Mol. Biol., 223, 159–170[CrossRef][Web of Science][Medline].

    Churchill, G.A. (1989) Stochastic models for heterogeneous DNA sequences. B. Math. Biol., 51, 79–94.

    Dempster, A.P., et al. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion). J. Roy. Stat. Soc. B, 39, 1–38.

    Gelfand, A. and Smith, A.F.M. (1990) Sampling-based approaches to calculating marginal densities. J. Am. Stat. Assoc., 88, 881–889[CrossRef].

    George, R.A. and Heringa, J. (2002a) An analysis of protein domain linkers: their classification and role in protein folding. Protein Eng., 15, 871–879[Abstract/Free Full Text].

    George, R.A. and Heringa, J. (2002b) Protein domain identification and improved sequence similarity searching using PSI-BLAST. Proteins, 48, 672–681[CrossRef][Web of Science][Medline].

    George, R.A. and Heringa, J. (2002c) SnapDRAGON: a method to delineate protein structural domains from sequence data. J. Mol. Biol., 316, 839–851[CrossRef][Web of Science][Medline].

    Gilks, W., Richardson, S., Spiegelhalter, D. Markov Chain Monte Carlo in Practice, (1996) , London Chapman and Hall.

    Gokhale, R.S. and Khosla, C. (2000) Role of linkers in communication between protein modules. Curr. Opin. Chem. Biol., 4, 22–27[CrossRef][Web of Science][Medline].

    Gouzy, J., et al. (1997) XDOM, a graphical tool to analyse domain rearrangements in any set of protein sequences. Comput. Appl. Biosci., 13, 601–608[Abstract/Free Full Text].

    Gracy, J. and Argos, P. (1998) Automated protein sequence database classification. II. Delineation of domain boundaries from sequence similarities. Bioinformatics, 14, 174–187[Abstract/Free Full Text].

    Holm, L. and Sander, C. (1994) Parser for protein folding units. Proteins, 19, 256–268[CrossRef][Web of Science][Medline].

    Islam, S.A., et al. (1995) Identification and analysis of domains in proteins. Protein Eng., 8, 513–525[Abstract/Free Full Text].

    Krogh, A., et al. (1994) Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol., 235, 1501–1531[CrossRef][Web of Science][Medline].

    Lander, E. and Green, P. (1987) Construction of multilocus genetic linkage maps in human. Proc. Natl Acad. Sci. USA, 84, 2363–2367[Abstract/Free Full Text].

    Liu, J. and Rost, B. (2003) Domains, motifs and clusters in the protein universe. Curr. Opin. Chem. Biol., 7, 5–11[CrossRef][Web of Science][Medline].

    Liu, J. and Rost, B. (2004a) CHOP: parsing proteins into structural domains. Nucleic Acids Res., 32, 569–571.

    Liu, J. and Rost, B. (2004b) Sequence-based prediction of protein domains. Nucleic Acids Res., 32, 3522–3530[Abstract/Free Full Text].

    Marsden, R.L., et al. (2002) Rapid protein domain assignment from amino acid sequence using predicted secondary structure. Protein Sci., 11, 2814–2824[CrossRef][Web of Science][Medline].

    Matthews, B.W. (1975) Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta, 405, 442–451[Medline].

    Miyazaki, S., et al. (2002) Characterization and prediction of linker sequences of multidomain proteins by a neural network. J. Struct. Funct. Genomics, 2, 37–51[CrossRef][Medline].

    Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 2444–2448[Abstract/Free Full Text].

    Robert, C.P. and Mengersen, K.L. (1999) Reparameterisation issues in mixture modeling and their bearing on MCMC algorithms. Comput. Stat. Data An., 29, 325–343.

    Robinson, C.R. and Sauer, R.T. (1998) Optimizing the stability of single-chain proteins by linker length and composition mutagenesis. Proc. Natl Acad. Sci. USA, 95, 5929–5934[Abstract/Free Full Text].

    Schmidler, S.C., et al. (2000) Bayesian segmentation of protein secondary structure. J. Comput. Biol., 7, 233–248[CrossRef][Web of Science][Medline].

    Schmidler, S.C., et al. (2001) Bayesian protein structure prediction. Case Studies in Bayesian Statistics, 5, 363–378.

    Siddiqui, A.S. and Barton, G.J. (1995) Continuous and discontinuous domains: an algorithm for the automatic generation of reliable protein domain definitions. Protein Sci., 4, 872–884[Web of Science][Medline].

    Sonnhammer, E.L.L. and Kahn, D. (1994) Modular arrangement of proteins as inferred from analysis of homology. Protein Sci., 3, 482–492[Web of Science][Medline].

    Suyama, M. and Ohara, O. (2003) DomCut: prediction of inter-domain linker regions in amino acid sequences. Bioinformatics, 19, 673–674[Abstract/Free Full Text].

    Taylor, W.R. (1999) Protein structural domain identification. Protein Eng., 12, 203–216[Abstract/Free Full Text].

    Tanaka, T., et al. (2003) Characteristics and prediction of domain linker sequences in multidomain proteins. J. Struct. Funct. Genomics, 4, 79–85[CrossRef][Medline].

    Udwary, D.W., et al. (2002) A method for prediction of linker regions within large multifunctional proteins, and its application to a type I polyketide synthase. J. Mol. Biol., 323, 585–598[CrossRef][Web of Science][Medline].

    van Leeuwen, H.C., et al. (1997) Linker length and composition influence the flexibility of Oct-1 DNA binding. EMBO J., 16, 2043–2053[CrossRef][Web of Science][Medline].

    Wernisch, L., et al. (1999) Identification of structural domains in proteins by a graph heuristic. Proteins, 35, 338–352[CrossRef][Web of Science][Medline].

    Wheelan, S.J., et al. (2000) Domain size distributions can predict domain boundaries. Bioinformatics, 16, 697–701.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
C. N.I. Pang, K. Lin, M. A. Wouters, J. Heringa, and R. A. George
Identifying foldable regions in protein sequence from the hydrophobic signal
Nucleic Acids Res., February 2, 2008; 36(2): 578 - 588.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/10/2264    most recent
bti363v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (6)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Bae, K.
Right arrow Articles by Elsik, C. G.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Bae, K.
Right arrow Articles by Elsik, C. G.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?