Skip Navigation


Bioinformatics Advance Access originally published online on March 12, 2008
Bioinformatics 2008 24(9):1145-1153; doi:10.1093/bioinformatics/btn097
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/9/1145    most recent
btn097v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Poleksic, A.
Right arrow Articles by Fienup, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Poleksic, A.
Right arrow Articles by Fienup, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Optimizing the size of the sequence profiles to increase the accuracy of protein sequence alignments generated by profile–profile algorithms

Aleksandar Poleksic * and Mark Fienup

Department of Computer Science, University of Northern Iowa, Cedar Falls, IA, 50614, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Profile-based protein homology detection algorithms are valuable tools in genome annotation and protein classification. By utilizing information present in the sequences of homologous proteins, profile-based methods are often able to detect extremely weak relationships between protein sequences, as evidenced by the large-scale benchmarking experiments such as CASP and LiveBench.

Results: We study the relationship between the sensitivity of a profile–profile method and the size of the sequence profile, which is defined as the average number of different residue types observed at the profile's positions. We also demonstrate that improvements in the sensitivity of a profile–profile method can be made by incorporating a profile-dependent scoring scheme, such as position-specific background frequencies. The techniques presented in this article are implemented in an alignment algorithm UNI-FOLD. When tested against other well-established methods for fold recognition, UNI-FOLD shows increased sensitivity and specificity in detecting remote relationships between protein sequences.

Availability: UNI-FOLD web server can be accessed at http://blackhawk.cs.uni.edu

Contact: poleksic{at}cs.uni.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Despite the obvious progress in the structural genomics efforts, protein and DNA remote homology detection and protein three-dimensional structure prediction are still some of the most challenging problems in computational molecular biology. One of the limiting factors for accurate modelling of a protein's structure from its primary sequence is the quality of the alignment of a query protein to the primary template (John and Sali, 2003). A number of approaches to overcome this limitation have been proposed including the profile-based methods for sequence alignment.

PSI-BLAST (Altschul et al., 1997) builds the so-called ‘position-specific scoring matrix’ (PSSM) by iteratively collecting database sequences related to the query. After the first iteration, the algorithm uses PSSM in place of the query sequence to search for new homologues. The procedure is terminated when no new database sequences are found.

Hidden Markov model (HMM)-based algorithms, such as HMMER (Eddy, 1998) and SAMT02 (Karplus et al., 2001), use an analytical model in place of the PSSM. This explicit probabilistic model allows HMM-based methods to accurately compute the likelihood that an arbitrary database sequence is generated from the model.

Profile–profile methods utilize information about aligned residues in the multiple alignments corresponding to both query and the template sequences. The sequence profiles are typically built with iterative profile-sequence algorithms, such as PSI-BLAST. The score assigned for aligning a pair of profile positions reflects the similarity between the distributions of amino-acid letters in multiple alignments at those positions.

Profile–profile methods are often able to recognize similarity between sequences sharing <20% of identical residues, as evidenced by the large-scale benchmarking experiments such as CASP (Tress et al., 2005; Vincent et al., 2005; Wang et al., 2005) CAFASP (Fischer et al., 2003), and LiveBench (Rychlewski and Fischer, 2005). But, despite their success in detecting distant sequence relationships, many aspects of current profile–profile techniques need further improvements. These include:

  1. The procedures for constructing the sequence profiles.
  2. The design of the alignment scoring function.
  3. The statistical methods for assessing the significance of the raw alignment scores.

In this article we present a new approach to constructing, aligning and scoring alignment profiles. We demonstrate a correlation between the size of the profiles (also called ‘profile thickness’) and the sensitivity of a profile–profile method. More specifically, we show that fixing and optimizing the size of the profiles, by fine tuning the average number of different residue types at various profile positions, can improve the inference of remote relationships between protein sequences. In addition, we prove that utilizing a profile dependent scoring scheme, such as profile derived background frequencies, further improves the detection of similarities between distantly related sequences.

The results of our study are incorporated into a profile–profile alignment method called UNI-FOLD. The performance of UNI-FOLD is tested in the Lindahl and SCOP benchmarks for fold recognition and in the SALIGN benchmark for the alignment accuracy.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Scoring function
2.1.1 Preliminaries
The difference between sequence–sequence and profile–profile methods is in the scoring scheme. In a profile–profile method, the score for aligning profile positions i and j reflects the similarity between the amino-acid probability distributions qi and qj at the positions i and j. The simplest such measure is the dot product, which represents the summation of the products of the amino-acid frequencies:


Formula 1

(1)
Dot product is implemented in a number of profile–profile methods, including FFAS03 (Jaroszewski et al., 2005) and ORFeus (Ginalski et al., 2003).1

More complicated scoring functions rely on the background model when scoring pairs of profile positions (Debe et al. 2006; Sadreyev and Grishin, 2003; Yona and Levitt, 2002). The scoring function in PROF_SIM (Yona and Levitt, 2002) uses the information theory measure called Jensen–Shannon (JS) divergence. The JS divergence between two probability distributions qi and qj is given by:


Formula 2

(2)
where DKL represents Kullback–Leibler divergence (relative entropy):


Formula 3

(3)
and r is the ‘most likely’ common source distribution for qi and q j:


Formula 4

(4)
The score for aligning two profile positions in PROF_SIM is defined as:


Formula 5

(5)
where Formula , Formula , and b is the ‘overall’ (background) distribution of 20 amino-acid letters.

It is easy to see that the score assigned for matching a pair of amino-acid distributions qi and q j in PROF_SIM depends not only on the similarity between qi and q j, but also on the similarity between their most likely source distribution r and the background distribution b. In other words, even the score between two similar distributions qi and q j (J close to 0) will not be very high unless both of them are far apart from the background distribution (S close to 1).

The COMPASS method (Sadreyev and Grishin, 2003) uses the effective amino-acid counts when scoring a pair of profile positions. COMPASS scoring function is given by:


Formula 6

(6)
where Formula and Formula are the ‘effective counts’ for the amino acid k at the positions i and j, respectively, bk is the background probability of k, and ci and cj are the weighting parameters. The weights ci and cj are used to balance the contribution of the two terms in formula (6). The values ci and cj are given by:


Formula 7

(7)


Formula 8

(8)

One advantage of function (6) is that it separates similar profile columns (positive scores) from non-similar columns (negative scores). It can also be shown that (6) reduces to PSI-BLAST scores if one profile consists of a single sequence.

2.1.2 UNI-FOLD scoring scheme
UNI-FOLD scoring scheme is a twofold generalization of (6). UNI-FOLD uses position specific (as opposed to fixed) background frequencies. In addition, UNI-FOLD scoring function takes into account the similarity between the predicted secondary structure states of the residues in two proteins.

Our scoring function can be described as a multinomial trials process, generalized for the non-integer setting. Aside from being intuitive, this description of the scoring function reveals our motivation for incorporating position-specific background frequencies into the alignment process.

In UNI-FOLD, the score for matching the profile positions i and j is the sum of the sequence and the structure term:


Formula 9

(9)

The sequence term is given by:


Formula 10

(10)
where


Formula 11

(11)


Formula 12

(12)


Formula 13

(13)
and


Formula 14

(14)

In (11)–(14), Formula are the effective amino-acid counts at position j and Formula are the background amino-acid frequencies at position i.

The effective amino-acid letters in column j can be viewed as the outcomes of a multinomial process in which the total number of trials is Nj and the probabilities of 20 different outcomes are Formula . The (multinomial) probability of observing exactly Formula outcomes of each amino-acid k in the column j is given by (12). In other words, if the amino-acid distribution Formula is viewed as a multinomial model, then (12) represents the probability that the residue counts in column j are generated from the model Formula . Thus, the score s(i, j) for aligning positions i and j given by (11) represents the logarithm of the ratio of the probability that the residue counts Formula are generated by the model Formula and the probability that the same counts are generated by the background model Formula .

Using some basic properties of logarithms, it can be shown that Formula has the form (6) without the weighting terms c1 and c2. We note that using non-balanced terms in (6) makes sense, because the number of effective amino-acid counts at a profile position is an indicator of the ‘confidence’ in the amino-acid probability distribution at that position.

UNI-FOLD scoring function further generalizes (6) by incorporating secondary structure matching and position-specific background frequencies. Using the secondary structure information is known to improve the sensitivity of a fold recognition method (Ginalski et al., 2003; Heringa, 2000; Söding, 2005). The secondary structure score in ORFeus is defined as a shifted dot product between the secondary structure probability vectors. In HHsearch the log-odd-based secondary structure scores are weighted and then added to each amino-acid column–column score.

In UNI-FOLD, the secondary structure contribution to the overall score, Formula , is computed using the same set of formulas (10)–(14)GoGoGoGo (the only difference is that Formula ) Therefore the secondary structure scoring in UNI-FOLD can also be viewed as a multinomial process with three outcomes, whose probabilities are the frequencies of the three secondary structure states: Formula (helix), Formula (strand) and Formula (coil).2 However, while the number of effective residues at the position j in Formula can be derived from the alignments of the amino-acid sequences, for Formula there is no notion of the effective secondary structure counts Formula (helix), Formula (strand) and Formula (coil). In lieu of the observed secondary structure counts, we set each Formula , Formula and Formula to a fixed positive number c. We tested several values for c and found c = 5 to work the best in identifying remote relationships between the sequences in a small test set (data not shown).

2.2 Background residue probabilities
The sequence score (11) for aligning a pair of positions i and j relates the probability that the effective counts Formula at the position j are generated from the model qi to the probability that nj are generated from the background model Formula . In the simplest case, the background frequencies are set to the overall probabilities of 20 amino-acid types (Robinson and Robinson, 1991). However, scoring functions that implement fixed background frequencies are often unable to distinguish true sequence similarities from those that are product of a compositional bias in two profiles. For example, an alanine-rich profile will typically score high against any other alanine-rich profile.

To compensate for compositional bias in protein families, UNI-FOLD uses profile-derived background frequencies. The background frequencies Formula at the profile position i are computed by averaging the observed amino-acid frequencies over the remaining positions in the profile, i.e.


Formula 15

(15)
where


Formula 16

(16)

In addition to using position-specific amino-acid background frequencies in Formula , UNI-FOLD incorporates protein-specific background probabilities when calculating Formula . In other words, the secondary structure background probabilities in UNI-FOLD are set to the overall probabilities of the three secondary structure states in the protein for which the profile is built.

We note that the idea of using the profile-dependent background frequencies is not new in sequence alignment methods. The profile-specific background models have been successfully used in HMM algorithms (see, for example, Barrett et al., 1997). However, to the best of our knowledge, profile-dependent background frequencies have never been used in profile–profile settings before.

As explained in Barrett et al. (1997), there are other ways to specify the background model when aligning two profiles. For example, the background frequencies can be different for every template (database) profile, but same for each query profile. We have not tested the template-specific background model, but, according to Barrett et al. it is not likely to work very well.

A head-to-head comparison in the performance between the scoring function that uses Robinson and Robinson background frequencies and the function that uses profile-derived background frequencies is given in the ‘Results’ section.

2.3 Alignment score significance
UNI-FOLD raw alignment scores are normalized by computing the P-values from the Gumbel type distributions of the background alignment scores. The Gumbel parameters {lambda} and µ are computed independently for each profile by aligning the profile to a representative set of 2500 sequence profiles computed for the FSSP family representatives (Holm et al., 1992).

The significance of a raw alignment score between the query and the template (database) profile is calculated from the Gumbel distribution specified by the template's parameters {lambda} and µ. Hence, the P-value of an alignment score S between the query and the template in UNI-FOLD represents the probability of getting the score of at least S in the comparison of the template profile against the profiles built for FSSP family representatives. We emphasize that pre-computing the distribution parameters for the database profiles is a standard technique used in many programs, such as HMMER (Eddy, 1998) and HHsearch (Söding, 2005).

2.4 Profile construction
In order for a profile–profile algorithm to recognize a weak similarity between two sequences (such as, for example, the relationship between the sequences that belong to the same SCOP fold), the sequence profiles must include as many diverse family members as possible. However, the same strategy might not be optimal when aligning two highly similar sequences. The problem is that the diverse profiles, i.e. those containing distantly related family members, often include a high percentage of false positives. The presence of false positives can compromise an otherwise accurate alignment between highly similar sequences (Sadreyev and Grishin, 2004).

The profiles in UNI-FOLD are computed using PSI-BLAST, but the number of PSI-BLAST iterations used to build the profiles is varied from one profile to another. More specifically, instead of building the profiles with a fixed number of PSI-BLAST iterations, we require that each profile has the same (or similar) size, as defined below.

We modified the PSI-BLAST program to output the effective number of different residue types observed in columns of the PSI-BLAST multiple alignments. In the PSI-BLAST paper (Altschul et al., 1997), this parameter is denoted by NC and is used to weight the contribution of the observed amino-acid counts when computing the score for matching a sequence residue to the profile column C. Since NC represents the effective number of different amino-acid letters (including the gap character), the value of NC cannot exceed 21. The variability of NC in an example profile is illustrated in the Figure 1.


Figure 1
View larger version (9K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. The average number of different residue types (NC) at various positions in the profile constructed for DNA primase from Pyrococcus horikoshii, (PDB ID: 1v33a_). The profile is built with 10 iterations of PSI-BLAST against the NCBI's non-redundant protein sequence database nr.

 
We define the profile size T as the average value of NC 1 over all positions predicted to be in regular secondary structure regions of the protein for which the profile is constructed. More precisely, the profile size is defined as


Formula 17

(17)
where n is the number of residues that are not part of a protein's loop region. We take the average value of NC – 1 over the regular secondary structure regions only, because of the significant variability of the amino-acid types in the protein's loop structures.

Figure 2 shows the distribution of the profile size (T) for profiles constructed for a set of 300 randomly chosen protein sequences.


Figure 2
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. The distribution of the profile size parameter T for profiles constructed for 300 randomly chosen protein sequences by running PSI-BLAST with two, four and eight iterations.

 
Thus, the question arises whether using the profiles that contain fixed (and ‘optimum’) amount of sequence information, as measured by the profile-size parameter T, can improve the inference of distant relationships between protein sequences.

We analysed the sensitivity of UNI-FOLD on different collections of profiles built for the set of 312 training sequences chosen at random from the scop_1.71_pdb_40 database (Chandonia et al., 2004). The description of the training set is given in Table 1.3


View this table:
[in this window]
[in a new window]

 
Table 1. The training set

 
Each collection of profiles is computed using a different profile size cutoff T. More specifically, for each value of T, we generated a set of profiles for the training sequences by running up to 10 PSI-BLAST iterations on each sequence, until the size of the profile reaches T. If T cannot be reached with the default PSI-BLAST parameters, the PSI-BLAST h-value for the inclusion of the database sequences into the profile is relaxed (from 0.001 to 0.005) and the procedure is repeated once again.

The accuracy of UNI-FOLD on each collection of profiles is measured as the percentage of the relationships between the training sequences identified at different SCOP levels (Murzin et al., 1995).

Figure 3 shows the ability of UNI-FOLD to recognize pairs of training sequences that belong to the same SCOP family, superfamily and fold. As seen in this figure, the sensitivity of UNI-FOLD in detecting SCOP family relationships is almost independent of the profile size. For proteins in the same SCOP superfamily, the sensitivity graph resembles the graph of a negative quadratic function, reaching its maximum at T = 11. At the fold level, the accuracy of UNI-FOLD is monotone increasing function of T.


Figure 3
View larger version (17K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The percentage of SCOP relationships between the training sequences detected by UNI-FOLD as a function of the profile size (T).

 
The outcome of our study does not provide a definite answer to the question of how to optimally select the profile size parameter T. The optimum value for identifying fold relationships is at the right end of the profile size interval (Fig. 3). In contrast, most superfamily relationships are identified using the profiles of size 11. We decided to set the default value for T in UNI-FOLD to 11, because in many applications (such as protein comparative structure modelling and function annotation) finding superfamily relationships takes precedence over finding fold relationships.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Lindahl benchmark
3.1.1 Sensitivity
The Lindahl benchmark (Lindahl and Elofsson, 2000) measures the ability of a fold recognition method to place a correct member of the SCOP group (family, superfamily and fold) at the top of its rank list. In the Lindahl test, 976 proteins are compared against each other, resulting in a total of almost one million comparisons. There are 1310 pairs of related proteins divided into three groups according to SCOP structural classification: 321 fold pairs, 434 superfamily pairs and 555 family pairs. The accuracy of a fold recognition method is measured as the percentage of the true hits identified in the first and the top five ranks.

Table 2 summarizes the results of Lindahl benchmark for UNI-FOLD and 13 well-established fold recognition methods: PSI-BLAST, HMMER (Eddy, 1998), SAM-T98 (Karplus et al., 1998), BLASTLINK, SSEARCH, SSHMM (Hargbo and Elofsson, 1999), THREADER (Jones et al., 1992), FUGUE (Shi et al., 2001), RAPTOR (Xu et al., 2003), PROSPECT II (Kim et al., 2003), SPARKS (Zhou and Zhou, 2004), SP3 (Zhou and Zhou, 2005) and FOLDpro (Cheng and Baldi, 2006).


View this table:
[in this window]
[in a new window]

 
Table 2. Lindahl benchmark for fold recognition sensitivity

 
UNI-FOLD outperforms the second best method in the Lindahl benchmark by ~2%, 14% and 4%, at the family, superfamily and the fold level, respectively. Our method performs especially well at the SCOP superfamily level, recognizing 69% of related pairs in the top rank and 82% in the top five rank. In addition, 42% of the superfamily relationships identified by UNI-FOLD are not detectable by the popular PSI-BLAST method.

Table 3 shows the contribution of various modifications of our algorithm to the overall sensitivity in the Lindahl test.


View this table:
[in this window]
[in a new window]

 
Table 3. Sensitivity of five versions of UNI-FOLD in the Lindahl benchmark

 
As seen in Table 3,4 switching to profiles of fixed (optimal) size improves the sensitivity of UNI-FOLD at all SCOP levels. The benefit of this modification is largest at the SCOP superfamily level (53.2% versus 62. 2%).

Our analysis also indicates that using profile derived, as opposed to fixed background frequencies (UNI-FOLD versus UNI-FOLDss) results in an average increase in sensitivity of about 2%.

On the other hand, there is no significant difference in the performance between UNI-FOLD and UNI-FOLD06. The percentage of correct UNI-FOLD matches in the top rank is higher than that of UNI-FOLD06 at the family level by 0.3% and at the fold level by 0.6%. On the other hand, UNI-FOLD06 detects more superfamily pairs (70.0% versus 69.1%). Although these results sound surprising, a similar finding that downplays the importance of large databases for profile building has been reported before (Lindahl and Elofsson, 2000, p. 620).

The results presented in Tables 2 and 3 represent the fraction of true hits in first and top five ranks identified by different fold recognition methods, but they do not provide any insight into the reliability of those methods. In theory, even the most sensitive method can assign high scores to false hits and low scores to true matches, as long as a true match is always placed at the top of the ranked list.

3.1.2 Reliability
One way to assess the reliability of a fold recognition method is to compute specificity–sensitivity plots (Lindahl and Elofsson, 2000). The specificity is defined as the percentage of all predicted positives that are true positives:


Formula 18

(18)

The sensitivity is the percentage of true hits that are predicted as positives:


Formula 19

(19)

In formulas (18) and (19), Formula denotes the number of correct hits with score greater than s, Formula is the number of false hits with score greater than s and Formula represents the number of correct hits with score below s.

The specificity–sensitivity curves for FOLDpro and UNI-FOLD are shown in Figures 4–6GoGo.5 Both programs have much better reliability than the profile-sequence methods tested in the Lindahl benchmark (data not shown). For reference, we also include plots for the popular HMMER program.


Figure 4
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. The specificity–sensitivity curves at the family level.

 

Figure 5
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. The specificity–sensitivity curves at the superfamily level.

 

Figure 6
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. The specificity–sensitivity curves at the fold level.

 
As seen in Figure 4, the sensitivity of UNI-FOLD at the family level is similar to that of FOLDpro for specificity values between 0 and 0.8, and better for specificity values close to 1. This indicates that, unlike FOLDpro, UNI-FOLD rarely assigns high scores to false matches. At the superfamily level UNI-FOLD has better sensitivity for all specificity values (Fig. 5). Finally, at the fold level (Fig. 6) both UNI-FOLD and FOLDpro have much better tradeoff between the sensitivity and specificity than HMMER. However, no method in our benchmark is able to reliably detect the fold level similarities.

3.2 SCOP benchmark
We emphasize that Lindahl benchmark (Table 2) does not include results for some of the best fold recognition methods available today, such as, for example, HHsearch. HHsearch is a homology detection and alignment method based on the pairwise comparison of profile HMMs. An earlier version of HHsearch (v1.2) performed very well in the last rounds of CASP and LiveBench experiments.

One way to compare HHsearch and UNI-FOLD is to compute a set of HMMs for sequences in the Lindahl set and run all-against-all HHsearch on this set. However, because the Lindahl sequences come from different databases (Lindahl and Elofson, 2000), constructing HMMs for Lindahl test set (and in particular including DSSP information into the HMMs) requires string matching, which is an involved and error prone procedure. To avoid this, we compared the performance of UNI-FOLD and HHsearch on a different and much larger set of sequences from the latest release of the SCOP database.

To construct the test set, we first downloaded the models for representative SCOP1.73 sequences from the HHsearch website. An all-against-all comparison between 12 155 sequences in this set results in almost 150 million pairs, with many sequences sharing high percentage of sequence identity (up to 70%). To obtain a more representative and easier to manage test set, we identified a subset of models for which the seed sequences are members of SCOP1.73_20 (obtained by clustering SCOP_1.73 at 20% sequence identity).6 This set contains 3950 sequences (HMMs), which corresponds to about 15.6 millions pairs in an all-against-all comparison.7

Table 4 shows the ability of HHsearch and UNI-FOLD to identify pairs of sequences from the test set that belong to the same SCOP group. As seen in Table 4, UNI-FOLD performs slightly better at the family level, recognizing 80.3% (1746 out of 2175) relationships in the Top1 rank and 90.4% (1967 out of 2175) in the Top5 rank. HHsearch is able to recognize 0.9% more superfamily pairs in Top1 rank but 1.4% less in Top5 rank. The largest difference between the two methods is seen in the Fold Top5 category: 1010 pairs detected by UNI-FOLD versus 905 detected by HHsearch.


View this table:
[in this window]
[in a new window]

 
Table 4. SCOP benchmark for fold recognition

 
We also note that the parameters of HHsearch were kept at their default values. In this mode HHsearch uses DSSP secondary structure assignment stored in the template's HMM when aligning two SCOP models. In contrast, UNI-FOLD uses only predicted secondary structure for both sequences. According to Söding (2005), using DSSP (as opposed to predicted) secondary structure results in about 1.2% increase in the sensitivity of the HHsearch program.

3.3 SALIGN benchmark
We compared the accuracy of UNI-FOLD to the accuracy of five methods in the SALIGN benchmark for alignment quality. The SALIGN test set consists of 200 pairs of structurally similar proteins aligned with the structure–structure alignment programme CE. The pairs are selected so that each CE alignment matches at least 50% of residues in both chains, with at least 100 residues aligned in each sequence. In addition, at least 90% of residues of one chain are covered by CE alignment. There are on average 65% of structurally equivalent C{alpha} with an RMSD of 3.5 Å.

The alignment accuracy of an alignment method in SALIGN benchmark is defined as the percentage of the aligned positions that are identical to those in the CE alignment. We note that SALIGN is a difficult alignment test since the sequences in each pair share <40% of sequence identity.

The percent CE overlap for UNI-FOLD in SALIGN benchmark is 57.4% (Table 5). This is higher than the accuracy of SP3 (56.6%) and SALIGN (Marti-Renom et al., 2004) (56.4%), despite the fact that both of these programs are optimized for alignment accuracy (SP3 is trained in the ProSup benchmark and SALIGN is trained on a set of CE alignments), while UNI-FOLD is optimized for the fold recognition. For reference, the BLAST program is able to correctly match only 26.1% of all aligned pairs.


View this table:
[in this window]
[in a new window]

 
Table 5. SALIGN Benchmark for alignment accuracy

 
The accuracy of UNI-FOLD drops to 56.8% when the fixed background frequencies are used in the scoring scheme. An additional drop of 0.6% is seen when the secondary structure scoring in UNI-FOLDSS is turned off.

We also note that the length of UNI-FOLD alignments is similar to length of the alignments generated by CE. The number of residue pairs matched by UNI-FOLD exceeds the number of pairs aligned by CE by 1.8% (48 514 UNI-FOLD pairs versus 47 667 CE pairs).

3.4 Speed
A typical UNI-FOLD search with a query sequence of ~250 residues against the PDB90 database consisting of ~15 000 sequences takes about 70 min on a 2.8 GHz Pentium 4 computer. Thus, the running time of UNI-FOLD is better than both the mean (661 min) and the median (287 min) response time for CASP7 servers.8 However, several well-established servers have better response time than UNI-FOLD, such as HHpred3, shub (Fischer, 2003) and nFOLD (Jones et al., 2005).


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We study the accuracy of a profile–profile alignment method as a function of the size of the alignment profiles. The profile size is defined as the average number of different residue types observed in the columns of the multiple alignment used for profile construction. The relationship between the algorithm's accuracy and the profile size is used to determine the optimal profile size parameter for the detection of protein sequence relationships in each of the three SCOP groups: family, superfamily and fold.

Our study finds that the sensitivity of a profile–profile method in detecting SCOP family relationships is independent on the profile size. On the other hand, most superfamily relationships can be identified by using the profiles of size of ~11. The similarity detection rate at the fold level is directly proportional to the size of the sequence profiles. In other words, most fold level relationships can be identified with rich profiles, built from the multiple alignments consisting of many remotely related protein sequences.

The techniques described in this article are implemented in a profile–profile alignment algorithm UNI-FOLD. UNI-FOLD uses a probabilistic scoring scheme and a profile-specific statistics for estimating the P-values of the alignment scores. To compensate for the compositional bias in the protein families, the background frequencies in UNI-FOLD are derived from the profile by averaging the observed amino-acid frequencies over the profile's positions.

We tested the sensitivity and reliability of UNI-FOLD utilizing the Lindahl and SCOP benchmarks for fold recognition and SALIGN benchmark for the alignment accuracy. To address a possible performance bias due to the increasing protein sequence databases, we also measured the sensitivity of a version of our algorithm that uses an older protein sequence database for profile construction and protein secondary structure prediction.

UNI-FOLD outperforms all other methods tested in the Lindahl benchmark in every SCOP category. However, the difference in the performance between the top fold recognition methods at the SCOP family and fold level is only 2–4%. At the SCOP superfamily level, UNI-FOLD is capable of recognizing 69% of all related pairs, while the second best method can detect only 56%. For comparison, the popular PSI-BLAST program is able to recognize only 27% of all superfamily relationships between the sequences in the Lindahl test set.

Our study finds that up to 2% of the improvement in sensitivity of UNI-FOLD is due to utilizing profile-specific background frequencies. Although we have not investigated the impact of the background model on the fold recognition sensitivity outside of the setting described in this article, other authors report their methods do not benefit from using protein-specific background model. The increase in UNI-FOLD sensitivity when switching to profile-specific background frequencies is due to the fact that the compositional bias in protein families in UNI-FOLD is treated in the alignment step. In other words, the added value of using the profile-specific background frequencies might not be as high for methods that eliminate the compositional bias in the profile-construction step, e.g. by filtering the multiple alignments used for profile construction.

UNI-FOLD can be easily incorporated into a consensus method for additional boost in sensitivity and specificity for protein structure prediction. The applicability of our method extends beyond the comparative protein structure modelling, since the algorithm does not use any protein structure information. Thus, we hope that UNI-FOLD will become a valuable tool in protein structure and functional annotation studies.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
This work is partially supported by a Battelle Grant from the Board of Regents, State of Iowa. We gratefully thank Professor Jianlin Cheng for providing us with the benchmark of FOLDpro method.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Burkhard Rost

1 Since the dot product is always positive, the alignments generated with direct application of (1) extend over the entire length of both sequences. In order for a scoring function to generate more accurate, local alignments, the average score for aligning two profile positions must be negative. One way to ‘adjust’ the expected score is to subtract a small positive value (‘zero shift’) from each score in the scoring matrix (see, for example, Jaroszewski et al., 2005). Back

2 The probabilities of the three secondary structure states in UNI-FOLD are computed with the PSIPRED program (Jones, 1999). Back

3 We excluded from the training set any sequence that belongs to the Lindahl test set. The Lindahl test set is used to assess the performance of UNI-FOLD (see the ‘Results’ section). A full list of training sequences can be found at http://blackhawk.cs.uni.edu/training_set.txt Back

4 We track the performance of UNI-FOLD06 because it helps us assess a potential performance bias associated with utilizing fast growing protein sequence databases for profile building. The nr database used by UNI-FOLD06 is released in May 2006 and it contains 3 636 980 protein sequences. For comparison, the nr database used by UNI-FOLDb in the Lindahl benchmark contains 5 470 121 sequences. Back

5 The results for HMMER are taken from Lindahl and Elofsson (2000) (http://www.sbc.su.se/~arne/protein-id/). The results for FOLDpro are kindly provided by Professor Jianlin Cheng. Back

6 The set of cluster representatives SCOP1.73_20 can be downloaded from the ASTRAL website: http://astral.berkeley.edu Back

7 Our test set can be downloaded at http://blackhawk.cs.uni.edu/scop_test_set.txt Back

8 See ftp://ftp.tuebingen.mpg.de/pub/protevo/HHsearch/CASP7/response_times.txt Back

Received on January 9, 2008; revised on March 3, 2008; accepted on March 7, 2008

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res (1997) 25:3389–3393.[Abstract/Free Full Text]

    Barrett C, et al. Scoring hidden Markov models. Comput. Appl. Biosci (1997) 13:191–199.[Abstract/Free Full Text]

    Chandonia JM, et al. The ASTRAL compendium in 2004. Nucleic Acids Res (2004) 32:D189–D192.[Abstract/Free Full Text]

    Cheng J, Baldi P. A machine learning information retrieval approach to protein fold recognition. Bioinformatics (2006) 22:1456–1463.[Abstract/Free Full Text]

    Debe DA, et al. STRUCTFAST: protein sequence remote homology detection and alignment using novel dynamic programming and profile-profile scoring. Proteins (2006) 64:960–967.[CrossRef][Web of Science][Medline]

    Eddy SR. Profile hidden Markov models. Bioinformatics (1998) 14:755–763.[Abstract/Free Full Text]

    Fischer D. 3D-SHOTGUN: a novel, cooperative, fold-recognition meta-predictor. Proteins: Struct. Funct. Genet (2003) 51:434–441.[CrossRef][Web of Science][Medline]

    Fischer D, et al. CAFASP3: the third critical assessment of fully automated structure prediction methods. Proteins (2003) 53(Suppl. 6):503–516.[CrossRef][Web of Science][Medline]

    Ginalski K, et al. ORFeus: detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Res (2003) 31:3804–3807.[Abstract/Free Full Text]

    Hargbo J, Elofsson A. Using hidden Markov models and predicted secondary structure in fold recognition. Proteins: Struct. Funct. Genet (1999) 36:68–76.[CrossRef][Web of Science][Medline]

    Heringa J. Computational methods for protein secondary structure prediction using multiple sequence alignments. Curr. Protein Pept. Sci (2000) 1:273–301.[CrossRef][Medline]

    Holm L, et al. A database of protein structure families with common folding motifs. Protein Sci (1992) 1:1691–1698.[Web of Science][Medline]

    Jaroszewski L, et al. FFAS03: a server for profile-profile sequence alignments. Nucleic Acids Res (2005) 33:W284–W288.[Abstract/Free Full Text]

    John B, Sali A. Comparative protein structure modeling by iterative alignment, model building and model assessment. Nucleic Acids Res (2003) 31:3982–3992.[Abstract/Free Full Text]

    Jones DT. Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol (1999) 292:195–202.[CrossRef][Web of Science][Medline]

    Jones DT, et al. A new approach to protein fold recognition. Nature (1992) 358:86–89.[CrossRef][Medline]

    Jones DT, et al. Prediction of novel and analogous folds using fragment assembly and fold recognition. Proteins (2005) 61:143–151.[CrossRef][Web of Science][Medline]

    Karplus K, et al. Hidden Markov models for detecting remote protein homologies. Bioinformatics (1998) 14:846–856.[Abstract/Free Full Text]

    Karplus K, et al. What is the value added by human intervention in protein structure prediction? Proteins (2001) 45(Suppl. 5):86–91.[CrossRef]

    Kim D, et al. PROSPECT II: Protein structure prediction program for the genome-scale. Protein Eng (2003) 16:641–650.[Abstract/Free Full Text]

    Lindahl E, Elofsson A. Identification of related proteins on family, superfamily and fold level. J. Mol. Biol (2000) 295:613–625.[CrossRef][Web of Science][Medline]

    Marti-Renom MA, et al. Alignment of protein sequences by their profiles. Protein Sci (2004) 13:1071–1087.[CrossRef][Web of Science][Medline]

    Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol (1995) 247:536–540.[CrossRef][Web of Science][Medline]

    Robinson AB, Robinson LR. Distribution of glutamine and asparagine residues and their near neighbors in peptides and proteins. Proc. Natl Acad. Sci. USA (1991) 88:8880–8884.[Abstract/Free Full Text]

    Rychlewski L, Fischer D. LiveBench-8: the large-scale, continuous assessment of automated protein structure prediction. Protein Sci (2005) 14:240–245.[CrossRef][Web of Science][Medline]

    Sadreyev RI, Grishin NV. COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol (2003) 326:317–336.[CrossRef][Web of Science][Medline]

    Sadreyev RI, Grishin NV. Quality of alignment comparison by COMPASS improves with inclusion of diverse confident homologs. Bioinformatics (2004) 20:818–828.[Abstract/Free Full Text]

    Shi J, et al. FUGUE: sequence-structure homology recognition using environment-specific substitution tables and structure-dependent gap penalties. J. Mol. Biol (2001) 310:243–257.[CrossRef][Web of Science][Medline]

    Söding J. Protein homology detection by HMM-HMM comparison. Bioinformatics (2005) 21:951–960.[Abstract/Free Full Text]

    Tress M, et al. Assessment of predictions submitted for the CASP6 comparative modeling category. Proteins (2005) 61:27–45.[CrossRef][Web of Science][Medline]

    Vincent JJ, et al. Assessment of CASP6 predictions for new and nearly new fold targets. Proteins (2005) 61:67–83.[CrossRef][Web of Science][Medline]

    Wang G, et al. Assessment of fold recognition predictions in CASP6. Proteins (2005) 61:46–66.[CrossRef][Web of Science][Medline]

    Xu J, et al. RAPTOR: optimal protein threading by linear programming. J. Bioinform. Comput. Biol (2003) 1:95–117.[CrossRef][Medline]

    Yona G, Levitt M. Within the twilight zone: a sensitive profile–profile comparison tool based on information theory. J. Mol. Biol (2002) 315:1257–1275.[CrossRef][Web of Science][Medline]

    Zhou H, Zhou Y. Single-body knowledge-based energy score combined with sequence-profile and secondary structure information for fold recognition. Proteins (2004) 55:1005–1013.[CrossRef][Web of Science][Medline]

    Zhou H, Zhou Y. Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins (2005) 58:321–328.[CrossRef][Web of Science][Medline]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
24/9/1145    most recent
btn097v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Poleksic, A.
Right arrow Articles by Fienup, M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Poleksic, A.
Right arrow Articles by Fienup, M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?