Bioinformatics Advance Access originally published online on September 27, 2005
Bioinformatics 2005 21(23):4239-4247; doi:10.1093/bioinformatics/bti687
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Profile-based direct kernels for remote homology detection and fold recognition
Department of Computer Science and Engineering, University of Minnesota Minneapolis, MN 55455, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Protein remote homology detection is a central problem in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for remote homology detection. The performance of these methods depends on how the protein sequences are modeled and on the method used to compute the kernel function between them.
Results: We introduce two classes of kernel functions that are constructed by combining sequence profiles with new and existing approaches for determining the similarity between pairs of protein sequences. These kernels are constructed directly from these explicit protein similarity measures and employ effective profile-to-profile scoring schemes for measuring the similarity between pairs of proteins. Experiments with remote homology detection and fold recognition problems show that these kernels are capable of producing results that are substantially better than those produced by all of the existing state-of-the-art SVM-based methods. In addition, the experiments show that these kernels, even when used in the absence of profiles, produce results that are better than those produced by existing non-profile-based schemes.
Availability: The programs for computing the various kernel functions are available on request from the authors.
Contact: karypis{at}cs.umn.edu
| 1 INTRODUCTION |
|---|
|
|
|---|
Breakthroughs in large-scale sequencing have led to a surge in the available protein sequence information that has far out-stripped our ability to experimentally characterize their functions. As a result, researchers are increasingly relying on computational techniques to classify these sequences into functional and structural families based on sequence homology.
Although satisfactory methods exist to detect homologs with high levels of similarity, accurately detecting homologs at low levels of sequence similarity (remote homology detection) still remains a challenging problem. Some of the most popular approaches for remote homology prediction compare a protein with a collection of related proteins using methods such as protein family profiles (Gribskov et al., 1987), PSI-BLAST (Altschul et al., 1997), and hidden Markov models (HMMs) (Krogh et al., 1994; Baldi et al., 1994; Karplus et al., 1998). These schemes produce models that are generative in the sense that they build a model for a set of related proteins and then check to see how well this model explains a candidate protein.
In recent years, the performance of remote homology detection has been further improved through the use of methods that explicitly model the differences between the various protein families (classes) and build discriminative models. In particular, a number of different methods have been developed that build these discriminative models using support vector machines (SVM) (Vapnik, 1998) and have shown, provided there are sufficient data for training, to produce results that are in general superior to those produced by either pairwise sequence comparisons or approaches based on generative models (Jaakkola et al., 2000; Liao and Noble, 2002; Leslie et al., 2002, 2003; Ben-Hur and Brutlag, 2003; Hou et al., 2003, 2004; Saigo et al., 2004; Kuang et al., 2005).
A core component of an SVM is the kernel function, which measures the similarity between any pair of examples. Different kernels correspond to different notions of similarity and can lead to discriminative functions with different performance. One approach for deriving a kernel function is to first choose an appropriate feature space (potentially derived from the input space directly), represent each sequence as a vector in that space and then take the inner product (or a function derived from them) between these vector-space representations as a kernel for the sequences.
One of the early attempts with such feature-space-based approaches is the SVM-Fisher method (Jaakkola et al., 2000), in which a profile HMM model is estimated on a set of proteins belonging to the positive class and used to extract a vector representation for each protein. Another approach is the SVM-pairwise scheme (Liao and Noble, 2002), which represents each sequence as a vector of pairwise similarities to all sequences in the training set. A relatively simpler feature space that contains all possible short subsequences ranging from 3 to 8 amino acids (kmers) is explored in a series of papers [Spectrum kernel (Leslie et al., 2002), Mismatch kernel (Leslie et al., 2003), and profile kernel (Kuang et al., 2005)]. All three of these methods represent a sequence X as a vector in this feature space and differ on the scheme they employ to actually determine if a particular dimension u (i.e. kmer) is present (i.e. has a non-zero weight) in X's vector or not. The Spectrum kernel considers u to be present if X contains u as a substring, the Mismatch kernel considers u to be present if X contains a substring that differs with u in at most a predefined number of positions (i.e. mismatches) and the profile kernel considers u to be present if X contains a substring whose PSSM-based ungapped alignment score with u is above a user-supplied threshold. An entirely different feature space is explored by the SVM-Isites (Hou et al., 2003) and SVM-HMMSTR (Hou et al., 2004) methods that take advantage of a set of local structural motifs (SVMIsites) and their relationships (SVM-HMMSTR).
An alternative to measuring pairwise similarity through a dot-product of vector representations is to calculate an explicit protein similarity measure. The recently developed LA-Kernel method (Saigo et al., 2004) represents one such example of a direct kernel function. This scheme measures the similarity between a pair of protein sequences by taking into account all the optimal local alignment scores with gaps between all of their possible subsequences. The experiments presented in Saigo et al. (2004) show that this kernel is superior to previously developed schemes that do not take into account sequence profiles and that the overall classification performance improves by taking into account all local alignments.
In this paper we develop new kernel functions that are derived directly from explicit similarity measures and utilize sequence profiles. We present two classes of such kernel functions. The first class, referred to as window based, determines the similarity between a pair of sequences by using different schemes to combine ungapped alignment scores of certain fixed-length subsequences. The second, referred to as local alignment based, determines the similarity between a pair of sequences using SmithWaterman alignments and a position independent affine gap model, optimized for the characteristics of the scoring system. Both kernel classes utilize profiles constructed automatically via PSI-BLAST and employ a profile-to-profile scoring scheme we develop by extending a recently introduced profile alignment method (Mittelman et al., 2003).
Experiments on two benchmarks derived from SCOP, one designed to detect remote homologs and the other designed to identify folds, show that these new kernels produce results that are substantially better than those produced by all other state-of-the-art SVM-based methods. In addition, the experiments show that these newly proposed kernels, even when used in the absence of profiles, produce results that are better than those produced by existing non-profile-based schemes.
| 2 METHODS AND ALGORITHMS |
|---|
|
|
|---|
2.1 SVM and kernel functions
Key to our algorithm for protein classification is its learning methodology, which is based on support vector machines. Given a set of positive training sequences
and a set of negative training sequences
, an SVM learns a classification function f(X) of the form
![]() | (1) |
and
are non-negative weights that are computed during training by maximizing a quadratic objective function, and
is called the kernel function that is computed over the various training set and test set instances. Given this function, a new sequence X is predicted to be positive or negative depending on whether f(X) is positive or negative. In addition, the value of f(X) can be used to obtain a meaningful ranking of a set of instances, as it represents the strength by which they are members of the positive or negative class.
2.2 Sequence profiles
The inputs to our classification algorithm are the various proteins and their profiles. A protein sequence X of length n is represented by a sequence of characters X =
a1, a2, ..., an
such that each character corresponds to 1 of the 20 standard amino acids. The profile of a protein X is derived by computing a multiple sequence alignment of X with a set of sequences {Y1, ..., Ym} that have a statistically significant sequence similarity with X (i.e. they are sequence homologs). In this paper we obtain the profiles using PSI-BLAST (Altschul et al., 1997) as it combines the steps of finding the homologous sequences and computing their multiple alignment, is very fast, and has been shown to produce reasonably good results. However, the profile-based kernels developed here can be used with other methods of constructing sequence profiles as well.
The profile of a sequence X of length n is represented by two n x 20 matrices. The first is its position-specific scoring matrix PSSMX that is computed directly by PSI-BLAST using the scheme described in (Altschul et al., 1997). The rows of this matrix correspond to the various positions in X and the columns correspond to the 20 distinct amino acids. The second matrix is its position-specific frequency matrix PSFMX that contains the frequencies used by PSI-BLAST to derive PSSMX. These frequencies (also referred to as target frequencies (Mittelman et al., 2003)) contain both the sequence-weighted observed frequencies [also referred to as effective frequencies (Mittelman et al., 2003)] and the BLOSUM62 (Henikoff and Henikoff, 1992) derived-pseudocounts (Altschul et al., 1997). For each row, the frequencies were scaled so that they add up to one. In the cases in which PSI-BLAST could not produce meaningful alignments for certain positions of X, the corresponding rows of the two matrices were derived from the scores and frequencies of BLOSUM62.
2.3 Profile-based sequence similarity
Many different schemes have been developed for determining the similarity between profiles that combine information from the original sequence, position-specific scoring matrix, or position-specific target and/or effective frequencies (Mittelman et al., 2003; Wang and Dunbrack Jr, 2004; Marti-Renom et al., 2004). In our work we use a scheme that is derived from PICASSO (Heger and Holm, 2001; Mittelman et al., 2003). Specifically, the similarity score between the i-th position of protein's X profile, and the j-th position of protein's Y profile is given by
![]() | (2) |
Equation (2) determines the similarity between two profile positions by weighting the position-specific scores of one sequence according to the frequency at which the corresponding amino acid occurs in the second sequence's profile. Note that by construction, Equation (2) leads to a symmetric similarity score. The key difference between Equation (2) and the corresponding scheme used in Mittelman et al. (2003) (referred to as PICASSO3), is that our measure uses the target frequencies, whereas the scheme of (Mittelman et al., 2003) was based on effective frequencies. Our experiments (not included here) indicate that target frequencies lead to better results.
2.4 Window-based kernels
The first class of profile-based kernel functions that we developed determines the similarity between a pair of sequences by combining the ungapped alignment scores of certain fixed length subsequences (referred to as wmers). Given a sequence X of length n and a user-supplied parameter w, the wmer at position i of X (w < i
n w) is defined to be the (2w + 1)-length subsequence of X centered at position i. That is, the wmer contains xi, the w amino acids before, and the w amino acids after xi. We will denote this subsequence as wmerX(i).
2.4.1 All fixed-width wmers (AF-PSSM)
The AF-PSSM kernel computes the similarity between a pair of sequences X and Y by adding up the alignment scores of all possible wmers between X and Y that have a positive ungapped alignment score. Specifically, if the ungapped alignment score between two wmers at positions i and j of X and Y, respectively is denoted by wscoreX,Y(i, j), n and m are the lengths of X and Y, respectively and
is the set of all possible wmer-pairs of X and Y with a positive ungapped alignment score, i.e.
![]() | (3) |
i
n w and w + 1
j
m w, then the AF-PSSM kernel computes the similarity between X and Y as
![]() | (4) |
The ungapped alignment score between two wmers is computed using the profile-to-profile scoring method of Equation (2) as follows:
![]() | (5) |
Note that both the AF-PSSM kernel and the profile kernel (Kuang et al., 2005) determine the similarity between a pair of sequences by considering how all of their fixed-length subsequences are related in view of sequence profiles. However, unlike the feature space-based approach employed by Profile, the AF-PSSM kernels determine the wmer-based similarity of two sequences by comparing all of their possible wmers directly. This allows AF-PSSM to precisely determine whether two wmers are similar or not and provide better quantitative estimates of the degree to which two wmers are similar.
2.4.2 Best fixed-width wmer (BF-PSSM)
In determining the similarity between a pair of sequences X and Y, the AF-PSSM kernel includes information about all possible wmer-level local alignments between them. In light of this observation, it can be thought of as a special case of the LA kernels proposed by Saigo et al. (2004), which compute the similarity between a pair of sequences as the sum of the optimal local alignment scores with gaps between all possible subsequences of X and Y. The results reported in Saigo et al. (2004) show that taking into account all possible alignments leads to better results.
To see whether or not this is true in the context of the profile-derived wmer-based kernels, we developed a scheme that attempts to eliminate this multiplicity by computing the similarity between a pair of sequences based on a subset of the wmers used in the AF-PSSM kernel. Specifically, the BF-PSSM kernel selects a subset
of
[as defined in Equation (3)] such that (1) each position of X and each position of Y is present in at most one wmer-pair and (2) the sum of the wscores of the selected pairs is maximized. Given
, the similarity between the pair of sequences is then computed as follows:
![]() | (6) |
The relation between
and
can be better understood if the possible wmer-pairs in
are viewed as forming an n x m matrix, whose rows correspond to the positions of X, columns to the positions of Y, and values correspond to their respective wscores. Within this context,
corresponds to a matching of the rows and columns (Papadimitriou and Steiglitz, 1982) whose weight is high (bipartite graph matching problem). Since the selection forms a matching, each position of X (or Y) contributes at most one wmer in Equation (6), and as such, eliminates the multiplicity present in the AF-PSSM kernel. At the same time, since we are interested in a highly weighted matching, we try to select the best wmers for each position.
In our algorithm, we use a greedy algorithm to incrementally construct
by including the highest weight wmers that is not in conflict with the wmers already in
. This algorithm terminates when we cannot include in
any additional wmers.
Note that an alternate way of defining
is to actually look for the maximum weight matching (i.e. the matching whose weight is the highest among all possible matchings). However, the complexity of the underlying bipartite maximum weight matching problem is relatively high [O(n2m + nm2) (Papadimitriou and Steiglitz, 1982)], and for this reason we use the greedy approach.
2.4.3 Best variable-width wmer (BV-PSSM)
In fixed-width wmer-based kernels the width of the wmers is fixed for all pairs of sequences and throughout the entire sequence. As a result, if w is set to a relatively high value, it may fail to identify positive scoring subsequences whose length is smaller than 2w + 1, whereas if it is set too low, it may fail to reward sequence pairs that have relative long similar subsequences.
To overcome this problem, we developed a kernel, referred to as BV-PSSM, which is derived from the BF-PSSM kernel but operates with variable width wmers. In particular, given a user-supplied width w, it considers the set of all possible wmer-pairs whose length ranges from one to w, i.e.
![]() | (7) |
of wmer-pairs that form a high weight matching. The similarity between the pair of sequences is then computed as follows:
![]() | (8) |
is constructed by including the highest scoring wmer for i that does not conflict with the previous selections, this scheme can automatically select the highest scoring wmer whose length can vary from one up to w; thus, achieving the desired effect.
2.5 Local alignment-based kernels (SW-PSSM)
The second class of profile-based kernels that we examine compute the similarity between a pair of sequences X and Y by finding an optimal alignment between them that optimizes a particular scoring function. There are three general classes of optimal alignment-based schemes that are commonly used to compare protein sequences. These are based on global, local and globallocal (also known as end-space free) alignments (Gusfield, 1997). Our experiments with all of these schemes indicate that those based on optimal local alignments [also referred to as SmithWaterman alignments (Smith and Waterman, 1981)] tend to produce somewhat better results. For this reason we use this method to derive a profile-based alignment kernel, which is referred to as SW-PSSM.
Given two sequences X and Y of lengths n and m, respectively, the SW-PSSM kernel computes their similarity as the score of the optimal local alignment in which the similarity between two sequence positions is determined using the profile-to-profile scoring scheme of Equation (2), and a position-independent affine gap model. The actual alignment is computed using the O(nm) dynamic programming algorithm developed by Gotoh (1982).
Within this local alignment framework, the similarity score between a pair of sequences depends on the particular values of the affine gap model [i.e. gap-opening (go) and gap-extension (ge) costs] and the intrinsic characteristics of the profile-to-profile scoring scheme. In order to obtain meaningful local alignments, the scoring scheme that is used should produce alignments whose score must on average be negative with the maximum score being positive (Smith and Waterman, 1981). A scoring system whose average score is positive will tend to produce very long alignments, potentially covering segments of low biologically relevant similarity. However, if the scoring system cannot easily produce alignments with positive scores, it may fail to identify any non-empty similar subsequences.
To ensure that the SW-PSSM kernel can correctly account for the characteristics of the scoring system, we modify the profile-to-profile scores calculated from Equation (2) by adding a constant value. This scheme, commonly referred to as zero-shifting (Wang and Dunbrack Jr, 2004), ensures that the resulting alignments have scores that on the average are negative while allowing for positive maximum scores. In our scheme, the amount of zero-shifting, denoted by zs, is kept fixed for all pairs of sequences, as a limited number of experiments with sequence pair-specific zs values did not produce any better results.
2.6 From similarity measures to Mercer kernels
Any function
can be used as a kernel as long as for any number n and any possible set of distinct sequences {X1, ..., Xn}, the n x n Gram matrix defined by
is symmetric positive semidefinite. These functions are said to satisfy Mercer's conditions and are called Mercer kernels, or simply valid kernels.
The similarity based functions described in the previous sections can be used as kernel functions by setting
to be equal to one of AF-PSSMXi,Xj, BF-PSSMXi,Xj BV-PSSMXi,Xj or SW-PSSMXi,Xj. However, the resulting functions will not necessarily lead to valid Mercer kernels, because G may not be positive semidefinite.
To overcome this problem we used the approach described in Saigo et al. (2004) to convert a symmetric matrix defined on the training set instances into positive definite by subtracting from the diagonal of the training Gram matrix its smallest negative eigenvalue. The resulting matrix is identical to the similarity based Gram matrix at all positions expect those along the main diagonal. We also experimented with the empirical kernel map approach proposed in Scholkopf and Smola (2002), but we find that the eigenvalue-based scheme produced superior results.
| 3 EXPERIMENTAL DESIGN |
|---|
|
|
|---|
3.1 Dataset description
We evaluated the classification performance of the profile-based kernels on a set of protein sequences obtained from the SCOP database (Murzin et al., 1995). We formulated two different classification problems. The first was designed to evaluate the performance of the algorithms for the problem of homology detection when the sequences have low sequence similarities (i.e. the remote homology detection problem), whereas the second was designed to evaluate the extent to which the profile-based kernels can be used to identify the correct fold when there are no apparent sequence similarities (i.e. the fold detection problem).
3.1.1 Remote homology detection (superfamily detection)
Within the context of the SCOP database, remote homology detection was simulated by formulating it as a superfamily classification problem. The same dataset and classification problems (The dataset and classification problem definitions are available at http://www.cs.columbia.edu/compbio/svm-pairwise) have been used in a number of earlier studies (Liao and Noble, 2002; Hou et al., 2004; Saigo et al., 2004) allowing us to perform direct comparisons on the relative performance of the various schemes. The data consisted of 4352 sequences from SCOP version 1.53 extracted from the Astral database, grouped into families and superfamilies. The dataset was processed so that it does not contain any sequence pairs with an E-value threshold <1025. For each family, the protein domains within the family were considered positive test examples, and protein domains within the superfamily but outside the family were considered positive training examples. This yielded 54 families with at least 10 positive training examples and 5 positive test examples. Negative examples for the family were chosen from outside of the positive sequences' fold, and were randomly split into training and test sets in the same ratio as the positive examples.
3.1.2 Fold detection
Employing the same dataset and overall methodology as in remote homology detection, we simulated fold detection by formulating as a fold classification within the context of SCOP's hierarchical classification scheme. In this setting, protein domains within the same superfamily were considered to be as positive test examples, and protein domains within the same fold but outside the superfamily were considered as positive training examples. This yielded 23 superfamilies with at least 10 positive training and 5 positive test examples. Negative examples for the superfamily were chosen from outside of the positive sequences' fold and split equally into test and training sets (The classification problem definitions are available at http://bioinfo.cs.umn.edu/supplements/remote-homology/). Since the positive test and training instances were members of different superfamilies within the same fold, this new problem is significantly harder than remote homology detection, as the sequences in the different superfamilies did not have any apparent sequence similarity (Murzin et al., 1995).
3.2 Profile generation
The position-specific score and frequency matrices used by the profile-based scoring method of Equation (2) were generated using the latest version of the PSI-BLAST algorithm (available in NCBI's blast release 2.2.10), and were derived from the multiple sequence alignment constructed after five iterations using an e value of 103 (i.e. we used blastpgp j 5 e 0.001). The PSI-BLAST was performed against NCBI's nr database that was downloaded in November of 2004 and contained 2 171 938 sequences.
3.3 SVM learning
We use the publicly available support vector machine tool SVMlight (Joachims, 1999) that implements an efficient soft margin optimization algorithm. Following the approach used by the LA-Kernel (Saigo et al., 2004), for any given positive semi-definite kernel Gram matrix
to be tested, we first normalize the points to unit norm in the feature space and separate them from the origin by adding a constant value of one.
3.4 Evaluation methodology
We measured the quality of the methods by using the receiver operating characteristic (ROC) scores, the ROC50 scores, and the median rate of false positives (mRFP). The ROC score is the normalized area under a curve that plots true positives against false positives for different possible thresholds for classification (Gribskov and Robinson, 1996). The ROC50 score is the area under the ROC curve up to the first 50 false positives. Finally, the mRFP is the number of false positives scoring as high or better than the median-scoring true positives.
Among these evaluation metrics, owing to the fact that the positive class is substantially smaller than the negative class, the ROC50 is considered to be the most useful measure of performance for real-world applications (Gribskov and Robinson, 1996). For this reason, our discussions in the rest of this section will primary focus on ROC50-based comparisons. Also, the ROC50 values that are being reported for the superfamily- and fold-level evaluations correspond to the average ROC50 values over the 54 families and 23 superfamilies, respectively.
| 4 RESULTS |
|---|
|
|
|---|
4.1 Performance of the window-based kernels
Table 1 summarizes the performance achieved by the window-based kernels for the superfamily- and fold-level classification problems across a range of w values.
|
These results show that for both the superfamily and fold level classification problems, the BV-PSSM kernel achieves the best results, the AF-PSSM kernel tends to perform the worst, whereas the BF-PSSM kernel's performance is between these two. In the case of superfamily classification, the performance advantage of BV-PSSM over that of BF-PSSM is relatively small, whereas in the case of fold classification, the former has a clear advantage. It achieves an ROC50 value that is on average 16.3% better across the different window lengths.
Comparing the sensitivity of the three schemes based on the value of w, we see that, as expected, their performance is worse for w = 1, as they only consider wmers of length 3, and their performance improves as the value of w increases. In general, the BV-PSSM kernel performs better for larger windows, whereas the performance of the other kernels tends to degrade more rapidly as the length of the window increases beyond a point. Again, this result is consistent with the design motivation behind the BV-PSSM kernel. Also, the results show that the best value of w is also dependent on the particular classification problem. For most kernels, the best results for fold classification were obtained with longer windows compared with the superfamily classification.
To see the effect of using sequence profiles, we performed a sequence of classification experiments in which we used the same set of window-based kernel functions, but instead of scoring the similarity between two amino acids using the profile-based scheme [Equation (2)], we used the BLOSUM62 position-independent scoring matrix. The results obtained from these experiments, which are summarized in Table 2, illustrate the advantage of using sequence profiles in designing kernel functions for both remote homology detection and fold recognition. The profile-based kernel functions achieve significant improvements over their non-profile counterparts across all different kernel functions, classification problems and metrics.
|
Comparing the performance of the profile-based kernel functions across the two classification problems, we see that their overall effectiveness in remote homology detection (superfamily level classification) is much higher than that of fold recognition. This result is in line with the underlying complexity of the classification problem, as the sequence-based signals for fold recognition are extremely weak. This is also manifested by the relative improvement achieved by the profile-based kernel functions over their BLOSUM62-based counterparts (Tables 1 and 2). For fold recognition, the ROC50 values of the profile-based kernels are higher than those based on BLOSUM62 by a factor of two, whereas for remote homology prediction, the relative ROC50 values are higher by 2530%.
In light of the previously published results on LA-Kernels (Saigo et al., 2004), the better results achieved by the BF-PSSM and BV-PSSM kernels over those achieved by the AF-PSSM kernel (which also hold for their corresponding BLOSUM62-based instances of these kernels) were surprising. One explanation for this discrepancy may be the fact that our window-based kernels consider only short-length ungapped alignments, and the results may be different when longer alignments with gaps are considered as well.
4.2 Performance of the local alignment-based kernels
Table 3 summarizes the performance achieved by the optimal local alignment-based kernel for the superfamily- and fold-level classification problems across a representative set of values for the gap-opening, gap-extension and zero-shift parameters. These parameter values were selected after performing a study in which the impact of a large number of value combinations was experimentally studied, and represent some of the best performing combinations. Owing to space constraints, this parameter study is not included in this paper.
|
The most striking observation from these results is the major impact that the zero-shift parameter has to the overall classification performance. For both the superfamily- and fold-level classification problems, the best results are obtained by the SW-PSSM kernel for which the zero shift parameter has been optimized (i.e. the results corresponding to the last two rows of Table 3).
Comparing the classification performance of the SW-PSSM kernel against the window-based kernels (Table 1) we see that the zero-shift optimized SW-PSSM kernel leads to better results than those obtained by the window-based kernels. Moreover, the relative performance advantage of SW-PSSM is higher for fold recognition over the superfamily classification problem. However, if the SW-PSSM kernel does not optimize the zero-shift parameter (i.e. zs = 0.0), the window-based kernels consistently outperform the SW-PSSM kernel. We also performed a limited number of experiments to see the extent to which the performance of the window-based kernels can be improved by explicitly optimizing the zero-shift parameter for them as well. Our results show that these kernels are not significantly affected by such optimizations.
To also see the impact of sequence profiles in the context of kernels derived from optimal local alignments, we evaluated the classification performance of a set of kernel functions that compute the optimal local sequence alignment using the BLOSUM45 and BLOSUM62 amino acid scoring matrices. Table 4 shows some of the results obtained with these kernel functions for a representative set of values for the gap opening, gap extension and zero-shift parameters.
|
Comparing the results of Table 4 with those of Table 3 we see that, as was the case with the window-based kernels, incorporating profile information leads to significant improvements in the overall classification performance. In addition, these results show that (1) the widely used value for the gap-opening cost (go = 10) is not necessarily the best for either remote homology detection or fold recognition and (2) the classification performance achieved by local alignment kernels derived from the BLOSUM matrices can be further improved by explicitly optimizing the zero-shift parameter as well.
4.3 Comparisons with other schemes
Tables 5 and 6 compare the performance of the various kernel functions developed in this paper against that achieved by a number of previously developed schemes for the superfamily and fold level classification problems, respectively. In the case of the superfamily level classification problem, the performance is compared against SVM-Fisher (Jaakkola et al., 2000), SVM-Pairwise (Liao and Noble, 2002) and different instances of the LA-Kernel (Saigo et al., 2004), SVM-HMMSTR (Hou et al., 2004), Mismatch (Leslie et al., 2003) and Profile (Kuang et al., 2005). In the case of the fold level classification problem, we only include results for the LA-Kernel and Profile schemes, as these results could be easily obtained from the publicly available data and programs for these schemes.
|
|
The results in these tables show that both the window- and local alignment-based kernels derived from sequence profiles (i.e. AF-PSSM, BF-PSSM, BV-PSSM and SW-PSSM) lead to results that are in general better than those obtained by existing schemes. Comparing the ROC50 values obtained by our schemes, we see that each one of them outperforms all existing schemes. The performance advantage of these kernels is greater over existing schemes that rely on sequence information alone (e.g. SVM-Pairwise, LA-Kernels), but still remains significant when compared against schemes that either directly take into account profile information (e.g. SVM-Fisher, Profile) or utilize higher-level features derived by analyzing sequence-structure information (e.g. SVM-HMMSTR). Also, the relative advantage of our profile-based methods over existing schemes is greater for the much harder fold level classification problem over the superfamily-level classification problem. For example, the SW-PSSM scheme achieves ROC50 values that are 13.8 and 81.8% better than the best values achieved by existing schemes for the superfamily- and fold-level classification problems, respectively.
To get a better understanding of the relative performance of the various schemes across the different classes, Figures 1 and 2 plot the number of classes whose ROC50 was greater than a given threshold that ranges from 0 to 1. Specifically, Figure 1 shows the results for the remote homology detection problem, whereas Figure 2 shows the results for the fold detection problem. (Note that these figures contain only results for the schemes that we were able to run locally.) These results show that our profile-based methods lead to higher ROC50 values for a greater number of classes than either the Profile or LA-kernels, especially for larger ROC50 values (e.g. in the range of 0.60.95). Also, the SW-PSSM tends to consistently outperform the rest of the profile-based direct kernel methods.
|
|
In addition, the results for the BF-GSM, BV-GSM and SW-GSM kernels that rely on the BLOSUM scoring matrices show that these kernel functions are capable of producing results that are superior to all of the existing non-profile-based schemes. In particular, the properly optimized SW-GSM scheme is able to achieve significant improvements over the best LA-Kernel-based scheme (7.6% higher ROC50 value) and the best SVM-HMMSTR-based scheme (15.1% higher ROC50 value). This relative performance of BF-GSM, BV-GSM and SW-GSM kernels over existing non-profile-based schemes is further illustrated in Figure 3, which plots the number of classes whose ROC50 was greater than a given threshold. For almost all threshold values, these three new kernels outperform all previous schemes. Note that this plot does not include results for the SVM-HMMSTR scheme as this information was not reported.
|
| 5 DISCUSSION AND CONCLUSION |
|---|
|
|
|---|
This paper presented and experimentally evaluated a number of kernel functions for protein sequence classification that were derived by considering explicit measures of profile-to-profile sequence similarity. The experimental evaluation in the context of a remote homology prediction problem and a fold recognition problem show that these kernels are capable of producing superior classification performance over that produced by earlier schemes.
Three major observations can be made by analyzing the performance achieved by the various kernel functions presented in this paper. First, as was the case with a number of studies on the accuracy of protein sequence alignment (Mittelman et al., 2003; Wang and Dunbrack Jr, 2004; Marti-Renom et al., 2004), the proper use of sequence profiles lead to dramatic improvements in the overall ability to detect remote homologs and identify proteins that share the same structural fold. Second, kernel functions that are constructed by directly taking into account the similarity between the various protein sequences tend to outperform schemes that are based on a feature-space representation [where each dimension of the space is constructed as one of k-possibilities in a k-residue long subsequence or using structural motifs (Isites) in the case of SVM-HMMSTR]. This is especially evident by comparing the relative advantage of the window-based kernels over the Profile kernel. Third, time-tested methods for comparing protein sequences based on optimal local alignments (as well as global and local-global alignments), when properly optimized for the classification problem at hand, lead to kernel functions that are in general superior to those based on either short subsequences (e.g. spectrum, mismatch, profile or window-based kernel functions) or local structural motifs (e.g. SVM-HMMSTR). The fact that these widely used methods produce good results in the context of SVM-based classification is reassuring as to the validity of these approaches and their ability to capture biologically relevant information.
| Acknowledgments |
|---|
This work was supported by NSF EIA-9986042; ACI-9982274, ACI-0133464, ACI-0312828, IIS-0431135, the Army High Performance Computing Research Center contract number DAAD19-01-2-0014, and by the Digital Technology Center at the University of Minnesota. Funding to pay the Open Access publication charges for this article was provided by the National Science Foundation.
Conflict of Interest: none declared.
Received on May 12, 2005; revised on June 15, 2005; accepted on September 20, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., et al. (1997) Gapped blast and PSI-blast: a new generation of protein database search programs. Nucleic Acids Res, . 25, 33893402
Baldi, P., et al. (1994) Hidden Markov models of biological primary sequence information. Proc. Natl Acad. Sci. USA, 91, 10531063.
Ben-Hur, A. and Brutlag, D. (2003) Remote homology detection: a motif based approach. Bioinformatics, 19, i26i33[Abstract].
Gotoh, O. (1982) An improved algorithm for matching biological sequences. J. Mol. Biol, . 162, 705708[CrossRef][ISI][Medline].
Gribskov, M., et al. (1987) Profile analysis: detection of distantly related proteins. Proc. Natl Acad. Sci. USA, 84, 3554358.
Gribskov, M. and Robinson, N. (1996) Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Comput. Chem, . 20, 2533[CrossRef][ISI][Medline].
Gusfield, D. Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, (1997) , NY Cambridge University Press.
Heger, A. and Holm, L. (2001) Picasso: generating a covering set of protein family profiles. Bioinformatics, 17, 272279
Henikoff, S. and Henikoff, J.G. (1992) Amino acid subsitution matrices from protein blocks. Proc. Natl Acad. Sci. USA, 89, 1091510919
Hou, Y., et al. (2003) Efficient remote homology detection using local structure. Bioinformatics, 19, 22942301
Hou, Y., et al. (2004) Remote homolog detection using local sequencestructure correlations. Proteins, 57, 518530[CrossRef][ISI][Medline].
Jaakkola, T., et al. (2000) A discriminative framework for detecting remote protein homologies. J. Comput. Biol, . 7, 95114[CrossRef][ISI][Medline].
Joachims, T. (1999) Making large-scale svm learning practical. In Schölkopf, B., Burges, C., Smola, A. (Eds.). Advances in Kernel MethodsSupport Vector Learning, , Cambridge, MA MIT Press.
Karplus, K., et al. (1998) Hidden Markov models for detecting remote protein homologies. Bioinformatics, 14, 846856
Krogh, A., et al. (1994) Hidden Markov models in computational biology: applications to protein modeling. J. Mol. Biol, . 235, 15011531[CrossRef][ISI][Medline].
Kuang, R., et al. (2005) Profile-based string kernels for remote homology detection and motif extraction. J. Bioinform. Comput. Biol, . 3, 527550[CrossRef][Medline].
Leslie, C., et al. (2002) The spectrum kernel: a string kernel for SVM protein classification. Pac. Symp. Biocomput, . 564575.
Leslie, C., et al. (2003) Mismatch string kernels for svm protein classification. Adv. Neural Inf. Process. Syst, . 20, 467476.
Liao, L. and Noble, W.S. (2002) Combining pairwise sequence similarity and support vector machines for detecting remote protein evolutionary and structural relationships. Proceedings of the International Conference on Research in Computational Molecular BiologyWashington DC , pp. 225232.
Marti-Renom, M., et al. (2004) Alignment of protein sequences by their profiles. Protein Sci, . 13, 10711087
Mittelman, D., et al. (2003) Probabilistic scoring measures for profileprofile comparison yield more accurate short seed alignments. Bioinformatics, 19, 15311539
Murzin, A.G., et al. (1995) SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol, . 247, 536540[CrossRef][ISI][Medline].
Papadimitriou, C.H. and Steiglitz, K. Combinatorial Optimization: Algorithms and Complexity, (1982) , Englewood Cliffs, NJ Prentice-Hall.
Saigo, H., et al. (2004) Protein homology detection using string alignment kernels. Bioinformatics, 20, 16821689
Scholkopf, B. and Smola, A. Learning with Kernels, (2002) , Boston, MA MIT Press.
Smith, T.F. and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol, . 147, 195197[CrossRef][ISI][Medline].
Vapnik, V. Statistical Learning Theory, (1998) , New York John Wiley.
Wang, G. and Dunbrack, R.L., Jr. (2004) Scoring profile-to-profile sequence alignments. Protein Sci, . 13, 16121626
This article has been cited by other articles:
![]() |
A. R. Shah, C. S. Oehmen, and B.-J. Webb-Robertson SVM-HUSTLE--an iterative semi-supervised machine learning approach for pairwise protein remote homology detection Bioinformatics, March 15, 2008; 24(6): 783 - 790. [Abstract] [Full Text] [PDF] |
||||
![]() |
S. Hochreiter, M. Heusel, and K. Obermayer Fast model-based protein homology detection without alignment Bioinformatics, July 15, 2007; 23(14): 1728 - 1736. [Abstract] [Full Text] [PDF] |
||||
![]() |
P. Fariselli, I. Rossi, E. Capriotti, and R. Casadio The WWWH of remote homolog detection: The state of the art Brief Bioinform, March 1, 2007; 8(2): 78 - 87. [Abstract] [Full Text] [PDF] |
||||
![]() |
H. Rangwala and G. Karypis Incremental window-based protein sequence alignment algorithms Bioinformatics, January 15, 2007; 23(2): e17 - e23. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||












