Skip Navigation


Bioinformatics Advance Access originally published online on October 31, 2007
Bioinformatics 2007 23(23):3217-3224; doi:10.1093/bioinformatics/btm511
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/23/3217    most recent
btm511v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shin, H.
Right arrow Articles by Lichtarge, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shin, H.
Right arrow Articles by Lichtarge, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Graph sharpening plus graph integration: a synergy that improves protein functional classification

Hyunjung Shin 1,*,{dagger}, Andreas Martin Lisewski 2,{dagger} and Olivier Lichtarge 2

1Department of Industrial & Information Systems Engineering, Ajou University, San 5, Wonchun-dong, Yeoungtong-gu, 443—749, Suwon, Korea and 2Department of Molecular & Human Genetics, Baylor College of Medicine, One Baylor Plaza, Houston, Texas 77030, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GRAPH-BASED LEARNING
 3 GRAPH SHARPENING
 4 GRAPH INTEGRATION
 5 FUNCTION PREDICTION...
 6 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Predicting protein function is a central problem in bioinformatics, and many approaches use partially or fully automated methods based on various combination of sequence, structure and other information on proteins or genes. Such information establishes relationships between proteins that can be modelled most naturally as edges in graphs. A priori, however, it is often unclear which edges from which graph may contribute most to accurate predictions. For that reason, one established strategy is to integrate all available sources, or graphs as in graph integration, in the hope that the positive signals will add to each other. However, in the problem of functional prediction, noise, i.e. the presence of inaccurate or false edges, can still be large enough that integration alone has little effect on prediction accuracy. In order to reduce noise levels and to improve integration efficiency, we present here a recent method in graph-based learning, graph sharpening, which provides a theoretically firm yet intuitive and practical approach for disconnecting undesirable edges from protein similarity graphs. This approach has several attractive features: it is quick, scalable in the number of proteins, robust with respect to errors and tolerant of very diverse types of protein similarity measures.

Results: We tested the classification accuracy in a test set of 599 proteins with remote sequence homology spread over 20 Gene Ontology (GO) functional classes. When compared to integration alone, graph sharpening plus integration of four vastly different molecular similarity measures improved the overall classification by nearly 30% [0.17 average increase in the area under the ROC curve (AUC)]. Moreover, and partially through the increased sparsity of the graphs induced by sharpening, this gain in accuracy came at negligible computational cost: sharpening and integration took on average 4.66 (±4.44) CPU seconds.

Availability: Software and Supplementary data will be available on http://mammoth.bcm.tmc.edu/

Contact: shin{at}ajou.ac.kr or lisewski{at}bcm.edu

Supplementary information: Supplementary materials are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GRAPH-BASED LEARNING
 3 GRAPH SHARPENING
 4 GRAPH INTEGRATION
 5 FUNCTION PREDICTION...
 6 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The prediction of biological function is a central challenge in modern biology since few genomic or proteomic data have detailed annotations (Friedberg, 2006). A new class of computer aided solutions to this problem is currently emerging that aims to integrate diverse sources of relevant information, rather than relying only on single methods that have specific limitations (Friedberg et al., 2006; Laskowski et al., 2005; Pal and Eisenberg, 2005). Function information may come in many forms, ranging from genomic and molecular to cellular and tissue contexts. The unprecedented availability of data from primary to tertiary protein structure in particular motivated many methods that use computational comparisons of specific molecular attributes to define functional relationships (Watson et al., 2005). From this, a clearer signal for function prediction may emerge by considering multiple relevant relationships.

A natural model of relationships between proteins is a network (or graph), where nodes depict genes or proteins and edges represent their possible interactions or correlations (Kanehisa et al., 2004; Uetz et al., 2000; von Mering et al., 2002) (see Fig. 1). Among these there are networks of weighted edges that evaluate molecular similarity between protein pairs extracted from sequence or structure comparisons and classify or predict protein function (Adai et al., 2004; Friedberg and Godzik, 2005; Hou et al., 2005; Yona et al., 1999). Protein function prediction can benefit by combining a diverse set of similarity measures from protein sequences, physical interactions, gene regulatory networks and metabolic pathways, etc. And there have been such approaches to combining multiple networks, for instance, majority vote (Hishigaki et al., 2001; Schwikowski et al., 2000), Bayesian networks (Deng et al., 2003), discriminative learning methods (Lanckriet et al., 2004a; Vert and Kanehisa, 2003), probabilistic integration by log-likelihood scores (Lee et al., 2004) and semidefinite programming (SDP) based SVM method (Lanckriet et al., 2004b), etc. Recently, it was shown that graphs based on semi-supervised learning (Chapelle et al., 2006; Zhou et al., 2004) yield better classification performance for sequence similarity networks than conventional local sequence comparison (Noble et al., 2005). This suggested that graph-based semi-supervised learning can capture global network information that is functionally relevant. Later, and in a more general approach, multiple protein networks were then combined into a single computationally efficient semi-supervised learning problem, and this further not merely improved functional classification over individual networks but also enormously reduced computation time when compared with other state-of-the-art methods for multiple networks (Tsuda et al., 2005).


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. A graph model of relationships between proteins. Nodes depict genes or proteins and edges represent their possible interactions or correlations, e.g. molecular similarities extracted from sequence or structure comparisons. An annotated protein is labelled either by ‘ + 1’ or ‘ – 1’, indicating either it belongs to a particular functional class or not. Graph-based function prediction seeks to classify the unannotated (unlabelled) proteins marked as ‘?’.

 
Here, we further extend these ideas to improve the annotation of protein function based on graphs. Starting from a diverse set of protein similarity measures, some that are standard and based on sequence and others that are more novel and based on structure, we apply two most recent graph-based algorithms in machine-learning field. Compared to previous work on graph integration, the novelty of this approach lies in taking into explicit account the directionality of edges in each graph through a method that we call graph sharpening, since it effectively eliminates those edges that tend to corrupt functional information represented in the underlying network. With a test set of 15 out of 20 functional GO terms among 599 non-redundant protein structures from the PDBselect25 list (Hobohm and Sander, 1994), we show that sharpening and integration raise the overall classification performance by nearly 30%. In absolute values, an average increase of the AUC—the area under the Receiver Operating Characteristic (ROC) curve by 0.17 up to a level of 0.75.

This article is organized as follows. First, we introduce graph-based semi-supervised learning in Section 2. In Section 3, the basic idea and some elements of the mathematical framework behind graph sharpening are explained (Shin et al., 2006). In Section 4, we present a specific graph-integration approach (Tsuda et al., 2005). Then, we demonstrate how these methods can be successfully applied to protein function prediction (Section 5). We conclude with some future work remarks.


    2 GRAPH-BASED LEARNING
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GRAPH-BASED LEARNING
 3 GRAPH SHARPENING
 4 GRAPH INTEGRATION
 5 FUNCTION PREDICTION...
 6 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In the graph-based semi-supervised learning algorithm (Zhou et al., 2004), a data point xi (i = 1,..., n) is represented as a node i in a graph, and the relationship between data points is represented by an edge where the connection strength from each node j to each other node i is encoded in element wij of a weight matrix W. A weight wij can take a binary value (0 or 1) in the simplest case. Often, a Gaussian function of Euclidean distance between points, with length scale {sigma}, is used to specify connection strength:


Formula

The i ~ j stands for node i and j having an edge between them that can be established either by k nearest neighbors or by Euclidean distance within a certain radius r, ||xixj||2 < r.1 The labelled nodes have labels yl isin { – 1,1}, while the unlabelled nodes have zeros yu = 0. Our algorithm will output an n-dimensional real-valued vectorFormula , which can be thresholded to make label predictions on fl = 1,..., fn after learning. It is assumed that (a) fi should be close to the given label yi in labelled nodes, and (b) overall, fi should not be too different from the fj of adjacent nodes (i ~ j). One can obtain f by minimizing the following quadratic functional (Belkin et al., 2004, Chapelle et al., 2003; Zhou et al., 2004):


Formula 1

(1)
where y = (y1,...,yl, 0,...,0){top}, and the matrix L, called the graph Laplacian matrix (Chung, 1997), is defined as L = D W where D = diag(di), di = {sum}j wij. The first term corresponds to the loss function in terms of condition (a), and the second term represents the smoothness of the predicted outputs in terms of condition (b). The parameter µ trades off loss versus smoothness. The solution of this problem is obtained as


Formula 2

(2)
where I is the identity matrix.


    3 GRAPH SHARPENING
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GRAPH-BASED LEARNING
 3 GRAPH SHARPENING
 4 GRAPH INTEGRATION
 5 FUNCTION PREDICTION...
 6 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Now we present a method that is directly applicable to the weight matrix W, and which is based on the following intuition: in an undirected graph as in Figure 1, all connections are reciprocated and so the matrix of edge weights W is symmetric as shown in Figure 2a. However, when W describes relationships between labelled and unlabelled points, it is not necessarily desirable to regard all such relationships as symmetric. That is, we may differentiate the contribution of all edges to information flow by not weighing them equally. First, some edges may convey more useful information in one direction (e.g. from labelled to unlabelled) than in the reverse direction. Propagating information in the reverse direction, from unlabelled to labelled, may be undesirable since it allows points about which information is uncertain to corrupt the source of information in the system. Since we are already using the language of ‘points’ and ‘edges’, we will say that this causes the point and its outgoing edges to be ‘blunted’, reducing their effectiveness. There are many problem settings (e.g. protein function prediction and other applications in the field of bioinformatics) in which (a) there is a high degree of certainty about the input space representation of each labelled point and its label, and (b) the number of labelled points is low. In this situation, it is indicated to avoid blunting, thus to preserve the effectiveness of information sources. Second, edges directly connecting oppositely labelled points may propagate unhelpful information in either direction. The smoothness condition in Equation (1) plays the role of forcing both predicted scores to be similar to each other and again both important points, the very sources of information are blunted. Further, those edges unnecessarily incur a conflict of the opposite flows in the area of a high certainty in the system. Third, propagation of information between unlabelled points is different—-while some edges of the graph may be more helpful than others in solving the overall problem, a priori we do not know which these might be. Allowing the unlabelled points to harmonize with their neighbours (thus implementing smoothness condition) common to most such learning approaches) is a desirable process.


Figure 2
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Graph sharpening: (a) an original undirected (bi-directed) graph and its symmetric weight matrix W. Note that in the interpretation of W, the row index denotes the destination and the column index the source—so, e.g. wij should be read as ‘the weight of the edge from j to i (j -> i)’. (b) Edges from unlabelled to labelled points (denoted as dotted arrows) are disconnected. (c) Edges between oppositely labelled points are further removed. Sharpening leaves the edges between unlabelled points intact.

 
Figure 2 illustrates the concept of graph sharpening. Figure 2a shows an original graph of seven nodes (points) and its (so far) symmetric weight matrix W. For simplicity, we set the value of ‘1’ as a weight on every edge. Remember that, in the interpretation of W, the row index denotes the destination and the column index the source, and wij reads ‘the weight of the edge from j to i (j -> i)’. In Figure 2b, the edges from unlabelled to labelled points wij (denoted as dotted arrows) are disconnected where i belongs to the set of labelled points l : ={1, 2, 3} and j to the set of unlabelled points u : ={4, 5, 6, 7}. Figure 2c presents the case of connection between oppositely labelled points, w12 and w21. Therefore, both edges are further removed from Figure 2b. Finally, we obtain a sharpened graph in Figure 3. Note that the weight matrix becomes asymmetric, and the edges between unlabelled points remain intact. A detailed mathematical foundation of sharpening is given in (Shin et al., 2006). Let us give a schematic flow of the proof. We pose the general question: what if W is not considered fixed? Is it possible to change some or all of the wij such that our algorithm performs better? We begin by re-formulating the objective function Equation (1) in terms of W, which is based on the formulation of Belkin et al. (2004). Blockwise consideration of the weight matrix,


Formula

allows us to state a condition which solutions W must satisfy if the objective function is to be optimized—there are many such solutions. Exploring every class of solutions is currently regarded as intractable and is left as an open problem. However, we can specify one simple ad hoc solution, concordant with the intuition already stated. By setting Wll to a non-negative diagonal matrix (including the null matrix) and Wlu to 0 such as


Formula 3

(3)
(see the final W in Fig. 2c), we obtain the output prediction for unlabelled data points


Formula 4

(4)
In spreading activation network terms, Equation (4) is equivalent to information being propagated from labelled to unlabelled data once (Wulyl) to set the initial condition for subsequent spreading activation among u {leftrightarrow} u, analogous to Equation (2) but now excluding the labelled points. This also has intuitive appeal. First, for labelled points, it assures fl = ylthere is no loss of information on labelled data points. By disconnecting unnecessary and unhelpful edges, we allow the labelled points and their outgoing edges to stay ‘sharp’ in their influence on the rest of the graph. Second, for unlabelled points, it preserves an important principle of SSL, namely exploitation of the manifold structure inferred from unlabelled data points, by keeping the edges, u {leftrightarrow} u and l -> u, of W.


Figure 3
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. The sharpened graph: in contrast to the original in Figure 1, the sharpened graph is no longer fully reciprocated and so the matrix of edge weights W becomes asymmetric.

 

    4 GRAPH INTEGRATION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GRAPH-BASED LEARNING
 3 GRAPH SHARPENING
 4 GRAPH INTEGRATION
 5 FUNCTION PREDICTION...
 6 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
A protein graph model represents relationships between proteins. Nodes depict proteins, and edges represent their possible interactions or similarities; for instance, one can extract an edge set (or similarity measurements) from amino acid sequences, or from structures, or from interactions, etc. Since there are many ways to represent a set of edges for a given set of proteins there exist many graphs. And each graph is regarded as partly independent from and partly complementary to others. Frequently, it can be difficult to decide which graph will perform best for function prediction for unlabelled proteins. Individual graphs (from single data sources) are often not sufficient for reliable prediction. One way to enhance reliability may be to integrate the given multiple graphs. Integrating multiple graphs stands for finding an optimum value of the linear combination coefficient for the individual graphs. In the semi-supervised learning framework, this translates to finding the combination coefficients {alpha} for the individual Laplacians, as shown in Figure 4. Recently, Tsuda et al. (2005) proposed an integration method for multiple graphs. One can see how the illustration in Figure 4 is related to their formulation,


Formula 5

(5)
and its output prediction


Formula 6

(6)
where K is the number of graphs and an Lk is the corresponding graph Laplacian to graph Gk.


Figure 4
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Graph integration: every single graph Gk can solely be used for label prediction. However, since different graphs contain partly independent and partly complementary pieces of information, integrating them into one may increase the reliability of predictions. In semi-supervised learning framework, graph integration stands for finding an optimum value of the linear combination coefficient {alpha} for individual graph Laplacians Lk.

 

    5 FUNCTION PREDICTION EXPERIMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GRAPH-BASED LEARNING
 3 GRAPH SHARPENING
 4 GRAPH INTEGRATION
 5 FUNCTION PREDICTION...
 6 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
5.1 Data
In automated protein functional class classification, accurate detection of functional relationships beyond close sequence homology is desirable. Thus, we have restricted our selection to protein chains from the PDBselect25 list (version October 2004), where any pair shares no more than 25% sequence identity. Functional information was assigned in terms of the Gene Ontology (GO) in the category ‘Molecular function’ and mapped through the gene ontology annotation database (GOA PDB 24.0). We chose the 20 highly populated GO categories as in Hou et al. (2005). Of all PDBselect25 chains, 599 proteins share at least one of these functional GO terms (Table 1). Therefore, we took them as our experimental protein set.


View this table:
[in this window]
[in a new window]

 
Table 1. Twenty functional GO terms from the category ‘Molecular function’

 
To generate weighted edges between nodes (or proteins), we used four different computational measures of molecular similarity. Each measure assigned to every protein pair (i, j) a positive weight (0 ≤ wij ≤ 1) meaning the degree of molecular similarity between chain i and j. The four different similarity measures are Basic Local Alignment and Search Tool (BLAST, Altschul et al., 1990), the standard approach to alignment of primary sequence that resulted in a degree of sequence identity; length-corrected Contact Metric (CM, Lisewski and Lichtarge, 2006), a metric-based similarity score by considering vector representations of contacts from the entire tertiary structure; Fast Alignment and Search Tool (FAST, Zhu and Weng, 2005), an accurate and computationally efficient 3D geometrical alignment algorithm calculating positive similarity scores; Evolutionary Trace Annotation (ETA, Kristensen et al., 2006), a method that measures similarity based on local similarity of protein substructure, specifically 3D templates that are small structural motifs of evolutionarily important residues identified by the evolutionary trace method (Lichtarge et al., 1996). Our choice represented all currently and commonly used approaches to protein similarity measurement (Watson et al., 2005): sequence alignment (BLAST), global 3D alignment (FAST), local 3D alignment of small motifs (ETA), and alignment-independent vector-based approaches (CM). An edge-weight from BLAST/CM/FAST is a continuous value while that of ETA is a binary value from a support vector classifier, i.e. the edge-weight between two proteins is set to 1 (wij = 1) if they were predicted functionally related; no edge (wij = 0) otherwise.

5.2 Results
Since a protein can belong to several GO categories, we posed a binary-class classification problem for each GO term in Table 1. The GO categories having only a few labelled proteins (less than 10) were excluded; therefore, our experiment became 15 binary-class classification problems, determining membership or non-membership of unannotated (unlabelled) proteins for the respective GO terms. For each GO term, we calculated the 5-fold cross-validation (5 CV) AUC (area under the curve) of ROC (receiver operating characteristic) as a performance measurement. The ROC curve plots true positive rate (sensitivity) as a function of false positive rate (1-specificity) for all possible thresholds (as opposed to choosing a single threshold (Gribskov and Robinson, 1996), see Figure 7. Hence, it has become a standard qualifier for classification algorithms. An AUC of 0.5 corresponds to random guessing, while an AUC of 1.0 implies the algorithm successfully outputs a higher predicted value for any positive example than that for any negatives example (perfect classification).

We compared the proposed method integration on sharpened graphs, with the four individual graphs obtained from BLAST, FAST, CM and ETA, respectively, and also with the previous method of Tsuda et al. (2005)—integration on original graphs. In every comparison, we examined the performance change in terms of original versus sharpened and individual versus integrated. This setting enabled us to see the effect of the proposed method from two separate viewpoints, sharpening and integration. The value of smoothing parameter µ—from Equation (2) of original, Equation (4) of sharpening and Equation (6) of integration—was obtained by 5CV searching over µ isin {0.1, 1, 5, 10, 50, 100}. For each of the available settings per GO category, we compared the results at their best parameters. A table with the best parameter is published at the Internet address (http://mammoth.bcm.tmc.edu/biocomp2007/). Remarkably, sharpening does not include any additional parameters, nor integration—the linear combination coefficient {alpha} in Equation (6) is automatically set as a solution of the optimization.

5.2.1 Graph sharpening on individual graphs
Figure 5a shows the distribution of 300 (= 15 GO terms x 5 CVs x 4 graphs) AUC pairs from the original (unsharpened) and sharpened graphs. Most dots are scattered above the diagonal, thus indicating better performance of the sharpened graphs against originals. A sign test statistically rejected the null hypothesis, ‘the distribution would be evenly scattered below and above the diagonal’, with significant confidence (p < 9 x 10–16). Table 2 confirms this result: on average, sharpening increases the original AUC by 0.03 (ETA) up to 0.12 (FAST).


Figure 5
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Positive effect of sharpening in pairwise AUC comparison for two competing methods in three different cases. More dots in upper diagonal half indicate that the method in vertical axis outperforms the other assigned in the horizontal axis; P-values (from a sign test) represent the statistical deviation from a random distribution of dots in the upper and lower half. (a) Sharpened versus. original: for 221 out of 300 (= 5 repetitions x 15 GO terms x 4 graphs) pairs in total, sharpening gave a higher AUC; (b) integrated (sharpened) versus individuals (sharpened): for 212 out of 300 (= 5 x 15 x 4) experiments, integration gave a higher AUC and (c) integration with sharpening (2007) versus without sharpening (2005): for 70 out of 75 (= 5 x 15) experiments, integration with sharpening had a higher ROC score. More detailed results can be found in Appendix A from our Supplementary Material.

 

View this table:
[in this window]
[in a new window]

 
Table 2. AUC comparison for all the combination of it original versus sharpened and individual versus integrated. Other statistics such as sensitivity and specificity can be found in Appendix B in our Supplementary Material

 
5.2.2 Integrated graph versus individual graphs
The next test was whether individual prediction in sharpened graphs could be further improved through their integration as stated in Section 4. Figure 5b shows that this was the case: 222 of the 300 ROC AUC pairs are above the diagonal (p < 9 x 10–16). In Figure 6, the average over 5CV AUCs are assigned to their function classes. The figure assures that integration on sharpened graphs achieves the highest AUCs when compared with the individuals (sharpened)—except for GO 0003677 (DNA binding) and GO 0008270 (Zinc ion binding). Integration effect on sharpened individual graphs show more significance and less volatility than the effect of integration on the original (unsharpened) networks (Table 2): 0.09 (± 0.04) versus 0.02 (± 0.08) in average AUC increase by integration.


Figure 6
View larger version (22K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. AUC comparison (avg. of five cross-validation) between individual and the integrated graphs: bars within a group correspond to BLAST, CM, FAST, ETA and the integrated graph in due order. For 13 out of 15 categories, integration with sharpening significantly surpasses individual performance.

 
5.2.3 Proposed method versus previous method
In the third test, we verified how integration on sharpened graphs upgrades the performance of the previous method (Tsuda et al., 2005), integration alone, so to speak. The results show that sharpening significantly contributes to AUC improvement, see Figure 5c; 70 out of 75 ROC pairs are in the upper diagonal (p < 3 x 10–18). Integration alone can be insignificant and even worse (see, value 0.02 ± 0.08 in Table 2), particularly if given graphs are noisy, which can often be the case in protein similarity networks. But with sharpening, integration becomes more effective, which is reflected by an AUC increase of 0.17 ± 0.08, see Table 2. Hence, given the performance of single unsharpened (original) graphs, one can raise the AUC as much as 0.17 ± 0.08 by applying integration with sharpening.

5.2.4 Computation time
Sharpening does not require computation. The computation time of an original graph, the time for solving the sparse linear system in Equation (2), was nearly trivial (less than 0.001 CPU second with MATLAB in a standard 1500 MHz PC with 1 GB of memory). It became faster for a sharpened graph since the Laplacian matrix gets sparser. Integration in Equation (6) took avg. 24.27 (± 10.86) CPU seconds for original, while it was avg. 4.66 (±4.44) for sharpened weights, and graph sparsity through sharpening benefitted the computation time for integration.

5.2.5 Enrichment
Figure 7 shows a typical ROC curve of the proposed method for GO 0003824 ‘Catalytic activity’. The curve shows that sharpening and integration can reliably discriminate enzymes (catalytic proteins) from non-enzymes among proteins with < 25% sequence similarity. The inner figure shows the distribution of predicted values (scores), f, for the unannotated proteins. An enrichment of enzymes toward larger values is evident.


Figure 7
View larger version (32K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. ROC curve on GO 0003824 (‘Catalytic activity’)—the curve shows that graph sharpening and integration can reliably discriminate enzymes (catalytic proteins) from non-enzymes. Insert: the distribution of predicted values (scores), f, for the unlabelled nodes (unannotated proteins)—empty bars stand for the distribution of non-enzymes, whereas solid bars stand for that of enzymes. An enrichment of enzymes toward larger values is evident.

 

    6 CONCLUSION AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GRAPH-BASED LEARNING
 3 GRAPH SHARPENING
 4 GRAPH INTEGRATION
 5 FUNCTION PREDICTION...
 6 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We applied to the problem of protein function prediction two recent developments in machine-learning methods based on graph, sharpening (Shin et al., 2006) and integration (Tsuda et al., 2005). Graph sharpening is a formal (details were omitted for brevity) yet intuitive approach for disconnecting undesirable edges in a graph. The result yields sparser graphs that contain less noise and, in turn, reduce computational expense and improve prediction without need for additional parameters. Using an already established optimization framework, graph integration can then be applied with greater efficiency and effectiveness in order to pool information from multiple sources. Both strategies, graph sharpening and graph integration, lend themselves well to the protein function prediction problem. Graph sharpening offers a general framework to address the noise that is pervasive among functionally relevant measures of protein similarity, while graph integration allows overall predictions to take into account a wide variety of complementary types of information on protein similarity graphs, each one representing a different aspect or type of functional information. While either strategy can be applied alone, with sharpening having a greater effect than integration, the best result was reached when sharpening and integration are used together and thus yield a synergestic effect on function prediction performance at lesser computational cost. This improvement is noteworthy for its size (0.17, or nearly 30% average increase in the area under the ROC curve) and in view of the diversity of similarity scores that were integrated: BLAST is used over sequences, CM and FAST are applied over whole structures and ETA is applied to a local structural motif (and has only been optimized for enzymes, rather than for the GO annotations used here).

This work motivates possible future studies. First, the method is general and its full application for function prediction will still require a continued refinement of individual methods, as well as broadening the number of similarity measures whose graphs are sharpened and then integrated together. Towards this goal, we plan an internet accessible software tool for sharpening and integration to allow large scale function prediction using these methods. Second, from a theoretical perspective, statistical formalization or framework on how sharpening and integration work together has not been conclusively established. This should be studied further.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GRAPH-BASED LEARNING
 3 GRAPH SHARPENING
 4 GRAPH INTEGRATION
 5 FUNCTION PREDICTION...
 6 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
H.S. was supported by the Korea Research Foundation Grant funded by the Korean Government (MOEHRD) and by the grant for Post Brain Korea 21. O.L. would like to gratefully acknowledge support from NSF DBI-0547695, NIH GM066099, and the March of Dimes MOD FY06-371.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Jonathan Wren

{dagger}The authors wish it to be known that, in their option, the first two authors should be regarded as joint First Authors. Back

1 We represent scalars as lower case, vectors as boldface lower case and martrices are uppercase. 0 (or 1) is a vector or matrix of variable-dependent size containing of all zeros (or ones). Back

Received on May 4, 2006; revised on August 24, 2007; accepted on October 8, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 GRAPH-BASED LEARNING
 3 GRAPH SHARPENING
 4 GRAPH INTEGRATION
 5 FUNCTION PREDICTION...
 6 CONCLUSION AND DISCUSSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Adai AT, et al. Connecting the protein structure universe by using sparse recurring fragments. J. Mol. Biol (2004) 340:179–190.[CrossRef][Web of Science][Medline]

    Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]

    Belkin M, et al. Regularization and Semi-supervised Learning on Large Graphs. In Proceedings of the 17th Annual Conference on Learning Theory (COLT) 3120. In: Lecture Notes in Computer Science (2004) Springer. 624–638.

    Chapelle O, et al. Cluster kernels for semi-supervised learning. In: Advances in Neural Information Processing Systems (NIPS) 15—Becker S, et al, eds. (2003) Cambridge, MA: MIT Press. 585–592.

    Chapelle O, et al. Semi-Supervised Learning (2006) Cambridge, MA: MIT Press.

    Chung FRK. Spectral Graph Theory. In: Number 92 in Regional Conference Series in Mathematics (1997) Providence, RI: American Mathematical Society.

    Deng M, et al. An integrated probabilistic model for functional prediction of proteins. J. Comp. Biol (2003) 11:463–475.

    Friedberg I. Automated protein function prediction–the genomic challenge. Brief. Bioinformatics (2006) 7:225–242.[Abstract/Free Full Text]

    Friedberg I, Godzik A. Connecting the protein structure universe by using sparse recurring fragments. Structure (2005) 13:1213–1224.[Medline]

    Friedberg I, et al. Jafa: a protein function annotation meta-server. Nucleic Acids Res (2006) 34(Web server issue):W379–W381.[Abstract/Free Full Text]

    Gribskov M, Robinson NL. Use of receiver operating characteristic (roc) analysis to evaluate sequence matching. Comput. Chem (1996) 20:25–33.[CrossRef][Web of Science][Medline]

    Hishigaki H, et al. Assessment of prediction accuracy of protein function from protein-protein interaction data. Yeast (2001) 18:523–531.[CrossRef][Web of Science][Medline]

    Hobohm U, Sander C. Enlarged representative set of protein structures. Protein Sci (1994) 3:522–524.[Web of Science][Medline]

    Hou JT, et al. Global mapping of the protein structure space and application in structure-based inference of protein function. Proc. Natl Acad. Sci. USA (2005) 102:3651–3656.[Abstract/Free Full Text]

    Kanehisa M, et al. The KEGG resources for deciphering genome. Nucleic Acids Res (2004) 32:D277–D280.[Abstract/Free Full Text]

    Kristensen DM, et al. Recurrent use of evolutionary importance for functional annotation of proteins based on local structural similarity. Protein Sci (2006) 15:1530–1536.[CrossRef][Web of Science][Medline]

    Lanckriet GRG, et al. Kernel-based data fusion and its application to protein function prediction in yeast. In: In Proceedings of the Pacific Symposium on Biocomputing (PSB) (2004a) 9:300–311.

    Lanckriet GRG, et al. A statistical framework for genomic data fusion. Bioinformatics (2004b) 20:2626–2635.[Abstract/Free Full Text]

    Laskowski RA, et al. Profunc: a server for predicting protein function from 3d structure. Nucleic Acids Res (2005) 33:W89–W93.[Abstract/Free Full Text]

    Lee I, et al. A probabilistic functional network of yeast genes. Science (2004) 306:1555–1558.[Abstract/Free Full Text]

    Lichtarge O, et al. An evolutionary trace method defines binding surfaces common to protein families. J. Mol. Biol (1996) 257:342–358.[CrossRef][Web of Science][Medline]

    Lisewski AM, Lichtarge O. Rapid detection of similarity in protein structure and function through contact metric distances. Nucleic Acids Res (2006) 34:e152.[Abstract/Free Full Text]

    Noble WS, et al. Idetifying remote protein homologs by network propagation. FEBS J (2005) 272:5099–5100.[CrossRef][Medline]

    Pal D, Eisenberg D. Inference of protein function from protein structure. Structure (2005) 13:121–130.[Medline]

    Schwikowski B, et al. A network of protein-protein interactions in yeast. Nat. Biotechnol (2000) 18:1257–1261.[CrossRef][Web of Science][Medline]

    Shin H, et al. Graph-based semi-supervised learning with sharper edges. (2006) Proceedings of the 17th European Conference on Machine Learning (ECML 2006). Lecture Notes in Computer Science 4212. Springer, Berlin-Heidelberg. 402–413.

    Tsuda K, et al. Fast protein classification with multiple networks. Bioinformatics (2005) 21:59–65.[Web of Science]

    Uetz P, et al. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature (2000) 403:623–627.[CrossRef][Medline]

    Vert JP, Kanehisa M. Graph-driven features extraction from microarray data using diffusion kernels and kernel CCA. In: Advances in Neural Information Processing Systems 15—Becker S, et al, eds. (2003) Cambridge, MA: MIT Press. 1425–1432.

    von Mering C, et al. Comparative assessment of large-scale data sets of protein-protein interactions. Nature (2002) 417:399–403.[Medline]

    Watson JD, et al. Predicting protein function from sequence and structural data. Curr. Opin. Struct. Biol (2005) 15:275–284.[CrossRef][Web of Science][Medline]

    Yona G, et al. Protomap: automatic classification of protein sequences, a hierarchy of protein families, and local maps of the protein space. Proteins Struct. Funct. Genet (1999) 37:360–678.[CrossRef][Web of Science][Medline]

    Zhou D, et al. Learning with local and global consistency. In: In Advances in Neural Information Processing Systems (NIPS) 16 (2004) Cambridge, MA: MIT Press. 321–328.

    Zhu J, Weng Z. Fast: a novel protein structure alignment algorithm. Proteins (2005) 14:417–423.[CrossRef]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Supplementary Data
Right arrow All Versions of this Article:
23/23/3217    most recent
btm511v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Shin, H.
Right arrow Articles by Lichtarge, O.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Shin, H.
Right arrow Articles by Lichtarge, O.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?