Skip Navigation


Bioinformatics Advance Access originally published online on November 28, 2007
Bioinformatics 2007 23(24):3364-3373; doi:10.1093/bioinformatics/btm520
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/24/3364    most recent
btm520v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chua, H. N.
Right arrow Articles by Wong, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chua, H. N.
Right arrow Articles by Wong, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

An efficient strategy for extensive integration of diverse biological data for protein function prediction

Hon Nian Chua 1,*, Wing-Kin Sung 2 and Limsoon Wong 2

1Graduate School for Integrative Sciences and Engineering and 2School of Computing, National University of Singapore, Singapore

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: With the increasing availability of diverse biological information, protein function prediction approaches have converged towards integration of heterogeneous data. Many adapted existing techniques, such as machine-learning and probabilistic methods, which have proven successful on specific data types. However, the impact of these approaches is hindered by a couple of factors. First, there is little comparison between existing approaches. This is in part due to a divergence in the focus adopted by different works, which makes comparison difficult or even fuzzy. Second, there seems to be over-emphasis on the use of computationally demanding machine-learning methods, which runs counter to the surge in biological data. Analogous to the success of BLAST for sequence homology search, we believe that the ability to tap escalating quantity, quality and diversity of biological data is crucial to the success of automated function prediction as a useful instrument for the advancement of proteomic research. We address these problems by: (1) providing useful comparison between some prominent methods; (2) proposing Integrated Weighted Averaging (IWA)—a scalable, efficient and flexible function prediction framework that integrates diverse information using simple weighting strategies and a local prediction method. The simplicity of the approach makes it possible to make predictions based on on-the-fly information fusion.

Results: In addition to its greater efficiency, IWA performs exceptionally well against existing approaches. In the presence of cross-genome information, which is overwhelming for existing approaches, IWA makes even better predictions. We also demonstrate the significance of appropriate weighting strategies in data integration.

Contact: hnchua{at}i2r.a-star.edu.sg

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Following the success of genome sequencing, much work has been done towards deriving greater understanding of the complex functional mechanism of gene products. Over the past few years, an arsenal of automated function prediction tools has been developed through the collective efforts of many researchers. These tools help accelerate elucidation of functional machinery by providing systematic identification of potential novel information for experimental verification. While initial approaches derive predictions from specific information such as sequence similarity (Khan et al., 2003; Martin et al., 2004) and protein interactions (Chua et al., 2006; Deng et al., 2003; Letovsky and Kasif, 2003; Vazquez et al., 2003), an ever-increasing flood of diverse biological information from concerted efforts in genomic and proteomic research has triggered the advancement of prediction approaches towards data integration.

Many approaches have been proposed to this end, and these can be generally divided into two groups: The first group focuses on the prediction of a few general function categories (Deng et al., 2004; Lanckriet et al., 2004; Troyanskaya et al., 2003; Tsuda et al., 2005), while the other group targets at predicting more specific functional annotations from the GO (Ashburner et al., 2000) for proteins (Chen and Dong, 2004; Karaoz et al., 2004; Xiong et al., 2006). Within the first group, Lanckriet et al. (2004) have been shown to perform the best. However, little comparisons have been made between methods in the second group, or between the two groups, making the evaluation of existing methods difficult.

In general, methods that take the first approach make use of more computationally demanding methods, and perform better but require more time to make the same number of predictions given the same inputs. Methods from the second group make use of less computationally demanding optimization as well as network-based approaches. These methods can scale better to larger amount of data, as well as larger number of data sources, and are able to make predictions for a larger number of functional annotations, such as the controlled vocabulary used in the GO.

Nonetheless, all these methods employ some form of machine-learning or optimization methods such as Bayesian Networks (Troyanskaya et al., 2003), Markov Random Field (Deng et al., 2004), Support Vector Machines (Lanckriet et al., 2004), Convex Optimization (Tsuda et al., 2005), Hopfield Network (Karaoz et al., 2004), Simulated Annealing (Chen and Dong, 2004) and Artificial Neural Networks (Xiong et al., 2006). This limits the scalability of each approach to larger datasets depending on their complexity. Here, we refer to complexity as a combination of: (1) the number of proteins to be predicted with annotations; (2) the number of possible annotations to be predicted; (Equation (3) the number of data sources used for prediction and (4) the number of proteins described by each data source.

With the rapidly increasing amount of biological data available, the performance differences that can be gleaned from using a more sophisticated optimization method is likely be overshadowed by the ability to make use of larger and more data sources. In fact, a recent study has shown that the use of global optimization may not actually yield significant improvement over simpler local prediction methods (Murali et al., 2006). This propelled us to look at protein function prediction based on fusing more data using a simple local prediction method. We refer to local prediction as making predictions based on direct evidence, as opposed to using propagated information (Deng et al., 2004; Karaoz et al., 2004; Troyanskaya et al., 2003) or optimizing the overall consistency of all annotations (Chen and Dong, 2004; Lanckriet et al., 2004; Tsuda et al., 2005).

Another issue is for less well-studied genomes with limited amount of related biological data. It is important to update the function predictions as long as more data is available. Hence, efficient method that provides constantly updated predictions—based on a combination of not only heterogeneous, but also cross-genomic, sources of information—will be extremely useful to biologists studying protein functions.

Here, we propose a very simple method, Integrated Weighted Averaging (IWA), which uses a uniform framework for combining multiple sources of evidence. Our method involves three steps: (1) each evidence source is first assessed with a reliability score based on their ability to infer function for proteins in the training data.; (2) each data source is then modeled into an undirected graph with proteins as vertices and relationships between proteins as edges and (Equation (3) finally, graphs from each data source is combined to form a more complete graph with edge weights computed from the reliability of the data sources. Each protein is predicted with annotations based on a simple weighted voting function that involves only its neighbors in the graph. The simplicity of the approach makes it very scalable to large amount of data. Each data source can be dynamically re-weighted as more data becomes available. Since predictions only depend on the local neighborhood in the final graph, predictions can be made on-the-fly for each protein simply by combining all related information in each data source.

We show here that, despite the simplicity of its formulation, our method performs relatively well against existing approaches. We also show that our method can include a large amount of data, including cross-genome information, to make much better predictions. Perhaps most importantly, the nature of our method makes it possible to be integrated into online applications to provide up-to-date predictions as data sources grow.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
Lee et al. (2004) used a unified log-likelihood scoring function to combine several sources of binary gene relationship data into a graph, which can be clustered into groups that show strong similarity in function. This illustrates the fact that different data source has different degree of correlation with function similarity. Hence to achieve effective integration, it is necessary to assign weights to each data source based on some common yardstick (i.e. function similarity). We adopt a similar model in our approach.

Figure 1 illustrates our approach to integrating multiple data sources using a uniform weighting scheme. Each data source is modeled as an undirected graph G = <V, E >, where V and E are the set of vertices and edges in the graph G, with each vertex representing a protein and each edge (u, v) representing a relationship between proteins u and v. Graphs G1, G2 and G3 in the figure depict graphs that represents three data sources. The edges in each graph may have a different weighting scheme (e.g. G1 and G2), or may be unweighted (e.g. G3). Edge weights from each data source are first discretized into uniform intervals [described later in Equation (1)], and subsequently re-weighted using a common benchmark, i.e. consistency with known annotations, described later in Equation (2).


Figure 1
View larger version (16K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Uniform weighting scheme to combine different data sources. G1, G2 and G3 are graphs representing three data sources. Each node is a protein, while each edge is a binary relationship. Initial edge weights from each data source are discretized into intervals using Equation (1), and reweighted into common weight that is consistent across different data sources using Equation (2). G1, G2 and G3 are then combined to form the final graph G'. Edge weights in G' are computed using Equation (3). Weights are derived separately for each function.

 
Multiple graphs derived from different data sources in this way are combined to form a larger and more complete graph G'. Each edge in G' can be weighted based on the data sources that contributed to the edge [described later in Equation (Equation (3)]. Each edge weight estimates the confidence in which a particular function will be shared between two proteins.

2.1 Discretization of data source with existing scoring functions
As mentioned earlier, some data sources have pre-computed scores. For example, for a sequence homology dataset generated from BLAST searches (Altschul et al., 1990), we not only know that two proteins u and v form an edge, but also the E-scores from BLAST. Protein pairs defined by sequence similarity with lower E-scores are more likely to indicate function similarity. To provide a better estimation of edge weights, we subdivide these data sources into subtypes. Given a set of edges E from a data source k where both vertices of each edge in E have at least one functional annotation, we subdivide E into subtypes using a straightforward approach:

  1. Edges in E are parsed to find the maximum and minimum scores, Sk,max and Sk,min, respectively;
  2. Edges in E are sorted into n bins, b1, ... , bn, of equal intervals between Sk,min and Sk,max;
  3. Each bin bi is used as a different subtype for which confidence will be evaluated individually using Equation (1);
  4. Given an observation, Oe,k,S, of edge e from data source k with score S, its subtype or bin will be determined by:


    Formula 1

    (1)

  5. If S ≥ Sk,min, the confidence of e based on observation Oe,k,S is estimated by the confidence of the subtype defined by the bin identified by BinIndexk(S).
  6. Since Sk,min is determined on the training data based on edges where both vertices are annotated, it is possible for S to be smaller than Sk,min. If S < Sk,min, the confidence of e based on observation Oe,k,S is taken to be 0 since there is no training data to estimate its confidence.

Note that no assumption is made on the range or nature (e.g. positive, negative or no correlation with function similarity, linear or parametric, etc.) of the pre-computed scores.

In our experiments, we use n = 20 for all data sources. This is a naïve way of selecting n, since the number of pairs in each data source can vary greatly. Pooling confidence scores from a large number of pairs into a small number of bins will definitely lead to a loss of resolution. This is especially apparent when genome-scale datasets, such as expression measurements are used. While we simplify our experiments by using a fixed value for n, the value of n should be made a tunable variable in a real-world implementation for greater flexibility to further improve predictions.

2.2 Estimating the confidence of data sources
Edges defined by different types of evidence have varying probability of sharing common function of different nature. For example, edges defined by sequence homology may have higher probability of sharing common function than that defined by microarray gene expression. Even for the same type of data like protein–protein interaction data, different experiments may have different probabilities subjected to factors such as noise, environment and the nature of the procedures used for each experiment. Correlation with function similarity not only varies with the nature of the data, but also with the nature of the function. For example, sequence similarity is more likely to indicate sharing of molecular functions rather than biological processes. Moreover, some types of evidence may embed scoring information. Edges with different scores may differ significantly in the degree of correlation with function similarity. Hence, data sources may be further subdivided into subsets based on available information such as experimental source or embedded scores.

To capture these variations, we evaluate the confidence of each data source, as well as their subsets separately for each function. The probability that a data source k transfers function f, is estimated using:


Formula 2

(2)
where Ekf is the subset of edges of data source k where each edge has either one or both of its vertices annotated with function f, and both its vertices has at least one functional annotation;

Sf(u,v) = 1 if u and v shares function f, 0 otherwise.

When |Ek,f| is small, the variance of p(k,f) is high. Hence, a pseudo count of 1 is added to the denominator. Table 1 illustrates p(k,f) computed for one GO annotation and a variety of data sources from Dataset B described in Section 2.5.


View this table:
[in this window]
[in a new window]

 
Table 1. Examples of data types and their computed confidence for the GO term GO:0006402 (mRNA catabolism)

 
2.3 Estimating the confidence of an edge in the combined graph
After the confidence of edges in the graph representing each data source is derived, these graphs can be combined into a larger, more complex graph G' that contains all edges and nodes in the component graphs. Essentially, two nodes in G' are connected if and only if they are connected in some of the component graphs. The confidence of each edge (u,v) in G' for each function f is estimated by the subtypes in which (u,v) is observed:


Formula 3

(3)
Du,v is the set of subtypes of data sources that contains edge (u,v).

2.4 Assigning the score of an annotation to a protein
Function f is assigned a score Sf(u) for protein u using a weighted averaging method, defined by:


Formula 4

(4)

Sf(u) is the score of function f for protein u;

ef(v) = 1 if protein v has function f, 0 otherwise

Nu is the set of proteins that are linked by an edge to protein u;

ru,v,f is the link confidence between protein u and protein v.

2.5 Datasets
Due to the limited scalability of some approaches, comparison between different approaches is done using two separate datasets:

2.5.1 Dataset A
The first dataset is a popular benchmark that is first used in Deng et al. (2004) and subsequently in Lanckriet et al. (2004). This dataset is available online at http://www-hto.usc.edu/msms/IntegrateFunctionPrediction/. The dataset comprises a total of 6355 yeast proteins, of which 3588 are annotated with one or more of 12 functional annotations from the most general level of functional classes from the Munich Information Center for Protein Sequences (MIPS). The 12 functional classes are shown in Table 2.


View this table:
[in this window]
[in a new window]

 
Table 2. Twelve functional classes from MIPS

 
Function is predicted for these annotated proteins by combining four sources of evidence: MIPS genetic and physical interactions, Tandem Affinity Precipitation (TAP) protein complexes, Pfam (Bateman et al., 2004) domains and gene expression correlations:
  1. Protein–protein interaction. There are a total of 2448 unique protein pairs involving 1884 proteins defined by the MIPS physical and genetic interaction datasets.
  2. Protein complexes. The protein complex information from the TAP dataset yields 30 731 unique pairs among 1354 proteins.
  3. Pfam domains. The Pfam domain dataset contains 28 616 unique protein pairs that share at least one Pfam domain.
  4. Expression correlation. A total of 1366 unique protein pairs with highly correlated (Pearson's correlation coefficient ≥0.8) expression profiles involving 585 proteins are extracted from the Spellman cell cycle microarray data (Spellman et al., 1998).

These datasets are provided to GAIN and GUMP as unweighted binary pairs. Following the procedures outlined in Xiong et al. (2006), predictions using GUMP is repeated over a range of parameters to find the best parameters for the dataset.

2.5.2 Dataset B
The second dataset is compiled using up-to-date information from public databases. Functional annotations are taken from nearly the entire set of Gene Ontology (GO) annotations (Ashburner et al., 2000). Predictions are validated separately for each of the three GO namespaces: molecular function, biological process and cellular component.

Gene Ontology (GO) annotations are arranged in a hierarchical manner. Defining the three base namespaces ‘molecular_ function’, ‘biological_process’ and ‘cellular_component as level 0, there are 19 655 terms up to 15 levels of annotation. Lower level terms are more generic while higher level terms are more specific. The three base categories, obsolete terms, as well as three other vague terms: ‘GO:0005554 molecular function unknown’, ‘GO:0000004 biological process unknown’ and ‘GO:0008372 cellular component unknown’, are excluded. Annotations with evidence code ‘IEA (Inferred from Electronic Annotation) are also excluded since they are derived using automated methods without human involvement. Annotations from Uniprot and PDB are also excluded as these have relatively few non-IEA coded annotations and may involve proteins that overlap with annotations for specific genomes.

GO annotations follows the ‘true path’ rule, i.e. a protein that is annotated with a GO term is also annotated with all its ancestor terms. A child term is a more specific subset of the parent term. A function that is well studied can have many more descendant terms than one that is less known. Hence if all GO terms are used for validation, better studied functions will be given much greater weight during performance evaluation and may result in biased conclusions. One simple way to address this problem is to consider GO terms from a particular level in the hierarchy. However, due to differences in nature of the GO terms, the same level in the ontology may not be uniformly reflective of the specificity of the terms. Moreover, some terms may not have sufficient annotations for the correspondingly computed statistical measure to be conclusive. To avoid the above problems, we adopt the concept of informative Functional Class (Chua et al., 2006; Zhou et al., 2002) to selectively identify GO terms for validation.

A GO term is defined to be informative if there is: (1) at least 30 proteins annotated with it; and (2) had no child terms with annotated with at least 30 proteins. Only informative terms are used for prediction performance validation. This ensures that terms used for validation has a reasonable number of annotations and do not have overlapping descriptions. The definition of informative GO terms also means that the most specific descendant terms that can be conclusively validated are selected.

Predictions are done separately for each namespace. There are 56, 105 and 43 informative GO terms for the namespaces ‘molecular_function’, biological_process’ and ‘cellular_component, respectively. While all GO terms are used during prediction, only informative terms are used in the computation of validation scores.

Functional association information is obtained from seven data sources:

  1. Sequence homology. Protein sequences are downloaded from the GO database (http://archive.godatabase.org). Each yeast sequence is aligned with all other sequences using BLAST. The top five hits with an E-value ≤ 1 is used to define binary relationships. This yields 9736 distinct protein pairs among 4376 proteins when BLAST is performed only against yeast sequences, and 23 282 distinct protein pairs among 19 985 proteins when BLAST is performed against all sequences.
  2. Protein–protein interactions. Interaction data for yeast proteins is obtained from a recent release of BIOGRID (Breitkreutz et al., 2003). There are a total of 50 434 distinct interaction pairs between 5 298 yeast proteins.
  3. Pfam domains. Pfam domain information of the sequences is extracted from the SwissPfam database at (http://www.sanger.ac.uk/Software/Pfam/ftp.shtml). The SwissPfam database contains precomputed Pfam domains for SwissProt and TrEMBL proteins with an E-value threshold of 0.01. A total of 129 541 unique pairs between 23 298 proteins are obtained.
  4. Pubmed abstracts. Pubmed abstracts are obtained by searching each protein's name and aliases on NCBI Entrez Pubmed (http://www.ncbi.nlm.nih.gov/entrez/). Only the first 1000 abstracts returned are used. For each protein u, the names and aliases of every other protein v from the same genome are then searched in the abstracts associated with protein u. A relationship is defined between protein u and v if v is found in these abstracts. A total of 43 678 distinct pairs between 4275 yeast proteins are obtained.
  5. Predicted interactions. Predicted interactions are obtained from the Search Tool for the Retrieval of Interacting Genes/Proteins (STRING) database at http://string.embl.de/ (Snel et al., 2000). There are a total of 145 003 distinct protein pairs involving 4424 yeast proteins.
  6. Gene expression data. Two widely used gene expression profiles are obtained from Eisen et al. (1998) and the Rosetta Compendium (Hughes et al., 2000). Gene pairs with a correlation score ≥0.7 are used.

This dataset is available publicly at our Supplementary website http://srs2.bic.nus.edu.sg/~kenny/integration. Specific details on this dataset are provided in the Supplementary Material.

2.6 Scoring functions
Integrated Weighted Averaging (IWA) takes weighted binary association as inputs. We use the following scoring functions to weight different type of data:

  1. BLAST results. The negative log E-scores between each protein pair is used as the score of that pair. For pairs with zero E-score, a score of 999 is used to avoid an infinity score.
  2. Protein–protein interactions. The FS-Weight (Chua et al., 2006) has been shown to provide a good estimate of functional similarity between interacting protein pairs (direct interactions), as well as between protein pairs that do not interact but share common interaction partners (indirect interactions). To keep our comparison simple, we only use direct interaction pairs here. Each interacting protein pair is scored using a simplified variant of the FS-Weight measure:


Formula 5

(5)
Np refers to the set that contains p and its interaction neighbors.

(3) PFam domains. The protein pairs are scored by the number of common domains they share:


Formula 6

(6)
Dk is the set of Pfam domains found in protein k.

(4) Pubmed abstracts. The relationship between u and v is scored as follows:


Formula 7

(7)
Ax is set of Pubmed abstracts that contain protein x.

(5) Gene expression profiles. Each pair of genes is scored using the Pearson's correlation coefficient between their expression profiles. Gene pairs with correlation below 0.7 are discarded. For gene expression profiles in Dataset A, gene pairs with correlation ≥0.8 are weighted with 1, while others are discarded.

2.7 Validation methods
2.7.1 Dataset A
For comparisons on Dataset A, validation is done following the experimental procedures stipulated in Deng et al. (2004). The 3588 annotated proteins are predicted using three repetitions of 5-fold cross-validation. The area under the Receiver Operating Characteristics (Gribskov and Robinson, 1996) graph is computed for each functional class and averaged over the predictions in all 15 folds. A perfect classifier for a class will have an ROC score of 1 for the class, while a random classifier will yield an ROC score close to 0.5. Validation results on the dataset using Deng et al.'s Markov Random Field and Lanckriet et al.'s SDP/SVM methods are taken from Lanckriet et al. (2004), while the other methods are run and validated using available implementations from the authors.

2.7.2 Dataset B
For each GO namespace, we perform 10-fold cross-validation on 5448 annotated yeast proteins from SGD and validate each method using only the informative GO terms (Zhou et al., 2002). The annotated proteins are randomly divided into 10 equal-sized groups. Each time the annotations for proteins from a group are hidden and predicted using annotations for proteins from the other nine folds as training data. Only annotations and functional linkage data involving yeast proteins are used in this experiment due to the memory limitations of GeneFAS on our machine. Two measures are used to measure performance here:

(1) Receiver operating characteristics

The Receiver Operating Characteristics (ROC) graph is a popular measure of the discriminative abilities of a classifier. Here, we compute the area under the ROC graph of each informative GO term. Due to the large number of terms involved, we compare the different approaches by plotting the number of informative GO terms that can be predicted with ROC scores better or equal to various ROC thresholds.

(2) Precision-recall analysis

While ROC captures the predictive power for each GO term, it does not tell us how well the prediction scores of a method reflect the confidence of predictions. A method that can separate proteins with a particular function from proteins that do not have the function will yield a high ROC score for that function. However, it may assign scores that do not reflect the confidence of a prediction uniformly across different functional terms. For example, a prediction score of 0.6 may indicate a likely true positive for one term, but the same value may also indicate a likely false positive for another. This makes it difficult for a user to interpret prediction results. To capture how well the prediction scores of a method reflect the confidence of predictions, we adopt the precision versus recall analysis used in Deng et al. (2004) and Chen and Dong (2004).

Precision and recall are defined as follows:


Formula 8

(8)

Proteins 1, ... , K are annotated proteins

ki is the number of functions correctly predicted for protein i;

mi is the number of functions predicted for protein i and

ni is the number of functions annotated for protein i.

Using varying thresholds on prediction scores, a range of precision and recalls can be plotted for each method. Only informative GO terms are used in the computation of precision and recall.

2.8 Existing methods
We compare IWA with the following approaches:

2.8.1 Markov Random Field
Deng et al. (2004) use global optimization method based on Random Markov Fields and belief propagation to compute a probability that a protein has a function given the functions of all other proteins in the interaction dataset. Similar approaches have been used for predicting protein functions from protein–protein interactions. (Deng et al., 2003; Letovsky and Kasif, 2003; Vazquez et al., 2003).

2.8.2 Fusion Kernels
Lanckriet et al. (2004) use Semi-Definite Programming (SDP) to combine heterogeneous data sources for function prediction using Support Vector Machines (SVM). A separate kernel is generated from each data source using customized techniques. SDP is then used to obtain an optimal combination of the kernels for SVM learning. From known comparisons with some other works (Deng et al., 2004; Tsuda et al., 2005), Fusion Kernels has been shown to perform favorably. However, this method is computationally complex and does not scale well to large number of annotations. Hence, we will only include it in comparisons using Dataset A.

2.8.3 GAIN
GAIN (Karaoz et al., 2004) models an input functional linkage network as a discrete-state Hopfield network in which function assignments are propagated to achieve globally consistent annotations. In our experiments, we use GAIN-1.8, which is publicly available at http://bioinformatics.cs.vt.edu/~murali/software/gain/. Following the description from Karaoz et al. (2004), gene pairs from gene expression data are weighted using Pearson Correlation. Protein pairs from other datasets are given a weight of 1. As GAIN takes a single functional linkage network as an input without differentiating between data sources, we do not use the scoring functions described in Section 2.6 for each data source.

2.8.4 GUMP
GUMP (Xiong et al., 2006) extracts feature vectors for each protein based on the functions of associated proteins and the corresponding sources of evidence. The extracted feature vectors are then trained using artificial neural networks. GUMP does not use any weighting scheme. In our experiments, we use the MATLAB implementation of GUMP that is available with the online publication at http://www.biomedcentral.com/1471-2105/7/268. Following the procedure described by the authors, experiments are repeated while varying parameter values over a given range to obtain optimal parameters.

2.8.5 GeneFAS
GeneFAS (Chen and Dong, 2004) combines information from different data sources using a probabilistic approach. For each data source, the probability that a protein pair from that data source shares a particular function is estimated. This is done for each function and data source. The probability of a protein having a function is then computed by combining the pre-computed probabilities for all associated proteins using a naïve Bayesian method. This is referred to as local prediction. Using these local predictions as weights, a customized simulated annealing method is then used to achieve global optimization. GeneFAS accepts three kinds of data types: unweighted protein–protein interactions, phylogenetic profiles and microarray data. In our experiments, gene expression datasets are provided to GeneFAS as microarray data, while other datasets are provided to GeneFAS as unweighted protein–protein interactions. The GeneFAS software is publicly available at http://digbio.missouri.edu/software/genefas/.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
3.1 Comparison using Dataset A
All the methods described in Section 3.1 are compared using Dataset A. Figure 2 shows that averaged Receiver Operating Characteristics (ROC) for the 12 MIPS functions predicted using: (1) Markov Random Field (Deng et al., 2004); (2) SVM/SDP kernel methods (Lanckriet et al., 2004); (Equation (3) GUMP; (4) GAIN; (5) GeneFAS and (6) Integrated Weighted Averaging (IWA).


Figure 2
View larger version (38K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Average ROC scores for predicting annotated yeast proteins with 13 MIPS functional classes using 7 different approaches: (1) Markov Random Field; (2) Fusion Kernels; (3) GUMP; (4) GAIN; (5) Integrated Weighted Averaging (IWA) and (6) Integrated Weighted Averaging with newer datasets (IWA*)

 
Lanckriet et al.'s SVM/SDP kernel method performs the best, followed by GeneFAS, GAIN and IWA, which performed similarly. GUMP falls considerably behind, and MRF yielded the lowest ROC scores. Given the fact that it neither makes use of optimization methods nor functional propagation, IWA performs rather well.

To illustrate the impact of including more diverse and more up-to-date information as emphasized earlier, we also made predictions using IWA using more and newer data sources, which include cross-genomic BLAST results and cross-genomic PFam information (see Section 3.2 for complete description). As we only have GO functional annotations for other genomes, these are mapped to the 12 MIPS functions using MIPS2GO mappings from MIPS. Only 7 of the 12 MIPS functions have mappings to GO. The corresponding ROC scores are presented as IWA* in Figure 3. Using these additional information, IWA yielded predictions that outperforms all other predictions made in 11 of the 12 functions.


Figure 3
View larger version (41K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. (a) The number of informative GO from: (i) molecular function (top); (ii) biological process (middle) and (iii) cellular component (bottom) that can be predicted better or equal to various thresholds using data from six heterogeneous sources with GeneFAS, GAIN, Integrated Weighted Averaging and Integrated Weighted Averaging with cross-genomic information (left). (b) Precision versus recall of predictions made using data from six heterogeneous sources with GeneFAS, GAIN, Integrated Weighted Averaging and Integrated Weighted Averaging with cross-genomic information (right).

 
3.2 Comparison using Dataset B
Despite having superior prediction performance, complex optimization techniques are less well suited to large-scale predictions involving large number of more specific annotations. This is addressed in several other works (Chen et al., 2005; Karaoz et al., 2004; Xiong et al., 2006). These approaches focused on making predictions using comprehensive annotations from the controlled vocabulary in the GO. To investigate the performance of IWA in such large-scale predictions, we compare it against GAIN and GeneFAS using Dataset B. Details on the dataset are provided in the Supplementary Material. We only compare with these two methods because these are the only methods here that provide scalable solutions for integrating generic information types. GUMP is not included in this comparison as it will take more than reasonable time to obtain optimal parameter values for such a large dataset (the neural network needs to be retrained 6 times for a scaling parameter and 11 times for the number of hidden nodes). In this comparison, GO functions are predicted for yeast proteins using more recent datasets, some of which involves cross-genome information. However, annotations from other genomes will not be used in the comparison as GeneFAS was unable to run on our machine with these included.

Figure 3 (left column) shows the number of informative GO terms that can be predicted with an ROC score above or better than various threshold using the two approaches. IWA performs the best out of the three approaches, indicating that it has better discriminative ability over the other two approaches. Given the limited size of datasets and the generality of the functions used in Dataset A, the lack of a unified weighting scheme for individual data source did not noticeably affect GAIN's performance relative to GeneFAS and IWA. However, with the bigger datasets and more specific functional terms used in Dataset B, the consequence of this limitation becomes more apparent.

Figure 3 (right column) shows the precision-recall analysis of the two approaches for each GO namespace. IWA obtains significantly higher precision over the entire recall range compared to GeneFAS and GAIN. This indicates that the prediction scores computed by IWA reflect the confidence of the predictions much better than the other two approaches across all informative GO terms in each namespace. This means that the prediction scores of IWA are more consistent between different terms, making it easier for users to interpret prediction results. Interestingly, while no propagation of functional assignments are made in IWA, the recall of its predictions is not in any observable sense inferior to the two other approaches. This indicates that the effectiveness of functional propagation could be rather limited given sufficient data.

To show that using only informative GO terms for the evaluation of prediction performance do not introduce significant bias, we also repeat the evaluation using only level-3 GO terms. The corresponding average ROC scores for the four methods, when evaluated using informative and level-3 GO terms respectively, are presented in Table 3. IWA achieved the highest average ROC scores when validation is done using both informative and level-2 GO terms. The ROC scores computed for level-3 GO terms follow a similar trend with those computed for informative GO terms, indicating that the use of informative GO terms does not introduce any bias to the conclusion while ensuring that validation results are statistically conclusive. Substantially higher ROC is achieved when cross-genomic information is used with IWA.


View this table:
[in this window]
[in a new window]

 
Table 3. Average ROC score for predictions made by GeneFAS, GAIN, Integrated Weighted Averaging and Integrated Weighted Averaging with cross-genomic information when validated using (a) informative GO terms; and (b) level-3 GO terms. IWA performs better than GAIN and GeneFAS for all three GO namespaces in both validations (highlighted in bold)

 
3.2.1 Phylogenetic profiles
Since the optimal data types for GeneFAS are protein–protein interactions, microarray data and phylogenetic profiles, we also compare the three approaches using only these data types. Protein–protein interactions and microarray data are used as described earlier. Phylogenetic profiles across 24 different species for yeast proteins are obtained from the authors of GeneFAS. The phylogenetic profiles are provided to GeneFAS without further processing. For IWA and GAIN, each pair of genes is scored using the absolute Pearson's correlation coefficient between their phylogenetic profiles. Gene pairs with correlation below 0.7 are discarded.

As GeneFAS takes an exceptionally long time to process the phylogenetic profiles, we only perform the comparison on informative GO terms from the biological process namespace. Figure 4 shows the ROC and precision versus recall graphs for each method. The results are consistent with the previous experiments, suggesting that IWA performs better that GAIN and GeneFAS.


Figure 4
View larger version (19K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. (a) The number of informative GO from biological process that can be predicted better or equal to various thresholds using data from six heterogeneous sources with GeneFAS, GAIN and Integrated Weighted Averaging (left). (b) Precision versus recall of predictions made using data from six heterogeneous sources with GeneFAS, GAIN and Integrated Weighted Averaging (right).

 
3.3 Computational time
Since the major benefit of IWA is efficiency, we would like to compare the computational efficiency of IWA, GeneFAS and GAIN. The theoretical complexity of each method is highly dependent on the topology of the input network and cannot be easily determined. Hence, we will simply compute the actual CPU time required by each method to complete the same prediction task described in Section 3.2.

The implementations of GeneFAS, GAIN and IWA are programmed in Java, C++ and Perl, respectively. Hence, GAIN may have a slight advantage since it is implemented in compiled code while the others are implemented in interpreted codes. Predictions using the three implementations are performed on the same machine, which is equipped with a single Pentium 4 CPU running at 3.0 GHz, 512 KB cache, and 4.0 GB RAM. We capture the CPU time in which each process takes by initiating each implementation through a Perl script and computing the user time taken by the child process. The corresponding time taken by each implementation prediction task is presented in Table 4.


View this table:
[in this window]
[in a new window]

 
Table 4. CPU user time taken by the implementation of GeneFAS, GAIN and Integrated Weighted Average to complete the same prediction task

 
GeneFAS and IWA involve time taken to process and weight each data source, which is reflected in Table 3 as training time. GAIN does not perform weighting for data sources and hence does not incur any training time. GeneFAS took significantly more time for training. IWA took substantially less time to make predictions (testing) compared to the other two implementations. It also took the least total time (training and testing) for the prediction task.

3.4 Using cross-genome information
To illustrate how the use of cross-genome information can help to further enhance the prediction results of IWA, we repeat the experiment in Section 3.2 using IWA without excluding information from other genomes. This includes BLAST searches, PFam domains sharing and GO annotations. The validation results using ROC and precision-recall analysis are included in Figure 3 using the label IWA*. For both validation measures, the prediction performance of IWA improved noticeably for informative GO terms from molecular function and biological process, and remained the same for cellular component. This is anticipated as the cross-sequence information available in our dataset is limited to sequence-based information, which is more reflective of molecular functions. Nonetheless, non-sequence-based information can be easily incorporated using the IWA framework, which will potentially improve prediction performance for functional terms from biological process and cellular component.

3.5 Contribution of individual data sources
Figure 5 (top) shows the percentage of known GO biological process annotations that can be suggested by different number of data sources. A significant percentage of known annotations (more than 80%) is suggested by three or more sources of data. This indicates that the various data sources overlap substantially. Figure 5 (bottom) shows the fraction of GO biological process annotations suggested by different number of data sources that correspond to known annotations, i.e. the precision of the suggested annotations. Annotations suggested by more data sources exhibit significantly higher precision. These observations exemplify the advantages of integrating heterogeneous data sources for protein function prediction.


Figure 5
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. (a) Percentage of known GO annotations for biological process that is suggested by different number of data sources (top); and (b) the fraction of suggested annotations by different number of data sources that coincide with known annotations (bottom) using seven data sources from: (1) BIOGRID; (2) PFAM; (3) PUBMED; (4) BLAST on multiple genomes (BLAST_ALL); (5) STRING; (6) expression correlations from Eisen et al.' s microarray data and (7) expression correlations from the Rosetta microarray data.

 
Figure 6 presents the precision-recall analysis predictions made by IWA using individual datasets as well as using a combination of all datasets described in Section 3.2. We observe that the predictive power of sequence homology is significantly higher for molecular functions compared to the two other GO namespaces. The gene expression information used does not provide much coverage. We also observe that combining all the data sources yields better performance than using any one alone. ROC analysis yields similar conclusions, and is omitted due to space constraint.


Figure 6
View larger version (30K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Precision versus recall of predictions made by Integrated Weighted Averaging for informative GO terms in: (a) molecular function (top); (b) biological process (middle) and (c) cellular component (bottom) using Integrated Weighted Averaging on binary associations from (1) BIOGRID; (2) PFAM; (3) PUBMED; (4) BLAST on multiple genomes (BLAST_ALL); (5) STRING; (6) expression correlations from Eisen et al.' s microarray data; (7) expression correlations from the Rosetta microarray data and (8) combination of 1–7.

 
3.6 Comparison with direct homology inference from BLAST
Another interesting observation we made from the experiments is that IWA can provide better predictions from BLAST homology search results than interpreting the search results directly. To illustrate this, we emulate a common way of function inference from BLAST results. Given an unknown protein, we perform BLAST on its sequence using an E-value cutoff of 1, and retrieve the top 5 hits. The GO terms associated with the protein in each hit are then assigned to the unknown protein using the negative log E-value of the result. Note the same information is used in Section 3.2 as an input for IWA. Two sets of BLAST searches are used: one against yeast proteins from SGD only; and the other against all proteins in the dataset. This procedure is used to predict molecular function GO terms for unannotated yeast proteins from SGD. Using the precision-recall analysis described earlier, we compare predictions made this way with predictions made using the same information with IWA. The results are presented in Figure 7. Using IWA yielded predictions with greater precision over most of the recall range. We also observe that the inclusion of cross-genome homology search substantially improves prediction performance. Similar conclusions can be drawn from predictions for biological process and cellular component.


Figure 7
View larger version (28K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 7. Precision versus recall of predictions made for informative GO terms from molecular function using: (1) function transfer from top five BLAST hits against yeast genome (BLAST_SGD TOP); (2) function transfer from top five BLAST hits against multiple genomes (BLAST_ALL TOP); (3) Integrated Weighted Averaging using binary associations from top five BLAST hits against yeast genome (BLAST_SGD); (4) Integrated Weighted Averaging using binary associations from top five BLAST hits against multiple genomes (BLAST_ALL) and (5) Integrated Weighted Averaging using binary associations from all sources (ALL SOURCES).

 
3.7 Contribution of weighting scheme
To illustrate the contribution of our weighting scheme, we repeat IWA with all available data: (1) using the complete weighting method described in Section 2; (2) without subdividing data sources into subtypes during weighting and (Equation (3) without using weighting. The corresponding precision-recall analysis of predictions made using all data sources for informative biological process GO terms is shown in Figure 8. We observe that without subdividing each data source based on pre-computed scores, precision dropped over the entire recall range. Furthermore, if weighting is completely omitted, precision falls significantly. Similar conclusions can be drawn from predictions for molecular function and cellular component. These observations exemplify the importance of applying appropriate weighting in the data fusion task.


Figure 8
View larger version (25K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 8. Precision versus recall of predictions made for informative GO terms from biological process with Integrated Weighted Averaging using: (1) complete weighting method; (2) weighting without subdividing data sources based on pre-computed scores and (3) no weighting.

 
3.8 Limitations of IWA
Like other protein function prediction methods that uses functional association between proteins (Chen and Dong, 2004; Deng et al., 2004; Karaoz et al., 2004; Troyanskaya et al., 2003; Tsuda et al., 2005; Xiong et al., 2006), IWA will not be able to make any predictions if no association is available, such as in the case of a novel genome with no known sequence or domain homology with known sequences. In these cases, ab initio function prediction approaches such as Jensen et al. (2002) may be useful. Alternatively, features such as predicted localization and post-translational information used in ab initio approaches may also be used to generate binary relationships between proteins in the novel genome and known proteins. This association information can then be used with association-based methods like the IWA. The feasibility of such an approach, however, is beyond the scope of this study.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We have presented here IWA, a simple yet effective framework for integrating large amount of diverse information for function prediction. While perfecting prediction techniques with increasingly complex methods may yield minor incremental improvements, creating a framework that is simple enough to scale up to both the diversity and sheer quantity of rapidly growing information is likely to create greater impact in proteomic research. Despite the simplicity of its formulation, IWA yields exceptional performance compared to state-of-the-art approaches. Moreover, it yields prediction scores that are more consistent across different functions, making them easy to interpret without further manipulation. Using our approach, we have shown that cross-genome information can be tapped to further improve prediction performance. Finally, we have also demonstrated the significance of applying suitable weighting in data fusion.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We would like to thank Dong Xu and T. M. Murali for their invaluable assistance on the GeneFAS and GAIN software, respectively. We would also like to thank the reviewers for their constructive comments. This work was supported in part by an A*STAR AGS scholarship and by an MOE AcRF Tier 1 grant.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Chris Stoeckert

Received on May 1, 2007; revised on October 12, 2007; accepted on October 12, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol (1990) 215:403–410.[CrossRef][Web of Science][Medline]

    Ashburner M, et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet (2000) 25:25–29.[CrossRef][Web of Science][Medline]

    Bateman A, et al. The Pfam protein families database. Nucleic Acids Res (2004) 32:D138–D141.[Abstract/Free Full Text]

    Breitkreutz BJ, et al. The GRID: the General Repository for Interaction Datasets. Genome Biol (2003) 4:R23.[CrossRef][Medline]

    Chen Y, Dong X. Global protein function annotation through mining genome-scale data in yeast Saccharomyces cerevisiae. Nucleic Acids Res (2004) 32:6414–6424.[Abstract/Free Full Text]

    Chua HN, et al. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics (2006) 22:1623–1630.[Abstract/Free Full Text]

    Deng M, et al. An integrated probabilistic model for functional prediction of proteins. J. Comput. Biol (2004) 11:463–475.[CrossRef][Web of Science][Medline]

    Eisen M, et al. Cluster analysis and display of genome-wide expression patterns. Proc. Natl Acad. Sci. USA (1998) 95:14863–14868.[Abstract/Free Full Text]

    Gribskov M, Robinson NL. Use of receiver operating characteristic analysis to evaluate sequence matching. Comput. Chem (1996) 20:25–33.[CrossRef][Web of Science][Medline]

    Hughes TR, et al. Functional discovery via a compendium of expression profiles. Cell (2000) 102:109–126.[CrossRef][Web of Science][Medline]

    Jensen L, et al. Ab initio prediction of human orphan protein function from post-translational modifications and localization features. J. Mol. Biol (2002) 319:1257–1265.[CrossRef][Web of Science][Medline]

    Karaoz U, et al. Whole genome annotation using evidence integration in functional linkage networks. Proc. Natl Acad. Sci. USA (2004) 101:2888–2893.[Abstract/Free Full Text]

    Khan S, et al. GoFigure: automated gene ontology annotation. Bioinformatics (2003) 19:2484–2485.[Abstract/Free Full Text]

    Lanckriet GR, et al. Kernel-based data fusion and its application to protein function prediction in yeast. Proc. Pac. Symp. Biocomput (2004) 300–311.

    Lee I, et al. Probabilistic functional network of yeast genes. Science (2004) 306:1555–1558.[Abstract/Free Full Text]

    Letovsky S, Kasif S. Predicting protein function from protein/protein interaction data: a probabilistic approach. Bioinformatics (2003) 19(Suppl. 1):i197–i204.[Abstract]

    Martin DM, et al. GOtcha: a new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinformatics (2004) 5:178.[CrossRef][Medline]

    Murali TM, et al. The art of gene function prediction. Nat. Biotechnol (2006) 24:1474–1475.[CrossRef][Web of Science][Medline]

    Snel B, et al. STRING: a web-server to retrieve and display the repeatedly occurring neighbourhood of a gene. Nucleic Acids Res (2000) 28:3442–4.[Abstract/Free Full Text]

    Spellman PT, et al. Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell (1998) 9:3273–3297.[Abstract/Free Full Text]

    Troyanskaya OG, et al. A Bayesian framework for combining heterogeneous data sources for gene function prediction (in S. cerevisiae). Proc. Natl Acad. Sci. USA (2003) 100:8348–8353.[Abstract/Free Full Text]

    Tsuda K, et al. Fast protein classification with multiple networks. Bioinformatics (2005) 21:ii59–ii65.[Abstract]

    Vazquez A, et al. Global protein function prediction from protein–protein interaction networks. Nat. Biotechnol (2003) 21:697–700.[CrossRef][Web of Science][Medline]

    Xiong J, et al. Genome wide prediction of protein function via a generic knowledge discovery approach based on evidence integration. BMC Bioinformatics (2006) 7:268.[CrossRef][Medline]

    Zhou X, et al. Transitive functional annotation by shortest-path analysis of gene expression data. Proc. Natl Acad. Sci. USA (2002) 99:12783–12788.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
BioinformaticsHome page
K. Wabnik, T. R. Hvidsten, A. Kedzierska, J. Van Leene, G. De Jaeger, G. T. S. Beemster, J. Komorowski, and M. T. R. Kuiper
Gene expression trends and protein features effectively complement each other in gene function prediction
Bioinformatics, February 1, 2009; 25(3): 322 - 330.
[Abstract] [Full Text] [PDF]


Home page
Brief BioinformHome page
H. Michael, J. Hogan, A. Kel, O. Kel-Margoulis, F. Schacherer, N. Voss, and E. Wingender
Building a knowledge base for systems pathology
Brief Bioinform, December 10, 2008; (2008) bbn038v1.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/24/3364    most recent
btm520v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Chua, H. N.
Right arrow Articles by Wong, L.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Chua, H. N.
Right arrow Articles by Wong, L.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?