Skip Navigation


Bioinformatics Advance Access originally published online on November 6, 2008
Bioinformatics 2009 25(1):121-122; doi:10.1093/bioinformatics/btn567
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
25/1/121    most recent
btn567v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Melvin, I.
Right arrow Articles by Noble, W. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Melvin, I.
Right arrow Articles by Noble, W. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2008 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

RANKPROP: a web server for protein remote homology detection

Iain Melvin 1,*, Jason Weston 1, Christina Leslie 2 and William Stafford Noble 3,4,*

1NEC Laboratories of America, Princeton, NJ, 2Computational Biology Program, Sloan-Kettering Institute, Memorial Sloan-Kettering Cancer Center, New York, NY, 3Department of Genome Sciences and 4Department of Computer Science and Engineering, University of Washington, Seattle, WA, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 References
 

Summary: We present a large-scale implementation of the RANKPROP protein homology ranking algorithm in the form of an openly accessible web server. We use the NRDB40 PSI-BLAST all-versus-all protein similarity network of 1.1 million proteins to construct the graph for the RANKPROP algorithm, whereas previously, results were only reported for a database of 108 000 proteins. We also describe two algorithmic improvements to the original algorithm, including propagation from multiple homologs of the query and better normalization of ranking scores, that lead to higher accuracy and to scores with a probabilistic interpretation.

Availability: The RANKPROP web server and source code are available at http://rankprop.gs.washington.edu

Contact: iain{at}nec-labs.com; noble{at}gs.washington.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 References
 
RANKPROP (Weston et al., 2004) is a network-based inference algorithm for identifying subtle protein sequence similarities, corresponding to remote homology relationships or to structural similarities. The algorithm operates on a protein similarity network, a graph in which each node is a protein and each weighted edge connecting two proteins indicates their similarity. Such a network can be built using existing tools, such as PSI-BLAST (Altschul et al., 1997).

The key idea of the RANKPROP algorithm is to extract global information from a protein similarity network by propagating outward from a user-specified query protein. Effectively, the algorithm sums over all possible paths from the query to each target protein. Thus, after propagation, the resulting activation score for each node includes global information about that protein's relationship to the query. Ranking proteins by these scores is analogous to performing a database search using a tool such as PSI-BLAST, except that the ranking induced by RANKPROP reflects the global topology of the protein similarity network.

In (Weston et al., 2004), PSI-BLAST is used to measure sequence similarity, and the unnormalized weight for the edge from node i to node j is Wij=exp(–Sj(i)/{sigma}), where Sj(i) is the PSI-BLAST E-value assigned to protein i given query j, and the parameter {sigma} is a positive constant. Edges are only included in the network for E-values smaller than a fixed threshold. We obtain a stochastic connectivity matrix M for the protein similarity network by row-normalizing edge weights Wij to obtain transition probabilities: Mij = Wij / {sum}j Wij.

Given such a network and a query sequence q, the RANKPROP algorithm is simple to describe. First, all nodes are assigned initial activation scores that reflect each target protein's similarity to q. Like the edge weights, these scores are computed from PSI-BLAST E-values using the same equation. At each iteration of the algorithm, the activation score at a given node is replaced by the weighted sum of the scores from all of its incoming edges. The update rule includes a diffusion constant {alpha} that controls the rate of diffusion through the network. Formally, we define the initial activation scores as PFormula= exp(–Sq(i)/{sigma}). Viewing Pt as the column vector of activation levels at iteration t, the algorithm is given by PFormula= {alpha}MPFormula+PFormula if Pi!=q and PFormula = 1 otherwise, where {alpha} isin(0,1). One can show that this iterative procedure converges to a fixed point, which in practice happens in a small number of iterations. The output of the RANKPROP algorithm is a ranking of the nodes in the network according to their final activation values. Proteins that receive a high activation score are linked to the query via many strongly weighted paths and vice versa. A multidomain query protein will produce strong matches to any target protein that contains one or more of the query domains. A single domain query A may connect through a multidomain protein AB to infer a false relationship with B. However, previous work (Weston et al., 2004) has found that as long as the query sequence is connected to many other proteins, then the true homologs will be mutually reinforcing and receive a higher rank.

In this work, we extend the original RANKPROP algorithm in two ways: (1) improving accuracy by propagating simultaneously from proteins that are very closely related to the query, and (2) improving the interpretability of the scores produced by RANKPROP by empirically mapping them to probabilities. The mapped score can be interpreted as the probability that the target protein is a member of the same SCOP superfamily as the query. We also announce the availability of a free web server that allows individual queries against a protein similarity network derived from the NRDB40, comprising 1.1 million targets.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 References
 
The RANKPROP server uses the PSI-BLAST all-versus-all similarity matrix for NRDB40 provided by the PairsDB website (Heger et al., 2008). NRDB40 is a subset of the non-redundant sequence database, filtered so that no pairs exhibit >40% sequence identity.

We generalize the RANKPROP algorithm to accept a set Q of query proteins, rather than a single query protein. To use this extra information we perform propagation as usual, but we constrain the activation scores for all the query points such that they are highly ranked. In particular, we choose the set Q to be all the proteins that have a match with the initial query q with a PSI-BLAST E-value<0.001. We then constrain our algorithm to have Pj = 1–Sq(j), {forall} jisinQ. This modification is useful because, instead of propagating from a single query source node in the graph, we can propagate from several source nodes that all belong to the same family or superfamily that we are searching for.

The original RANKPROP algorithm outputs scalar values that are not directly interpretable. In the new version of the algorithm, we map each RANKPROP score to an estimate of the probability that the corresponding query and target proteins belong to the same structural superfamily. We employ the SCOP database (Murzin et al., 1995) to compute a histogram of empirical frequencies of the activation levels Pi for each protein i. More specifically, we choose bin centers vk and compute the following quantities: nk, the number of times Pi falls into bin vk, and sn, the number of times that the latter occurs and i is in the same superfamily as the query. We are interested in the value sk / nk, which can be interpreted as the probability for each activation value bin of the target being in the same superfamily as the query. We choose the bin centers v= (0,0.01,0.02,..., 0.2 , 0.3 ,..., 1), and we enforce monotonicity in the final output by setting pi/ni = pi–1/ni–1 if pi/ni<pi–1/ni–1.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 References
 
Table 1 compares our large-scale RANKPROP results with PSI-BLAST (using NRDB40 and the same blastpgp parameters as PairsDB: –j 10 –e 1 –h 0.001 –b 10000 –v 10000) and the previously published version of RANKPROP (using the SWISSPROT database, 108k proteins). RANKPROP NRDB40 is a straightforward scaling up of the previous RANKPROP algorithm to NRDB40. In addition, RANKPROP+homologs NRDB40 employs the extensions described in Section 2. Accuracy is measured following the methodology given in (Weston et al., 2004): SCOP version 1.59 is split into train and test portions, and hyperparameters are chosen by using the training set. Then, each test protein is treated as a query, and the quality of a method's protein ranking is measured by using the area under the receiver operating characteristic (ROC) curve, up to the first (ROC1) or 50th (ROC50) false positive. We report results as average ROC1 and ROC50 scores across all 3083 test proteins. Using a larger network yields improvements across all four performance metrics, and propagating from multiple queries improves the performance still further. A Wilcoxon signed rank test, corrected for multiple tests, shows that all differences in Table 1 are significant at 0.01, except for the three pairs of methods marked with asterisks.


View this table:
[in this window]
[in a new window]

 
Table 1. Ranking accuracy

 
We also evaluate the performance of RANKPROP using a combined ROC curve across all the queries in our test set, following the protocol of (Altschul et al., 1997). Figure 1 shows the combined ROC curves for RANKPROP NRDB40 (ranked by activation value), RANKPROP+homologs NRDB40 (ranked by probability) and PSI-BLAST (ranked by E-value). Compared with average per-query ROC scores, the combined ROC curve requires that scores are well calibrated from one query to the next. The figure shows that the mapping of RANKPROP scores to probabilities significantly improves the calibration, yielding better performance than PSI-BLAST for all but the first few false positives (across 3083 queries).


Figure 1
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. Combined ROC curve across multiple queries. For each method, search results from 3083 queries were sorted into a single list. The figure plots, for varying thresholds in the ranked list, the fraction of all known homologs (SCOP superfamily members) that fall above the threshold, as a function of the number of non-superfamily members above the threshold.

 
The RANKPROP web server first looks for an exact match of the query sequence against the sequences in NRDB40. If such a match is found, the server will retrieve the precomputed PSI-BLAST results from the database and then apply the RANKPROP algorithm. In this case the server takes around 90 s to process a query. If the sequence is not found in the database, then the server will run PSI-BLAST first, which on average takes an additional 15 min.


    Funding
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 References
 
National Institutes of Health award (R01 GM074257).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Burkhard Rost

Received on June 26, 2008; revised on October 29, 2008; accepted on October 29, 2008

    References
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 Funding
 References
 

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]

    Heger A, et al. Pairsdb atlas of protein sequence space. Nucleic Acids Res. (2008) 36:D276–D280.[Abstract/Free Full Text]

    Murzin AG, et al. SCOP: a structural classification of proteins database for the investigation of sequences and structures. J. Mol. Biol. (1995) 247:536–540.[CrossRef][Web of Science][Medline]

    Weston J, et al. Protein ranking: from local to global structure in the protein similarity network. Proc. Natl Acad. Sci. (2004) 101:6559–6563.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
25/1/121    most recent
btn567v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Melvin, I.
Right arrow Articles by Noble, W. S.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Melvin, I.
Right arrow Articles by Noble, W. S.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?