Skip Navigation


Bioinformatics Advance Access originally published online on June 3, 2009
Bioinformatics 2009 25(16):2076-2077; doi:10.1093/bioinformatics/btp346
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
25/16/2076    most recent
btp346v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Wang, Y.
Right arrow Articles by Grishin, N. V.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Y.
Right arrow Articles by Grishin, N. V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2009 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

PROCAIN server for remote protein sequence similarity search

Yong Wang 1, Ruslan I. Sadreyev 2 and Nick V. Grishin 2,3,*

1 Biomedical Engineering Program, University of Texas Southwestern Medical Center, 2 Howard Hughes Medical Institute and 3 Department of Biochemistry, University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390-9050, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES AND USAGE
 ACKNOWLEDGEMENTS
 REFERENCES
 

Sensitive and accurate detection of distant protein homology is essential for the studies of protein structure, function and evolution. We recently developed PROCAIN, a method that is based on sequence profile comparison and involves the analysis of four signals—similarities of residue content at the profile positions combined with three types of assisting information: sequence motifs, residue conservation and predicted secondary structure. Here we present the PROCAIN web server that allows the user to submit a query sequence or multiple sequence alignment and perform the search in a profile database of choice. The output is structured similar to that of BLAST, with the list of detected homologs sorted by E-value and followed by profile–profile alignments. The front page allows the user to adjust multiple options of input processing and output formatting, as well as search settings, including the relative weights assigned to the three types of assisting information.

Availability: http://prodata.swmed.edu/procain/

Contact: grishin{at}chop.swmed.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES AND USAGE
 ACKNOWLEDGEMENTS
 REFERENCES
 
Protein similarity detection and sequence alignment is a significant branch of bioinformatics. It is widely used for prediction of protein structure and function and in protein evolution studies (Kinch et al., 2003). To increase the accuracy of these applications, homology detection sensitivity and alignment quality is crucial. However, despite significant research efforts, it is still difficult to accurately detect homologs with relatively low sequence similarity.

BLAST (Altschul et al., 1990) and FASTA (Pearson and Lipman, 1988) are the first generation of sequence similarity search programs. Based on sequence-sequence comparison, these programs perform very well for proteins with high sequence identity. PSI–BLAST (Altschul et al., 1997), a method based on the comparison of sequence to multiple sequence alignment (MSA), brings the sensitivity of similarity search to another level, since MSA incorporates information about the query protein family. COMPASS (Sadreyev and Grishin, 2003) is based on protein MSA–MSA comparison and improves similarity search further, especially for remote homologs. The latest version of COMPASS employs an advanced statistical model (Sadreyev and Grishin, 2008) to help increase the accuracy of remote similarity detection. HHsearch (Soding, 2005), another MSA–MSA comparison method, uses the formalism of hidden Markov models (HMMs), which allows for position-specific rather than fixed affine gap penalties. HHsearch also incorporates predicted or observed secondary structure (SS) in the process of alignment construction and statistical significance estimation. These two characteristics of HHsearch contribute to more accurate remote similarity detection.

Recently we developed PROCAIN, a method for protein MSA comparison with assisting information (Wang et al., 2009). PROCAIN combines residue substitution constraints at individual sequence positions with sequence motif matching, residue conservation and SS scoring. PROCAIN also incorporates an empirical method for the estimation of statistical significance that is based on the comparison of non-homologous proteins from a calibration database (for query sequence) and a database used for search (for subject sequence). This method produces more realistic E-values and improves the ranking of hits. Benchmarked together with COMPASS 3.0 and HHsearch (version 1.5), PROCAIN shows better remote homology inference and alignment quality (Figure 1).


Figure 1
View larger version (15K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. PROCAIN's overall performance with respect to homology detection accuracy (left) and alignment quality (right). Homology detection quality is measured by overall structure similarity of detected protein pairs. If the detected protein pair has a GDT-TS score (normalized by query length) larger than 15 (on the scale from 0 to 100), then the hit is considered a true positive; false positive otherwise. The GDT-TS cutoff was determined previously (Qi et al., 2007) based on the observed GDT-TS distributions for homologs and non-homologs. Alignment quality is measured by the average GDT–TS–like score (vertical axis on the scale from 0 to 1) based on the alignments of homologous sequences.

 

    2 FEATURES AND USAGE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES AND USAGE
 ACKNOWLEDGEMENTS
 REFERENCES
 
The main page of the PROCAIN web server consists of an input box and several option sets. The user can paste a protein sequence or alignment into the input box or upload them using the browse button. The user can choose a protein database to search: the server features SCOP, PDB and PFAM databases. The user can access the results interactively in the current window or choose to receive an html link to results by email after the search is completed.

There are three option sets: input processing options, search options and output formatting options. Following the link provided by the name of each option will lead the user to a help page with a brief explanation of the option. Input option set includes options for running PSI–BLAST and further processing of the resulting alignment of detected homologs, such as the number of iterations, cutoff E-values, etc. Output formatting options include the upper-bound E-value to truncate the list of hits at, significance threshold and the maximum number of alignments the user wants to be shown. The output example button in the upper right corner of the input box will show the user a typical result page.

The search options are probably the most important for an experienced user. These include the values of affine gap penalties (costs of gap opening and gap extension) that allow for the adjustment of the coverage of produced alignments. Increasing gap penalties will decrease coverage and vice versa. PROCAIN generally produces long alignments with coverage of 40% larger than COMPASS and almost 200% larger than HHsearch. Such longer alignments are normally favored by the users attempting to predict the structure of the query protein, since they provide a more complete picture of the possible structure of the query protein.

PROCAIN constructs alignments based on the combination of four scores: sequence similarity scores, amino-acid conservation scores, sequence motif scores and SS scores:


Formula

where sseq is the sequence similarity score; C is the total conservation score in the two columns, normalized to the range [0–1]; sm is the sum of the sequence similarity scores of the current position and its previous and next neighboring positions; sss is the SS similarity score. w is the weight parameter for each score. {delta}m=1 if the sequence similarity scores of the current position and its two neighboring positions are all positive, {delta}m=0 otherwise. The resulting all-positions to all-positions scores s are used to construct the optimal local alignment of the two profiles using Smith-Waterman algorithm (Smith and Waterman, 1981). The search options provide an opportunity to adjust this scoring function by changing the weight of the different components according to the user's experience and expectations. The default weights of score terms are the optimal values obtained on a subset of diverse SCOP domains (Wang et al., 2009). Although the composition of the training set (47.9% for {alpha}/β, 17.6% for all {alpha}, 9.6% for all β and 8.9% for {alpha}+β) may reflect the overall composition of the SCOP database very well, it is dominated by {alpha}/β class. This composition bias leads to different homology detection accuracy in different protein classes. PROCAIN performance in the {alpha}/β class is very similar to the overall performance, whereas the other three classes show significant differences. These differences suggest that homology detection in all {alpha}, all β and {alpha}+β classes may benefit from individualized adjustment of weights in the scoring function. For example, decreasing the contribution of SS score and putting more emphasis on residue similarity can potentially improve the detection quality among all {alpha} and all β proteins, where types and boundaries of SS elements are less informative for alignment construction. Thus experienced users are encouraged to adjust the weights of the three additional scores according to the properties of the query protein.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES AND USAGE
 ACKNOWLEDGEMENTS
 REFERENCES
 
The authors would like to thank Ming Tang for providing technical support with setting up the server.

Funding: National Institutes of Health (grant number GM67165 to N.V.G.).

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Burkhard Rost

Received on April 22, 2009; revised on May 28, 2009; accepted on May 29, 2009

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 FEATURES AND USAGE
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Altschul SF, et al. Basic local alignment search tool. J. Mol. Biol. (1990) 215:403–410.[CrossRef][Web of Science][Medline]

    Altschul SF, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. (1997) 25:3389–3402.[Abstract/Free Full Text]

    Kinch LN, et al. CASP5 assessment of fold recognition target predictions. Proteins (2003) 53(Suppl. 6):395–409.[CrossRef][Web of Science][Medline]

    Pearson WR, Lipman DJ. Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA (1988) 85:2444–2448.[Abstract/Free Full Text]

    Qi Y, et al. A comprehensive system for evaluation of remote sequence similarity detection. BMC Bioinformatics (2007) 8:314.[CrossRef][Medline]

    Sadreyev R, Grishin N. COMPASS: a tool for comparison of multiple protein alignments with assessment of statistical significance. J. Mol. Biol. (2003) 326:317–336.[CrossRef][Web of Science][Medline]

    Sadreyev RI, Grishin NV. Accurate statistical model of comparison between multiple sequence alignments. Nucleic Acids Res. (2008) 36:2240–2248.[Abstract/Free Full Text]

    Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol. (1981) 147:195–197.[CrossRef][Web of Science][Medline]

    Soding J. Protein homology detection by HMM-HMM comparison. Bioinformatics (2005) 21:951–960.[Abstract/Free Full Text]

    Wang Y, et al. PROCAIN: protein profile comparison with assisting information. Nucleic Acids Res. (2009) 37:3522–3530.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
25/16/2076    most recent
btp346v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Wang, Y.
Right arrow Articles by Grishin, N. V.
PubMed
Right arrow PubMed Citation
Right arrow Articles by Wang, Y.
Right arrow Articles by Grishin, N. V.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?