Bioinformatics Advance Access originally published online on August 16, 2005
Bioinformatics 2005 21(19):3803-3805; doi:10.1093/bioinformatics/bti619
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
NdPASA: a pairwise sequence alignment server for distantly related proteins

1Department of Chemistry, Temple University Philadelphia, PA 19122, USA
2Center for Biotechnology, Temple University Philadelphia, PA 19122, USA
*To whom correspondence should be addressed at Department of chemistry, Temple University, 1901 N. 13th street, Philadelphia, PA19122, USA
| Abstract |
|---|
|
|
|---|
Summary: NdPASA is a web server specifically designed to optimize sequence alignment between distantly related proteins. The program integrates structure information of the template sequence into a global alignment algorithm by employing neighbor-dependent propensities of amino acids as a unique parameter for alignment. NdPASA optimizes alignment by evaluating the likelihood of a residue pair in the query sequence matching against a corresponding residue pair adopting a particular secondary structure in the template sequence. NdPASA is most effective in aligning homologous proteins sharing low percentage of sequence identity. The server is designed to aid homologous protein structure modeling. A PSI-BLAST search engine was implemented to help users identify template candidates that are most appropriate for modeling the query sequences.
Availability: http://guanyin.chem.temple.edu
Contact: feng{at}temple.edu
Protein sequence alignment is an essential component of biomedical research. It is one of the standard approaches to explore potential functional activity of a newly discovered protein by identifying sequence homologues that may be evolutionarily related (Pearson and Lipman, 1988). Structural and functional information of a new protein can often be inferred from the knowledge of well-characterized homologous proteins. An accurate sequence alignment is critical to comparative protein structure prediction. While closely related protein sequences in general are relatively easy to align using the existing sequence-based methods, the success rate of these methods in finding correct alignment is significantly reduced when the sequence identity between two aligned sequences is <25%, a threshold often referred to as the twilight zone (Rost, 1999). Recent attempts to improve pairwise sequence alignment have benefited greatly from incorporating sequence-profile and structural information into the alignment algorithms (Marti-Renom et al., 2004; Wang and Feng, 2005).
NdPASA is a web server for pairwise sequence alignment of distantly related homologous proteins. It provides a user-friendly interface for a global sequence alignment algorithm that incorporates neighbor-dependent amino acid propensity. By utilizing the structural information on the template sequence, NdPASA has significant improvements over the standard PSI-BLAST in aligning sequence pairs with <20% sequence identity (Wang and Feng, 2005). In addition to neighbor-dependent amino acid secondary structure propensities, the algorithm also utilizes a structure-dependent gap opening and extension penalty scheme. A higher gap penalty was applied for gaps that occurred within the regular secondary structures than for gaps that occurred in the loops. The neighbor-dependent amino acid secondary structure propensities were derived from sequence analysis of proteins that calculated the effect of neighboring amino acid type on the propensity of residues for adopting
-helices, ß-strands and loops in proteins (Crasto and Feng, 2001; Wang and Feng, 2003). The values of neighbor-dependent propensity reflected the likelihood of an amino acid pair adopting a particular secondary structure conformation. The rationale for the utilization of neighbor-dependent amino acid propensity in sequence alignment is easily recognized. Methods employing sequence-based substitution matrix often have limited success in aligning sequences sharing low percentage of sequence identity. The incorporation of the neighbor-dependent amino acid propensities allowed us to estimate the probability of an amino acid pair to be aligned with a corresponding amino acid pair adopting a specific secondary structure in the template sequence. For example, an amino acid pair in the query sequence having a low neighbor-dependent propensity for
-helical conformation would be less likely aligned with an amino acid pair in an
-helix of the template sequence. NdPASA performs most effectively when the structural information of the template sequence is available.
The NdPASA incorporated the information of secondary structure propensity into the NeedlemanWunsch global alignment algorithm with affined gap penalty (Wang and Feng, 2005). A scaling factor was introduced to augment the relative weight between the neighbor-dependent secondary structure propensity score and the amino acid substitution score. The default substitution matrix was BLOSUM62. The gap opening and extension penalties were introduced as secondary structure-dependent parameters, whose values were estimated from optimizing the alignment accuracy of 500 randomly selected homologous sequence pairs that were used as a training dataset (Wang and Feng, 2005). Considering that regular secondary structures were often more conserved than loop regions, the gap opening penalties for the helices and strands were assigned higher values than that for the loops.
A detailed analysis on the performance and the benchmarking tests of the NdPASA algorithm is presented elsewhere (Wang and Feng, 2005). Using super-positions of homologous proteins derived from the PSI-BLAST analysis and the SCOP classification of a non-redundant Protein Data Bank (PDB) database as a gold standard, we found that NdPASA had improved pairwise alignment. Statistical analyses of the performance of NdPASA indicated that the introduction of sequence patterns of secondary structure derived from the neighbor-dependent sequence analysis clearly improved alignment performance for sequence pairs sharing <20% sequence identity. For sequence pairs sharing 1321% sequence identity, NdPASA improved the accuracy of alignment over the conventional global alignment (GA) algorithm using the BLOSUM62 by an average of 8.6% (Wang and Feng, 2005).
| NdPASA SERVER |
|---|
|
|
|---|
The NdPASA server is designed mainly to aid homologous protein structure modeling of query sequences. It provides a simple user interface for easy interaction. Figure 1a shows a schematic diagram of the algorithm implemented in the server. Since NdPASA alignment is most effective when the structural information of the template is available, we designed an input page with three options. In addition to entering a query sequence, the user may input either the sequence or the PDB entry-ID of a template protein. If a user-specified template sequence is entered, the NdPASA server will perform a PSI-BLAST search against the PDB database and return results containing PDB entries that share at least 80% sequence identity with the template (Altschul et al., 1997). The user is asked to select one of the PDB entry-IDs as the desired template for subsequent pairwise sequence alignment with the query sequence. When the identity of the template is determined, the NdPASA server assigns secondary structure elements using DSSP for the template before applying NdPASA algorithm for sequence alignment with the query sequence (Kabsch and Sander, 1983). However, if no sequence match is found between the input template and the proteins in the PDB, the user can submit the template sequence to the PSIPRED server for secondary structure predictions (Jones, 1999; http://bioinf.cs.ucl.ac.uk/psipred/psiform.html). The NdPASA accepts the returned secondary structure assignments for subsequent alignment. However, when the user has no information about the template, the NdPASA server will perform a PSI-BLAST search against the non-redundant protein structure database (PDB) for sequences homologous to the query using the BLOSUM62 matrix. The user also has the option to choose different scoring matrices, including PAM250, PAM300, PAM120, BLOSUM35, BLOSUM45, BLOSUM50, BLOSUM60, BLOSUM62 and BLOSUM80. In addition, the gap opening and extending penalty parameters can also be changed. The default options of BLOSUM62, as well as the gap opening (11) and extending (1) penalties, were selected based on experimental tests by aligning 2021 pairs of remotely related sequences when the NdPASA yielded the best overall results (Wang and Feng, 2005). The returned results contain essential information that may be helpful for user to determine the most appropriate template candidate for the query sequence. All returned results are displayed with their sequence names, PDB entry-ID, PSI-BLAST scores and the percentage sequence identities, as determined by the PSI-BLAST when compared with the query protein. In order to limit the scope of template selection, we specified the PSI-BLAST output to contain only the sequences with either top 5 or 15 ranked scores for inspection. An optional filter was also implemented where the user may limit the output of the PSI-BLAST search to those sequences that share sequence homology above a defined identity range. When a template candidate is identified, the user may select the radial button next to the desired sequence and click submit for optimized pairwise sequence alignment by NdPASA. Upon receiving a command to align the query sequence against one of the templates identified by PSI-BLAST, the program fetches the template sequence from the PDB and assigns a secondary structure conformation for every residue in the template by using DSSP (Kabsch and Sander, 1983). It then performs NdPASA alignment incorporating the structure information of the template derived from the DSSP. The result of the NdPASA alignment is displayed in a pop-up window with the query and the template sequences aligned (Fig. 1b). The secondary structure information of the template sequence is also displayed for inspection. NdPASA also produce an alignment in the standardized FASTA format so that the results can be easily integrated with other bioinformatics tools.
|
| AVAILABILITY |
|---|
|
|
|---|
NdPASA was implemented in JAVA. It was compiled on a LINUX-based workstation. The web server can be freely accessed on the World Wide Web at http://guanyin.chem.temple.edu. A brief description of the program and a detailed user guide with examples are also available at the website.
| Acknowledgments |
|---|
The authors would like to thank members of the Feng laboratory for helpful discussions. The authors also thank the American Cancer Society for the financial support (PRG9926301GMC) and the commonwealth of Pennsylvania for the appropriation.
Conflict of Interest: none declared.
| Footnotes |
|---|
Present address: Department of Genetics, Center for Bioinformatics, University of Pennsylvania, PA 19104, USA
Received on May 9, 2005; revised on July 13, 2005; accepted on August 8, 2005
| REFERENCES |
|---|
|
|
|---|
Altschul, S.F., et al. (1997) Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res., 25, 33893402
Crasto, C.J. and Feng, J.A. (2001) Sequence codes for extended conformation: a neighbor-dependent sequence analysis of loops in proteins. Proteins, 42, 399413[CrossRef][Web of Science][Medline].
Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol., 292, 195202[CrossRef][Web of Science][Medline].
Kabsch, W. and Sander, C. (1983) Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers, 22, 25772637[CrossRef][Web of Science][Medline].
Marti-Renom, M.A., et al. (2004) Alignment of protein sequences by their profiles. Protein Sci., 13, 10711087[CrossRef][Web of Science][Medline].
Pearson, W.R. and Lipman, D.J. (1988) Improved tools for biological sequence comparison. Proc. Natl Acad. Sci. USA, 85, 24442448
Rost, B. (1999) Twilight zone of protein sequence alignments. Protein Eng., 12, 8594
Wang, J. and Feng, J.A. (2003) Exploring the sequence patterns in the alpha-helices of proteins. Protein Eng., 16, 799807
Wang, J. and Feng, J.A. (2005) NdPASA: a novel pairwise protein sequence alignment algorithm that incorporates neighbor-dependent amino acid propensities. Proteins, 58, 628637[CrossRef][Web of Science][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
