Bioinformatics Advance Access originally published online on November 29, 2005
Bioinformatics 2006 22(4):460-465; doi:10.1093/bioinformatics/bti805
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
SimShift: Identifying structural similarities from NMR chemical shifts
LFE Bioinformatik, Institut für Informatik, Ludwig-Maximilians-Universität München Amalienstraße 17, D-80333 München, Germany
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: An important quantity that arises in NMR spectroscopy experiments is the chemical shift. The interpretation of these data is mostly done by human experts; to our knowledge there are no algorithms that predict protein structure from chemical shift sequences alone. One approach to facilitate this process could be to compare two such sequences, where the structure of one protein has already been resolved. Our claim is that similarity of chemical shifts thereby found implies structural similarity of the respective proteins.
Results: We present an algorithm to identify structural similarities of proteins by aligning their associated chemical shift sequences. To evaluate the correctness of our predictions, we propose a benchmark set of protein pairs that have high structural similarity, but low sequence similarity (because with high sequence similarity the structural similarities could easily be detected by a sequence alignment algorithm). We compare our results with those of HHsearch and SSEA and show that our method outperforms both in >50% of all cases.
Contact: Simon.Ginzinger{at}bio.ifi.lmu.de
| 1 INTRODUCTION |
|---|
|
|
|---|
NMR spectroscopy is one of the most important methods for resolving structures on an atomic level and has been successfully applied to macromolecules such as proteins. Several problems arise on the way from the NMR experiment to the full determination of the 3D coordinates of the structure. One of them is the interpretation of the so-called chemical shifts. These are known to inherently carry structural information. It is a difficult task to determine the topology of the protein from the chemical shift data alone. This is therefore usually done by incorporating (human) expert knowledge, in combination with modelling tools and additional experimentsa time-consuming process that may take up to several months.
We present an approach (called SimShift) to identifying structural similarities among two proteins by searching for similarities in the associated chemical shift sequences. This is done by computing an alignment of the two sequences, the so-called shift-alignment. The shift-alignment algorithm will be presented in Section 3.
The justification of our approach can be seen as follows. The chemical shift for a certain nucleus is influenced by its environment. This is due to the fact that surrounding electron clouds induce a local magnetic field which adds to or subtracts from the field applied in the NMR experiment. This results in different shifts for different environments. In a protein the chemical shift of an atom is influenced by the type of amino acid it is part of and the electron clouds of atoms which are close in space due to the tertiary structure of a protein. The influence of the amino acid type may be removed by subtracting the so-called random-coil shifts, which give an average value for the chemical shift of a specific atom in a certain amino acid. These normalized chemical shift-values are called secondary shifts. For more information on chemical shifts and NMR in general we refer the reader to Levitt (2001).
We will empirically prove the claim that similarity of the shift sequences implies similarity of the respective structures. To do so, in Section 2 we define a benchmark set which we show to be hard for structure prediction. This set consists of pairs of proteins which have high structural similarity (measured by the the MaxSub-score of their best superposition) but low sequence similarity.1 In analogy to the definition of structural correctness we define sequence similarity as the MaxSub-score of the superposition of the residue pairs assigned by an amino acid sequence alignment algorithm. By performing the steps presented in Section 2 it is possible to generate hard test sets for the evaluation of protein prediction methods in general. Since the scientific community lacks a clearly defined method for deriving benchmark sets for such a task, this can be viewed as another contribution of our research.
We show in Section 4 that SimShift is capable of detecting non-trivial structural similarities. SimShift is always better than a mere alignment of secondary structure elements (SSEA), and beats HHsearch (a method that uses both primary and secondary structure information) in >50% of all cases. This shows that our alignment quality is situated in the gap between methods that make use of sequence and secondary structure information and high quality structurestructure alignments.
To our knowledge there is only one approach which aims at inferring structural information at this early stage of an NMR-experiment, namely TALOS (Cornilescu et al., 1999). The TALOS approach predicts backbone torsion angles from chemical shifts and sequence information by making use of a database of high quality X-ray structures and resonance assignments. This works roughly as follows. A sliding window of length 3 is used to partition the input sequences into triples. For each such a triple the database is then searched for similar triples (in terms of sequence and shifts) for which the torsion angles are known, and the best 10 matches are selected. On basis of their agreement the
and
angles are either calculated or the prediction is declined. If any, TALOS gives very accurate predictions, which is the case for
40% of all residues in the author's benchmark set (or 2/3 after optimization by a human expert).
SimShift is different from TALOS in the sense that it searches for similar structures rather than predicting the structure. It thus enables the construction of a crude model of the query structure, even if there is no exact match in the database. This is the reason why we are not bound to using high-resolution X-ray structures as templates, as it is the case for TALOS. Indeed, any structure for which coordinates and chemical shifts are available may be used for comparison. A comparison against TALOS reveals that the SimShift alignments result in significantly better
- and
-angle predictions for about half of the targets in our test set.
A typical application of SimShift will be as follows: having measured the NMR shift data of a protein with unknown tertiary structure, the obtained sequence is compared with a database of resolved proteins for which shift data are also available. The unresolved protein is likely to be similar in structure to a protein whose shift-values align well. So the 3D model of the aligned residues of the known protein is a good starting point for resolving the new structure. We are convinced that SimShift can significantly speed up the process of determining protein structures, as obtaining chemical shifts is a much faster procedure than resolving a protein structure completely.
| 2 SELECTION OF THE BENCHMARK SET |
|---|
|
|
|---|
The selection of the benchmark set is sketched in Figure 1. In the following it is explained in more detail.
|
2.1 Databases used
All protein entries from BMRB (Seavey et al., 1991) for which N- and C
-shifts are available are used.2 From 3175 BRMB-entries of the proteins/peptide class, 1217 contained chemical shift-values for N and C
for at least 80 residues.3 Apart from the information on chemical shifts, each BMRB-entry also contains the corresponding amino acid sequence, though their source (i.e. the protein where the sequence was taken from) is not given in a regular way, sometimes even missing. To identify protein structures corresponding to these amino acid sequences a BLAST-search (Altschul et al., 1997) against the sequences from the ASTRAL (Chandonia et al., 2004) database is conducted for each BMRB-entry. If the full BMRB sequence can be matched without gaps against an ASTRAL sequence, the corresponding ASTRAL structure is assigned to the BMRB-entry. After this selection procedure we ended up with 585 structures that contain chemical shift-values for N and C
and could be mapped to an ASTRAL-entry. As some entries in BMRB match more than one sequence in ASTRAL, one representative structure has to be chosen. This is accomplished using the AEROSPACI score (Chandonia et al., 2004) provided for each ASTRAL entry. The main contribution to the AEROSPACI score comes from the resolution of the corresponding protein. Higher scores represent better resolutions. Therefore, if more than one sequence from ASTRAL matches one BMRB sequence, the structure with the highest AEROSPACI score is chosen. In general, chemical shift data are not available for every residue in the structure. The matched structures are cut at the beginning and the end to remove overhanging ends, for which no chemical shift data are available.
To calculate the secondary structure assignment (needed in Phase 3 of the algorithm), STRIDE (Heinig and Frishman, 2004) was run on all structures in ASTRAL. Via the mapping the assignment was transfered to the corresponding residues in each BMRB-entry.
2.2 Evaluating the structural correctness of alignments
We will often need to measure the structural correctness of an alignment. The following procedure is always used:
- Extract pairs of aligned residues from the alignment.
- Extract the coordinates of the C
atoms from the tertiary structures.
- Superimpose the two point sets using the superposition algorithm from Kabsch, 1978.
- Calculate the MaxSub-score (Siew et al., 2000) using a certain distance threshold.
2.3 Defining a benchmark set
Our aim is to show the algorithm's ability to identify structural similarities in pairs of proteins where sequence similarity is low. To create a test set with these properties we first compute all possible combinations of our 585 structures from Section 2.1. Now we select pairs that fulfill the following constraints.
- Low predictability from the sequence We calculate a BLAST pairwise alignment for all pairs. If a BLAST alignment has been found, we evaluate the structural correctness of the alignment with the method from Section 2.2. We keep all pairs that either do not have a BLAST alignment or whose BLAST alignment has a MaxSub-score
0.05, where the distance threshold is 3.5 Å.
- Existence of structural similarity The pairs to be finally used should have some detectable structural similarity, despite their low sequence similarity. To identify such protein pairs we calculate CE-alignments for all remaining test pairs with the method presented by Shindyalov and Bourne (1998). The correctness of the alignment is again evaluated as described in Section 2.2. We keep those pairs with a MaxSub-score > 0.4, where the distance threshold is 5.0 Å.
The set of pairs that passed these two criteria consists of 2548 pairs which are built from 417 structures. Their average number of residues is 117, with a standard deviation of 38 (rounded). About one-fourth of these structures (124) were solved through X-ray diffraction. Their average resolution is 1.82 Å.
| 3 THE SHIFT-ALIGNMENT ALGORITHM |
|---|
|
|
|---|
For the algorithm presented in this section we will be using the shift-values of the C
- and the N-atoms (from the backbone of the protein). The algorithm takes two amino acid sequences s = s1
sn and t = t1
tm and the shift-values of the respective C
- and the N-atoms and returns a list of aligned amino acids. This is done in the three phases that are explained in subsections 3.1 to 3.4.
3.1 Phase 1: calculation of the shift-difference matrix
For the shift-values of the C
-atoms we compute the distance for each possible pairing of the amino acids and store the result in a shift-difference matrix MC
, i.e. for all 1
i
m and 1
j
n we calculate
![]() |
is the secondary shift of the C
-atom of residue i in sequence t. The shift-difference matrix MN is computed accordingly for the
-values.4 The secondary shift-values are obtained using the random coil shifts from Wishart et al. (1995). For the 15N random-coil shift in Proline (not available from Wishart et al., 1995) we take the results from Braun et al. (1994).
3.2 Phase 2: find good blocks
We now wish to find a set of blocks {bh} that represents good local alignments (without gaps) of substrings from s and t. More formally, a block b is defined as a triple (i, j, k) with 1
i
m k + 1 and 1
j
n k + 1, where the intended meaning is that ti
ti + k 1 aligns with sj
sj + k1. For simplicity of notation, for a given block b = (i, j, k) we define the block extents Xmin(b) = j, Ymin(b) = i, Xmax(b) = j + k 1 and Ymax(b) = i + k 1.
Two restrictions are placed on these blocks. Firstly, they should fulfill a minimum length criterion, so we require that k
l for some minimum length l. Secondly, all aligned amino acids should have similar shift-values for both the C
- and the N-atom, i.e. Mc
[i+p][j+p]
C
and MN[i + p][j + p]
N for all 0
p < k. We further require the extent of blocks be maximal, i.e. Mc
[Ymin(b) 1][Xmin(b) 1] >
C
or MN[Ymin(b) 1] [Xmin(b) 1] >
N and likewise for the other end of the block (Xmax(b) + 1, Ymax(b) + 1). The values
C
and
N are called cutoff-parameters. A graphical depiction of this concept can be found in Figure 2. For the rest of this section we denote by n' the number of blocks that have been found in Phase 2.
|
3.3 Phase 3: concatenation of blocks
In this step multiple local alignments are concatenated to a global alignment consisting of more than one block. To find the best global alignment, a positive score (representing the block's global correctness) is first associated with each block. To calculate this score we re-use the idea of secondary shifts; here, however, we normalize not only according to the amino acid type, but also according to the protein's secondary structure. This increases the influence of long-range interactions on the score. For normalization we calculate z-scores using the averaged ß-strand, random-coil and
-helix shifts and the according standard deviations from (Wang and Jardetzky, 2002). To calculate a specific block score, we calculate the N and C
differences between the normalized chemical shifts and sum them over all residue pairs in the block. A more formal definition of the block score is given in the following paragraph.
Defining
to be the chemical shift value that is normalized according to secondary structure and amino acid type, we set
![]() |
![]() |
The effect of this formula is that the pair scores are inverted and moved to the positive range to associate the highest score with the best pair. Using these scores we apply the following algorithm to identify the highest scoring block chain.
For a given set of blocks {bh} a block chain B is defined as a non-overlapping sequence of blocks bh1,...,bh1 where non-overlapping means that
and
for all 1
q < r. The block score is extended in a natural way to block chains by setting
. The optimal block chain is the one with maximal score. Figure 3 shows an example. Further, the block extents are extended to block chains by defining Ymin(B) to be the minimum of all Ymin's in B, and likewise for the other extents Ymax, Xmin and Xmax.
|
Algorithm 1 is used to compute the optimal block chain for a set of n' blocks. It is a straightforward adaption of the algorithm in Joseph et al. (1992).5 We note that with an efficient implementation of the set D the running time of Algorithm 1 is O(n' log n'). As phases 1 and 2 can both be implemented in O(nm), the total running time of the shift-alignment algorithm is O(nm + n' log n'). Although this term may be as high as O(nm log (nm)), it will be O(nm) even for slightly restrictive choices of the parameters l,
C
and
N, because with few blocks calculated in Phase 2 the n' log n'-term is asymptotically less than the nm-term.
3.4 Parameter optimization
The performance of the algorithm is highly dependent on the choice of the minimum length restriction l and the cutoff parameters
C
and
N. Since it is hard to predict which combination of the three values yields the best alignments, we try all combinations of the three parameters in the range l
{3,4,...,17} and
C
,
N
{2.0,2.5,3.0,...,10.0}. The average difference between the MaxSub-scores of SimShift and CE is used for ranking the parameter combinations.
Several combinations of the parameters yield equally good scores. Among those we inspected some by hand and found a clear influence of the parameters on the RMSD and the length of the alignment: lowering
C
and
N and increasing l yields shorter alignments with a better RMSD, whereas raising the cutoffs and lowering the minimal block length yields longer alignments with a higher RMSD. In the following, we use the cutoff values
N = 5.5,
C
= 3.5 and a minimum blocks length of 12, which seems to be a reasonable tradeoff between specificity and sensitivity. In practice one might start searching a database of secondary shift sequences with a target sequence using very strict parameters. If no satisfactory answer is found one might start loosening parameters. That way some similarity to another structure may be found in any case; however, one has to keep in mind that the probability of achieving a correct structural alignment declines.
| 4 RESULTS |
|---|
|
|
|---|
We compare SimShift to SSEA (Fontana et al., 2005) and HHsearch (Söding, 2005). The former is used to rule out the possibility that SimShift does a mere alignment of secondary structure elements, the latter because it is a state-of-the-art method that incorporates both sequence and secondary structure information. We use the benchmark set from Section 2 for comparison.6 In each pair of this set, one structure is used as a template, the other as a target. For SSEA and HHsearch, the secondary structure of the target is computed by PSIPRED (Jones, 1999), whereas for SimShift it is computed by the method from Wishart et al. (1992). In a second comparison we used the predictions from Wishart et al. (1992) for all methods. Since this yielded worse results for both SSEA and HHSearch, we only show the plots based on the PSIPRED predictions in the following. The secondary structure of the template is calculated by STRIDE for all methods. We only compare alignments where all three methods produce a non-empty result. Therefore the number of pairs is reduced to 1373. For the comparison a MaxSub score with a distance threshold of 5.0 Å is used.
Figure 4 shows the percentage of pairs where SimShift is better than SSEA (top line) or HHsearch (bottom line). SimShift is substantially better than SSEA, revealing that the method achieves more than a mere alignment of secondary structure elements. Regarding the comparison with HHsearch, note that in the region where the pairs do not have a global structural similarity (0.4
CE MaxSub < 0.58), SimShift is significantly better. However, with rising structural similarity, HHsearch plays off its strength.
|
To reveal the gain of quality one can achieve using chemical shifts in addition to primary and secondary structures we further investigate those cases where SimShift is better than both of the other methods. The results can be seen in Figure 5 (bottom 3 lines). It shows that the average MaxSub score of the alignments made by our method is much better than the average scores of SSEA's and HHsearch's alignments. In fact, SimShift is about 3 times better than SSEA, and about twice as good as HHsearch.
|
We further plot the average MaxSub scores of the respective CE-alignments in the graph of Figure 5 (top line). It is interesting to see that it is much lower than the average score of the whole segments on the x-axis. This emphasizes the fact that SimShift is especially useful in cases where structural similarity is not very high and the performance of other methods decreases.
4.1 Comparison with TALOS
As TALOS predicts backbone angles rather than producing an alignment, it was impossible to include it in the tests of the previous section. However, because the true structure of our targets is known, we can compare the RMSDs of the backbone angles of the best alignment produced by SimShift to the predictions made by TALOS. Of all 363 different targets from our final benchmark set, 178 have a better RMSD for both the
- and the
-angles. The average RMSD-difference for those where SimShift is better is 18.18° for
, and even 45.58° for
. This shows that SimShift can be useful for assisting the structure resolution process even in the presence of TALOS.
| 5 DISCUSSION |
|---|
|
|
|---|
We aimed at answering the question: Is it possible to create structurally correct alignments from chemical shift data alone, when sequence similarity is low? We argued that this is indeed the case. Through the comparison with other methods we also motivated that information about long-range interactions can be extracted from chemical shift data and may be used to create structurally meaningful alignments.
The shift data used here are derived from the BMRB which is known to contain high quality as well as low quality entries. We were interested in the performance of our approach on experimental data, we therefore did not include any intermediate processing steps. Additionally, because there is only a limited number of proteins with associated chemical shifts, it is not advisable to reduce this set even more by restricting oneself to confirmed high quality entries. As the performance presented here was achieved on shifts probably containing erroneous data, one can expect even more accurate alignments when using curated shift data.
What has been presented is a first step towards automating the structure determination process with NMR spectroscopy. Secondary shift alignments can be a useful tool for the spectroscopist who starts searching a database of chemical shifts before performing additional experiments. If similarities can be identified a model for the protein of interest may be created. Through the comparison with NOE maps, e.g. it is possible to validate (or invalidate) the model.
There is still some work to do towards automating structure determination. Future steps will include improving the choice of the best block chain through additional algorithmic efforts and by designing a P-value which evaluates the probability of the correctness of a secondary shift alignment. The latter will enable us to develop a prediction method from the alignment algorithm that has been presented here.
|
| Acknowledgments |
|---|
We want to thank Volker Heun for helpful discussions and suggestions. The first author also wants to thank Manfred J. Sippl and Robert Konrat for introducing him to the problem.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Anna Tramontano
1We choose the MaxSub-score as a measure of structural correctness since it is a good trade-off between the RMSD and the number of aligned residues. ![]()
2The snapshot of the BRMB-database was taken on February 22, 2005. ![]()
3This threshold was chosen to exclude BMRB-entries consisting of just one secondary structure element. ![]()
4We stress the fact that the shift-difference matrices are only of conceptual nature and need not be calculated explicitly. However, they facilitate the understanding of the algorithm. ![]()
5In line 11 of Algorithm 1 we corrected a slight mistake in the original algorithm (Joseph et al., 1992) by comparing Ymin(Bc) with Ymin(bj) instead of Ymax(bj). ![]()
6This is admissible because the cutoff parameters for SimShift were optimized against the CE-alignment (see Section 3.4), whereas here we compare to different methods. ![]()
Received on August 25, 2005; revised on November 18, 2005; accepted on November 27, 2005
| REFERENCES |
|---|
|
|
|---|
-
Altschul, S.F., et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res, . 25, 33893402
[Abstract/Free Full Text] . -
Braun, D., et al. (1994) Sequence-corrected 15N random coil chemical shifts. Am. Chem. Soc, . 116, 84668469[CrossRef].
-
Chandonia, J.M., et al. (2004) The ASTRAL Compendium in 2004. Nucleic Acids Res, . 32, (Database issue) 189192
[Abstract/Free Full Text] . -
Cornilescu, G., et al. (1999) Protein backbone angle restraints from searching a database for chemical shift and sequence homology. J. Biomol. NMR, 13, 289302[CrossRef][Web of Science][Medline].
-
Fontana, P., et al. (2005) The SSEA server for protein secondary structure alignment. Bioinformatics, 21, 393395
[Abstract/Free Full Text] . -
Heinig, M. and Frishman, D. (2004) STRIDE: a web server for secondary structure assignment from known atomic coordinates of proteins. Nucleic Acids Res, . 32, (Web Server issue) 500502[CrossRef].
-
Jones, D.T. (1999) Protein secondary structure prediction based on position-specific scoring matrices. J. Mol. Biol, . 292, 195202[CrossRef][Web of Science][Medline].
-
Joseph, D., et al. (1992) Determining DNA sequence similarity using maximum independent set algorithms for interval graphs. Proceedings of the SWAT 1992, , Helsinki, Finland , pp. 326337.
-
Kabsch, W. (1978) A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallogr, . A34, 827828[CrossRef].
-
Levitt, M.H. (2001) Spin dynamics. Basics of Nuclear Magnetic Resonance, , Wiley, Chichester, UK.
-
Seavey, B.R., et al. (1991) A relational database for sequence-specific protein NMR data. J. Biomol. NMR, 1, 217236[CrossRef][Medline].
-
Shindyalov, I.N. and Bourne, P.E. (1998) Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng, . 11, 739747
[Abstract/Free Full Text] . -
Siew, N., et al. (2000) MaxSub: an automated measure to assess the quality of protein structure predictions. Bioinformatics, 16, 776785
[Abstract/Free Full Text] . -
Söding, J. (2005) Protein homology detection by HMM-HMM comparison. Bioinformatics, 21, 951960
[Abstract/Free Full Text] . -
Wang, Y. and Jardetzky, O. (2002) Probability based protein secondary structure identification using combined NMR chemical-shift data. Protein Science, 11, 852861[CrossRef][Web of Science][Medline].
-
Wishart, D.S., et al. (1992) The chemical shift index: a fast and simple method for the assignment of protein secondary structures through NMR spectroscopy. Biochemistry, 31, 16471651[CrossRef][Medline].
-
Wishart, D.S., et al. (1995) 1H, 13C, and 15N random coil NMR chemical shifts of the common amino acids. J. Biomol. NMR, 5, 6781[CrossRef][Web of Science][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||







