Skip Navigation


Bioinformatics Advance Access originally published online on September 25, 2007
Bioinformatics 2007 23(21):2829-2835; doi:10.1093/bioinformatics/btm406
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/21/2829    most recent
btm406v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Sweredoski, M. J.
Right arrow Articles by Baldi, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sweredoski, M. J.
Right arrow Articles by Baldi, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© 2007 The Author(s)
This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Minimizing the overlap problem in protein NMR: a computational framework for precision amino acid labeling

Michael J. Sweredoski 1,3, Kevin J. Donovan 2,3, Bao D. Nguyen 2, A.J. Shaka 2,3 and Pierre Baldi 1,3,*

1Department of Computer Science, 2Department of Chemistry and 3Institute for Genomics and Bioinformatics, University of California, Irvine, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 APPENDIX: EXACT SOLUTION USING...
 ACKNOWLEDGEMENTS
 REFERENCES
 

Motivation: Recent advances in cell-free protein expression systems allow specific labeling of proteins with amino acids containing stable isotopes (15N, 13 C and 2H), an important feature for protein structure determination by nuclear magnetic resonance (NMR) spectroscopy. Given this labeling ability, we present a mathematical optimization framework for designing a set of protein isotopomers, or labeling schedules, to reduce the congestion in the NMR spectra. The labeling schedules, which are derived by the optimization of a cost function, are tailored to a specific protein and NMR experiment.

Results: For 2D 15N-1H HSQC experiments, we can produce an exact solution using a dynamic programming algorithm in under 2 h on a standard desktop machine. Applying the method to a standard benchmark protein, calmodulin, we are able to reduce the number of overlaps in the 500 MHZ HSQC spectrum from 10 to 1 using four samples with a true cost function, and 10 to 4 if the cost function is derived from statistical estimates. On a set of 448 curated proteins from the BMRB database, we are able to reduce the relative percent congestion by 84.9% in their HSQC spectra using only four samples. Our method can be applied in a high-throughput manner on a proteomic scale using the server we developed. On a 100-node cluster, optimal schedules can be computed for every protein coded for in the human genome in less than a month.

Availability: A server for creating labeling schedules for 15N-1H HSQC experiments as well as results for each of the individual 448 proteins used in the test set is available at http://nmr.proteomics.ics.uci.edu.

Contact: pfbaldi{at}ics.uci.edu

Supplementary information: Supplementary data are available at Bioinformatics online.


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 APPENDIX: EXACT SOLUTION USING...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Protein NMR spectra are often limited by what could be termed spectral congestion, the inability to discern individual resonance peaks with either certainty or clarity. For a 2D [15N,1H] heteronuclear single-quantum correlation (HSQC) experiment (Bodenhausen and Ruben, 1980) a large, structured protein will display a spectrum rich with peaks. This crowding arises because there is one peak expected from each NH spin pair in the molecule, and there are many such pairs. Spectral congestion, exacerbated by line broadening from the slower tumbling rates of larger proteins, can produce a spectrum that is impossible to assign.

One solution is to use 3D and 4D spectra to improve the resolution, but these high-dimensional experiments typically incur a penalty in both instrument time and absolute sensitivity. Another avenue is to employ higher magnetic field strengths B0, at 800 or 900 MHZ 1H resonance frequency. However, access to these state-of-the-art instruments is limited and their cost is a barrier to widespread adoption.

An obvious third way to tackle spectral congestion is simply to reduce the number of resonance peaks. There are at least two ways to achieve this. The first is to exploit some property of the spin systems, using a pulse sequence that selects certain ‘spin topologies’ (Levitt and Ernst, 1985) while rejecting most others. Broadly known as editing methods, it is difficult to design a set of edited spectra that (i) are each sufficiently simple; and (ii) preserve the entire information content when considered in aggregate.

A second way to reduce the number of peaks is to reduce the number of NMR-active nuclei. This method has been employed in selective 1H-methyl group labeling (Rosen et al., 1996) of otherwise perdeuterated proteins. Selective 1H-methyl group labeling of the branched chain amino acids alanine, valine, leucine and isoleucine ({gamma}2 only) is possible by overexpression in 2H2O using protonated pyruvate as the sole carbon source.

A more advanced, precise and controlled biochemical manipulation is now possible by dispensing with cell-based recombinant protein expression altogether, assembling just the relevant protein synthesis machinery itself, as a carefully formulated extract, and then conducting the protein synthesis in a test tube, supplying individual amino acids, the gene that codes for the protein of interest, and an energy source (Kudlicki et al., 2007). These in vitro coupled transcription-translation protein expression systems have become the focus of increased research attention over the last 5 years, as it has become clear that the methodology has a number of telling advantages compared to expression in Escherichia coli or other living systems (Kigawa et al., 2004; Klammt et al., 2004; Koglin et al., 2006; Morita et al., 2004; Shi et al., 2004; Staunton et al., 2006).

Recently, a particular cell-free expression system has been shown to provide fast, efficient protein expression for use in NMR experiments (Keppetipola et al., 2006). The central feature is the ability to supply to the protein synthesis reaction mixture only certain designated subsets of labeled amino acids, for instance, containing 15N, while all other amino acids are 14N isotopomers, and so are not observed in a 2D or 3D NMR experiment. The absence of facile amino acid scrambling, which can defeat attempts to label only a subset of the amino acids in cell-based systems, is one important advantage of cell-free protein expression. Such lessened cross-talk is made possible by the absence of aminotransferase activity found in whole living organisms, allowing specific labeling by amino acid type (Kigawa et al., 1995, 1999), although not yet by position.

Much progress has already been achieved with the new power of cell-free protein expression, and spectra have been simplified by focusing on only certain peaks of interest. For example, selectively labeled samples have been used to assign some resonances in a congested spectral region of a protein containing a large number of {alpha}-helical regions (Trbovic et al., 2005).

Other efforts have used selectively labeled samples for the assignment of peaks from a single residue type (Kainosho and Tsuji, 1982; Yabuki et al., 1998).

Other efforts in selective labeling have aimed at complete backbone assignment by way of a number of selectively labeled samples, employing selective labeling schemes in conjunction with a specific assignment strategy, including a standard one used by an auto-assignment program (Zimmerman et al., 1997). Otting and coworkers devised a combinatorial labeling strategy that uses five separate samples: each residue is labeled in one, two, or three of the samples, and the residue types are assigned based on the presence and absence of peaks (Ozawa et al., 2006; Wu et al., 2006). Another novel labeling strategy used samples that are partially labeled, the different intensities between residue types then enabling peak assignment based on relative peak intensity (Parker et al., 2004). This approach can be used to identify up to 16 residue types from five different samples.

Here, we present an alternative and more comprehensive approach to selective labeling. We use the term precision labeling to denote any sample labeled by amino acid type with known amino acids that are nearly 100% enriched at specified positions, jointly or separately; there must be no amino acid scrambling and the sample must not be a mixture of isotopomers. The scheme is completely general, applying to carbon-13, nitrogen-15 and/or doubly labeled samples.

Our approach is unique because it is not wedded to a specific assignment strategy nor does it rely on peak intensity for assignment; rather it can be quite generally applied to try to produce optimum spectra for any protein. We will refer to a labeling schedule as the set of instructions that describe how many samples to prepare and how to isotopically label amino acids in each sample. Samples prepared according to the optimum labeling schedule should be maximally free and clear of peak overlap and therefore of the most utility for trouble-free assignment. In addition, constraints gleaned from the labeling schedule reduce the set of residues a peak could be mapped to during the assignment process. By creating decongested spectra, our approach overcomes one of the major hurdles in resolving the structure of larger molecules at lower magnetic field strengths.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 APPENDIX: EXACT SOLUTION USING...
 ACKNOWLEDGEMENTS
 REFERENCES
 
2.1 Formalization of the scheduling problem as an optimization problem
We will refer to the problem of finding an optimal labeling as Optimal Scheduling for Protein NMR Spectra (OSPNS). We use an integer programming formulation to describe the problem of OSPNS. We assume a fixed number of samples k isin Formula k ≥ 2. Keep in mind that we will later iterate over various values of k, where k ≤ 20 to find an appropriate balance between the number of overlaps and the number of samples produced. Additionally, we let Formula be the set of all the naturally occurring amino acids. The cost function Formula


Formula 1

(1)
is optimized over the 40 · k binary variables Nb,l and Cb,l


Formula 2

(2)


Formula 3

(3)

The definition of the overlap function Formula varies with the NMR experiment so we leave the details for the following section. Finally, we must include a set of constraints that ensure all peaks are observable in at least one spectrum.


Formula 4

(4)

It should be noted that there is some degeneracy in the labeling schedules. The sample numbers could be permuted without changing the value of the cost function Formula .

While the recent advances in cell-free protein expression now allow chemists to select in principle how the amino acids are isotopically labeled, in practice the full palette of isotopic possibilities is not yet routinely available. In addition, by independently labeling the carbon and nitrogen, we are doubling the number of variables in the optimization equation, making the problem of finding a true optimal solution computationally challenging. We will define additional constraints to create two subproblems that work around these limitations. Additional constraints can be created to deal with any additional availability issues.

Subproblem A assumes that one does not need to label any carbons in the samples. We can therefore set Cb,l = 1 {forall}b isin Formula ,1 ≤ l ≤ k. We can then rewrite the constraint from Equation (4) as


Formula 5

(5)
and minimize the new cost function Formula


Formula 6

(6)
With the additional constraints in subproblem A, we create a quadratic programming problem, thus reducing the complexity from the general problem.

Subproblem B allows for the carbons in a particular amino acid class to be labeled, but only if the corresponding nitrogen is labeled as well. Subproblem B lets us perform multiple HNCA and HNCO experiments with doubly labeled amino acid samples that are currently available as off-the-shelf items. The addition of the constraint in Equation (7) to the general problem defines subproblem B.


Formula 7

(7)

2.2 Definition of the overlap function Formula
Now that we have specified each problem, we return to the definition of Formula , which is specific to each experiment. Our basic motivation is to calculate the number of possible overlapping peaks for any combination of amino acid labeling. To find the value of Formula , we must first define the following set of variables.


Formula 8

(8)


Formula

Note that TE is dependent on several factors including the magnetic field strength and protein size. We empirically measured TE by looking at various NMR spectra. Throughout this paper, N refers to the backbone nitrogens, H refers to the hydrogens bonded to the backbone nitrogens, and C (without subscripts) refers to the carbonyl carbons in the backbone.

For an HSQC experiment, we define Formula (b,c) as the number of times H'-N' cross-peaks from residues in amino acid class b overlap with H'-N' cross-peaks from residues in amino acid class c. Here, Rb is the set of indices of all the residues in amino acid class b.


Formula 9

(9)


Formula

In both definitions of Formula (b,c,d,e) for HNCO and HNCA experiments, we assume that Formula (b,c,d,e) = 0 if b = d and c = e. In addition, Rb,c consists of the set of indices of residues in amino acid class c that are immediately preceded by a residue in amino acid class b. Sb,c consists of the set of indices of residues in amino acid class c if c = b, otherwise Sb,c is the empty set.

For an HNCO experiment, we define Formula (b,c,d,e) as follows.


Formula 10

(10)


Formula

For an HNCA experiment, we define Formula (b,c,d,e) as follows.


Formula 11

(11)


Formula

2.2.1 Example sequence
Suppose we have the following amino acid sequence: GSTYHLDVVS. For an HSQC experiment, we must first calculate the various values of Rb (e.g. RG = {1}, RV = {8,9} and RS = {2,10}). The HSQC overlap function for some of various amino acid combinations are as follows.


Formula

For an HNCO experiment, we first calculate the various values of Rb,c (e.g. RG,S = {2}, RV,S = {10} and RT,T = {emptyset}) and then calculate the overlap functions. The following examples illustrate some of the possible combinations.


Formula

For an HNCA experiment, we first calculate the various values of Rb,c and Sb,c (e.g. RG,S = {2}, RV,S = {10}, RD,V = {8}, RV,V = {9}, SD,V = {emptyset} and SV,V = {9}) and then calculate the overlap functions. The following examples illustrate some of the possible combinations.


Formula

Definitions of Formula (b,c,d,e) can be made for other NMR experiments following the same logic used for HSQC, HNCO and HNCA experiments. If HNCA experiments are used for constructing a backbone assignment, it would be more beneficial to concentrate our efforts on producing a set of decongested H-N spectra. To achieve this within our framework, one would simply drop the Formula terms from Equation (11). Notice that in the Equations (9–11GoGo) the true number of overlaps are used, but an approximation can also be employed. In the following section, we develop an estimate for the overlap function.

Estimation of the overlap function Formula
We must estimate the number of overlaps between each amino acid class when presented with a new protein for which no NMR data has previously been acquired. One simple approximation would be to assume that each residue has roughly the same probability of overlap with another residue, independent of the amino acid class. However, we know that certain factors such as amino acid chemical structure and secondary protein structure affect the location of chemical shifts in the spectrum. From the corresponding statistics listed at RefDB (Zhang et al., 2003) database, we model the chemical shifts distributions with Gaussian distributions. The Gaussian models accurately describe the locations of the chemical shifts and are the standard distributions used in the RefDB and Biological Magnetic Resonance Data Bank (BMRB) (Seavey et al., 1991) databases.

Given the distributions of the chemical shifts for the designated nuclei and the assumption that the distributions are independent, we can calculate the probability that the chemical shifts will overlap. For any backbone atom E, we can model the E-chemical shift of the ith residue by a Gaussian distribution Formula where Formula can be conditioned on features including amino acid class and secondary structure of residue i. The more information we have about the location of each residues chemical shifts, the greater the accuracy of the overlap function and therefore the fewer the number of overlaps we will observe in the spectra. In our tests, µi and {sigma}i are the mean and SD of the E-chemical shifts of all residues in the RefDB that are in the same amino acid class and secondary structure of residue i.

We can then model the difference between the distribution of E-chemical shifts of residues i and j as Formula . The probability of overlap is therefore stated as follows.


Formula 12

(12)

Optimization of the cost function Formula
Optimization techniques such as simulated annealing (Kirkpatrick et al., 1983) and genetic algorithms (Holland, 1962, 1975) can find good solutions to the general problem, but there is no guarantee of optimality. The quality of the solutions for both methods depends on several parameters, including how long they are allowed to run. We attempted to optimize the cost function with off-the-shelf solutions, but we could not get satisfactory results. Finding solutions to subproblem A, which is quadratic in nature, took much longer than what would acceptable.

However, we are able to compute exact optimal solutions to subproblem A using a dynamic programming algorithm in under 2 h on a standard desktop machine. Since subproblem A involves the precision labeling of either the carbons or the nitrogens and the uniform labeling of the other, there are half as many variables over which to optimize. By taking advantage of this reduction in the size of the search space, we develop a dynamic programming algorithm that is optimal and whose run time is independent of the size of the protein. Note that even in the most general case, the run-time for evaluating the cost function is independent of the protein size and number of dimensions assuming the overlap function is precomputed. The details of the dynamic programming algorithm are given in the Appendix.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 APPENDIX: EXACT SOLUTION USING...
 ACKNOWLEDGEMENTS
 REFERENCES
 
We first demonstrate our approach using the known HSQC spectrum of calmodulin (Qian et al., 1998) to illustrate the expected gains in clarity that can be expected by taking a systematic approach to minimizing peak overlap. Calmodulin is a well-known benchmark protein in the field of protein NMR and it is large enough that the NMR spectrum on lower-field instruments would be challenging to analyze by standard methods. Using our dynamic programming algorithm for subproblem A, we are able to calculate optimal schedules for all possible number of samples in under 2 h.

We generate the labeling schedule using both the true and estimated overlap function for a HSQC experiment. The regenerated 2D spectra of the four precision-labeled samples generated with both the true and estimated overlap functions are presented in Figures 1 and 2. It should be noted that overlaps in the H-N spectrum are listed in the original assignment because the original experiment collected data in the N, H, C{alpha}, Cß and C dimensions.


Figure 1
View larger version (14K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. 15N-1H HSQC spectra of calmodulin (Qian et al., 1998) generated using our dynamic programming algorithm with the true cost function Figure 1. The red sample includes A, R, N and D. The green sample includes C, Q, E, G and H. The blue sample includes I, L, K, F and S. The purple sample includes M, T, W, Y and V. The overlaps according to our criteria are between: E6 (8.99, 121.7) and D118 (8.97, 121.7), R30 (7.94, 120.5) and K77 (7.94, 120.4), L32 (8.62, 121.0) and L105 (8.62, 120.9), T34 (8.10, 118.2) and E87 (8.11, 118.2), S38 (8.14, 119.4) and M124 (8.14, 119.5), A46 (8.27, 120.5) and L48 (8.27, 120.5), A73 (8.52, 121.3) and E82 (8.52, 121.3), M76 (7.95, 119.6) and A127 (7.94, 119.5), H107 (8.16, 119.6) and M124 (8.14, 119.5), R126 (8.42, 119.2) and V142 (8.43, 119.2).

 

Figure 2
View larger version (13K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. 15N-1H HSQC spectra of calmodulin (Qian et al., 1998) generated using our dynamic programming algorithm with an estimated cost function Figure 2. The red sample includes R, N, Q, I and F. The green sample includes D, H, L, S and Y. The blue sample includes A, C, K, M and T. The purple sample includes E, G, W and V.

 
In the Supplementary Material online, we present a graph that compares the number of overlaps as a function of the number of samples for calmodulin. The Rayleigh criteria (cross-peaks overlap if chemical shifts in each dimension are within half a peak width), with nitrogen peak widths of 0.3 p.p.m., hydrogen peak widths of 0.04 p.p.m. and carbon alpha peak widths of 0.25p.p.m., is used to identify overlaps. With calmodulin, we noticed that the schedules calculated using the estimated number of peak overlaps does not match the performance of the optimal schedules calculated using the known true peak locations. This is due to the imprecision in the estimation of the number of overlaps. The degree of imprecision in our estimates is indicated by the error bars in the comparisons of the overlap functions in Figures 3–6GoGoGo.


Figure 3
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 3. Relative percent congestion in the H spectrum for the set of 448 proteins. Schedules are optimized with a true overlap function, an estimated overlap function, an equal number of residues per sample and an equal number of amino acids per sample.

 

Figure 4
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 4. Relative percent congestion in the N spectrum for the set of 448 proteins. Schedules are optimized with a true overlap function, an estimated overlap function, an equal number of residues per sample and an equal number of amino acids per sample.

 

Figure 5
View larger version (26K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 5. Relative percent congestion in the C {alpha} spectrum for the set of 448 proteins. Schedules are optimized with a true overlap function, an estimated overlap function, an equal number of residues per sample and an equal number of amino acids per sample.

 

Figure 6
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 6. Relative percent congestion in the HSQC spectra for the set of 448 proteins. Schedules are optimized with a true overlap function, an estimated overlap function, an equal number of residues per sample and an equal number of amino acids per sample.

 
To determine the number of samples to use, we must take into account both the cost of creating additional samples and the calculated percent congested (number of overlaps/max number of overlaps). In general, we noticed that 4–5 samples were typically enough for our purposes. Given explicit costs for creating samples and costs for peak overlaps in the spectra, the exact number of samples that would minimize the total cost could be derived easily.

We compare the resulting spectra from our schedules to the spectra resulting from the labeling schedule of Ozawa et al. (2006) and Wu et al. (2006). According to our overlap criteria at 500 MHZ, we observe four overlaps in the four spectra simulated with our schedule using an estimated overlap function and six overlaps in the five spectra that would be produced with the schedule of Ozawa et al. (2006) and Wu et al. (2006). In all fairness, the goal of the schedule in Wu et al. is just to allow identification of the amino acids, and not necessarily to decongest the spectrum. Additionally, there are no performance criteria cited in Ozawa et al. (2006) or Wu et al. (2006) relevant to our problem. Note that the schedule in Wu et al. is developed for the average amino acid composition of proteins and not tailored to specific proteins. It should be noted that since our method is the first to directly tackle the problem of spectral congestion, there are no fair comparisons to other methods.

With only 10 overlaps in the HSQC spectrum of calmodulin, one could argue that a computer is not needed to calculate a schedule if we already know which peaks overlap. However, for cases where there are many overlaps, or we are using an estimated overlap function with real numbers, rather than integers, the problem would be nearly impossible solve by hand.

To further address these issues we have also computed schedules for a set of 448 proteins found in the BMRB (Seavey et al., 1991). The proteins in the test set have 50 or more residues, have N, H and C {alpha} chemical shifts for 90% of its residues and have been rereferenced in the RefDB (Zhang et al., 2003). For each of these 448 proteins, we computed schedules using (a) the true overlap function; (b) an estimated function using secondary structure predicted by the SSpro software (Pollastri et al., 2002) in the SCRATCH server suite (Cheng et al., 2005) and (c) an overlap function that assumes the probability of overlap between any two residues is constant. Additionally, 10 random schedules are generated for each protein with approximately the same number of amino acid classes labeled in each sample. Because there is not an exact mapping between the proteins in the BMRB and the PDB at the residue level, we could not prepare schedules using the true secondary structure for all the proteins in our set. On a subset where a reasonable mapping could be created, the results using schedules developed with true secondary structure knowledge were nearly identical to the schedules developed with the predicted secondary structure.

After computing the optimal schedules according to the cost functions we evaluate the schedule derived from true cost function using the relative percent congested [i.e. (number of overlaps – min number of overlaps)/(max number of overlaps – min number of overlaps)]. In Figures 3–5GoGo we look at the performance of the various schedules to resolve overlaps in the H spectrum, the N spectrum and the C{alpha} spectrum averaged over the set of 448 proteins. The error bars represent the SD in the performance.

In the H spectrum, we see that there is little gain using an estimated overlap over a schedule that assumes constant probability of overlap between residues. This is to be expected because the H chemical shifts are mostly independent of amino acid class and secondary structure. In the N spectrum and C {alpha} spectrum, we see more benefit from using an estimated overlap function. We are able to achieve 17.5% relative congestion in the N spectrum and 13.2% relative congestion in the C {alpha} spectrum using only four samples. This is a significant improvement over the performance of both the random schedules (20.9 and 21.0%, respectively) and constant probability of overlap schedules (18.9 and 18.9%, respectively).

In addition to looking at the spectra of individual backbone atoms, we also look at the performance of the schedules on resolving multidimensional spectra. However, issues arise when trying to measure the performance of the labeling schedules on high-dimensional spectra. Unless selective labeling is used, the chemical shifts in the BMRB that would compose a high-dimensional spectrum, such as a HNCA experiment, must already be resolved or they would not have been listed in the database. With this in mind, we must limit our evaluations to the HSQC spectra of the set of 448 proteins. The results from the HSQC spectra as well as the results from the C {alpha} spectra provide proof that our method can work in higher dimensions.

For the HSQC spectra, we use the same evaluation methods as with the individual backbone atom spectrum and we observe roughly the same trends in Figure 6. One peculiarity of this figure is that with one sample, the proteins are ~80% relatively congested. This can be explained by the proteins in the test set that are already as decongested as they can be with respect to the HSQC spectrum. On the set of 448 proteins, we observe an average reduction of the relative percent congestion to 15.1% in four samples using our estimated overlap function. This is in comparison to the 56.1% relative congestion in the schedule developed by Wu et al. achieved in five samples.

While the results for randomly splitting the amino acids into two or three equal sized groups nearly matches that of our algorithm using the estimated overlap function, we see a more dramatic increase in performance when more than four samples are used.

Finally, our methods have been implemented in a web server located at http://nmr.proteomics.ics.uci.edu for user to produce schedules for minimally congested HSQC spectra. The user supplies the primary sequence and optionally the secondary structure. If the user does not supply the secondary structure, a secondary structure prediction made with SSpro is used. The optimized labeling schedules are then emailed back to the user.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 APPENDIX: EXACT SOLUTION USING...
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we have provided a systematic way to design labeling schedules to decongest NMR spectra using precision-labeled samples. The method is a prime example of how to leverage the amino acid labeling that is now possible via cell-free protein expression.

By finding an optimal schedule using our computational framework, we have shown a dramatic increase in spectral resolution for not only the benchmark protein calmodulin (Qian et al., 1998), but also on a curated set of 448 proteins listed in the BMRB. Using four samples on a 500 MHZ HSQC spectra of calmodulin, the number of peak overlaps drops from 10 to 1 if the true cost function is utilized and from 10 to 4 if the cost function is estimated. In addition, our method is able to reduce the relative percent congestion by 84.9% in four samples using our estimated overlap function.

While this article has mainly focused on HSQC experiments, the approach is quite general and can be applied to any of the usual NMR experiments for backbone assignment. Future work will include adding the option to compute schedules for additional NMR experiments using our web server. Additionally, we will study how to incorporate the collected spectra into an integrated backbone assignment method.

The framework we have developed in this article provides a high-throughput method for assigning the backbone chemical shifts of proteins on a proteomic scale by allowing lower field, less expensive NMR spectrometers to run in parallel on larger structures. To help facilitate the high-throughput methods, we provide a web server that implements our method to produce decongested 15N-1H HSQC spectra. The simplified and clarified decongested spectra can be combined with the additional constraints gleaned from the labeling schedule in automated assignment programs thus streamlining the backbone assignment process. It is our hope and belief that our method for developing labeling schedules in conjunction with advances in cell-free protein expression can help usher in a new generation of high-throughput NMR spectroscopy studies.


    APPENDIX: EXACT SOLUTION USING DYNAMIC PROGRAMMING
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 APPENDIX: EXACT SOLUTION USING...
 ACKNOWLEDGEMENTS
 REFERENCES
 
The formulation of the dynamic programming algorithm requires that we augment the cost function Formula in Equation (6) with several additional variables. The new cost function Formula S,l,m operates on the subset of amino acids Formula {subseteq}Formula and range of samples specified by the integer variables l, m such that 1 ≤ l ≤ m ≤ k. We can then define Formula Formula ,l,m as


Formula 13

(13)
Note that Formula is the same as the cost function Formula in Equation (6). In addition, we can add an offset variable Formula to Formula without changing the cost so long as Formula

The following equality, which shows that the minimum of Formula Formula , l, m is equal to the minimum of two subproblems, allows us to make use of a dynamic programming algorithm.


Formula 14

(14)

Finding an optimal solution requires the computation of a cost matrix Formula , where Formula e is the eth subset of Formula and Formula . The base row of our matrix, when l = m = 1, can be computed easily as


Formula 15

(15)

Recursive computation of the second row of the matrix uses the costs from the first row.


Formula 16

(16)

We can then recursively compute successive rows of the matrix using the costs from the prior rows.


Formula 17

(17)

The sets Formula f and Formula g should be saved for each e and i for reconstruction of the optimal schedule using backtracking.

Once we have computed the entire matrix P, we must reconstruct the labeling schedule that gave us the optimal cost computed in P[k,2|Formula |]. We start with the assumption that each amino acid is only labeled once (i.e., Formula ). To construct our optimal schedule we will set Nb,i = 1 for some Formula and all b isin Formula f and set Nc,j = 1 for some Formula and all c isin Formula g where Formula f and Formula g were the sets that gave us the optimal cost for P[k,2|Formula |]. We then proceed to split the amino acids in Formula f into the two sets that gave us the optimal cost for Formula , knowing that we will label one set of the amino acids in the first half of the samples allotted and the other set in the second half of the samples allotted.

We continue our construction of the optimal schedule by recursively partitioning the amino acids and samples until we reach the base case where there is only one sample allotted for labeling a set of amino acids. At this point, we have each amino acid labeled in one sample and we have constructed a labeling schedule with an optimal cost.

The run time of this algorithm is independent of the protein length and is only dependent on the number of different amino acids, which is fixed at 20 if we only include the naturally occurring amino acids. The run time is therefore constant assuming the overlap function is precomputed.


    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 APPENDIX: EXACT SOLUTION USING...
 ACKNOWLEDGEMENTS
 REFERENCES
 
Work supported by an NIH grant (GM-66763) and a UC Discovery Grant bio05-10533 to A.J.S., and a Laurel Wilkening Faculty Innovation award, a Microsoft Faculty Research Award, an NIH Biomedical Informatics Training grant (LM-07443-01) and an NSF MRI grant (EIA-0321390) to P.B.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Burkhard Rost

Received on April 21, 2007; revised on July 6, 2007; accepted on August 6, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 CONCLUSION
 APPENDIX: EXACT SOLUTION USING...
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Bodenhausen G, Ruben DJ. Natural abundance N-15 NMR by enhanced heteronuclear spectroscopy. Chem. Phys. Lett., ( (1980) ) 69, : 185–189.[CrossRef][ISI].

    Cheng J, et al. Scratch: a protein structure and structural feature prediction server. Nuc. Acids Res., ( (2005) ) 33, : 72–76..

    Holland J. Outline for a logical theory of adaptive systems. J. Assoc. Comput. Mach., ( (1962) ) 9, : 297–314..

    Holland J. Adaptation in Natural and Artificial Systems, ( (1975) ) Ann Arbor, MI: University of Michigan Press..

    Kainosho M, Tsuji T. Assignment of the three methionyl carbonyl carbon resonances in streptomyces subtilisin inhibitor by a carbon-13 and nitrogen- 15 double-labeling technique. A new strategy for structural studies of proteins in solution. Biochemistry, ( (1982) ) 21, : 6273–6279.[CrossRef][Medline].

    Keppetipola S, et al. From gene to HSQC in under five hours: high-throughput NMR proteomics. J. Am. Chem. Soc., ( (2006) ) 128, : 4508–4509.[CrossRef][ISI][Medline].

    Kigawa T, et al. Cell-free synthesis and amino acidselective stable isotope labeling of proteins for NMR analysis. J. Biomol. NMR, ( (1995) ) 6, : 129–134.[ISI][Medline].

    Kigawa T, et al. Cell-free production and stable-isotope labeling of milligram quantities of proteins. FEBS Lett., ( (1999) ) 442, : 15–19.[CrossRef][ISI][Medline].

    Kigawa T, et al. Preparation of escherichia coli cell extract for highly prodcutive cell-free protein expression. J. Struct. Funct. Genomics, ( (2004) ) 5, : 63–68.[CrossRef][Medline].

    Kirkpatrick S, et al. Optimization by simulated annealing. Science, ( (1983) ) 220, : 671–680.[Abstract/Free Full Text].

    Klammt C, et al. High level cell-free expression and specific labeling of integral membrane proteins. Eur. J. Biochem., ( (2004) ) 271, : 568–580.[ISI][Medline].

    Koglin A, et al. Combination of cell-free expression and NMR spectroscopy as a new approach for structural investigation of membrane proteins. Magn. Reson. Chem., ( (2006) ) 44, : S17–S23.[CrossRef][ISI][Medline].

    Kudlicki W, et al. Cell-Free Protein Expression, ( (2007) ) Austin, TX: Landes Bioscience..

    Levitt MH, Ernst RR. Multiple-quantum excitation and spin topology filtration in high-resolution NMR. J. Chem. Phys., ( (1985) ) 83, : 3297–3310.[CrossRef].

    Morita E, et al. A novel way of amino acid-specific assignment in 1H-15N HSQC spectra with a wheat germ cell-free protein synthesis system. J. Biomol. NMR, ( (2004) ) 30, : 37–45.[CrossRef][ISI][Medline].

    Ozawa K, et al. 15N-labelled proteins by cellfree protein synthesis: strategies for high-throughput nmr studies of proteins and protein-ligand complexes. FEBS J., ( (2006) ) 273, : 4154–4159.[CrossRef][Medline].

    Parker M, et al. A combinatorial selective labeling method for the assignment of backbone amide NMR resonances. J. Am. Chem. Soc., ( (2004) ) 126, : 5020–5021.[CrossRef][ISI][Medline].

    Pollastri G, et al. Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins, ( (2002) ) 47, : 228–235.[CrossRef][ISI][Medline].

    Qian H, et al. Sequential assignment of 1H, 15N, 13C resonances and secondary structure of human calmodulin-like protein determined by NMR spectroscopy. Protein Sci., ( (1998) ) 7, : 2421–2430.[Abstract].

    Rosen MK, et al. Selective methyl group protonation of perdeuterated proteins. J. Mol. Biol., ( (1996) ) 263, : 627–636.[CrossRef][ISI][Medline].

    Seavey B, et al. A relational database for sequence-specific protein NMR data. J. Biomol. NMR., ( (1991) ) 1, : 217–236.[CrossRef][Medline].

    Shi J, et al. Protein signal assignments using specific labeling and cell-free synthesis. J. Biomol. NMR., ( (2004) ) 28, : 235–247.[CrossRef][ISI][Medline].

    Staunton D, et al. Cellfree expression and selective isotope labelling in protein NMR. Magn. Reson. Chem., ( (2006) ) 44, : S2–S9.[CrossRef][ISI][Medline].

    Trbovic N, et al. Strategy for the rapid backbone assignment of membrane proteins. J. Am. Chem. Soc., ( (2005) ) 13504–13505..

    Wu P, et al. Amino-acid type identification in 15N-HSQC spectra by combinatorial selective 15N-labeling. J Biomol NMR, ( (2006) ) 34, : 13–21.[CrossRef][ISI][Medline].

    Yabuki T, et al. Dual amino acid-selective and site-directed stable-isotope labeling of the human c-Ha-Ras protein by cell-free synthesis. J. Biomol. NMR, ( (1998) ) 11, : 295–306.[CrossRef][ISI][Medline].

    Zhang H, et al. RefDB: A datbase of uniformly referenced protein chemical shifts. J. Biomol. NMR, ( (2003) ) 25, : 173–195.[CrossRef][ISI][Medline].

    Zimmerman D, et al. Automated analysis of proteinnmr assignments using methods from artificial intelegence. J. Mol. Biol., ( (1997) ) 269, : 592–610.[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrowOA All Versions of this Article:
23/21/2829    most recent
btm406v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Google Scholar
Right arrow Articles by Sweredoski, M. J.
Right arrow Articles by Baldi, P.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Sweredoski, M. J.
Right arrow Articles by Baldi, P.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?