Computational Genomics
Genetic code symmetry and efficient design of GC-constrained coding sequences
1 School of Physics, Tel Aviv University Ramat-Aviv, Tel-Aviv 69978, Israel
2 Goldyne Savad Gene Therapy Institute, Hadassah Medical Center Kiryat Hadassah, Jerusalem 91120, Israel
3 School of Computer Science, Tel Aviv University Ramat-Aviv, Tel-Aviv 69978, Israel
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Cloning of long DNA sequences (4060 bases) into phage display libraries using polymerase chain reaction (PCR) is a low efficiency process, in which PCR is used to incorporate a DNA insert, coding for a certain peptide, into the amplified sequence. The PCR efficiency in this process is strongly affected by the distribution of GC bases in the amplified sequence. As any DNA insert coding for the target peptide may be attempted, there is a flexibility in choosing part of the amplified sequence. Since the number of inserts coding for the same peptide is exponential in the peptide length, a computational problem naturally arisesthat of efficiently finding an insert, whose parameters are optimal for PCR cloning.
Results: The GC distribution requirements are formulated as a search problem. We developed an efficient, linear time one pass algorithm for this problem. Interestingly, our algorithm strongly relies on an interesting symmetry, which we observed in the standard genetic code. Most non-standard genetic codes examined possess this symmetry as well, yet some do not. We generalize the search problem and consider the case of a non-standard, or arbitrary, genetic code where this symmetry does not necessary hold. We solve the generalized problem in polynomial, but nonlinear, time.
Availability: An implementation of the proposed algorithm is available upon request from the authors.
Contact: benny{at}cs.tau.ac.il
| 1 INTRODUCTION |
|---|
|
|
|---|
Efficient cloning of long DNA sequences (4060 bases) into phage display libraries, using polymerase chain reaction (PCR), is important for constructing non-random, directional phage libraries. The PCR efficiency in this process depends, among other factors, on the base composition of the amplified sequence. In particular, it depends on the following two attributes of the base composition [see Sambrook and Russell (2001); Rychlik (1993)]: (1) The reaction efficiency increases when the GC content (namely, the proportion of GC bases in the sequence) is restricted to the range 4060%, or narrower. (2) Consecutive GC intervals in the sequence, such as GCGCGC, decrease efficiency. More generally, long consecutive intervals of the bases G and C (GC runs), such as GGGCCC, should be avoided. In the present work, we explore the problem of optimizing the sequence coding for a peptide with respect to the two parameters given above, in order to improve its amplification by PCR. We refer to constraints on these parameters as the GC constrains.
In common applications of PCR there is no control over the base composition of the sequence to be amplified. The situation is different, however, when PCR is used to introduce specifically designed short DNA inserts into the amplified sequence. These inserts usually code for peptides, which are cloned when the amplified sequence is translated. This is the case when a DNA insert is introduced into a bacteriophage plasmid, so that the expressed peptide is cloned with the phage. In this procedure, the peptide to be displayed is given, but the DNA insert itself is not. We can thus exploit the degenerate nature of the genetic code in order to design better sequences, coding for the given target peptide, see Skiena (2001).
In most cases, the number of DNA inserts coding for the target peptide (henceforth, DNA realizations of the peptide) is exponential in the peptide length. Among them, we seek a realization adhering to specific GC constrains, which optimize the efficiency of the PCR process. This is manifestly a computational problem. Beyond fairly short peptide lengths, brute force search becomes infeasible, and an efficient algorithm is required.
This paper is organized as follows. In Section 2 we introduce a symmetry in the standard genetic code, which is crucial for our efficient solution. To the best of our knowledge, this symmetry has not been noticed before. In Section 3, we formulate the computational problem. Using the mentioned symmetry, we reduce it to a string problem for which we present a linear-time solution in Section 4. In Section 5, we consider an arbitrary (not necessarily symmetric) genetic code and discuss the problem in this generalized scenario.
| 2 GENETIC CODE SYMMETRY |
|---|
|
|
|---|
The nucleotides along the double-stranded DNA are arranged in complementary pairs, with a different number of pairing hydrogen bonds in each of the two possible pairs. The nucleotides Adenosine (A) and Thymidine (T) form two hydrogen bonds with each other and are called weak nucleotides. Cytidine (C) and Guanosine (G) form three hydrogen bonds with each other and are therefore called strong nucleotides. Given a coding nucleotide sequence, which is a string
, the induced weak/strong sequence is a string
, where each A or T is replaced by w and each C or G is replaced by s. We say that
is the weak/strong signature, or simply the signature, of S. For example, the signature of the nucleotide sequence ATC GCT is the weak/strong sequence wws ssw. Replacing codons in the standard genetic code with their signatures, we observe a useful symmetry. We start with two examples:
- Consider the amino acid Valine. Its set of codons is {GTA, GTT, GTG, GTC}, and the induced set of signatures is {sww, sws}. Here, the leftmost and middle positions are fixed, as they must be s and w, respectively. However, we are free to choose either w or s for the rightmost position, which makes it variable.
- Consider the amino acid arginine. Its set of codons is {CGA, CGT, CGC, CGG, AGA, AGG}, and the set of four signatures is {ssw,sss,wsw,wss}. In all four signatures the middle position is fixed to s. On the other hand, both the left and right positions are independently variable : for any signature created by assigning w or s to the left position, and w or s to the right position, there is an arginine codon with that signature.
We can thus abbreviate the set of signatures of arginine simply as *s*, where any * can be independently replaced by either s or w. This single string abbreviation fully describes the set of arginine's signatures and is called the representation of the set of codons of arginine. Similarly, the representation of the set of codons of valine is sw*.
DEFINITION. Letbe a representation. A specification of S is a string
, obtained from S by assigning either w or s to every * in S.
For example, all the specifications of *ws are wws and sws.
DEFINITION. Letbe a set of signatures, and let
be a representation. We say that S represents A if
. That is, any signature in A is some specification of S, and any possible specification of S is an element of A.
DEFINITION. Letbe a set of codons, and let
be a representation. We say that S represents D if it represents the set of signatures of the codons in D. In this case we say that D is representable.
For example, the set {AAA, TTC, TAG} is represented by ww*. Using terms from the two examples above, representability means that all the signature positions (left, middle and right) are either fixed or independently variable. The set of Arginine's codons is represented by *s*, but this does not imply that any nucleotide in {A, T, C, G} can replace any *, but rather that each specification of *s* is the signature of at least one of Arginine's codons.
Not every set of codons is representable. Consider, for example, the set {AAA,CCC}. Generally, in a random partition of {A, T, C, G}3 to disjoint subsets, representable subsets would be a rare event.
The Symmetry Observation. Let
be the standard genetic code, namely the partition of {A, T, C, G}3 to 20+1 subsets, where D1, ... , D20 are the sets of codons of the 20 amino acids and D21=Dstop is the set of stop codons. Then D1, ... , D20 are representable, while Dstop is not.
By the symmetry observation, we can define a function that maps the set of 20 amino acids into the set of representations,
. This mapping is depicted in Table 1.
|
Note that the middle position is fixed in all the amino acids. In order to use the symmetry observation, we will need to represent whole peptides as strings over
. We thus need the following.
DEFINITION. Suppose thatrepresents the set of codons of the amino acid Ai. We say that S represents Ai. Generally, suppose that
is concatenation of representations and
is a peptide, where each Ai is an amino acid represented by S(i). We say that S is the representation of the peptide A.l
We examined the non-standard genetic codes appearing in Wheeler et al. (2000) and Benson et al. (2000), based on Osawa et al. (1992) and Jukes and Osawa (1993), for symmetry. Among the 22 codes examined, 17 are symmetric, i.e. they have the property that all sets of amino acid codons are representable. The five remaining codes are not symmetric. So, non-symmetric codes can be found in nature. In each of them, the symmetry is violated by a single amino acid. Some examples of these unrepresentable amino acids are shown in Table 2. The Yeast Mitochondrial code, described in Bonitz et al. (1980), is an interesting case: it is not symmetric due to the set of Threonine codons, which is
(8 codons). This set is clearly not representable. We shall return to the non-standard codes in our concluding discussion.
|
| 3 THE RATE & RUN PROBLEMS |
|---|
|
|
|---|
The problem of finding a realization of a given peptide, satisfying specific GC constrains, can be formulated as follows. We are given a length N peptide, an integer run length limit d > 0 and a ratio
. Our goal is to find a DNA sequence of 3N nucleotides such that- the DNA sequence codes for the given peptide,
- the DNA sequence has exactly m strong nucleotides and
- the longest run of consecutive strong nucleotides in the sequence is no longer than d.
If no such sequence exists, the algorithm should so declare. We call this problem the nucleotide rate & run (R&R) problem. Since R&R is only concerned with the weak/strong structure of realizations of the given peptide, it is desirable to reduce it to a problem about signatures.
The symmetry observation makes such a reduction possible, as each amino acid can be mapped to a unique representation preserving its weak/strong structure. To solve the R&R problem, it then suffices to consider the induced sequence of representations. Furthermore, the only degrees of freedom in the problem are the variable positions, and our problem actually reduces to deciding which w/s assignment to make at those variable positions. In other words, we are looking for a certain specification of the input representation sequence.
Let us define the R&R problem again, this time using signatures. The input is a string
, which is the representation of the peptide of length N in question, and the integers d > 0 and m
0. A solution is a string
, which is a specification of S, such that
has at most d consecutive s, and
was obtained from S by turning exactly m*' into s. We call it the signature R&R problem.
| 4 AN EFFICIENT ALGORITHM |
|---|
|
|
|---|
The R&R problem involves a global condition (overall rate of strong nucleotides) as well as a local one (maximal run length). It may seem at a first glance that the we must try all possible realizations for the given peptide. A rough estimate for the number of possibilities to check is 3N on the average, where N is the number of amino acids in the input peptide (the exact number depends on the sizes of the codons' sets). A conservative estimate shows that such 3N exhaustive algorithm is feasible up to N
20. As current phage display technologies employ N up to 30, and larger N are likely to be used in the future, an efficient and exact solution for the R&R problem is of practical importance. In this section we develop a linear time one pass algorithm, which solves the signature R&R problem while running in O(N) steps. This makes it possible to optimize DNA inserts coding for peptides and proteins of any conceivable length.
4.1 Notations
Let
. The indices i where Si equals * are called the variable indices of S, and their number is denoted variable(S). Let
(i) be an enumeration of the variable indices, namely
is strictly increasing, and
.
Next we consider assigning weak or strong values to the variable positions. When
is assigned to the variable index
(i) of S, the resulting sequence is denoted by
![]() |
obtained by assigning s to exactly k variable indices of S (and assigning w to the rest) by
, formally defined as
![]() |
having no runs longer than d will be denoted BoundedRun(S,d).
We will also be interested in the maximal number of variable positions in S that can be assigned the value s while still satisfying the run length limit, d. Such a maximum obviously exists, and we define
![]() |
realizing this maximum will be called a strong specification of S with respect to the run length limit d.
4.2 Algorithm outline
Our algorithm is based on the concept of a strong specification. Consider the input representation in an instance of the signature R&R problem. We prove that given a strong specification of this input representation, we can (1) decide whether a solution exists in O(1) steps and (2) find a solution in O(N) steps (in case one exists). We conclude by showing that a single greedy pass, done in O(N) steps, is enough to obtain a strong specification of the input representation.
4.3 Continuity
Consider a representation S and some specification of it,
, such that
for some run length limit d. Suppose
. By changing
into w, we do not create any new runs, and particularly not runs longer than d. It is therefore obvious that
. Now suppose that the specification
above is a strong specification of S with respect to the run length limit d. Then in
, Strong(S, d) * of S are changed into s. By changing s appearing in
to w at variable indices of S, we can obtain a specification of S changing any smaller number of * in S into s. The above claim states that they will all be in BoundedRun(S, d). We have thus proved the following simple yet important
LEMMA. If kStrong(S, d), then Closure(S, k)
BoundedRun(S, d) is non-empty.
This property can be thought of as continuity of the solution. It enables us to reduce the signature R&R problem to the problem of finding a strong specification of S as follows. Finding a strong specification will establish the value of Strong(S, d). Then, if k > Strong(S, d) there is no solution, while if k
Strong(S, d) a solution exists, due to the lemma (see the example below). In the latter case, any strong specification found can be efficiently changed to yield a solution to the R&R problem.
4.4 Scan procedure for finding strong specification
The procedure for finding a strong specification for a representation S with respect to a run length limit d takes a greedy approach. We look for the first variable position (from the left) to which s can be assigned while satisfying the run length limit. Each variable position that cannot be assigned s is assigned w. For the purpose of run length calculation during the intermediate stages, any variable position not yet assigned (i.e. any *) is treated as a weak (w) position. We scan the representation from left to right, assigning values to all the variable positions of S in this manner. In order to check for violation of the run length limit, we keep a record of the number of contiguous s to the left of the current position and keep scanning to determine the number of contiguous s to the right. This procedure examines each position at most twice, and hence requires O(N) steps.
4.5 Example
Before going into the correctness proof details, let us give a concrete demonstration of the R&R algorithm. Consider the peptide Arg-Val-Trp-Leu-Leu-Ile, with the constrains GC-content 8/18
0.44 (m = 8) and run length limit d = 2. First, we concatenate the representations of the amino acids in this peptide (Table 1) and obtain the representation of the peptide, *s* sw* wss* w* *w* ww*. Next, we find a strong specification by scanning this representation from left to right, assigning s wherever the run length limit is not violated. The scan steps are depicted in Table 3.
|
The strong specification, obtained at the last step, has 10 strong nucleotides and no run longer than d = 2. Since 10 > 8, by the continuity lemma, the GC constraints are satisfiable for m = 8, d = 2, so a solution exists. Any specification with two fewer assignment of s will do. We arbitrarily set the two rightmost variable indices to w. This yields sswswswsswwsswwwww. Finally, we convert it back to nucleotides. It is evident from Table 1 that for most amino acids the back conversion from specification to codon is not uniquely determined. It is possible to apply here other considerations, such as Codon Usage, when choosing the conversion. For example, the following conversion is a solution to the above problem instance: CGA-GTG-TGG-TTG-CTA-ATT.
4.6 Correctness of the scan procedure
THEOREM. The scan procedure, described above, creates a strong specification of S with respect to the run length limit d.
PROOF. By induction on the number of * in S, variable(S). The procedure, when preformed on a representation S, creates a specification of S. Let us denote this specification by(S). As the run length limit d is fixed, Strong(S) will mean Strong(S, d) throughout the proof.
If variable(S) = 0, then
(S) = S. This is a strong specification of S, as required. Let us further examine the case where variable(S) = 1. If
, then Strong(S) = 1 and hence
is a strong specification of S. On the other hand, if
, then Strong(S) = 0 and
is again a strong specification of S.
Assume that the claim is correct for any representation
satisfying
, and let
such that variable(S) = m. Suppose
(
) is the first variable index, from the left, assigned w in
(S). Then
violates the run length limit, but
does not. We partition S to three substrings around S
(
) and denote
![]() |
The run length count is reset by any w, so we have
. Since
, by the induction hypothesis,
is a strong specification of S(i) for i = 1, 2. Therefore,
(S) assigns s to
variable indices of S.
Assume, towards a contradiction, that
(S) is not a strong specification of S. Let
be a strong specification of S. We partition T around T
(
), as before, and denote
. Note that T(i)is a specification of S(i) satisfying the run length limit, for i = 1, 2. As T is a specification, either (1)
or (2)
. We now show that both (1) and (2) are not possible.
- If
, then Strong (S) variable indices of S(1) and S(2) are assigned s in T. Therefore, T(i) is a specification of S(i) assigning s to more variable indices than
for either i = 1 or i = 2. But
and
are both strong specifications, a contradiction.
- Suppose
. Since
violates the run length limit, the number of variable indices that are assigned s in T(1)
s is at most the number of variable indices in that are assigned s in
. Thus, to have assigned more s in
than in
, we must have that more s are assigned in T(2) than in
. But
is a strong specification of S(2), a contradiction.
These contradictions show that
(S), the result of the described procedure, is indeed a strong specification of S.
The PCR cloning-efficiency criterion assumed, focusing on GC-content and maximal GC run length, is obviously a simplification. The interdependence between GC-distribution and PCR efficiency must be more complex. For example, it is thermodynamically plausible that under some run length limitation, many GC runs with only a little AT space between them are worse than a few well-spaced GC runs.
| 5 MARTIAN GENETIC CODES |
|---|
|
|
|---|
We mentioned earlier non-standard genetic codes, which may not be symmetric. Can the rate & run problem be solved efficiently for a genetic code without symmetry? To illustrate this scenario, suppose that we are faced with the similar problem of efficient cloning for a Martian species, whose genetic code is different.
5.1 Scenario
The genetic code of this species still consists of the four nucleotides {A, C, T, G}, where A, T are weak nucleotides and C,G are strong ones. The amino acid codons are all of the same length,
1. This generalized genetic code is therefore a partition of {A, C, T, G}
into disjoint subsets, each corresponding to a different amino acid. Note that the number of amino acids (subsets in the partition) need not be 20.
5.2 Problem
Let us now define the analogy of the rate & run problem for this scenario. We are given a generalized genetic code, a peptide of length N, and GC constrains: an upper bound on the run length, d > 0, and a target rate of strong nucleotides
. We seek a realization of the input peptide, which satisfies these GC constrains.
5.3 Algorithm
As the input genetic code is arbitrary, the algorithm cannot assume any symmetry, and we lose the continuity of the solution, upon which the linear-time R&R algorithm relied. A slower and more complex approach is called for. We describe a dynamic programming algorithm for the problem, which runs in polynomial time in the genetic code description length (4
) plus the input peptide length (N). Note that the algorithm is not polynomial in the input peptide length alone. In general, we expect any algorithm to go over each codon in the generalized genetic code at least once.
Our algorithm is based on the efficient solution to the Unary Subset Sum and Partition problems, see Garey and Johnson (1976). Denote the input peptide sequence by A1A2, ... , AN, where the Ais are amino acids. We build a table with m + 1 rows, numbered 0, ... , m, and N columns, labelled A1, ... , AN. Every column Ai is partitioned into subcolumns, each labelled with one of the codons of Ai.
As a simple example, consider our familiar standard genetic code (
= 3), and the following problem instance: N = 3, the peptide is
, target amount of strong nucleotides m = 6 and run length limit d = 3. Table 4 is the initial table for this example problem instance.
|
A codon subcolumn in the column Ai will have an entry in row number j if there is a realization of A1, ... , Ai with a total of j strong nucleotides, in which the codon of that subcolumn codes for Ai. Part of the entry itself will be the length of the rightmost run of strong nucleotide in this realization. For example, if the rightmost nucleotide is weak, this length is 0. In the process of filling up the table, row numbers represent the number of strong nucleotides accumulated so far in the realization.
The algorithm begins by filling the column A1 of the table as follows. For each codon of A1 with a total of h strong nucleotides, ending with t
consecutive strong nucleotides (e.g. for GTG, h = 2 and t = 1), we write t in row number h. We then proceed by filling the Ai column (i
2), using the entries in the Ai1 column. The filling procedure is identical for each codon subcolumn of Ai. Denote this codon by
and suppose that it has h strong nucleotides. For any entry t in row j of any subcolumn ß of Ai1, we check if concatenating
to a run of length t violates the run length limit d. If it does not, we calculate the new rightmost run length (with
concatenated) and write it in row number h + j of
's subcolumn (as mentioned, row numbers represent the cumulative amount of strong nucleotides) along with the name of ß. As is usually the case with dynamic programming approaches, the stored name of ß acts as a back pointer, which will later enable us to trace a solution (in case one exists). If several entries belong to the same cell, an entry with smallest rightmost run length will be written.
Returning to our simple example with the standard genetic code, when the algorithm fills Table 4, Table 5 is produced.
|
The existence of the solution can be read from the filled table. The target rate k corresponds to the target amount m of strong nucleotides in the realization. If some subcolumn of the last amino acid (AN) column has an entry in the target row m, then a solution exists. For example, in the above table a solution exists only for m = 3, 4, 5. If a solution exists, we use the entries in the filled table to find one (it need not be unique). We do it by tracing back a route, which led to the target row. For each amino acid from AN back to A1, the trace codon names, written by the algorithm in this column, are the codon(s) which can code for the previous amino acid in a solution.
Finally, let us bound the time complexity for the table-filling algorithm. Each column in the table is m + 1 cells high. To fill a certain subcolumn in column Ai, the algorithm scans all cells of the Ai1 column, fewer than m·4
cells. Scanning a cell takes O(d) steps and there are fewer than N·4
subcolumns in the table altogether, so filling the entire table takes less than
. The generalized genetic code table size is C·4
, where C is the maximum description length of one codon. Since the genetic code table is part of the input, the input length is
. A rough bound on the run time for an input of length n is thus
.
| 6 DISCUSSION |
|---|
|
|
|---|
The observation of symmetry in the standard genetic code played a crucial role in our linear-time solution for the R&R problem. The probability that a random genetic code (a randomly chosen partition of
) will be symmetric is negligible. However, the genetic code is obviously not random and is believed to have been shaped by several constrains, most of which we still do not fully understand. How surprising is the genetic code symmetry? More precisely, if we assume a minimal set of basic properties, satisfied by the standard genetic code, are symmetric codes a likely coincidence?1 It is reasonable to assume that genetic codes, which underwent evolution, are robust with respect to single-nucleotide mutations. The robustness criterion is usually defined as the total number of single-nucleotide mutations, of any codon in the code, which change the amino acids the codon codes for. The standard genetic code appears to be near-optimal in this sense, as shown by Freeland et al. (2000) using computer simulations. However, robustness with respect to single-nucleotide mutations alone does not imply symmetry. An interesting example is the non-symmetric, yet robust, yeast mitochondrial genetic code (mentioned in Section 2).
Another reasonable assumption to make is n-fold degeneracy of a large portion of the code. In all known genetic codes, most codon sets have the well-known 4-fold or 2-fold degeneracy of the rightmost nucleotide. A 4-fold degenerate set has the form {xyA, xyT, xyC, xyG} and is clearly representable. Furthermore, as transitions (mutations among Purines {G, A}, or among Pyrimidines {C, T}) are much more likely than transversions (PurinePyrimidines mutations), a 2-fold degenerate set is only mutation-robust if its codons end with either the Purines or the Pyrimidines. Indeed, all 2-fold-degenerate codon sets (in both standard and non-standard genetic codes) have either the form {xyA, xyG} or {xyT, xyC} and are thus representable. It follows that under the mutation-robustness assumption, all single, 2-fold, 3-fold and 4-fold-degenerate codon sets are representable. This accounts for the representability of most of the amino acids' sets in the standard code. The exceptions are Arginine, Leucine and Serineall having a set of six codons, composed of a 4-fold-degenerate subset and a 2-fold-degenerate subset. Since each of these subsets is (separately) representable, everything depends on the correspondence between these subsets' representations.
A simple calculation shows that if we fix a 4-fold-degenerate set and randomly choose (with equal probabilities) a mutation-robust, 2-fold-degenerate subset among the remaining 60 codons in {A, T, C, G}3, the probability that the resulting six-codon union will be representable is 11/15.
Suppose that our sample space of allowed genetic codes consists of codes similar to the standard code, i.e. are mutation-robust, have 17 single, 2-fold, 3-fold or 4-fold-degenerate amino acids, and three amino acids with a codon set being a union of a 4-fold-degenerate subset and a 2-fold-degenerate subset. It follows that a random genetic code in this sample space will be symmetric with probability
0.39. It is thus possible that the symmetry of the standard genetic code is not a direct result of evolutionary forces acting on the code, but rather a likely consequence of other properties (mutation-robustness and n-fold degeneracy), which were directly imposed by evolutionary forces. However, it is interesting to note that among 18 non-standard codes (mentioned in Section 2), which possess these properties, only 1 is not symmetric (the other 4 non-symmetric codes violate the n-fold degeneracy; see examples in Table 2). This fact may suggest that symmetry has, in fact, been favoured during the evolution of genetic codes.
| Acknowledgments |
|---|
The authors like to thank Assaf Alon (Weizmann Institute of Science, Rehovot), Jonathan Gershoni, Mia Horowitz and Metsada Pasmanik-Chor (Tel Aviv University) for the illuminating discussions. Additional thanks to the anonymous referees for their excellent remarks.
| FOOTNOTES |
|---|
1We ignore the issue of representability of the stop codons set.
| REFERENCES |
|---|
|
|
|---|
Benson, D., et al. (2000) GenBank. Nucleic Acids Res, . 28, 1518
Bonitz, S., et al. (1980) Codon recognition rules in yeast mitochondria. Proc. Natl Acad. Sci. USA, 77, 31673170
Freeland, S., et al. (2000) Early fixation of an optimal genetic code. Mol. Biol. Evol, . 17, 511518
Garey, M. and Johnoson, D. Computers and IntractabilityA Guide to the theory of NP-Completeness, (1979) DW Freeman & Co., New York, NY, pp. 9091.
Jukes, H. and Osawa, S. (1993) Evolutionary changes in the genetic code. Comp. Biochem. Phys. B Biochem Mol. Biol, . 106, 489494.
Liang, A. and Heckmann, K. (1993) Blepharisma uses UAA as a termination codon. Naturwissenschaften, 80, 225226[CrossRef][ISI][Medline].
Ohama, T., et al. (1993) Non-universal decoding of the leucine codon CUG in several Candida species. Nucleic Acids Res, . 21, 40394045
Osawa, S., et al. (1992) Recent evidence for evolution of the genetic code. Microbiol. Rev, . 56, 229264
Rychlik, W. (1993) Selection of primers for polymerase chain reaction. In White, B.A. (Ed.). PCR Protocols: Current Methods and Applications, , Totowa, NJ vol. 15 Humana Press, pp. 3140.
Sambrook, J. and Russell, D.W. Molecular Cloning, (2001) 3rd edn , New York Cold Spring Harbor Laboratory Press.
Skiena, S. (2001) Designing better phages. Bioinformatics, 17, S253S261[Abstract].
Wheeler, D., et al. (2000) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res, . 28, 410.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
be a set of signatures, and let
be a representation. We say that S represents A if
. That is, any signature in A is some specification of S, and any possible specification of S is an element of A.
be a set of codons, and let
is concatenation of representations and
is a peptide, where each Ai is an amino acid represented by S(i). We say that S is the representation of the peptide A.l


BoundedRun(S, d) is non-empty.