Skip Navigation


Bioinformatics Advance Access originally published online on December 15, 2005
Bioinformatics 2006 22(4):438-444; doi:10.1093/bioinformatics/btk004
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/4/438    most recent
btk004v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Volpe, J. M.
Right arrow Articles by Kepler, T. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Volpe, J. M.
Right arrow Articles by Kepler, T. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

SoDA: implementation of a 3D alignment algorithm for inference of antigen receptor recombinations

Joseph M. Volpe 1, Lindsay G. Cowell 2 and Thomas B. Kepler 2,3,*

1Computational Biology and Bioinformatics program, Duke University Medical Center Box 90090 Duke University, Durham, NC 27708-0090, USA
2Department of Biostatistics and Bioinformatics, Duke University Medical Center Box 90090 Duke University, Durham, NC 27708-0090, USA
3Department of Immunology, Duke University Medical Center Box 90090 Duke University, Durham, NC 27708-0090, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 

Motivation: The antigen receptors of adaptive immunity—T-cell receptors and immunoglobulins—are encoded by genes assembled stochastically from combinatorial libraries of gene segments. Immunoglobulin genes then experience further diversification through hypermutation. Analysis of the somatic genetics of the immune response depends explicitly on inference of the details of the recombinatorial process giving rise to each of the participating antigen receptor genes. We have developed a dynamic programming algorithm to perform this reconstruction and have implemented it as web-accessible software called SoDA (Somatic Diversification Analysis).

Results: We tested SoDA against a set of 120 artificial immunoglobulin sequences generated by simulation of recombination and compared the results with two other widely used programs. SoDA inferred the correct gene segments more frequently than the other two programs. We further tested these programs using 30 human immunoglobulin genes from Genbank and here highlight instances where the recombinations inferred by the three programs differ. SoDA appears generally to find more likely recombinations.

Availability: SoDA is freely available for use via the web at the http://dulci.org/soda/

Contact: kepler{at}duke.edu


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 
The defining characteristic of adaptive immunity is the somatic diversification of its antigen receptor genes. The genes for the T-cell receptor (TCR) and immunoglobulin (Ig) are formed by a process known as V(D)J recombination, in which transcribable genes are composed by the combinatorial joining of gene segments from two or three classes, depending on the specific locus involved. The basic unit of the TCR and Ig is a heterodimer of a heavy chain and a light chain. In the Ig, these units are further complexed. Light chain genes are made by recombination of one variable (V) and one joining (J) gene segment; heavy chain genes are made by recombination of V, J and an additional segment, the diversity (D) segment between them (Tonegawa, 1983; Ichiara et al., 1988).

In addition to this combinatorial diversification, TCR and Ig genes are subject to randomness in the location of the recombination sites in all segments and the stochastic addition of non-templated (n) nucleotides in the junctions (Alt and Baltimore, 1982). Studies conducted on both TCR and Ig show that the average number of n-nucleotides in a V–D junction or D–J junction is ~7 (McMahan and Fink, 2000; Rosner et al., 2001).

Ig genes also undergo somatic hypermutation (reviewed in Papavasiliou and Schatz, 2002 and Wagner and Nueberger, 1996). The frequency of point mutations is typically enhanced in the complementary determining regions, or CDR, which encode the antigen-binding regions of the molecule. The third CDR, CDR3, spans the 3' end of the V segment through the 5' end of the J segment, thus entirely encompassing the D segment for a heavy chain rearrangement.

The problem we set out to solve here is the statistical reconstruction of the recombination events that led to any given antigen receptor gene. The solution consists in identifying each of the V, D and J gene segments used as well as the recombination sites, point mutations and n-nucleotides introduced. Because of the uncertainty in the recombination sites, the aligned, untrimmed gene segments may overlap. Therefore, alignments of the target gene to the individual gene segments cannot be treated as independent. This feature sets the V(D)J problem apart from the otherwise similar spliced alignment problem (Gelfand et al., 1996). The substantial single-nucleotide diversification that occurs throughout all junctions owing to somatic hypermutation and the possible addition of n-nucleotides adds further complication.

Several algorithms exist for deciphering Ig and TCR gene segment composition. IMGT/V-QUEST is perhaps the most complete of these tools, having the ability to analyze both Ig and TCR sequences for human, mouse, sheep and other organisms (Giudicelli et al., 2004). V-QUEST, however, is based on the BLAST algorithm; it is not as sensitive as the dynamic programming methods for sequence alignment and does not guarantee finding the best alignment of two sequences (Altschul et al., 1990).

Another approach, based on the identification of conserved motifs in the target gene, was taken by the authors of JOINSOLVER (Souto-Carneiro et al., 2004). This program was developed for the specific task of analyzing the CDR3 region of rearranged Ig heavy chain sequences to characterize D segment usage, and thus does not analyze TCR sequences, though it does analyze Ig light chains. Neither program produces a final alignment that identifies the inferred origin of each nucleotide in the target sequence, nor does either program allow for gaps when performing alignments, although insertions and deletions are known to occur during somatic hypermutation (Smith et al., 1996).

Our contribution, which we have dubbed SoDA for Somatic Diversification Analysis, is based on dynamic programming, which has become the standard approach for pairwise sequence alignment because of its simplicity and guaranteed optimality. Traditional dynamic programming algorithms for sequence alignment (Needleman and Wunsch, 1970; Smith and Waterman, 1981) rely on the equivalence of pairwise alignments and paths through a two-dimensional lattice. Our algorithm is a generalization of these approaches based on an equivalence between V(D)J alignments and paths through a three-dimensional (3D) lattice as described in detail below.


    2 APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 
The solution to the V(D)J problem consists of the identification of a set Formula of gene segments and a set Formula of modifications—recombination sites, point mutations (including insertions and deletions) and n-nucleotide insertions—that when applied to Formula produces the observed target sequence Formula, i.e. Formula, and such that the posterior probability, Formula, is maximized. The posterior probability quantifies our confidence in the inference.

Bayes' rule gives

Formula 1(1)
where I is the indicator function, equaling one when its argument is true and zero otherwise. The basic assumption here is that Formula 1 and Formula 1 are independent given the constraint that Formula 1. Practically speaking, the information Formula 1 is based on empirical frequencies and specifies the cost function for the alignment algorithm.

To date, there has been limited research into preferential usage of specific Ig or TCR gene segments over others. Livak et al., 2000 found possible preferences for certain D and J gene segments in TCR rearrangements and Marshall et al., 1996 showed preferential usage of certain V gene segments in murine Ig rearrangements. Few studies, however, characterizing segment-by-segment usage for functional human rearrangements have been published, especially for Ig heavy chain rearrangements. One study by Brezinschek et al., 1997 did find frequencies for V, D and J gene segment usage in IgM, but the study was based on serum samples from only two male donors. Thus, there is limited data on such frequencies in the population. So, in performing these analyses, at present, SoDA assumes that all V, D and J gene segments are equally probable, rendering the Formula 1 term in Equation (1) a uniform.


    3 ALGORITHM
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 
SoDA proceeds through two stages. In the first, the set of viable V, D and J segments is chosen by independent unconditional pairwise alignments between the target gene and each candidate gene segment. As stated above, the optimality of a solution based on independent unconditional alignment is not assured, but we can eliminate gene segments that score too low to participate in the optimal solution to simplify the computationally intensive second stage.

Using a standard local alignment approach, SoDA first finds an alignment score for each V gene segment in the library. Once all V gene segments are aligned and scored, they are sorted; only those above a viability cutoff are retained. The process then repeats for J gene segments. For Ig heavy chains and TCR ß and {delta} chains, candidate D segments are evaluated by alignment against that part of the target sequence that lies between the invariant cysteine encoded by V and the invariant tryptophan/phenylalanine encoded by J, the positions of which are discovered by the initial alignments to V and J.

The second stage of SoDA is the simultaneous alignment of all segments in the remaining viable sets. This stage uses a novel 3D dynamic programming alignment algorithm to make its final sequence inference, which we now describe in detail.

3.1 Two contiguous sequences (light chains)
We started by designing an algorithm to align a target sequence to two contiguous sequences that can overlap and/or have a random number of additional nucleotides in their junction, as in the case of light chain rearrangements. Using a 3D matrix, we conceived the front vertical plane to represent the alignment of the V gene segment to the target sequence, where the vertical axis i represents the V gene segment and the horizontal axis j represents the observed input sequence (Fig. 1A). The third axis k would then represent the J gene segment, and thus each horizontal plane in the matrix represents an alignment of the J segment with the target sequence. The front V plane is further separated from the matrix by an n-layer, which allows for random n-nucleotide additions (Fig. 2A). A path, then, for the alignment through the matrix would first follow an increase in the i and j axes while k remained constant, representing an alignment of the V gene segment with the target sequence. Then, at some point, the path would pass into the n-layer and traverse through it, meaning that j continues to increase while movement in the i and k dimensions remain constant, thereby representing n-nucleotides. Or, the path would jump directly from some point in the V plane to some point in one of the horizontal J planes to account for situations where there had been no n-nucleotide additions and instead an overlap of the V and J segments (Fig. 1B). From that point on, the path would be horizontal only, with i remaining constant and j and k increasing, representing an alignment of the target sequence with the J gene segment. Accounting for exonuclease activity in rearrangements requires performing this alignment in three dimensions, to allow for a path from any point in the V segment to any point in the J segment while still being able to move through the n-layer. The algorithm was written such that the dynamic programming matrix h is computed by individual vertical layers. Let s(X) be the length of a gene segment X, where X can be the V, D, J or target sequence (T). Then, we define one vertical layer of the dynamic programming matrix h as h[1 ... s(V)][1 ... s(T)][k] for all integers k, such that 1 ≤ k ≤ s(J). Note that one vertical layer constitutes one unit along the k-axis. Computing the matrix by layers is necessary since the criteria evaluated for each matrix position differs per layer for the V, n- and initial J submatrix layers. First the V layer is filled in completely, calculating the value at each position using the typical dynamic programming alignment algorithm rules:


Figure 1
View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1 (A) For aligning two contiguous sequences to a target sequence, the j-axis represents the target sequence, while the other sequences are represented by the i and k axes. (B) Aligning the J sequence to the target sequence horizontally allows a traceback path to traverse from any point in V to any point in J, while still allowing for movement in the n-layer if necessary.

 

Figure 2
View larger version (18K):
[in this window]
[in a new window]
 
Fig. 2 (A) Allowing for a random number of n-nucleotides necessitates the addition of a special layer in the dynamic programming matrix. The layer is inserted between the V-to-target sequence alignment plane and the J-to-target alignment submatrix. (B) The matrix for aligning three sequences requires that the second n-layer and submatrix be appended to the bottom, and offset by 2 units in the k direction, to account for the front vertical V and n-layers.

 
{forall} i, j and k, such that 1 ≤ i ≤ s(V), 1 ≤ j ≤ s(T) and for k = 1:

Formula 1
where m is the score of a match or mismatch of the nucleotides being compared.

Once complete, the algorithm moves to the next layer back, where k = 2. This is the n-layer, which is a buffer layer between the V and J layers to accommodate the addition of n-nucleotides, so the rules for calculating h[i][j][k] here are different. Generally, vertical movement within a layer would imply gaps in the target sequence. In the context of the problem, however, an insertion or deletion that occurs within the set of n-nucleotides is simply interpreted as the existence or non-existence of an n-nucleotide. Thus, in the n-layer, vertical movement along the i-axis not defined:

{forall} i, j and k, such that 1 ≤ i ≤ s(V), 1 ≤ j ≤ s(T) and for k = 2,

Formula 1

The algorithm then proceeds to fill in the submatrix for the J layers, where k ≥ 3. Here, each calculation must evaluate not only adjacent positions, as in typical alignments, but must also consider positions in both the n- and V layers owing to the constraint of having a path that can move from any position in the J sequence to any position in the V sequence, and thus there are five maximum value options:

{forall} i, j and k, such that 1 ≤ i ≤ s(V), 1 ≤ j ≤ s(T) and 3 ≤ k ≤ s(J),

Formula 1
Options 3 and 4 above are specifically for considering the n- and V layers, respectively. Note that when k = 3, options 2 and 3 are redundant. Thus, for that layer only, there are four options.

As with standard alignment algorithms, a traceback path is created and saved as the matrix is computed. When the matrix is complete, traceback begins at the position in the J submatrix where the value is the greatest.

3.2 Three contiguous sequences (heavy chains)
Building upon the concepts for light chain rearrangements, we developed the algorithm necessary to accommodate heavy chain rearrangements, where there is a third contiguous sequence to be aligned to the target sequence. The key to the algorithm is to replicate the idea of an n-layer and a second 3D submatrix for the additional sequence, but to append both beneath the existing layers required for the alignment of the first two sequences (Fig. 3A). This is accomplished by designating the vertical axis i to represent the V sequence, an n-layer and the J sequence, so that i = s(V) + s(J) + 1. The D sequence, then, is represented along the k-axis, and the target sequence is still represented by the j-axis (Figure 2B and 3B). Arranging the sequences in this way allows for n-nucleotide additions and a path from any point in J to any point in D, while maintaining those same properties of movement for the D and V sequences. The matrix, then, is initially filled out using the same calculations as the two sequence alignment procedure, but with the additional steps of filling in the n-layer for the D–J junction and the submatrix for the J gene segment alignment to the target sequence. Thus, the algorithm switches from computing vertical layers, to computing horizontal layers. Computing this second n-layer is similar to computing the first, except that movement along the k-axis is undefined instead of the i-axis. Thus, the second n-layer is filled out using the following rules:


Figure 3
View larger version (10K):
[in this window]
[in a new window]
 
Fig. 3 (A) To account for the addition of a third sequence, a second n-layer and second submatrix are added beneath the original alignment matrix. (B) The third sequence addition requires associating two of the sequences to one of the axes. For Ig heavy chains, the J sequence is represented along the i-axis with the V sequence.

 
For i = i' + 1 and {forall} j and k, such that 1 ≤ j ≤ s(T) and for k ≥ 3,

Formula 1
where i' is the position along the i-axis at which the best alignment of D to the target sequence, relative to V, occurs. At each i, the horizontal alignment of D to the target sequence is similar, differing only by values that come from the vertical V layer. Thus, at some horizontal layer at height i, the alignment of D to the target sequence will be maximized relative to a position in V, which we define as the vertical position i' where the traceback path will pass through into the V layer (Fig. 3B). In terms of alignment to the target sequence, the V and J sequences are independent of each other. So, the computations for the second n-layer and the subsequent J submatrix begin at i' + 1 and k begins at 3, since k = 1 and k = 2 represent the vertical V and n-layers, respectively.

The J submatrix is filled out in a similar manner to the D submatrix, considering values in the n-layer and in the D submatrix horizontal layer at height i', in addition to adjacent positions within the matrix:

{forall} i, j, and k, such that i' + 2 ≤ i ≤ s(J), 1 ≤ j ≤ s(T) and 3 ≤ k ≤ s(D),

Formula 1
Note that when i = i' + 2, options 2 and 3 are redundant, and thus for the topmost horizontal layer in the J matrix, the algorithm chooses from only four options.

3.3 Traceback
Once the matrix is entirely computed, the traceback procedure begins at the cell in the J submatrix containing the maximum value. In general, traceback proceeds by moving concurrently along the i and j axes, while k remains constant, finding the alignment of the J sequence to the target sequence. Then, the traceback path moves into either the n-layer and then the D submatrix, or directly into the D submatrix. Movement within the n-layer is solely along the j-axis. At i = i', traceback proceeds through the D submatrix by moving concurrently along the j and k axes, finding the alignment between the D gene segment and the target sequence. Then, the path moves into the vertical n-layer for the V–D junction, or jumps directly into the V layer. Again, movement in the n-layer is solely along the j-axis. Once in the V layer, movement proceeds concurrently along the i and j axes until either i = 1 or j = 1, finding the alignment of the V gene segment to the target sequence. All movement within the n-layers is designated as the addition of n nucleotides in the junction between the respective gene segments. Thus, once the traceback is complete, the resulting alignment is the best possible alignment of the gene segments, with n-nucleotides and gaps, to the target sequence.

3.4 Expanding the 3D algorithm beyond three sequences
Though the 3D algorithm described here only aligns three subsequences to a target sequence, we believe that the fundamental concepts established here allow for expanding this alignment algorithm beyond three subsequences. In such a case, each additional sequence would be represented either by the k-or the i-axis, alternating by each addition. Also, a new 3D submatrix for aligning this new sequence would need to be appended in the same direction as the axis that represents the new sequence. In general, odd numbered sequences would be represented by the i-axis, while even numbered sequences would be represented by the k-axis. For example, in the case of our SoDA implementation, the third sequence, the J gene segment, is represented along the i-axis with the V gene segment and the submatrix for its alignment to the target sequence is appended below the matrix for the D gene segment, the second sequence. So, adding a fourth sequence would require that the k-axis represents the new sequence and that an additional submatrix for aligning this sequence to the target would be added in the direction of the k-axis, adjacent to the lower submatrix for the third subsequence alignment. Adding further sequences would thus create a set of submatrices arranged like a staircase, moving in the direction of the i and k axes.


    4 RESULTS AND DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 
To test SoDA, we developed a simulation program to generate artificial Ig sequences. The program simulates V(D)J recombination by randomly selecting a V, D and J gene segment, and selecting the effective recombination site for each segment from a uniform distribution extending 5 nucleotides to either side of the nominal recombination site. N-nucleotide addition is simulated by randomly choosing the number of n-nucleotides to add to each junction from the uniform distribution up to 12. Somatic hypermutation is simulated by introducing point mutations independently at each position with transition/transversion ratio fixed by empirical frequencies (Smith et al., 1996). The probability of mutation per nucleotide was varied to produce expected mutation frequencies of 2.5, 5, 10 and 15%; we generated 30 artificial sequences at each mutation rate.

We used these sequence sets to compare the inferences generated by SoDA, JOINSOLVER and V-QUEST. All three programs use the same V, D and J gene segment libraries from IMGT, except that JOINSOLVER adds five ‘irregular D’ segments to the D segment library. We chose four sets of 30 sequences as a reasonable test, given that only SoDA among the three programs can process sequences in batch. The results of our tests are shown in Table 1.


View this table:
[in this window]
[in a new window]
 
Table 1 Results of four sets of 30 simulated sequences at increasing mutation rates

 
As expected, as the mutation rate increased, SoDA's accuracy declined, though at each mutation rate, SoDA identified more rearrangements correctly than did either JOINSOLVER or V-QUEST. SoDA identified the correct VDJ combination in 28, 28, 27 and 23 cases out of 30 at 2.5, 5, 10 and 15% mutation rates, respectively. Also, for each test, SoDA correctly identified all 30 of the V segments. JOINSOLVER did not perform as well in identifying V gene segments (though this is not unexpected since JOINSOLVER was developed specifically for analyzing CDR3 regions and not V gene segments). V-QUEST identified fewer D segments correctly. All of the programs were able to identify the J gene segments correctly, and at the 15% mutation rate, JOINSOLVER and V-QUEST both performed slightly better than SoDA at this task. We then tested each of these programs on a set of 30 sequences selected randomly from a set of 650 rearranged Ig heavy chain sequences. Of these 650 sequences, 547 came from the set of sequences used by Souto-Carneiro et al., 2004 to test JOINSOLVER (Genbank accession nos Z68345-487 and Z80363-770), and the remaining 103 sequences were added to this set after searching Genbank for ‘human immunoglobulin heavy chain variable region’ (accession nos AM050894-1008). After testing the three programs using all 30 sequences, we found that the programs agreed in their V, D and J identifications on 18 of these genes. For the remaining 12 sequences, SoDA never made a V or J identification that was not supported by one of the other two programs. In two of these 12 cases, the only difference was in JOINSOLVER's V segment prediction. In one case JOINSOLVER's V choice differed from that of SoDA and V-QUEST, and the D identified by V-quest differed from that of SoDA and JOINSOLVER. In three other cases, V-QUEST simply did not identify both the D and J gene segments, though JOINSOLVER and SODA agreed on all three gene segments. There was one instance where JOINSOLVER's D choice differed from that of the other two programs, and a similar instance where V-QUEST's selection of a D segment was the only difference. For the remaining four instances, the three programs chose different D gene segments. The alignments for two of these cases are shown in Figure 4.


Figure 4
View larger version (37K):
[in this window]
[in a new window]
 
Fig. 4 For both GI numbers 1154693 and 1154688, each of the three programs, SoDA, JOINSOLVER and V-QUEST, returned the same V and J gene segment inference, but a different D segment. Above is a summary of the alignments, per program, of the D gene segment selected and its junctions with the V and J segments. The italicized n's in the key for JOINSOLVER show where we believe JOINSOVLER's inference can be improved, by changing them to Js. For GI1154693, SoDA's D gene segment identification seems to produce the most favorable alignment, having only one mutation compared with the four mutations of V-QUEST's alignment, and having a significantly lower number of n-nucleotide additions than JOINSVOLER's alignment. For GI1154688, JOINSOLVER's D gene segment identification produces the longest D segment alignment, but SoDA's choice produces the best overall alignment between the V, D and J gene segments.

 
For sequence 1154693 (GI number), SoDA's D gene choice appears arguably to provide the best fit (Fig. 5). JOINSOLVER's inference can be improved as shown where the key is italicized. Nevertheless, we believe SoDA's alignment to be the best owing to its smaller relative mutation frequency and the smaller number of n-nucleotides required. For sequence 1154688 (GI number), JOINSOLVER found the longest D gene segment alignment without mutations, but uses an unusual reading frame in the gene segment not found in the IMGT IGHD library (2/OF15-2) and uses 15 n-nucleotides in the V–D junction. V-QUEST aligns 11 bases of its D prediction (4-23*01) with 3 mutations, while SoDA's uses a common D segment (3-9*01), though it is inverted, with no mutations and only four n-nucleotides in the V–D junction. Though both SoDA and JOINSOLVER use four n-nucleotides in the D-J junction, JOINSOLVER's D alignment forces the J alignment further downstream, and therefore misses aligning the first 12 bases of the J segment to the target sequence. This example illustrates the dependence among segment alignments that motivated the development of SoDA. Although JOINSOLVER has found a D segment alignment that scores higher than that of the SoDA solution, when the respective complete alignments are considered, SoDA's use of a lower-scoring D segment is more than offset by the J segment alignment that this choice allows (Fig. 5).


Figure 5
View larger version (29K):
[in this window]
[in a new window]
 
Fig. 5 The complete output of SoDA's inference for sequence 1154693. SoDA provides a text file in this format for each sequence it analyzes.

 
In general, however, SoDA takes much longer to run, requiring typically up to 25 seconds per sequence to run on an AMD dual-core 64-bit machine compared to the few seconds that it takes for JOINSOLVER and V-QUEST to run. Neither JOINSOLVER nor V-QUEST produces such an alignment, but doing so means SoDA runs in polynomial time. Also, though JOINSOLVER was not developed to process TCR or light chain sequences, it can infer the use of tandem D segments, something that neither SoDA nor V-QUEST currently does, though this is a feature we intend to add to SoDA.


    5 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 
The antigen receptors of adaptive immunity are unique in their capacity for somatic diversification and consequently present unique challenges for bioinformatics. We have developed a software package built on a novel 3D generalization of pairwise alignment algorithms to find the most likely recombination scenario, including the gene segments and their somatic modifications, to account for any given TCR or Ig sequence. This information in turn allows a detailed analysis of the population somatic genetics of the immune response.

We tested our system, SoDA, against a set of artificial Ig sequences produced by simulating the rearrangement process, and a set of real human Ig genes chosen from Genbank. SoDA performed well, and compared favorably with the existing programs JOINSOLVER and V-QUEST. The cost of this performance enhancement is a substantially increased computational effort. Although validated here on human Ig heavy chains only, the program performs comparably on all Ig and TCR loci in humans and mice.


    Acknowledgments
 
J.M.V. would like to thank his colleagues in the Duke University Laboratory of Computational Immunology for helpful discussions during the genesis of this project and Duke University Graduate School for financial support. This research was made possible through the generous financial support of the Burroughs Wellcome Fund (Career Award 1004005 to L.G.C.) and of the National Institutes of Health (T.B.K.) through the Southeast Regional Center for Excellence in Emerging Infections and Biodefense (U54 AI057157 [GenBank] -02; B Haynes) and the Duke Center for Translational Research (5P30 AI051445 [GenBank] -01; B Haynes).


    FOOTNOTES
 
Associate Editor: Steven L. Salzberg

Received on October 18, 2005; revised on December 1, 2005; accepted on December 11, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 ALGORITHM
 4 RESULTS AND DISCUSSION
 5 CONCLUSION
 REFERENCES
 

    Alt, F.W. and Baltimore, D. (1982) Joining of immunoglobulin heavy chain gene segments: implications from a chromosome with evidence of three D–JH fusions. Proc. Natl. Acad. Sci, . 97, 4118–4122.

    Altschul, S.F., et al. (1990) Basic local alignment search tool. J. Mol. Biol, . 215, 403–410[CrossRef][Web of Science][Medline].

    Brezinschek, H., et al. (1997) Differential effects of selection and somatic hypermutation on human peripheral CD5+/IgM+ and CD5/IgM+ B cells. J. Clin. Invest, . 99, 2488–2501[Web of Science][Medline].

    Gelfand, M.S., et al. (1996) Gene recognition via spliced sequence alignment. Proc. Natl. Acad. Sci, . 93, 9061–9066[Abstract/Free Full Text].

    Giudicelli, V., et al. (2004) IMGTV-QUEST, an integrated software program for immunoglobulin and T cell receptor VJ and VDJ rearrangement analysis. Nucleic Acids Res, . 32, W435–W440[Abstract/Free Full Text].

    Ichiara, Y., et al. (1988) Organization of human immunoglobulin heavy chain diversity gene loci. EMBO J, . 7, 4141–4150[Web of Science][Medline].

    Livak, F., et al. (2000) Genetic modulation of T cell receptor gene segment usage during somatic recombination. J. Exp. Med, . 192, 1191–1196[Abstract/Free Full Text].

    Marshall, A.J., et al. (1996) Frequency of V(H)81x usage during B cell development: Initial decline in usage is independent of Ig heavy chain cell surface expression. J. Immunol, . 156, 2077–2084[Abstract].

    McMahan, C.J. and Fink, P.J. (2000) Receptor revision in peripheral T cells creates a diverse Vßrepertoire. J. Immunol, . 165, 6902–6907[Abstract/Free Full Text].

    Needleman, S.B. and Wunsch, C.D. (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol, . 48, 443–453[CrossRef][Web of Science][Medline].

    Ota, T. and Nei, M. (1994) Divergent evolution and evolution by the birth-and-death process in the immunoglobulin VH gene family. Mol. Biol. Evol, . 11, 469–482[Abstract].

    Papavasiliou, F.N. and Schatz, D.G. (2002) Somatic hypermutation of immunoglobulin genes: merging mechanisms for genetic diversity. Cell, 109, S35–S44[Medline].

    Rosner, K., et al. (2001) Third complementarity-determining region of mutated VH immunoglobulin genes contains shorter V, D, J, P, and N components than non-mutated genes. Immunology, 103, 179–187[CrossRef][Web of Science][Medline].

    Smith, D.S., et al. (1996) Di- and trinucleotide target preferences of somatic mutagenesis in normal and autoreactive b cells. J. Immunol, . 156, 2642–2652[Abstract].

    Smith, T.F and Waterman, M.S. (1981) Identification of common molecular subsequences. J. Mol. Biol, . 147, 195–197[CrossRef][Web of Science][Medline].

    Souto-Carneiro, M.M., et al. (2004) Characterization of the human Ig heavy chain antigen binding complementarity determining region 3 using a newly developed software algorithm, JOINSOVLER. J. Immunol, . 172, 6790–6802[Abstract/Free Full Text].

    Tonegawa, S. (1983) Somatic generation of antibody diversity. Nature, 302, 575–581[CrossRef][Medline].

    Wagner, S.D. and Neuberger, M.S. (1996) Somatic hypermuation of immunoglobulin genes. Annu. Rev. Immunol, . 14, 441–457[CrossRef][Web of Science][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J. Immunol.Home page
L. Li, Q. He, A. Garland, Z. Yi, L. T. Aybar, T. B. Kepler, J. A. Frelinger, B. Wang, and R. Tisch
{beta} Cell-Specific CD4+ T Cell Clonotypes in Peripheral Blood and the Pancreatic Islets Are Distinct
J. Immunol., December 1, 2009; 183(11): 7585 - 7591.
[Abstract] [Full Text] [PDF]


Home page
J. Immunol.Home page
I. Ueda-Hayakawa, J. Mahlios, and Y. Zhuang
Id3 Restricts the Developmental Potential of {gamma}{delta} Lineage during Thymopoiesis
J. Immunol., May 1, 2009; 182(9): 5306 - 5316.
[Abstract] [Full Text] [PDF]


Home page
J. Immunol.Home page
Z. E. Parra, M. L. Baker, A. M. Lopez, J. Trujillo, J. M. Volpe, and R. D. Miller
TCR{micro} Recombination and Transcription Relative to the Conventional TCR during Postnatal Development in Opossums
J. Immunol., January 1, 2009; 182(1): 154 - 163.
[Abstract] [Full Text] [PDF]


Home page
Nucleic Acids ResHome page
X. Brochet, M.-P. Lefranc, and V. Giudicelli
IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W503 - W508.
[Abstract] [Full Text] [PDF]


Home page
Int ImmunolHome page
U. Hershberg, M. Uduman, M. J. Shlomchik, and S. H. Kleinstein
Improved methods for detecting selection by mutation analysis of Ig V region sequences
Int. Immunol., May 1, 2008; 20(5): 683 - 694.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
B. A. Gaeta, H. R. Malming, K. J.L. Jackson, M. E. Bain, P. Wilson, and A. M. Collins
iHMMune-align: hidden Markov model-based alignment and identification of germline genes in rearranged immunoglobulin gene sequences
Bioinformatics, July 1, 2007; 23(13): 1580 - 1587.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/4/438    most recent
btk004v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (5)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Volpe, J. M.
Right arrow Articles by Kepler, T. B.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Volpe, J. M.
Right arrow Articles by Kepler, T. B.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?