Skip Navigation


Bioinformatics Advance Access originally published online on April 26, 2007
Bioinformatics 2007 23(13):1580-1587; doi:10.1093/bioinformatics/btm147
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/13/1580    most recent
btm147v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gaëta, B. A.
Right arrow Articles by Collins, A. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gaëta, B. A.
Right arrow Articles by Collins, A. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2007. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

iHMMune-align: hidden Markov model-based alignment and identification of germline genes in rearranged immunoglobulin gene sequences

Bruno A. Gaëta 1,2,*, Harald R. Malming 2, Katherine J.L. Jackson 1, Michael E. Bain 2, Patrick Wilson 3 and Andrew M. Collins 1

1School of Biotechnology and Biomolecular Sciences, 2School of Computer Science and Engineering, The University of New South Wales, Sydney, NSW 2052, Australia and 3Molecular Immunogenetics Program, The Oklahoma Medical Research Foundation, Oklahoma City, OK 73104, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: Immunoglobulin heavy chain (IGH) genes in mature B lymphocytes are the result of recombination of IGHV, IGHD and IGHJ germline genes, followed by somatic mutation. The correct identification of the germline genes that make up a variable VH domain is essential to our understanding of the process of antibody diversity generation as well as to clinical investigations of some leukaemias and lymphomas.

Results: We have developed iHMMune-align, an alignment program that uses a hidden Markov model (HMM) to model the processes involved in human IGH gene rearrangement and maturation. The performance of iHMMune-align was compared to that of other immunoglobulin gene alignment utilities using both clonally related and randomly selected IGH sequences. This evaluation suggests that iHMMune-align provides a more accurate identification of component germline genes than other currently available IGH gene characterization programs.

Availability: iHMMune-align cross-platform Java executable and web interface are freely available to academic users and can be accessed at http://www.emi.unsw.edu.au/~ihmmune/

Contact: bgaeta{at}cse.unsw.edu.au


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Antibody production is critical to our defences against microbial invaders. In order to respond to the incredible diversity of foreign antigens, antibodies (immunoglobulins) must be produced with specificities of equal diversity. The diversity of the repertoire is created by recombination, for functional immunoglobulin genes are created by the joining of a number of genes. During the early development of each B cell, a functional heavy chain variable domain is created by the essentially random recombination of three germline genes (IGHV, IGHD, IGHJ) that are each selected from a separate set. There are between 38 and 45 functional IGHV genes per haploid genome (Lefranc 2001; Li et al., 2002), 23 unique functional IGHD sequences (Lee et al., 2006) and 6 functional IGHJ sequences (Ravetch et al., 1981; Ruiz et al., 1999).

During the recombination process, the ends of the joining genes are trimmed by unknown exonucleases, and as many as 10 nt are frequently removed from the IGHD and IGHJ gene ends. Non-template encoded nucleotides (N nucleotides) are also added between the recombining genes by the enzyme terminal deoxynucleotidyl transferase (TdT) (Basu et al., 1983). This enzyme is biased towards the addition of guanine, but the process is an essentially random one that can result in the addition of as many as 25 nt between the joining genes.

During an immune response, additional diversity is generated by the process of somatic mutation, which principally occurs within the germinal centres of the lymph nodes. During this process, antigen-selected B cells accumulate mutations in their immunoglobulin genes. Such mutations are high-frequency events, occurring at an estimated rate of 10–3 mutations per base pair per generation or approximately one immunoglobulin gene mutation per B cell division. Clonal expansion, following antigen selection, therefore gives rise to a clone of diverse sequences whose antibodies are encoded by the same germline genes, but which have diverged from one another through the mutation process. In fact many sequences can be seen, after large clonal proliferations, which have accumulated 30 or more mutations within the 370 or so nucleotides that make up the rearranged V-D-J gene. Mutations are typically concentrated in the complementarity determining regions (CDRs) which encode the variable domain loops that contact the antigen.

Similarities between germline genes, the effects of exonuclease and TdT activity and the somatic mutation process together can make it difficult to identify the germline genes within a rearranged V-D-J gene. This is particularly true in the case of IGHD genes, which range in length from just 11 to 37 nt. After processing of the IGHD gene ends by exonucleases, there may sometimes be very little remaining from the germline sequence. Nevertheless there are many reasons why researchers and clinicians attempt to identify the genes involved in the immunoglobulin rearrangements.

Many studies have reported biases in antibody gene usage in conditions including rheumatoid arthritis (Huang et al., 1998) and other autoimmune diseases (Dorner and Lipsky, 2001), as well as in many leukaemias and lymphomas, such as MALT lymphoma (Yoshida et al., 2006), multiple myeloma (Kosmas et al., 1999) and chronic lymphocytic leukaemia (Tobin et al., 2002; Widhopf et al., 2004). All such studies require the accurate identification of germline genes. The identification of somatic point mutations has been critical to the study of the mutation process (Neuberger et al., 2005; Zheng et al., 2005), and this too begins with the alignment of mutated sequences against the germline gene repertoire. The analysis of mutations is also of clinical importance. For example, the enumeration of somatic point mutations is an important prognostic indicator in chronic lymphocytic leukaemia (Damle et al., 1999; Hamblin et al., 1999). Finally, studies of the processes that generate immunoglobulin diversity—particularly nucleotide removals and additions—are impossible without reliable identification of germline genes and of the ends of the processed genes.

A number of different programs have been developed for aligning immunoglobulin gene sequences and identifying their germline components. IMGT/V-QUEST+JCTA (Bleakley et al., 2006) integrates IMGT/V-QUEST (Giudicelli et al., 2004) and IMGT/JunctionAnalysis (Yousfi Monod et al., 2004) and is based on dynamic programming sequence alignment with additional alignment steps at the gene junctions. JOINSOLVER (Souto-Carneiro et al., 2004) focusses on the third CDR of heavy chains (CDR3H) which includes the IGHD gene and its junction with the IGHV and IGHJ genes. It uses specific sequence motif searches to delimit regions to be aligned (without gaps) to candidate germline genes. SoDA (Volpe et al., 2006) uses a variation on the dynamic programming sequence alignment algorithm that takes into account the variation around the IGHV–IGHD and IGHD–IGHJ junctions resulting from the competing effects of nucleotide addition and exonuclease action.

The mutation model underlying an alignment utility can have significant impact on its ability to identify the component germline genes of a rearranged sequence. Simple mutation models that do not take into account the sequence and location dependence of somatic mutation and other diversity generation processes often result in unlikely alignments with the 3' end of the variable region containing many more mismatches that would be expected based on the number of mutations observed at the 5' end. To address this issue, we have developed iHMMune-align, an application that incorporates explicit models of the various antibody generation processes in the form of probability distributions along a hidden Markov model (HMM) of the variable region of the heavy chain gene, and generates an alignment of the rearranged sequence with its most likely component germline genes. Its development was particularly designed to improve IGHD gene identification. iHMMune-align therefore presently focusses on alignment of heavy chain variable regions, though it can easily be adapted to the alignment of immunoglobulin light chains and T cell receptors. The current version is based on data gathered from human immunoglobulin gene sequence data and is therefore appropriate only for the alignment of human sequences, although the approach can also be readily adapted to other species provided sufficient data is available for these species.

In this report, we provide a detailed description of iHMMune-align, and compare its performance against that of IMGT/V-QUEST+JCTA, JOINSOLVER and SoDA. It is difficult to evaluate the accuracy of immunoglobulin gene alignment software because it is almost always impossible to be certain of the germline genes that contributed to a sequence. Somatic point mutations of one sequence can arguably make it look like another sequence. For this reason, it is particularly difficult to be certain of IGHV alleles that might be used, and to distinguish between the short, highly similar IGHD genes. To test the performance of the various utilities, we have used sets of clonally related sequences. These sets were generated from cDNA from tonsillar B cell, and each set therefore represents a clonal expansion from a single V-D-J rearrangement. Although there might be argument about the components of the original sequence, the performance of a program can be gauged by the proportion of sequences within each set for which the same IGHV, IGHD and IGHJ genes are identified. Further measures of the accuracy of the programs are provided by comparisons of alignments of an additional 662 cDNA-derived sequences. We conclude that iHMMune-align is the most accurate program for human heavy chain gene alignment, as it is based upon a HMM its performance can be expected to further improve as additional data are built into the model.


    2 APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
iHMMune-align proceeds through the following steps.

2.1 V-gene pre-alignment
The program starts with dynamic programming local alignment (Gotoh, 1990; Smith and Waterman, 1981) of the rearranged sequence with the human IGHV germline repertoire obtained from IMGT (Lefranc, 2005; Lefranc et al., 2005). This step allows the identification of the best-matching IGHV gene and also the estimation of the amount of somatic mutation based on the frequency of mismatches in the resulting alignment.

2.2 HMM construction
A HMM is built using the topology shown in Figures 1 and 2. This HMM incorporates a chain of match states modelling the IGHV gene identified in step 2.1, and parallel chains of match states representing all the possible IGHD and IGHJ germline genes (Fig. 1). The IGHV, IGHD and IGHJ sections of the HMM are joined by match states representing nucleotides resulting from N- and P- addition, and delete states representing the effect of nucleotide removal at the junctions through exonuclease action (Fig. 2).


Figure 1
View larger version (7K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 1. iHMMune-align HMM topology overview. The HMM models one IGHV gene and all the possible IGHD and IGHJ genes, together with junction states corresponding to N- and P-addition.

 

Figure 2
View larger version (29K):
[in this window]
[in a new window]
[Download PowerPoint slide]
 
Fig. 2. Detailed view of sections of the iHMMune-align HMM graph. (a) 3' end of IGHV region and adjacent N region (b) one IGHD gene model and adjacent P-addition states (c) IGHD-IGHJ N-region and one IGHJ gene model. Grey circles represent non-emitting states.

 
The HMM initially represents rearranged but unmutated sequences, with emission probabilities for the IGHV, IGHD and IGHJ gene match states set at 1 for the nucleotide observed at this position in the germline gene, and 0 for the other 3 nt. Emission probabilities for match states corresponding to nucleotides inserted by P-addition and transition probabilities at the IGHV–IGHD and IGHD–IGHJ junctions (which model exonuclease action in the V-D-J recombination process) are set based on frequencies for these events observed in a set of unmutated rearranged sequences (Jackson et al., 2004). Emission probabilities for match states corresponding to N-addition nucleotides are based on experimentally determined nucleotide addition propensities for TdT (Basu et al., 1983).

2.3 Adjustment of emission probabilities
The emission probabilities in all match states of the model are then re-calculated to model the process of somatic mutation. The base probability of mutation is extrapolated from the mutation frequency observed in the initial IGHV region pre-alignment, which provides an estimate of the number of rounds of mutation the sequence has undergone. This probability is then adjusted to take into account the position of the putative mutation, the local sequence context and the effect of antigen selection.

2.3.1 Sequence position
The probability of somatic hypermutation has been observed to decrease with distance from the 5' end of the rearranged gene, with a distribution fitting an exponential decay of the form AN + i = Ai e0.0024i where AN is the mutation propensity at position N (Rada and Milstein, 2001). iHMMune-align adjusts the probability of mutation along the sequence accordingly.

2.3.2 Local sequence context
iHMMune-align offers a choice of two models to represent the sequence dependence of somatic hypermutation. The ‘Hotspot’ model is based on the observation that sequence mutability occurs preferably at specific DNA motifs (RGYW, WRCY, WAN). iHMMune-align increases the probability of mutation at hotspots defined by these motifs, in proportion to the frequency of mutation at these motifs relative to other sequences (Martin and Scharff, 2002). When the user selects this model, iHMMune-align adjusts the probability of change away from the germline sequence by a factor of 32/6 for the second position of tetranucleotides fitting the RGYW consensus (and the third position of its reverse complement WRCY), based on the observation that these hotspots each contribute to 1/6 of observed mutations and each represent 1/32 of all possible tetranucleotides. Following the same reasoning, the probability of mutation for the central A of the WAN motif is adjusted by a factor of 8/3 and the probability of mutation of non-hotspot sequences is adjusted to take into account their lower than expected mutability.

The ‘Trinucleotide’ model assigns a mutability score to each possible trinucleotide based on observed mutation frequencies after correction for position along the sequence (Collins et al., 2004) and adjusts emission probabilities along the model in proportion to this mutability score.

2.3.3 Antigen selection
Over the course of the humoral immune response, B cells producing immunoglobulins with good affinity for antigen are encouraged to proliferate and those accumulating mutations that result in non-functional or less effective immunoglobulins are eliminated. As a result of this selection process, mutations in the CDRs that are involved in antigen binding are favoured over those in the framework regions that are responsible for the folding of the variable domain, and whose mutations are more likely to disrupt the overall structure of the antibody. To reflect this process, iHMMune-align adjusts the probability of mutation in the CDRs by a factor of 1.5, estimated from comparison of multiple immunoglobulin heavy chain (IGH) gene sequences (Collins et al., 2004).

These factors are used to calculate the probability of mutation PM at each match state of the HMM. Emission probabilities for each IGHV, IGHD and IGHJ gene state are set at 1 – PM for the germline sequence nucleotide and PM/3 for each of the other 3 nt.

2.4 Sequence alignment
The HMM is finally aligned with the rearranged, mutated sequence, using the Viterbi algorithm (Rabiner, 1989). The program outputs the alignment corresponding to the optimal path along the HMM and reports the germline genes corresponding to this optimal path. In cases where the IGHD gene has been heavily mutated or truncated by exonuclease action, its identification can be problematic and iHMMune-align reports the matching gene only when it meets an additional criterion based on number of consecutive perfect sequence matches in the IGHD gene alignment.


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
3.1 Implementation
iHMMune-align was implemented in the Java language, using the BioJava libraries (www.biojava.org), and has been successfully run under Microsoft Windows, MacOS X and Linux. The initial Smith–Waterman alignment is performed using the program Jaligner (Moustafa, 2006). iHMMune-align is accessed through a graphical user interface that allows selecting up to 50 sequences to align, and changing the IGHV, IGHD and IGHJ germline gene reference repertoires. A command line version is also available from the authors on request and has been used with Perl scripts to align batches of up to 10 000 sequences. The default values used to calculate the various transition probabilities at the IGHV–IGHD and IGHD–IGHJ junctions are based on frequencies observed in a large sequence set (Jackson et al., 2004) and are suitable in most cases. However, these values can be modified by experts when aligning sequences that are known to have different characteristics—for example fetally derived sequences that have been reported to undergo less N-nucleotide addition (Benedict et al., 2000). Users can also select between the commonly accepted Hotspot mutation model and the Trinucleotide model used routinely in our own analyses. Mutation model choice did not affect alignment in our evaluations, but both models were included for expert users. Other user-modifiable parameters include the output format (HTML or spreadsheet compatible) and the criterion for IGHD gene acceptance (5-mer or 8-mer). Using the 5-mer criterion, iHMMune-align requires 5 consecutive perfect matches to accept a D-gene identification. The more stringent 8-mer criterion used by default in our laboratory requires 8 or 9 consecutive matches in a row, or 10–11 matches with one mismatch or 12 matches with 2 mismatches. These rules are based upon modelling the likelihood that randomly generated N nucleotides will match IGHD genes (Collins et al., 2004).

While not as fast as IMGT/V-QUEST+JCTA, the program requires only ~5 s to align a query sequence using a 2 GHz Intel Core Duo processor. The program requires 5 MB of disk space to install and runs best with the java heap and stack sizes set to a minimum of 512 MB each.

3.2 Evaluation using clonally related sequence sets
Ideally, evaluation of the accuracy of iHMMune-align predictions requires a benchmark set of rearranged immunoglobulin gene sequences of known germline gene composition. Since the germline gene composition cannot be known with confidence except for relatively unmutated sequences, we used sets of clonally related sequences known to be derived from the same V-D-J rearrangements. Two sets were derived from tonsillar IgD class-switched B cells (Zheng et al., 2004), and have previously been described. The two sets consisted of 57 and 99 unique sequences for which all programs being tested could produce an alignment. Sequences were aligned against the germline IGHV, IGHD and IGHJ gene repertoires using iHMMune-align, IMGT/V-QUEST+JCTA (Giudicelli et al., 2004; Yousfi Monod et al., 2004) JOINSOLVER (Souto-Carneiro et al., 2004) and SoDA (Volpe et al., 2006). For this comparison iHMMune-align alignments were performed using both the Trinucleotide and Hotspot mutation models, with both models resulting in the same alignments. Other programs were accessed through their websites, using default program parameters. At the time of testing, all programs used the same version of the IMGT germline gene repertoire.

The results of the analysis with the 57-sequences set are shown in Table 1. The four programs generally agreed on the longer IGHV and IGHJ genes (with variation with respect to predicted alleles), but differed with respect to IGHD gene alignment. Out of the four programs, iHMMune-align was the most consistent in its germline gene identification and predicted the same IGHD gene for all but one of the sequences.


View this table:
[in this window]
[in a new window]

 
Table 1. Number of alignments in agreement, and numbers of alternative alignments seen when 57 clonally related sequences were aligned using four alignment utilities

 
Results for the 99-sequences set are presented in Table 2. This set included heavily mutated sequences that can present a challenge for any alignment program. IMGT/V-QUEST+JCTA, iHMMune-align and SoDA all selected the same V-D-J gene set (IGHV4-34*01 IGHD6-6*01 IGHJ6*02) for the majority of sequences in the set, while JOINSOLVER predicted an inverted IGHD gene in the majority of its alignments. The best performing program with regard to consistent predictions was iHMMune-align, which predicted the consensus gene set for 72 out of 99 sequences. With regard to the identification of the IGHD gene, iHMMune-align did not make a prediction for 12 out of the 99 sequences, as their IGHD gene alignments did not satisfy the 8-mer criterion, and iHMMune-align predicted a non-consensus IGHD gene in only 8 out of the 87 remaining sequences.


View this table:
[in this window]
[in a new window]

 
Table 2. Number of alignments in agreement, and number of alternative alignments seen when 99 clonally related sequences were aligned using four alignment utilities

 
3.3 Evaluation using a random sequence sample
Clonally related sequences can provide an effective benchmark set as they are known to be derived from the same V-D-J rearrangement and alignment utilities can therefore be evaluated based on their ability to predict the same V-D-J combination for every sequence in the set. However, only few of these sequence sets are available. Further testing required the use of randomly sampled immunoglobulin sequences of unknown V-D-J composition.

A set of 662 rearranged immunoglobulin gene sequences that have previously been used in an evaluation of alignment software (Volpe et al., 2006) were collected from the EMBL database (Kanz et al., 2005). Germline genes were predicted for each of these sequences using each of the four alignment programs with the same parameter settings as for evaluation with clonally related sequences. Total agreement was only seen with 104 of the sequences. Sequences were removed from analysis if one or more utilities failed to identify an alignment, and Table 3 summarizes the extent of consensus that was seen between the programs. Analysis of the ‘odd program out’, where consensus was seen between three of the four programs, highlights the differing performances of the programs, but does not allow firm conclusion to be drawn regarding accuracy since the original germline rearrangements are not known and different programs may agree more on the basis of similarity of algorithm than on actual predictive accuracy.


View this table:
[in this window]
[in a new window]

 
Table 3. Level of agreement between four alignment programs IMGT/V-QUEST+JCTA (IMGT), iHMMune-align (iHMM), JOINSOLVER (JS) and SoDA, in the alignment of 662 human (IGH) sequences

 
The quality of iHMMune-align alignments is further supported by analysis of the distribution of mutations between associated IGHV, IGHD and IGHJ genes. The shortest IGHD gene alignments (mean length of 13.3 nt) were seen with JOINSOLVER and did not include mismatches. The longest were seen with IMGT/Junction Analysis, with a mean length of 17.9 nt having mean 2.1 mutations. iHMMune-align alignments had a mean length of 15.4 nt and 0.6 mutations, while SoDA alignments had a mean length of 16.1 nt and 1.0 mutations.

JOINSOLVER and SoDA both allow alignments to inverted IGHD sequences. The appropriateness of this was tested by an examination of alignment lengths. The average length of 86 inverted SoDA alignments was 8.1 nt, compared to an average 15.0 nt for the 550 other SoDA IGHD alignments. The most common inverted SoDA alignment was to one or other of the IGHD2-2 alleles. The average length of these 37 alignments was 6.5 nt, while the mean length of the 48 regular IGHD2-2 alignments was 20.5. The longest inverted alignment was 15 nt, and this included 3 mismatches. This was an alignment to IGHD3-22*01. The inverted germline IGHD3-22*01 gene aligns well against many IGHD genes in the regular orientation, including a single mismatch in 10 nt to IGHD2-21*01 and six mismatches in 21 nt to IGHD3-3*02. It should therefore not be surprising, e.g. if a mutated IGHD3-3*02 gene occasionally aligns well to an inverted IGHD3-22*01 sequence.

As a further evaluation of IGHD and IGHJ gene alignments, we considered sequences that aligned perfectly against germline IGHV genes according to each alignment program. We then examined the level of mutations in the associated IGHD and IGHJ genes. Since the probability of somatic mutation decays exponentially with position along the sequence (Rada and Milstein, 2001), it is expected that sequences with no mutations in the IGHV gene should have very few or no mutations in their IGHD and IGHJ genes. The results are presented in Table 4, which shows lower levels of mutation in the iHMMune-align output. This is probably a measure of the ability of the different programs to correctly identify the ends of the genes, for most of the apparent mutations identified by the other programs were at the gene ends. No data are presented for JOINSOLVER because its algorithm does not allow for any IGHD mutations.


View this table:
[in this window]
[in a new window]

 
Table 4. Numbers of mismatches seen in IGHD and IGHJ alignments, in sequences that aligned to germline IGHV genes with no mismatches

 

    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
The identification and alignment of component germline genes in rearranged and mutated immunoglobulin gene and cDNA sequences is important not only for understanding the mechanisms for generating antibody diversity but also as part of many clinical investigations. We have developed iHMMune-align, an application that incorporates an explicit model of V-D-J recombination and somatic mutation processes in the form of a HMM. The use of an HMM allows modelling immunoglobulin gene-specific processes that are not represented when using standard pairwise sequence alignment techniques originally developed for aligning homologous sequences and modelling general evolutionary processes. The current version of iHMMune-align focusses on the alignment of human IGH, although the algorithm can be applied to light chains and other organisms provided sufficient training data are available.

iHMMune-align includes an initial Smith–Waterman alignment step for identifying the IGHV gene. The IGHV region is relatively long (around 300 bases) and diverse, with 50 genes and 217 alleles having been reported to be functional (Lefranc, 2001). Incorporating all possible IGHV genes in the HMM would therefore be too computationally expensive. The V-REGION is long enough to be readily identifiable by pairwise alignment, and all four utilities tested were in relatively good agreement regarding their IGHV gene assignment, although variations were observed with regard to predicted alleles. The pre-alignment of the IGHV gene also allows iHMMune-align to estimate the relative amount of somatic mutation over the sequence, which is then used to calibrate the emission probabilities of the HMM.

The evaluation of immunoglobulin gene alignment accuracy is not straightforward as no ‘gold standard’ sequence benchmark of known V-D-J composition is available. Germline gene composition can be inferred by expert visual inspection only for relatively unmutated sequences that are trivial to align by any approach, but not for more problematic mutated sequences where utilities will differ most in their predictions. Others have used simulated rearranged immunoglobulin genes as test sequences (Volpe et al., 2006), but this approach does not take into account the known (and unknown) mechanisms of immunoglobulin diversity generation and is unlikely to result in biologically significant conclusions. We propose the use of sets of clonally related sequence sets for evaluating immunoglobulin gene alignment utilities. Although the V-D-J composition of these sequences is unknown, all sequences in a set are derived from a single rearrangement. An ‘ideal’ alignment utility able to identify the original germline genes should therefore predict the same V-D-J composition for all sequences in the set. Two such sets were available to us, each displaying different characteristics in the CDR3 region (one uses a short IGHD gene, the other a longer IGHD gene). Together these sets provide a good estimate of the performance of different alignment approaches with highly mutated sequences, and complement evaluations with larger, more diverse sets of relatively unmutated sequences used in our and other studies. More comprehensive evaluation of immunoglobulin gene alignment methods will require the availability of additional sets of clonally related sequences, but in their absence, we believe the present study represents the most thorough attempt to date to develop an objective benchmark for immunoglobulin gene alignment accuracy.

A comparison of iHMMune-align with the current standard application, IMGT/V-QUEST+JCTA, and JOINSOLVER and SoDA, two relatively new alignment programs, highlights its excellent performance. iHMMune-align predicted the V-D-J composition of the two sets of clonally related sequences with higher accuracy than the other programs. Program quality was also evaluated by visual inspection of alignments, especially over the short IGHD gene and its junction with IGHV and IGHJ genes. iHMMune-align IGHD gene alignments were generally longer and contained fewer mismatches than the alignments generated by other programs. iHMMune-align models explicitly the factors known to affect somatic mutation and as a result avoids unlikely alignments that postulate a much larger number of mutations at the 3' end of the gene relative to its 5' end.

Alignment quality is also a function of the germline gene repertoire used by the program. All four programs were evaluated using the same release of the IMGT repertoire. However both JOINSOLVER and SoDA allow alignment to inverted IGHD genes, whose use is controversial (Corbett et al., 1997) and which are not included in the default repertoires used by IMGT programs and iHMMune-align. For every sequence in the test set where JOINSOLVER or SoDA predicted an inverted IGHD gene, iHMMune-align was able to generate a likely alignment with an IGHD gene in standard orientation.

One potential weakness of iHMMune-align algorithm is that it allows for nucleotide insertions or deletions at the gene junctions but not within germline genes. Some nucleotide insertions and deletions have been observed at small frequency in the somatic mutation process. However, the need to maintain the immunoglobulin protein framework means that frameshift mutations in the coding regions are eliminated during clonal expansion and antigen selection and only in-frame insertions and deletions of 3 or 6 nt are tolerated and have been reported in CDR1 and CDR2 (Wilson et al., 1998). Of the four algorithms tested, only SoDA allows for insertions and deletions within genes, but in our testing the overwhelming majority of insertions were single nucleotide insertions in the IGHV genes. Insertions were seen in 31 IGHD alignments, but again, all but one of these insertions were of a single nucleotide.

iHMMune-align is currently available as a Java application. Planned program developments include a web interface to facilitate access to the program for the casual user. The nature of the algorithm suggests that the accuracy of iHMMune-align can be improved with additional training data. We are currently focussing on improving the modelling of exonuclease removal, which should improve the reliability of short IGHD alignments.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Alfonso Valencia

Received on September 17, 2006; revised on April 11, 2007; accepted on April 11, 2007

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Basu M, et al. Synthesis of compositionally unique DNA by terminal deoxynucleotidyl transferase. Biochem. Biophys. Res. Comm (1983) 111:1105–1112.[CrossRef][Web of Science][Medline]

    Benedict CL, et al. Terminal deoxynucleotidyl transferase and repertoire development. Immunol. Rev (2000) 175:150–157.[CrossRef][Web of Science][Medline]

    Bleakley K, et al. IMGT standardization for statistical analyses of T cell receptor junctions: the TRAV-TRAJ example. In Silico Biol (2006) 6:0051.

    Collins AM, et al. Partitioning of rearranged Ig genes by mutation analysis demonstrates D-D fusion and V gene replacement in the expressed human repertoire. J. Immunol (2004) 172:340–348.[Abstract/Free Full Text]

    Corbett SJ, et al. Sequence of the human immunoglobulin diversity (D) segment locus: a systematic analysis provides no evidence for the use of DIR segments, inverted D segments, "minor" D segments or D-D recombination. J. Mol. Biol (1997) 270:587–597.[CrossRef][Web of Science][Medline]

    Damle RN, et al. Ig V gene mutation status and CD38 expression as novel prognostic indicators in chronic lymphocytic leukemia. Blood (1999) 94:1840–1847.[Abstract/Free Full Text]

    Dorner T, Lipsky PE. Immunoglobulin variable-region gene usage in systemic autoimmune diseases. Arthritis Rheum (2001) 44:2715–2727.[CrossRef][Web of Science][Medline]

    Giudicelli V, et al. IMGT/V-QUEST, an integrated software program for immunoglobulin and T cell receptor V-J and V-D-J rearrangement analysis. Nucleic Acids Res (2004) 32:W435–W440.[Abstract/Free Full Text]

    Gotoh O. Optimal sequence alignment allowing for long gaps. Bull. Math. Biol (1990) 52:359–373.[Web of Science][Medline]

    Hamblin TJ, et al. Unmutated Ig V(H) genes are associated with a more aggressive form of chronic lymphocytic leukemia. Blood (1999) 94:1848–1854.[Abstract/Free Full Text]

    Huang SC, et al. VH usage and somatic hypermutation in peripheral blood B cells of patients with rheumatoid arthritis (RA). Clin. Exp. Immunol (1998) 112:516–527.[CrossRef][Web of Science][Medline]

    Jackson KJ, et al. Exonuclease activity and P nucleotide addition in the generation of the expressed immunoglobulin repertoire. BMC Immunol (2004) 5:19.[CrossRef][Medline]

    Kanz C, et al. The EMBL Nucleotide Sequence Database. Nucleic Acids Res (2005) 33:1.[Abstract/Free Full Text]

    Kosmas C, et al. Molecular analysis of immunoglobulin genes in multiple myeloma. Leuk. Lymphoma (1999) 33:253–265.[Web of Science][Medline]

    Lee CE, et al. Reconsidering the human immunoglobulin heavy-chain locus: 1. An evaluation of the expressed human IGHD gene repertoire. Immunogenetics (2006) 57:917–925.[CrossRef][Web of Science][Medline]

    Lefranc MP. Nomenclature of the human immunoglobulin heavy (IGH) genes. Exp. Clin. Immunogenet (2001) 18:100–116.[CrossRef][Web of Science][Medline]

    Lefranc MP. IMGT, the international ImMunoGeneTics information system(R): a standardized approach for immunogenetics and immunoinformatics. Immunome. Res (2005) 1:3.[CrossRef][Medline]

    Lefranc MP, et al. IMGT, the international ImMunoGeneTics information system. Nucleic Acids Res (2005) 33:D593–D597.[Abstract/Free Full Text]

    Li H, et al. Genetic diversity of the human immunoglobulin heavy chain VH region. Immunol. Rev (2002) 190:53–68.[CrossRef][Web of Science][Medline]

    Martin A, Scharff MD. AID and mismatch repair in antibody diversification. Nat. Rev. Immunol (2002) 2:605–614.[Web of Science][Medline]

    Moustafa A. JAligner: open source Java implementation of Smith-Waterman. (2006) http://jaligner.sourceforge.net.

    Neuberger MS, et al. Somatic hypermutation at A.T pairs: polymerase error versus dUTP incorporation. Nat. Rev. Immunol (2005) 5:171–178.[CrossRef][Web of Science][Medline]

    Rabiner LR. A tutorial on hidden markov-models and selected applications in speech recognition. Proc. IEEE (1989) 77:257–286.[CrossRef]

    Rada C, Milstein C. The intrinsic hypermutability of antibody heavy and light chain genes decays exponentially. EMBO J (2001) 20:4570–4576.[CrossRef][Web of Science][Medline]

    Ravetch JV, et al. Structure of the human immunoglobulin µ locus: characterization of embryonic and rearranged J and D genes. Cell (1981) 27:583–591.[CrossRef][Web of Science][Medline]

    Ruiz M, et al. The human immunoglobulin heavy diversity (IGHD) and joining (IGHJ) segments. Exp. Clin. Immunogenet (1999) 16:173–184.[CrossRef][Web of Science][Medline]

    Smith TF, Waterman MS. Identification of common molecular subsequences. J. Mol. Biol (1981) 147:195–197.[CrossRef][Web of Science][Medline]

    Souto-Carneiro MM, et al. Characterization of the human Ig heavy chain antigen binding complementarity determining region 3 using a newly developed software algorithm, JOINSOLVER. J. Immunol (2004) 172:6790–6802.[Abstract/Free Full Text]

    Tobin G, et al. Somatically mutated Ig V(H)3-21 genes characterize a new subset of chronic lymphocytic leukemia. Blood (2002) 99:2262–2264.[Abstract/Free Full Text]

    Volpe JM, et al. SoDA: implementation of a 3D alignment algorithm for inference of antigen receptor recombinations. Bioinformatics (2006) 22:438–444.[Abstract/Free Full Text]

    Widhopf GF, et al. Chronic lymphocytic leukemia B cells of more than 1% of patients express virtually identical immunoglobulins. Blood (2004) 104:2499–2504.[Abstract/Free Full Text]

    Wilson PC, et al. Somatic hypermutation introduces insertions and deletions into immunoglobulin V genes. J. Exp. Med (1998) 187:59–70.[Abstract/Free Full Text]

    Yoshida M, et al. Immunoglobulin V-H genes in thymic MALT lymphoma are biased toward a restricted repertoire and are frequently unmutated. J. Pathol (2006) 208:415–422.[CrossRef][Web of Science][Medline]

    Yousfi Monod M, et al. IMGT/JunctionAnalysis: the first tool for the analysis of the immunoglobulin and T cell receptor complex V-J and V-D-J JUNCTIONs. Bioinformatics (2004) 20(Suppl. 1):I379–I385.[CrossRef][Medline]

    Zheng N-Y, et al. Human immunoglobulin selection associated with class switch and possible tolerogenic origins for C delta class-switched B cells. J. Clin. Invest (2004) 113:1188–1201.[CrossRef][Web of Science][Medline]

    Zheng NY, et al. Intricate targeting of immunoglobulin somatic hypermutation maximizes the efficiency of affinity maturation. J. Exp. Med (2005) 201:1467–1478.[Abstract/Free Full Text]


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
Nucleic Acids ResHome page
X. Brochet, M.-P. Lefranc, and V. Giudicelli
IMGT/V-QUEST: the highly customized and integrated system for IG and TR standardized V-J and V-D-J sequence analysis
Nucleic Acids Res., July 1, 2008; 36(suppl_2): W503 - W508.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
23/13/1580    most recent
btm147v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (1)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Gaëta, B. A.
Right arrow Articles by Collins, A. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Gaëta, B. A.
Right arrow Articles by Collins, A. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?