Computational Genomics
A comparative genome approach to marker ordering


1 Laboratoire de génétique cellulaire BP 52627, 31326 Castanet Tolosan, France
2 Laboratoire de mathématiques et informatique appliquées, INRA BP 52627, 31326 Castanet Tolosan, France
3 CNRS UMR6061 Génétique et Développement, Université de Rennes 1 IFR 140, 2 Av du Pr Léon Bernard, CS 34317, 35043 Rennes, France
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Genome maps are fundamental to the study of an organism and essential in the process of genome sequencing which in turn provides the ultimate map of the genome. The increased number of genomes being sequenced offers new opportunities for the mapping of closely related organisms. We propose here an algorithmic formalization of a genome comparison approach to marker ordering.
Results: In order to integrate a comparative mapping approach in the algorithmic process of map construction and selection, we propose to extend the usual statistical model describing the experimental data, here radiation hybrids (RH) data, in a statistical framework that models additionally the evolutionary relationships between a proposed map and a reference map: an existing map of the corresponding orthologous genes or markers in a closely related organism. This has concretely the effect of exploiting, in the process of map selection, the information of marker adjacencies in the related genome when the information provided by the experimental data is not conclusive for the purpose of ordering. In order to compute efficiently the map, we proceed to a reduction of the maximum likelihood estimation to the Traveling Salesman Problem. Experiments on simulated RH datasets as well as on a real RH dataset from the canine RH project show that maps produced using the likelihood defined by the new model are significantly better than maps built using the traditional RH model.
Availability: The comparative mapping approach is available in the last version of de Givry,S. et al. [(2004) Bioinformatics, 21, 17031704, www.inra.fr/mia/T/CarthaGene], a free (the LKH part is free for academic use only) mapping software in C++, including LKH (Helsgaun,K. (2000) Eur. J. Oper. Res., 126, 106130, www.dat.ruc.dk/keld/research/LKH) for maximum likelihood computation.
Contact: thomas.faraut{at}toulouse.inra.fr
| 1 INTRODUCTION |
|---|
|
|
|---|
Since the discovery of the molecular basis of genes, the time devoted to mapping has dramatically increased, reaching its apogee with the advent of whole genome sequence projects. Although the complete sequence provides the ultimate map of a genome, the problem of constructing a map from experimental data remains an active area of research (Bø et al., 2002; Crane and Crane, 2004; Mester et al., 2003; Wu et al., 2003). Maps are key to the study of organisms that are not planned to be sequenced in the near future. In addition, the availability of detailed maps offers great advantage in the process of whole genome sequencing (Havlak et al., 2004). The production of whole genome sequences therefore does not dismiss the need for gene mapping. It suggests however alternative mapping strategies. Having in hand the exhaustive gene catalog of a completely sequenced genome, makes it possible to take advantage of the conservation of chromosome segments with a related genome of interest. This approach, also called comparative mapping, has been extensively used for many years as a guideline for the construction of maps in animals as well as in plants (Bowers et al., 2005; O'Brien and Graves, 1990). The comparative mapping strategy is also of great value in the context of whole genome sequence assembly (Havlak et al., 2004; Pop et al., 2004).
We propose here a novel approach to gene mapping, in the context of radiation hybrid (RH) mapping, provided that a closely related completely sequenced genome is available. Unlike the traditional approach, the map of the reference organism is used at the very first step of marker ordering for the construction and evaluation of the candidate maps. Although devised in the context of RH mapping, we believe that the proposed method applies equally to other mapping strategies such as genetic mapping. Sections 2 and 3 describe a new statistical model that takes into account both the experimental RH data and the order in a related organism. Section 4 deals with the algorithmic aspects of searching the space of all possible maps, trying to find the best one according to the predefined criterion, without evaluating the n!/2 possible marker orders. Finally, the interest of this approach is evaluated on both simulated and real data, showing a significant improvement in map quality over the traditional approach.
| 2 THE STATISTICAL MODEL |
|---|
|
|
|---|
Our presentation is restricted to the case of RH mapping which can be described by a simple statistical model (Boehnke et al., 1991). In order to focus our presentation on the new comparative approach for marker ordering, the RH mapping technique and the associated statistical model are described in detail in the Appendix of this paper. We implicitly develop our comparative approach principle in the particular case of haploid error-free data due to the approximation using 2-point likelihood (see below and Appendix). The comparative principle is however not closely interlinked to the 2-point likelihood approach and could be extended to other approaches of RH mapping (see Discussion).
We note A, the reference organism and B, the organism of interest. For B an RH dataset X for n markers is available. We make the assumption that there is a one-to-one correspondence between the markers in B and their orthologs in A. The complete genome sequence of A provides a map
A of these markers in A. Our aim is to build a map, identified by a marker permutation
, for the n markers of organism B. Let P(X|
,
) denote the likelihood of the data for a given order
and a set of parameters (nuisance parameters such as the retention fraction and breakage frequencies for RHs). In the traditional maximum likelihood approach, the likelihood associated with each order is the maximum over all possible values of
:
![]() | (1) |
that maximizes this likelihood. Although the situation is generally complicated by the fact that the estimation of
depends on the particular choice of
, we will consider an approximation of this likelihood, using the product of 2-point maximum likelihoods strictly equivalent to the likelihood only for haploid error-free data, which breaks this dependencies between
and
(see Appendix and Agarwala et al., 2000 for a detailed description of 2-point likelihoods and a discussion of the relevance of such an approximation).
Using this approximation, we can consider the likelihood of the data as depending solely on
:
![]() | (2) |
![]() | (3) |
In this framework, the information provided by the existing map
A for the corresponding orthologous genes in A can be incorporated by defining a non-uniform prior distribution on the possible orders for the map in B. We suppose that the probability of an order is a function of its evolutionary distance to the reference map, measured with the number of breakpoints between the proposed order
and the reference order
A. This distance, denoted as k, is the number of adjacent markers in
which are not adjacent in
A.
As the choice of a particular order
implies a unique breakpoint distance with the reference map, the previous equation can be written as
![]() | (4) |
(k), the only expression which is not yet determined is P(
|k), the likelihood of a given order for a fixed number of breakpoints. For a given breakpoint distance, we assume that all the orders are equally probable and hence follow a uniform distribution. The likelihood takes the following form:
![]() | (5) |
| 3 NUMBER OF ORDERS AT A GIVEN BREAKPOINT DISTANCE |
|---|
|
|
|---|
We describe first the case of single chromosome genomes and then extend our results to the case of multiple chromosomes. Since complete map reversals define the same order, a permutation and its complete reversal will be considered equivalent in the sequel.
3.1 Single chromosome genomes
We assume that the reference order
A is the identity permutation. Consider an arbitrary permutation
. We define a segment in this permutation as a maximal set of markers in the permutation that contains no breakpoint with
A. The single order exempt of breakpoints with
A is
A itself. With a fixed breakpoint, the two resulting segments can be ordered in three different ways as follows:
![]() |
- 0 breakpoint is created when n is inserted before or after marker n 1, at the border of a segment;
- 1 breakpoint is created when n is inserted (1) inside a segment next to marker n 1, (2) at the position of an existing breakpoint or (3) at one of the two ends (borders) of the permutation except next to n 1;
- 2 breakpoints are created when n is inserted anywhere inside in a segment, except next to n 1.
: permutations with n isolated at one of the two extremities of the permutation.
: permutations with n isolated but in a central position (anywhere except at the extremities).
: permutations with n at one of the extremities of the permutation and at the border of a segment.
: permutations with n on the border of a segment but in a central position.
and
. We have
. The following induction relations enable to compute the number of permutations sharing a fixed number of breakpoints with the identity permutation: |
|
. The other relations can be derived by a similar analysis. Setting all quantities to 0 for k < 0 and using initial values of
,
, a simple dynamic programming procedure can compute all On(k) values for n
N and k
N 1 in quadratic time.
|
3.2 Multiple chromosome genomes
Generalization to multiple chromosomes implies to distinguish obligate breakpoints created by the concatenation of markers from different chromosomes from other breakpoints. If the chromosome maps of the reference organism are arbitrarily concatenated before the numbering process, some adjacencies in this new reference map must be considered as breakpoints. Let n1, ... , nr denote the number of markers on the chromosomes 1, ... , r of the reference organism A involved in a single linkage group of the genome of interest B. In the induction process, when incorporating the first marker from a new chromosome in the permutation, i.e of the type
for
, one has to ensure that an additional breakpoint is always created. The number of permutations at a given breakpoint distance k when n spans the n1 +
+ nr markers uses the same induction relations as defined in 3.1 with the following modifications for the particular cases where
: |
|
| 4 MAXIMUM LIKELIHOOD COMPUTATION REDUCED TO SOLVING A TRAVELING SALESMAN PROBLEM |
|---|
|
|
|---|
In order to compute efficiently the maximum likelihood estimation of
under the model defined by (4) we reduce the corresponding optimization problem to the Traveling Salesman Problem (TSP). The principle of this reduction is to write the likelihood of an order as a weighted path visiting all the markers in that order. Practically, this entails constructing a distance measure on the set of markers. We consider the log-likelihood
![]() |
![]() |
is the contribution of the RH data associated with marker interval [xi, xi+1] to the likelihood of the map defined by
(see Appendix and Agarwala et al., 2000). Due to the exponential nature of On(k), the additive contribution of each interval for the breakpoint counterpart of the likelihood is obtained by a linear regression y = a + bk on the data
using the exact computation of P(
|k) given by the recurrence formula (this computation can be easily precomputed once for different number of markers and the results made available as a table) of Section 3 and a predefined parameter
for the Poisson law. Setting
![]() | (6) |
![]() |
![]() |
and
Solving the resulting TSP instances can be done in several ways using either complete methods such as branch and cut or heuristic methods. We have tried both state-of-the-art complete and/or heuristic methods available in CONCORDE (Applegate et al., 1998) and LKH (Helsgaun, 2000, www.dat.ruc.dk/keld/research/LKH). The likelihood computation has been implemented above the CARTHAGÈNE (de Givry et al., 2004, www.inra.fr/mia/T/CarthaGene) C++ and LKH ([11]) C libraries.
For the purpose of comparing the performance of the comparative 2-point and the simple 2-point approaches, all the TSP instances in the sequel are resolved using the LKH heuristic of Helsgaun, 2000.
| 5 SIMULATED RH DATASETS |
|---|
|
|
|---|
The following protocol is used to generate RH datasets and reference orders. N markers are randomly distributed according to the uniform distribution on a chromosome of size S Ray giving rise to the target map or true order. The inter-marker expected breakage frequencies
corresponding to the inter-marker distance
i,i+1 are subsequently used to generate random RH datasets for I individuals according to the haploid equal retention model (Boehnke et al., 1991) with the retention fraction r, a false positive/negative error rate perror and a proportion of missing data pmiss. Finally, a reference order is generated by applying a sequence of rearrangement events (reversal, transposition and inverted transposition) on the true order with an expected number of events, or evolutionary distance, set to E (Moret et al., 2001). Note that an inversion creates two breakpoints while the two other rearrangements produce three breakpoints so that the expected number of breakpoints is
if the three rearrangements are equiprobable. In addition, a parameter H controls the proportion of known orthologous relationships which are randomly selected among the N possible ones. Whenever a marker x has no identified ortholog, 1x|y is set to 1 in (6) mimicking therefore a breakpoint. In the experiments, we tried the following values for the generator parameters:
. Each reported experimental result is a mean over 100 randomly generated RH datasets and reference orders by following the previous protocol with a fixed value of the parameters.
In order to assess the effectiveness of our comparative mapping approach, two performance metrics were used to evaluate the accuracy of the proposed maps: (1) proportion of the correctly reconstructed maps, (2) the longest increasing subsequence (LIS). Since, in our simulations, the true order is represented by the identity permutation, the LIS of a candidate order indeed measures how accurate the candidate map is. Let
, the LIS is the largest subset
such that
. We note LIS(
) = r the size of this set. LIS computation is folklore in algorithmic and has already been used for the evaluation of mapping strategies (Bø et al., 2002).
| 6 SIMULATION RESULTS |
|---|
|
|
|---|
The robustness of our approach was studied with respect to three different factors: the influence of the evolutionary distance with the reference genome, the influence of chromosome size (or marker density) and the proportion of known orthology relationships within the dataset.
As expected, the availability of a complete map for a closely related organism significantly improves mapping efficiency, the improvement being dependent on the evolutionary distance between the two maps (Fig. 2). In these experiments, for S = 15 Ray, for example, the true order was never found by the simple 2-point RH mapping approach while the comparative 2-point RH mapping recovered the true order from 16 up to 65 times depending on the evolutionary distance. The proportion of correctly reconstructed order is however too crude for a metric: as the number of markers increases, the probability of recovering the true order decreases rapidly (see Ben-Dor and Chor, 1997 for a formal analysis of this behavior). The LIS criterion in contrast, by measuring the size of the largest subset correctly ordered in the proposed map, enables to quantify the distance to the true map. Comparison using this criteria, shown in Figure 3, confirms the benefit of the comparative mapping approach. Less than 10% of the markers were wrongly positioned when the chromosome size belongs to the interval [5, 15] Ray in the case of comparative 2-point RH mapping with a medium-size evolutionary distance E = 4. On the contrary, simple 2-point RH approach got 33% of incorrectly positioned markers at its best (S = 10 Ray).
|
|
As shown in both figures, there is a clear influence of marker density on the mapping accuracy. Indeed, the linkage between markers is respectively loose and tight for large and small chromosomes. In both extreme cases, the RH dataset is not very informative for the purpose of ordering and the reference order provides therefore a valuable information. The robustness of the comparative approach to marker densities, due to the fact that the evolutionary breakpoints are independent from the number of markers, is of great value when the objective is to produce dense maps.
In our experiments, the expected number of breakpoints between the true order and the reference order, or
, was set to 1 in the Poisson prior P
(k). This value is generally unknown for the mapping process. However, no clear improvement in terms of both criteria was observed when using for each instance the exact number of breakpoints, available in the context of simulation (data not shown).
Finally, we studied the impact of diminishing the proportion of known orthologous relationships. Figure 4 shows the results for the LIS criterion on a 10 Ray chromosome with H
[0,100]. When H = 0, the method reduces to a simple 2-point RH mapping approach. When H was >4050%, we observed a clear improvement in terms of map quality for the comparative 2-point mapping approach compared to the simple 2-point RH model. Below this threshold, the knowledge of a partial reference order can be counterproductive, especially if the evolutionary distance is high. An explanation for this negative result, in the case of E = 8 and H = 30, is the fact that the number of breakpoints was close to the number of orthologous relationships (in the experiments, E = 8 corresponds to 18.74 breakpoints for 100 markers and still 11.27 breakpoints for H = 30 orthologous markers) and the TSP reduction provided a coarse approximation because of the arbitrary weight wx,y assigned in the absence of orthologs (see Section 5).
|
| 7 EXPERIMENTS WITH A DOG RH DATASET |
|---|
|
|
|---|
In order to test the efficiency of our method on a real example, we applied this comparative approach to the construction of a RH map of a whole canine chromosome (CFA2, Fig. 5) using a set of 426 markers typed on the RHDF9000 dog RH panel (Hitte et al., 2005). The human genome sequence was used as a reference map. As the RH markers consisted essentially in gene-based fragments, the corresponding orthologous position was determined for all 426 markers using a simple reciprocal best hit principle with the human gene catalog (Kirkness et al., 2003). The 426 markers cover the entire canine chromosome 2 (87 Mb) corresponding to a marker every 200 kb on average. We constructed RH maps of CFA2 using both the simple 2-point RH method and the comparative 2-point approach. The comparative mapping approach showed a clear improvement over the simple 2-point method in that the proposed map was in better agreement with the dog genome sequence (Lindblad-Toh et al., 2005) than the map built using the simple 2-point RH mapping approach. An illustration of this improvement is given in Figure 5.
|
| 8 DISCUSSION |
|---|
|
|
|---|
As frequently pointed out (Agarwala et al., 2000; Ben-Dor and Chor, 1997; Ben-Dor et al., 2000), and illustrated in the previous sections, the major impediments to producing dense high-quality RH maps are the panel resolution power and experimental data quality and not computation. The traditional avenue to overcome this problem is the construction of framework maps: only a subset of markers is ordered with the counterpart that the proposed order is significantly better (usually in a ratio of 1000:1 of the likelihood) than all other orders with the same markers. Unfortunately, this has a cost as the remaining unplaced markers (typically 5080% of initial dataset) are then positioned into bins of confidence leading to a placement map which may encounter many discrepancies with the true order. We propose here a novel approach that defines a new objective function which takes into account the information provided by a closely related completely sequenced genome: a genome for which an exhaustive map is available. The efficiency of the method is clearly dependent on the evolutionary distance between the reference genome and the genome one wishes to map but also on the quality of orthologous relationships. The proposed objective function performs significantly better than the simple 2-point likelihood on both simulated and real data for the range of parameters typically observed for the mammalian species (relative low number of breakpoints and ability to detect orthology relationships). While the experiments are here restricted to the comparison with the simple 2-point approach of RH mapping, our purpose was to demonstrate the benefits of incorporating comparative mapping information in an existing statistical framework, principle which should be applicable to other RH mapping strategies.
We would like to emphasize that the comparative approach could also be of interest within a species, in the context of genetic mapping, when a map of the markers already exists and one wishes to incorporate this prior knowledge into the statistical model while studying additional experimental dataset corresponding to another breed or cultivar for example.
Future directions should address the case of multiple closely related sequenced genomes.
| APPENDIX |
|---|
|
|
|---|
A RH experiment can be rapidly sketched as follows: cells from the organism under study are irradiated. The radiation breaks the chromosomes at random locations into separate fragments. A random subset of the fragments is then rescued by fusing the irradiated cells with normal rodent cells, a process that produces a collection of hybrid cells. The resulting clone may contain none, one or many chromosome fragments. This clone is then tested for the presence or absence of each of the markers. This process is performed a large number of times producing a radiated hybrid panel, previously called RH dataset in Section 2.
More formally, given N markers and I hybrid cells, a panel is a collection of I vectors of identical size N, containing boolean values 0 for the absence of a marker and 1 for its presence.
The radiation breakage frequencies between two markers, estimated from their co-occurrence pattern in a panel of radiated hybrid cells (possible configuration patterns are (11), (10), (01) or (00) in vectors), provides, in a similar manner to the recombination fraction in genetic mapping, a measure of the distance separating the markers. The distance unit is the Ray, corresponding to a segment length where one break is expected. Let r denote the retention fraction and
the breakage probability between markers y and z. The conditional probabilities of the status Z of marker z, knowing the status Y of marker y, is given by the following formulas (Boehnke et al., 1991):
![]() |
![]() | (1) |
and the 2-point likelihood
![]() |
is the extended set of parameters (
, r) and nij the cardinality of the different configurations outlined above with ni. the marginal cardinality ni0 + ni1.
The maximum likelihood estimate of r is simply the ratio of the total number of 1s to the total number of 1s and 0s (the number of 1 in the panel divided by I x N). The maximum likelihood estimate of the breakage frequency
can be derived analytically from (1) (see for example Agarwala et al., 2000 for a detailed description).
The natural mathematical framework for RH mapping depicts the succession of loci on a chromosome as successive steps of a Markov chain. The likelihood of a hybrid for a given order
= (x1 ... xn) is the probability to observe the data X under the associated Markov model
![]() | (2) |
![]() | (3) |
i is the set of parameters restricted to the interval between two consecutive markers. In particular, the maximization over the parameters
on one side and the order parameter
can be conducted independently. We call L
(Xi|Xi 1) the 2-point maximum likelihoods:
![]() |
described above. The likelihood of an order
can be computed directly from these maximum likelihoods:
![]() | (4) |
for simplicity, we note
![]() |
|
|
![]() | (5) |
In general however, the correct Markov formalization implies some hidden properties (model including the diploid nature of the genome or typing errors) and equality (4) no longer holds. It has been argued that the product of 2-point maximum likelihoods provides however a good approximation of the likelihood (Ben-Dor et al., 2000; Agarwala et al., 2000).
| Acknowledgments |
|---|
Thanks to Lauranne Duquenne who wrote the extension of CARTHAGÈNE taking into account a reference order data set and Brigitte Mangin for fruitful discussions concerning the statistical model.
| FOOTNOTES |
|---|
The authors wish it to be known that, in their opinion, the first two authors should be regarded as joint First Authors. | REFERENCES |
|---|
|
|
|---|
Agarwala, R., et al. (2000) A fast and scalable radiation hybrid map construction and integration strategy. Genome Res, . 10, 350364
Applegate, D., et al. (1998) On the solution of traveling salesman problems. Proc. ICM, III, 645656.
Ben-Dor, A. and Chor, B. (1997) On constructing radiation hybrid maps. J. Comp. Biol, . 4, 517533.
Ben-Dor, A., et al. (2000) RHOradiation hybrid ordering. Genome Res, . 10, 365378
Bø, T.H., et al. (2002) A fast top-down method for constructing reliable radiation hybrid frameworks. Bioinformatics, 18, 1118
Boehnke, M., et al. (1991) Statistical methods for multipoint radiation hybrid mapping. Am. J. Hum. Genet, . 49, 11741188[ISI][Medline].
Bowers, J.E., et al. (2005) Comparative physical mapping links conservation of microsynteny to chromosome structure and recombination in grasses. Proc. Natl Acad. Sci. USA, 102, 1320613211
Crane, C.F. and Crane, Y.M. (2004) A nearest-neighboring-end algorithm for genetic mapping. Bioinformatics, 21, 15791591.
de Givry, S., et al. (2004) CARTHAGENE: multipopulation integrated genetic and radiated hybrid mapping. Bioinformatics, 21, 17031704.
Havlak, P., et al. (2004) The Atlas genome assembly system. Genome Res, . 14, 721732
Helsgaun, K. (2000) An effective implementation of the Lin-Kernighan traveling salesman heuristic. Eur. J. Oper. Res, . 126, 106130[CrossRef].
Hitte, C., et al. (2005) Facilitating genome navigation: survey sequencing and dense radiation-hybrid gene mapping. Nat. Rev. Genet, . 6, 643648[CrossRef][ISI][Medline].
Kirkness, E.F., et al. (2003) The dog genome: survey sequencing and comparative analysis. Science, 301, 18981903
Lindblad-Toh, K., et al. (2005) Genome sequence, comparative analysis and haplotype structure of the domestic dog. Nature, 438, 803819[CrossRef][Medline].
Mester, D., et al. (2003) Constructing large-scale genetic maps using an evolutionary strategy algorithm. Genetics, 165, 22692282
Moret, B.M., et al. (2001) New approaches for reconstructing phylogenies from gene order data. Bioinformatics, 17, Suppl. 1, S165S173[Abstract].
O'Brien, S.K. and Graves, J.A. (1990) Report of the committee on comparative gene mapping. Cytogenet. Cell Genet, . 55, 406433[ISI][Medline].
Pop, M., et al. (2004) Comparative genome assembly. Brief. Bioinform, . 5, 237248
Wu, J., et al. (2003) Monte Carlo simulations on marker grouping and ordering. Theor. Appl. Genet, . 107, 568573[CrossRef][ISI][Medline].
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||








and 


















