Skip Navigation


Bioinformatics Advance Access originally published online on November 25, 2004
Bioinformatics 2005 21(8):1579-1591; doi:10.1093/bioinformatics/bti164
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1579    most recent
bti164v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Crane, C. F.
Right arrow Articles by Crane, Y. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Crane, C. F.
Right arrow Articles by Crane, Y. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2004. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions{at}oupjournals.org

A nearest-neighboring-end algorithm for genetic mapping

Charles F. Crane 1,* and Yan M. Crane 2

1USDA–ARS and Department of Botany and Plant Pathology, Purdue University West Lafayette, IN 47907, USA
2USDA-ARS and Department of Entomology, Purdue University West Lafayette, IN 47907, USA

*To whom correspondence should be addressed.


    Abstract
 TOP
 Abstract
 1 INTRODUCTION
 2 DELETION AS A...
 3 THE NEAREST-NEIGHBORING-ENDS...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 

Motivation: High-throughput methods are beginning to make possible the genotyping of thousands of loci in thousands of individuals, which could be useful for tightly associating phenotypes to candidate loci. Current mapping algorithms cannot handle so many data without building hierarchies of framework maps.

Results: A version of Kruskal’s minimum spanning tree algorithm can solve any genetic mapping problem that can be stated as marker deletion from a set of linkage groups. These include backcross, recombinant inbred, haploid and double-cross recombinational populations, in addition to conventional deletion and radiation hybrid populations. The algorithm progressively joins linkage groups at increasing recombination fractions between terminal markers, and attempts to recognize and correct erroneous joins at peaks in recombination fraction. The algorithm is O (mn3) for m individuals and n markers, but the mean run time scales close to mn2. It is amenable to parallel processing and has recovered true map order in simulations of large backcross, recombinant inbred and deletion populations with up to 37 005 markers. Simulations were used to investigate map accuracy in response to population size, allelic dominance, segregation distortion, missing data and random typing errors. It produced accurate maps when marker distribution was sufficiently uniform, although segregation distortion could induce translocated marker orders. The algorithm was also used to map 1003 loci in the F7 ITMI population of bread wheat, Triticum aestivum L. emend Thell., where it shortened an existing standard map by 16%, but it failed to associate blocks of markers properly across gaps within linkage groups. This was because it depends upon the rankings of recombination fractions at individual markers, and is susceptible to sampling error, typing error and joint selection involving the terminal markers of nearly finished linkage groups. Therefore, the current form of the algorithm is useful mainly to improve local marker ordering in linkage groups obtained in other ways.

Availability: The source code and supplemental data are http://www.iubio.bio.indiana.edu/soft/molbio/qtl/flipper/

Contact: ccrane{at}purdue.edu


    1 INTRODUCTION
 TOP
 Abstract
 1 INTRODUCTION
 2 DELETION AS A...
 3 THE NEAREST-NEIGHBORING-ENDS...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
Genetic mapping is a necessary step in understanding genomic organization and the relationship between genes and phenotype. A genetic map is a linear or circular representation of the order of markers and distances among markers along the chromosome(s) of an organism. The genetic mapping problem is to find the most likely correct order and spacing of markers in a genetic map, given the base chromosome number and the presence, absence, or copy number of each marker in each member of a mapping population. The genetic mapping problem pits common knowledge against common sense. The common knowledge, repeatedly cited (e.g. Wu et al., 2003; Mester et al., 2003b; Jansen et al., 2001; Ben-Dor et al., 2000), is that there are n!/2 possible orders of n markers, where complete map reversals are equivalent. It is logical to conclude from this that genetic mapping is a case of the traveling salesman problem, a classic NP-hard combinatorial optimization problem that can be solved approximately through simulated annealing (Press et al., 1992) the Held-Karp algorithm (Ben-Dor et al., 2000) or an evolutionary strategy algorithm (Mester et al., 2003b). The common sense is that genetic maps are one-dimensional, reflecting the linear order of markers along the DNA in the chromosome. It is not physically possible to reach other markers without first visiting one's nearest neighbor in that direction. This implies that a greedy algorithm can identify nearest neighbors and order markers correctly in low-order polynomial time, if the distances among markers are consistent with a single map order, i.e. if the mapping population is large enough.

High-throughput methods are making feasible the simultaneous scoring of tens of thousands of markers in thousands of individuals. For an initial example, single-nucleotide polymorphisms have been scored in zebrafish microarrays (Stickney et al., 2002) and deletion or addition of single copies of non-polymorphic mouse sequences can be reliably detected (Chung et al., 2004). While the cost remains astronomical, it is declining as microarray and microbead technologies become more flexible (Nuwaysir et al., 2002) and reproducible, and thus there is an incentive to develop a method to order many thousands of markers accurately, efficiently and automatically. Such a method is proposed here. It is based on the idea that most mapping experiments can be viewed in terms of deletion of markers from a reference genotype. The method progressively identifies and joins nearest-neighboring markers to condense out the genetic map from the set of all genotyped markers. It is completely non-parametric, depending only on rankings of recombination fractions, with no a priori assumptions about breakpoint interference or repair processes.

This paper presents a view of various types of mapping populations, interpreted as deletion populations relative to a reference genotype, in Section 2. Section 3 presents the nearest-neighboring ends algorithm. Section 4 gives results for its accuracy in response to population size, dominance, segregation distortion, missing data and random typing errors, in three specific types of mapping populations, and also its run time relative to input size. Section 4 finishes with a map of 1003 loci from hexaploid bread wheat (Triticum aestivum L. emend. Thell.). Section 5 discusses the strengths and weaknesses of the algorithm.


    2 DELETION AS A GENERAL BASIS FOR GENETIC MAPPING
 TOP
 Abstract
 1 INTRODUCTION
 2 DELETION AS A...
 3 THE NEAREST-NEIGHBORING-ENDS...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
First, four conventional but basic definitions are needed. A locus is the position of a gene or other unique sequence of interest on the chromosome. An allele is a recognizable variant of the DNA sequence at a locus. A marker is the basic unit of a genetic map. A linkage group is an independently segregating collection of ordered markers. An allele is codominant if it can be recognized in the presence of other allele(s), and if other allele(s) can be recognized in its presence. An allele is dominant if it can be recognized to the exclusion of other alleles. A marker can be a dominant allele, a codominant allele, or a whole locus (e.g. if the dosage of non-polymorphic loci can be measured by means of competitive hybridization with a reference genotype on a microarray). Thus two codominant alleles at a single locus are two independent markers, and a dominant allele is a single marker that occurs in a single linkage group. Codominant markers are required to identify homologous linkage groups. The number of linkage groups is related to the base chromosome number x and the ploidy p of the organism, and to the design of the experiment, itemized as follows without trying to be exhaustive:

  1. Backcross (BC1). A hybrid of two inbred lines is backcrossed to one of the parental lines. The haploid complement of the non-recurrent parent is mapped, with x linkage groups, c codominant markers and d dominant markers.
  2. Inbred (F2 through F{infty}). An F1 hybrid of two inbred lines is self-pollinated to produce the F2 generation. Subsequent generations are obtained through self-pollination and single-seed descent. There are two alternative mapping strategies: map one of the parents with x linkage groups and n = c + d markers, or map the 2x linkage groups of the F1 hybrid, using 2c + d markers. These two strategies also apply to haploid and doubled haploid populations derived from a heterozygous founder, except that there are 2c + 2d markers for the case of 2x linkage groups because there are no dominant heterozygotes in the mapping population.
  3. Full-sib outcross, testcross and double cross. When two distinct heterozygous individuals are crossed, it is possible to map the heterozygous markers of each parent that do not occur in the other parent. Thus there are 4x linkage groups (the reference genotype is the sum of the parents) and as many codominant markers as are unique and segregating in each parent. An alternative is to map the distinctive markers of only one parent into 2x linkage groups.
  4. Deletion. Chromosome segments are deleted by irradiation, chemical clastogen or specific genetic interaction (Tsujimoto et al., 2001). One design monitors reduced dosage of n non-polymorphic markers over x linkage groups. Another design deletes 2c + d heterozygous markers from 2x linkage groups, as in irradiation of a hybrid. A third design deletes c + d markers from the x linkage groups of one heterozygous parent. In radiation hybrid mapping, x is set to the number of added chromosomes, and c + d markers are mapped unless there are two added homologues.

In each of the preceding situations, segregation deletes a different subset of reference-genotype markers from each individual of the mapping population. The recombination fraction between markers i and j is the number of instances in which i or j, but not both, have been deleted. Markers i and j are nearest neighbors if their recombination fraction is lower than that for i or j with any other marker. We propose, on the basis of experience with simulated data, that a data set is consistent with a single map order if the ranking of recombination fractions from marker a through intervening markers to marker b is the reverse of the ranking from b through intervening markers to a. In the next section, it is presumed that the data are indeed consistent with a single map order.


    3 THE NEAREST-NEIGHBORING-ENDS ALGORITHM
 TOP
 Abstract
 1 INTRODUCTION
 2 DELETION AS A...
 3 THE NEAREST-NEIGHBORING-ENDS...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
(a) Input data: The input consists of dosage (1 =present) for each marker in the reference genotype, and dosage (0 =absent, 1 =present) for each marker in each individual in the mapping population. The format is orthogonal to that used by MapMaker, in that data are listed by marker within individuals, not by individual within markers. For n markers in m individuals, the algorithm builds an nx n array rf of recombination fractions, first filling the lower tridiagonal half:

Initialize rf[i][j] to 0 for i = 0 to n – 1 markers and

j = 0 to i markers.

for k = 0 to m – 1 individuals

for i = 0 to n – 1 markers

for j = 0 to i markers

rf[i][j]:= rf[i][j] + 1 if marker i is

deleted and marker j is not, or marker

j is deleted and marker i is not.

endfor

endfor

endfor

Copy the lower tridiagonal half of rf to the upper tridiagonal half of rf.

Set rf[i][i] to m for i = 0 to n – 1 markers, since a marker cannot be its own nearest neighbor.

The algorithm simultaneously fills an array cumf of cumulative deletion frequencies, which is used to locate centromeres in deletion maps and to identify segregation distortion in various maps:

for i = 0 to n – 1 markers

cumf[i]: = 0//Initialize cumf to 0 for all markers.

endfor

for k = 0 to m – 1 individuals

for i = 0 to n – 1 markers

cumf[i]:= cumf[i] + 1 if marker i is deleted

in individual k

endfor

endfor

The filling of the rf and cumf arrays is amenable to parallel processing in multiprocessor environments, since the subtotals from multiple jobs can be added together.

(b) Initialize a book-keeping array of available ends and two arrays of pointers, which together hold the map as a doubly linked list:

for i = 0 to 2 * n – 1 markers

allends[i]: = 0 //Both ends of each marker are

available for joins.

endfor

The allends list is duplicate; for marker i < n, one end is allends[i], and the other end is allends[i+n]. Since allends contains only two ends per marker, branched linkage groups are precluded.

for i = 0 to n – 1 markers

forepointers[i]:= null

aftpointers[i]:= null

endfor //Thus all markers are initially unlinked.

The position of marker i is its subscript in forepointers and aftpointers. As the map develops, the nearest neighbor on the fore side of marker i will be marker forepointers[i], and the nearest neighbor on the aft side of marker i will be aftpointers[i].

(c) Beginning at the lowest recombination fraction, identify and join nearest neighbors, consuming two ends from the book-keeping array allends per join:

for k = 0 to m – 1 //value of recombination fraction

rf[i][j]

for i = 0 to 2 * n – 1 ends

for j = 0 to 2 * n – 1 ends //Look at all pairs

of ends.

Join marker i%n to marker j%n if rf[i%n]

[j%n] = k and if allends[i] = allends [j] = 0 and if j%n

is not already in the linkage group that contains i%n.

allends[i]:= –1 //used and ineligible for

further joins

allends[j]:= –1 //used and ineligible for

further joins

Break out of the k loop if the number of available ends ≤2 * base chromosome number.

endfor

endfor

endfor

Joins of markers i and j require compatible ends, i.e. aft to fore:

aftpointers[i]:= j and forepointers[j]:= i

or fore to aft:

forepointers[i]:= j and aftpointers[j]:= i

If both ends to be joined are fore ends, or both are aft ends, it is necessary to reverse one of the linkage groups by interchanging values of forepointers and aftpointers along its length. The fore end of the linkage group is identified where forepointers[i] is null and aftpointers[i] is not null. The linkage group is traversed by proceeding from i to aftpointers[i] to aftpointers[aftpointers[i]] and so on until an aftpointers value is null. As for swaps in general, a temporary variable keeps track of where to go next in the aftpointers array. To prevent cycles, the same traversal procedure is used to determine if marker j already belongs to marker i's linkage group, and the join is not performed if j is found.

(d) Find and correct errors in the map order. At this point, the map might contain concatenated linkage groups, inversions and transpositions of blocks of markers within linkage groups, and translocations of blocks of markers between linkage groups. Four distinct strategies, termed presumptive, stringent transactional, relaxed transactional and simulated annealing of marker blocks, have been tried to correct these types of errors. None of these strategies has fully succeeded with markedly inconsistent input data, and none of them is needed with fully consistent input data, which produce an error-free map. They are useful when the input data are somewhat inconsistent with a single map order. The presumptive strategy was used for all the simulations in Section 4, and all four strategies were applied to the real example in Section 4.9.

The first step in each strategy is to ensure that there are at least the ploidy times base chromosome number of linkage groups. If there are fewer than this many, the links having the greatest recombination fraction are broken until there are enough linkage groups. The second step is to seek out and repair inversions within and at the end of linkage groups. Any block of markers bounded by high-recombination links is reversed, if doing so would shorten the map. Now the strategies diverge.

The presumptive strategy cuts linkage groups that have more markers than a user-specified limit, at the link having the greatest recombination fraction, under the presumption that any larger linkage groups are mistakenly concatenated. Cuts continue until all linkage groups are below the limiting number of markers. Then the presumptive strategy cuts all remaining links that exceed a threshold recombination fraction. Next, each of the shortest linkage groups is joined to its nearest neighbor until ploidy times base chromosome number linkage groups remain. Joins are preferentially end-to-end, but insertion into the interior of another linkage group is allowed if it would further shorten the map. Joins are also monitored to prevent the formation of cycles. Finally, for as long as it shortens the map, the marker block bounded by the lowest recombination fractions is moved to its nearest neighbors, either elsewhere in the same linkage group or in another linkage group. The presumptive strategy sometimes results in an increased overall map length, and it reduces the number of ends whenever two linkage groups are joined.

The stringent transactional strategy always creates as many free ends as it consumes, and cannot lengthen the map. It identifies putatively transposed marker blocks bounded by links of high recombination fraction and moves them as in the presumptive strategy. It can also join such a block onto a free end if they are nearest neighbors. The difference from the presumptive strategy is that when a cut-and-rejoin operation would consume more ends than it produces, the link elsewhere that has the highest recombination fraction is considered as part of the transaction. The operation is performed only if the entire transaction (cut, rejoin, and when necessary cut elsewhere) shortens the map.

The relaxed transactional strategy also maintains a constant number of free ends while it corrects transpositions. It differs from the stringent strategy in allowing a transaction to proceed if all resulting links are shorter than the longest link that was cut. Therefore, it allows the map to lengthen even though it eliminates the longest links.

The simulated annealing strategy likewise applies to the correction of transpositions. A list of eligible markers is prepared from the free ends and the links that exceed a threshold recombination fraction. Joins are proposed between random pairs of eligible markers. The break and rejoin operation is performed if it would shorten the map. If it would lengthen the map, it is performed with a probability exp((previous map length – resulting map length) / temperature) (Press et al., 1992) thereby enacting the Metropolis algorithm over some cooling schedule. Because it randomly exchanges eligible markers one pair at a time, this strategy is far less efficient than the preceding strategies. Also, it has no way to insert misplaced markers into the interior of marker blocks. Its virtue is its potential to uncover shorter maps that would not be accessible to a purely greedy strategy, when the input data are inconsistent.

(e) Apply an appropriate mapping function to the interlocus recombination fractions while traversing the map in the aftpointers array. Output the map with its interlocus distances.

3.1 Variant to accommodate missing data
The preceding algorithm builds up the array of recombination fractions on the presumption that all data are present, or that any missing values have been imputed already. Another, generally better, way to deal with missing data is to include a second n x n array (instances) to hold the number of instances in which each cell [i,j] of the recombination fraction array has available data, i.e. the number of times that marker i and marker j are both accounted for. After all the recombination fractions have been accumulated, each can be multiplied by the total number of individuals and then divided by the value for markers i and j in the instances array, to estimate what the recombination fraction would have been with all data present. Then the main loop can look for recombination-fraction values that are within 0.5 of each possible integer value, as the latter ratchets upward from 0. The rest of the algorithm operates as specified above.


    4 RESULTS
 TOP
 Abstract
 1 INTRODUCTION
 2 DELETION AS A...
 3 THE NEAREST-NEIGHBORING-ENDS...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
Three types of mapping populations were simulated to ascertain the behavior of the algorithm at different breakpoint frequencies: (1) a deletion population in which all markers have been scored for reduction from the same dosage, (2) a disomic backcross (BC1) population derived from homozygous parents, and (3) a disomic F10 recombinant inbred population derived from homozygous parents, where m F2 individuals have been advanced to m F10 individuals by nine generations of single-seed descent. The genotypes of the simulated populations were scored, and the scores were fed into the mapper program in the same format as for real data. The simulations were used to investigate the response of map accuracy to population size, marker spacing, dominance, segregation distortion, missing data and random typing errors. Several simulations looked into the effects of base chromosome number. Simulations also addressed the relationship of run time to population size and number of markers. Finally, there is an application to one of the standard molecular maps of bread wheat, illustrating the algorithm's limits in regard to missing data, small population size and gapped distribution of markers within chromosomes.

4.1 Simulation methods and typical settings
Separate programs were written to simulate deletion, backcross and recombinant inbred populations, and additional variants of the backcross program were used for different situations with segregation distortion. The deletion and backcross programs created a true marker order and distribution as numeric values in (0, 1). For recombinant inbreds, dominant markers were placed in (0, 1); the codominant markers were placed in the interval (0, 0.5), and this distribution was replicated to the interval (0.5, 1), with the marker numbers (names) incremented by c, the number of loci having codominant alleles. Markers were either randomly distributed, or uniformly distributed and randomly ordered. For parallel processing, it was possible to build different mapping populations to the same true order and distribution, but otherwise each simulated population had its own marker order. The majority of deletion and backcross simulations were based on a monoploid karyotype of eight equally long chromosomes. This monoploid number doubled to 16 for recombinant inbreds, since both parental genomes were mapped. Marker locations were not allowed to fall precisely on chromosome ends.

For each simulator, there was a maximum allowed number of events that could do nothing or result in a breakpoint. Each chromosome pair was treated as double-stranded to ease the mathematics, but the results apply accurately to the actual four-stranded pairs if the probability of deletion is adjusted by half. For deletion populations, there typically were six potential deletion events, each with probability 0.4 of no break, 0.3 of a single break resulting in a terminal deletion, 0.27 of a double break resulting in an interstitial deletion, and 0.03 that a whole chromosome was lost, resulting in a mean of 4.97 breakpoints per mapping individual. For backcross and inbred populations, there typically were 24 such events (thus a maximum of three chiasmata per meiotic bivalent), each with probability 0.65 of no break and 0.35 of a crossover, resulting in a mean of 8.36 breakpoints per mapping individual. In the inbred simulator, the breakpoints were coordinated between the (0,0.5) and (0.5,1) intervals, and the process of meiotic breakage and gamete fusion was repeated for typically nine times to produce an F10 generation, resulting in a mean of 17.03 breakpoints per mapping individual. Initial breakpoints fell at random in physical space for deletion events and recombinational space for recombinational events. The random distribution of the second breakpoint of an interstitial deletion was controlled with a settable variable. There was no attempt to model chiasma interference or a strictly Poisson distribution of deletion breakpoints, in part because different clastogens produce different deletion regimes (Wong et al., 2000) that appear to reflect repair processes and the winding of DNA in the chromosome. For segregation distortion, the presence of 1–3 particular markers could be selected for or against, and recombination between particular markers could be selected against.

A separate program compared estimated with true maps. Performance metrics included (a) the number of marker pairs in the true order that were preserved in either forward or reversed order in the estimated map, (b) the number of markers in the same position in the true and estimated maps, disregarding complete reversal of linkage groups, (c) the number and population of linkage groups in true and estimated maps, and (d) in deletion maps, the number of centromeres with correct flanking arms. Also, the mapper itself provided a histogram of number of marker pairs versus value of recombination fraction upon traversing the map.

4.2 Demonstration of mapping accuracy with consistent data
Table 1 displays the results of simulations with various numbers of uniformly distributed markers and populations likely large enough to produce data consistent with the true map order. The algorithm found the true map order for 3308 of 3311 maps, and transposed a single pair of markers in the remaining three maps. A larger population probably would have rectified the three reversed pairs.


View this table:
[in this window]
[in a new window]
 
Table 1 Recovery of true map order with uniformly spaced markers

 
4.3 Relationship of map accuracy to population size
Separate series of runs were set up for 1005 uniformly distributed markers in deletion, backcross and F10 recombinant inbred populations, to obtain the proportion of the true map recovered versus population size (Fig. 1). Population size began at 40 individuals and incremented upward until the true order was being recovered invariably. Accuracy is shown for matching pairs and exact positions recovered. These metrics gave almost identical results for more than 150 F10 individuals, 250 BC1 individuals and 500 deletion individuals. Below these threshold population sizes, translocations, transpositions and inversions occurred in the map order and depressed the percentage of exact positions recovered. The most frequent rearrangement occurred when an interstitial deletion occurred within a bin of otherwise unresolved markers. In this case, the algorithm usually put the markers within the interstitial deletion into a bin of their own next to the bin where the markers actually resided. Although the plotted curves for either metric expectedly would reach 0 for a population of 0 individuals, and reached 1 for large populations, no arctangent function of the form y = 2x arctan(axb)/{pi} fit well enough to reliably estimate the population size that would give 99% recovery of the true order. All fits to the rational function y = xb/(ab + xb) were worse, because of its sigmoidal character.



View larger version (20K):
[in this window]
[in a new window]
 
Fig. 1 Effect of population size on map accuracy with 1005 uniformly spaced loci and 8 chromosomes under standard conditions (see text). Curves are in three pairs, F10 inbred (top, treated as 2010 markers in 16 linkage groups), BC1 (middle) and deletion (bottom). Within each pair, the thin curve is a fraction of correct neighbors, and the thick curve is fraction of the map that matched the true marker order.

 
Series of runs were also performed for 1005 randomly distributed markers in deletion, backcross and F10 recombinant inbred populations, to plot matching-pair recovery versus population size (Fig. 2). As expected, the curves were lower than those for uniformly distributed markers, and the shape was qualitatively flatter as the curve approached 1, reflecting the very close proximity of some markers.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 2 Effect of population size on map accuracy with 1005 randomly spaced loci and 8 chromosomes under standard conditions (see text). Thin solid curve, F10 inbred population (as 2010 markers in 16 linkage groups); dotted curve, BC1; thick solid curve, deletion.

 
Of particular interest was the scaling of the population size needed to obtain a fixed percentage accurate recovery of the true marker order, as this scaling relationship determines the resource requirements of large mapping projects. It was investigated for 99% accurate recovery of marker pairs, using sets of runs with incrementing population sizes whose results bracketed 99% recovery. Because the curves in Figures 1 and 2 are nearly straight at this level, linear regression was performed of percent recovery versus population size, yielding the estimated population size needed for exactly 99% accuracy. These estimated populations were plotted against number of uniformly or randomly distributed markers (Figs 3 and 4). The relationship was close to linear for both marker distributions and all three population types, over a 12-fold range of marker number. Disconcertingly, the necessary populations were 20–22 times larger with randomly spaced markers, as measured by the slopes of regressions versus number of markers.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 3 Population size needed to recover 99% of true marker pairs in relation to number of uniformly spaced loci under standard conditions (see text). Thin curve, deletion population; thick curve, BC1; dotted curve, F10 inbred. The dotted curve has two codominant markers per locus, grouped into 16 linkage groups, while the solid curves have one marker per locus, in 8 linkage groups.

 


View larger version (18K):
[in this window]
[in a new window]
 
Fig. 4 Population size needed to recover 99% of true marker pairs in relation to the number of randomly spaced loci. Solid regression line with squares, deletion population; long-dashed line in middle with circles, BC1; short-dashed line with triangles, F10 inbred. The inbred case has two codominant markers per locus.

 
4.4 Effect of dominant markers in inbred generations
Dominant markers prevent reliable scoring of recessive markers in inbred populations, because of the inability to distinguish homozygous from heterozygous dominant genotypes. Errors will occur in attempting to map recessive markers in the presence of heterozygotes. The problem can be sidestepped by mapping only the dominant allele, which can be accurately scored. This was confirmed by recovery of the true marker order from several simulated F2 populations with segregating dominant and codominant markers, where the dominant markers had been randomly allocated to homologous linkage groups in the true order. However, the algorithm did not attempt to merge the independently estimated maternal and paternal homologous linkage groups.

4.5 Segregation distortion
Segregation distortion was investigated with individual markers, pairs of markers and a whole class of loci (the centromeres).

4.5.1 Complete selection for or against single markers
For each case, 200 BC1 populations were simulated with 2005 uniformly distributed markers and 3000 individuals. All individuals had the selected marker, which was never deleted, or all individuals lacked the selected marker, which was always deleted. The estimated marker order and map length were unaffected in all 400 populations: 399 populations yielded the true order, and one population had two loci reversed. The only indication of segregation distortion was that the cumulative deletion frequency went to 0 with selection for the affected marker, or to the population size m with selection against it.

4.5.2 Selection against recombination between two markers
In a BC1 population, such selection could take three forms: selection against the non-recurrent allele at both markers, selection for the non-recurrent allele at both markers, or selection against recombinants between the two markers so that either parental combination was allowed. All three forms were simulated in backcross populations with 2005 uniformly distributed markers and 3000 individuals. With absolutely no recombination between the selected markers, the estimated marker order was translocated between the linkage groups or parts of the same linkage group that contained the selected markers. The selected markers were juxtaposed in 200 of 200 maps for each case, although the effect on overall map fidelity was minor if both selected markers were near telomeres in the true map; only four markers were out of position in the least affected map.

Figure 5 shows the effect of incompletely inhibited recombination between two markers. It is a plot of proportion of markers in their true position and proportion of true pairs recovered, versus probability of recombination between two specific markers whose position is random in the true order. Each data point represents 200 populations. One set had 2005 markers and 3000 individuals, and the other set had 201 markers and 300 individuals. There is a sigmoidal transition from frequent translocation of two linkage groups at no recombination, to no effect on estimated marker order above a recombination probability of 0.07 for populations of size 300, and 0.006 for populations of size 3000. Local marker ordering was also disturbed for populations of size 300 when the recombination probability was below 0.07. The intensity of segregation distortion that was necessary to provoke translocated marker order was greater for large populations, and the larger populations could be mapped accurately unless there was nearly complete distortion.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 5 Response of map accuracy to population size and inhibited recombination between two randomly chosen loci in BC1 populations with uniformly spaced markers and a base chromosome number of 8. Dotted, fraction of correct neighbors; solid, fraction of map that matched the true order. Short curves, 2005 markers in 3000 individuals; long curves, 201 markers in 300 individuals.

 
4.5.3 Selection in favor of recombination between two markers
Backcross populations were simulated with 2005 uniformly distributed markers and 3000 individuals, all of which had experienced recombination between two specified markers. There was no effect on marker order or map length if the selected markers were not linked in the true order. Selection on truly linked markers expanded the map between them, ordinarily without affecting marker order. The marker order was translocated only if the recombination fraction between the selected markers was on the order of that between unlinked markers, e.g. if the selected markers were adjacent or separated by only one other marker in the true order. In this case, the map was cleaved between the two or three affected markers, each of which went to a different linkage group, and two or three pairs of legitimate telomeres were joined to produce the base number of linkage groups. Had joins been stopped at a higher recombination fraction, the maps would have had two or three additional linkage groups but undisturbed marker order within them.

4.5.4 Deletion population stripped of monosomic individuals
In 300 simulated monoploid populations, each with 8 chromosomes, 2005 loci and 6000 individuals, the single-hit and double-hit threshold values were set so that any given hit (out of the six hits possible) had a 30% chance to produce a terminal deletion, a 30% chance to produce an interstitial deletion and a 40% chance of not deleting anything. Thus the centromeres were never deleted, and the populations were totally devoid of monosomes. The correct arms were associated in 6.67% of the linkage groups, exactly matching the 1/15 expectation for random assortment of 16 arms. Six population series, differing in size (1000, 2000 or 3000 individuals) and base chromosome number (8 or 12), were simulated to determine more quantitatively the effect of low monosome frequencies on correct arm association (Figs 6 and 7). For a base chromosome number of 8, a monosome probability of 0.011 or more per deletion event guaranteed correct association of all arms in a population of 1000 individuals. The necessary monosome probability decreased to 0.008 with 2000 individuals and 0.006 with 3000 individuals. For a base chromosome number of 12, these probabilities respectively increased to 0.016, 0.011 and 0.010. Thus population size, base chromosome number and probability of centromeric deletion in each event, interact to determine the likelihood that all monosomes occur in the mapping population.



View larger version (16K):
[in this window]
[in a new window]
 
Fig. 6 Response of arm association to probability of a monosome per deletion event for 1005 uniformly spaced markers and base number=8, at three different deletion-population sizes: 1000 (thick solid), 2000 (thin solid) and 3000 (dotted).

 


View larger version (17K):
[in this window]
[in a new window]
 
Fig. 7 Response of arm association to probability of a monosome per deletion event for 1005 uniformly spaced markers and base number=12, at three different deletion-population sizes: 1000 (thick solid), 2000 (thin solid) and 3000 (dotted).

 
There is a probability, designated incompletemono in the deletion simulator, that one or more arm segments are translocated onto other chromosomes when the centromere is lost, and thus that some ‘monosomes’ retain markers from the missing chromosome. To investigate if such retention affects the correct joining of chromosome arms, two series of deletion populations were simulated, with 1005 markers, a 0.004 probability of a monosome resulting per each of six deletion events, 1000 or 3000 individuals, and a range of incompletemono from 0 to 0.6. There were 200 populations per value of incompletemono. Linear regression yielded y = 0.838 – 0.012 x incompletemono for populations with 1000 individuals, and y = 0.939 + 0.010 x incompletemono for populations with 3000 individuals. Although the larger population increased the percentage correct association of chromosome arms, neither slope was significantly different from 0, and thus retention of ‘missing-arm’ markers did not affect arm association on the estimated map.

4.6 Effect of missing data
Randomly missing data were simulated by first preparing a complete file of scores, then replacing a percentage of them with null values. For the majority of tests, which employed the version of the program intended for complete data, a null value for any particular marker in any individual was then randomly set to 0 or 1, gated by the probability of 0 or 1 in the available data for that marker over the whole population. (The variant for missing data was applied to a limited sample of trials with missing data, discussed at the end of this section.) The mapper then ran on the amended data file. Figure 8 displays the results for a BC1 with 1005 markers, 1500 individuals and 200 populations per probability of missing data. The curves for preserved marker pairs and exactly matching marker position began to diverge at a probability of 0.04, as translocations began to appear among linkage groups. The divergence plateaued at a probability of 0.11, and maintained this value through the end of the simulation at a probability of 0.20. Simultaneously, the accuracy of local marker ordering deteriorated steadily from a probability of 0.01 to 0.20 of missing data. Figure 9 displays similar results, shifted to the right, for an F10 inbred. Local ordering began to deteriorate at a probability of 0.01, and the exact matches and matching pairs noticeably diverged from 0.09 upward, approaching saturation at 0.20. Figure 10 displays the results for a deletion population. The local deterioration was more severe at low probabilities of missing data, and translocations appeared sooner (0.05), but the number of translocations increased less rapidly and did not approach saturation within the tested interval.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 8 Effect of missing data on accuracy of a BC1 map for 1005 uniformly spaced markers, 1500 individuals and a base number of 8. Thin upper curve, fraction of correct neighbors; thick curve, fraction of map that exactly matched the true marker positions.

 


View larger version (14K):
[in this window]
[in a new window]
 
Fig. 9 Effect of missing data on accuracy of an F10 inbred map for 1005 uniformly spaced loci (2010 codominant markers) in 1200 individuals with a base chromosome number of 16. Thin upper curve, fraction of correct neighbors; thick curve, fraction of map that exactly matched the true marker positions.

 


View larger version (14K):
[in this window]
[in a new window]
 
Fig. 10 Effect of missing data on accuracy of a deletion map for 1005 uniformly spaced markers in 3000 individuals with a base chromosome number of 8. Thin upper curve, fraction of correct neighbors; thick curve, fraction of map that exactly matched the true marker positions.

 
Figure 11 demonstrates that increased BC1 population size could compensate for missing data, which were held constant at 0.08 while the population increased from 1500 to 9400 individuals. Here only one population per population size was plotted, but the matching-pairs and exact-location metrics converged with increasing population size, and accuracy was above 97% in both measures from 5700 individuals upward. However, translocations or transpositions occurred occasionally for sizes up to 9200 individuals.



View larger version (22K):
[in this window]
[in a new window]
 
Fig. 11 Restoration of map accuracy for 1005 uniformly spaced markers and 8 chromosomes with increasing BC1 population size in the presence of 8.0% missing data. Thin upper curve, fraction of correct neighbors; thick curve, fraction of map that exactly matched the true marker positions.

 
The situation was much better with the variant algorithm in Section 3.1. Table 2 presents results with varying percentages of missing observations for 100 simulated BC1 populations of 1500 individuals and 1005 markers. With 20% randomly missing observations, the variant algorithm recovered a mean of 99.2% (range: 98.19–100.00%) of the marker pairs from the true order, versus all the marker pairs in the absence of missing data. Furthermore, there were only five markers more than one place out of position versus the true order, over all 100 populations.


View this table:
[in this window]
[in a new window]
 
Table 2. Influence of randomly missing data on results with the variant algorithm in Section 3.1, given uniformly spaced markers

 
4.7 Effect of random typing errors
Typing errors were simulated much like missing data: a complete data file was generated, and then random markers were converted 0 to 1 or 1 to 0 in each individual, totaling to a settable percentage of incorrect scoring over the entire data file. The probabilities for false positive (0 to 1) and false negative (1 to 0) data were independent, but for this study they were kept equal. Figure 12 displays the sensitivity of a deletion population to random typing errors. There were 1005 uniformly spaced markers and 3000 individuals. From 0 to 10% of the data were affected, and there were 100 populations per value of typing error. Typing errors severely reduced accuracy, which was already below 99% with only a 0.005 probability of typing error. Translocations appeared at 0.015 typing error and reached a maximum at 0.04. Beyond that point, the rate of deterioration decreased, because the map was running out of correct sequence to be degraded.



View larger version (15K):
[in this window]
[in a new window]
 
Fig. 12 Severe effect of typing errors on a deletion map with 1005 uniformly spaced markers, 8 chromosomes and 3000 individuals. Thin upper curve, fraction of correct neighbors; thick curve, fraction of map that exactly matched the true marker positions.

 
4.8 Run time in relation to input size
There are two inputs that vary in size: the number of markers and the size of the mapping population. The algorithm has three main phases that can be evaluated for response to these sizes: summing recombination fractions, ordering the markers and correcting errors. The summing of recombination fractions is clearly related to the population size and to the square of the number of markers, since all two-by-two combinations of markers are evaluated for all individuals in the mapping population. The ordering phase evaluates recombination fractions for all combinations of markers for up to the size of the mapping population in the worst case, and for each of these combinations it looks through a growing number of markers to prevent cycles. Therefore, the ordering phase is O(mn3).

The asymptotic run time was never approached in practice, since the mean run time was dominated by the O(mn2) accumulation of recombination fractions. The scaling of the three phases was investigated in more detail in a series of 100 runs with 3000 BC1 mapping individuals and varying numbers of markers from n = 605 to n = 5555 (Fig. 13). For any pair of marker numbers, it was possible to evaluate the coefficients a and b in the expressions and , where c was varied manually, since there were enough local minima to thwart optimization by bisection or the downhill simplex method (Press et al., 1992). Let r = n2/n1 for n2 > n1. Then a = y1 – (y2y1)/(rc – 1) and . Values of a and b were thereby calculated for all pairs of n1 and n2 where n2 > n1, and the median values of a and b were used to calculate goodness of fit (sum of squared differences) over the 100 values of n. This led to a = –40.073, b = 0.0001081 and c = 1.983 for the accumulation phase, a = –0.68925, b = 0.005983 and c = 1.025 for the ordering phase, and a = 0.02576, b = 2.5746 x 10–11 and c = 2.52 for the correction phase, which was of the presumptive type. The negative intercepts for the first two phases indicate run time optimization on the machine, although it is likely that the ordering phase was dominated by the work performed in joining markers nearly n times, rather than looking for markers to join. The mean time for presumptive correction was essentially independent of n, since b was very close to zero. Transactional correction, whether stringent or relaxed, was also brief.



View larger version (13K):
[in this window]
[in a new window]
 
Fig. 13 Run time in relation to number of uniformly spaced markers for BC1 populations with 3000 individuals and 8 chromosomes. Thin upper curve, time to accumulate recombination fractions. Thick curve near x-axis, time to order the markers. The correction phase (with presumptive correction) is too short to appear on this scale.

 
Correction by simulated annealing was far slower. The necessary time depended upon the annealing schedule and the number of eligible ends. In mapping 2006 wheat markers (in the next section), the data acquisition took 19.25 s, the initial ordering of all markers took 10.68 s and the correction took 6472.93 s, even though only 248 actual or potential ends were eligible for breaks and joins.

4.9 A real example
The algorithm in Section 3.1 was applied to segregation for 1003 markers in 114 F7 individuals derived from synthetic W7984 and cultivar ‘Opata85’ in hexaploid (2n = 6x = 42 chromosomes) bread wheat, T.aestivum L. The published map consists of 21 linkage groups, in seven sets of three homoeologues (van Deynze et al., 1995; Nelson et al., 1995a,b,c; Marino et al., 1996; Mingeot and Jacquemin, 1999). It appears in its entirety, with group 3 amended by C. Bermudez and M.E. Sorrells, on the GrainGenes web site (http://wheat.pw.usda.gov/ggpages/maps.shtml). It was obtained semiautomatically with MapMaker 2.0, and membership of various framework markers in homoeologous groups was confirmed with nullitetrasomic and double ditelosomic cytogenetic stocks in cultivar ‘Chinese Spring’.

Although there were 114 individuals in the population, the effective size of the population was considerably smaller. The mean number of valid data points per marker was 86.5, and 366 loci (732 markers) were represented by fewer than 60 data points, the fewest being 43. The raw segregation data were downloaded from GrainGenes and reformatted to lists of markers by individual. For homoeologous group 3, data were removed for individual 61, so that its data covered the same individuals as did the other homoeologous groups. The parental alleles were treated as 2006 distinct markers that could be deleted from a monoploid genotype having 42 chromosomes. Maps were produced for the entire genome simultaneously, and also separately based on the markers in each published linkage group. All four of the correction schemes of Section 3(d) were tried in an effort to find the shortest map of 42 linkage groups that did not combine markers from both parents in the same linkage group.

The algorithm succeeded locally but failed globally. When applied to all markers together, it shortened the map, but concatenated blocks of markers from different published linkage groups, sometimes included markers from both parents in the same linkage group and left five single-marker linkage groups. The corrective phase with simulated annealing gave the shortest map, with 19550.03 breakpoints versus 23283.94 breakpoints over the entire mapping population for the published map, a decrease of 16.0% (Table S1, supplemental data). Relaxed transactional correction shortened the entire map to 19610.74 breakpoints.

When applied to the published members of individual linkage groups with correction by simulated annealing, the algorithm produced a total map length of 20160.93 breakpoints, 13.4% shorter than the published map (Table S2, supplemental data), and individual linkage groups shrank by 1.4% (6D) to 26.1% (4B). Marker order frequently differed between the two parents (Fig. 14). Although translocations are always possible, the high frequency of differing order suggests that the algorithm was making erroneous joins based on vagaries in the segregation of individual markers. When the maximum allowed gap within a linkage group equaled the 10th percentile of distances upon independent segregation, chromosomes 2D, 6B and 7B each had three linkage groups instead of the two expected, and 4A had four equally large linkage groups. The extra linkage group contained a single rogue marker for 6B and 7B, and three markers for 2D. Also, markers split unevenly into parental linkage groups for chromosomes 3A (88 markers to 70 markers), 3B (108:64), and 7A (59:37), because the largest gap within one parental linkage group exceeded the lowest recombination fraction between the ends of the linkage groups.



View larger version (23K):
[in this window]
[in a new window]
 
Fig. 14 Plot of marker locations for each linkage group of wheat synthetic W7984 versus its homolog in ‘Opata85’ upon accepting published marker assignments to linkage group. Each chromosome is designated at the lower right corner of its plot.

 
With correction by simulated annealing for all markers, the algorithm produced 104 concatenations of marker blocks from different published linkage groups. In four instances, markers were joined at zero distance (i.e. segregated identically) from different published linkage groups, possibly because of errors in maintaining or propagating the segregation data to GrainGenes: Xfba276-2A with Xfba276-5B, Xmwg837-1B.1 with Xmwg837-1D, Xbcd1124-1B with Xfba382-7A, and Xfbb024-3B with Xcdo1090-2A. In five other instances, the algorithm closely associated markers that had been published in different linkage groups and did not segregate identically, indicating possible errors in the published map: Xcdo1090-5A was 8.8 breakpoints from Xfbb222-6D; Xgwm108-3B was 15.2 breakpoints from Xfba8-7D, which in turn was 8.9 breakpoints from Xfba253-4A; and Xfbb293-3A.2 was 6.3 breakpoints from Xgwm247-3B, which was 12.7 breakpoints from Xgwm391-3A. Most or all of the remaining concatenations represent errors by the algorithm where gaps within linkage groups exceeded gaps between linkage groups. These gaps, ranging from 22 to 40 breakpoints, were below the recombination fraction (40 breakpoints) expected for the 10th percentile of independent marker association among 114 individuals (i.e. 90% of independently assorting markers would have been more than 40 breakpoints apart). They indicate a clustering of markers in recombinational space.


    5 DISCUSSION
 TOP
 Abstract
 1 INTRODUCTION
 2 DELETION AS A...
 3 THE NEAREST-NEIGHBORING-ENDS...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 
Many genetic mapping algorithms explicitly calculate a map length for each of a series of permuted marker orders (Ben-Dor et al., 2000; Wu et al., 2003). The marker order is optimized to minimize map length, for example, by means of simulated annealing. The nearest-neighboring-ends algorithm does not depend upon an entire map length, does not generate a succession of complete maps and does not build intermediary framework maps based on a subset of the markers.

The nearest-neighboring-ends algorithm is a greedy algorithm that can order tens of thousands of markers perfectly under favorable conditions, but its success under realistic conditions is much less impressive. The problem is data consistency, not complexity. The success of a greedy algorithm means that the genetic mapping problem is not NP-hard, and that it is not necessary to use methods from the Traveling Salesman Problem to solve it, at least when its data are consistent with a single map order. The success of a greedy algorithm also confirms Theorem 1 of Ben-Dor and Chor (1997) which asserts that Kruskal's algorithm will recover the correct marker order if the graph (array of recombination fractions) is consistent, which they defined as having all edges longer than any edges that they span.

The nearest-neighboring-ends algorithm is a variant of Kruskal's minimum spanning tree algorithm (Kruskal, 1956; Cormen et al., 2001) because it (a) begins with a ‘forest’ of single-marker paths, (b) always chooses a minimum-weight, not-yet-examined edge to add to the path, and (c) checks that any proposed edge does not complete a cycle, i.e. does not join vertices that already belong to the same path. It is asymptotically less efficient than the form given by Cormen et al. (2001) because it does not first sort the edge lengths and instead makes a usually limited number of passes over the entire graph. Its use of a book-keeping array with two entries per vertex, and its path-reversal mechanism to ensure compatible joins, are also not standard in Kruskal's algorithm. Ben-Dor and Chor (1997) essayed a Kruskal-like algorithm for radiation hybrid mapping of individual linkage groups with up to 500 markers, and found that it was more likely to find the correct marker order than a Prim-like algorithm. Otherwise, Kruskal's algorithm apparently has not been applied previously to genetic mapping problems.

The nearest-neighboring-ends algorithm is similar to seriation (Buetow and Chakravarti, 1987) in that it uses an n x n matrix of interlocus distances, which in their case are estimates based on lod score. Seriation uses a second n x n matrix to build n separate maps, each of which is nucleated on a different marker pair and extended by attaching nearest neighbors to each end, somewhat as in Prim's (Cormen et al., 2001) minimum spanning tree algorithm. Then a consensus map is extracted from the individual maps. Seriation is fast (Wu et al., 2003), but apparently it has not been benchmarked on the same scale of simulations as have been presented here.

The simulations of randomly spaced markers (Fig. 4) suggest that recombinant inbred designs require 6–8 times as many individuals as markers to get a reasonably reliable marker order, and backcross designs require twice that again. Such populations are 1–2 orders of magnitude larger than have generally been used to map crop plants. The critical factors are the numbers of breakpoints per population and the randomness of the marker distribution in the recombinational space. Even with large mapping populations and uniformly spaced markers, the intermarker distances estimated with the nearest-neighboring-end algorithm vary greatly. The population size required for accurate estimation of distances far exceeds that required for accurate ordering.

The effect of segregation distortion depends upon the ranking of its depression of recombination fraction with the typical recombination fraction that would have occurred without it. Large, dense maps are unaffected by moderate amounts of segregation distortion, and selection must affect more than one locus if there is to be any perturbation of marker order. Sparse, gapped maps appear to be much more sensitive to concerted segregation of selected markers, because the linkage groups are less distinct.

The linear relationship of necessary population size to number of markers (Fig. 3) contradicts Theorem 3 of Ben-Dor and Chor (1997), which claims a quadratic relationship even for uniformly spaced markers, i.e. that the population size for a given accuracy is related to the square of the number of markers. To resolve a map of n markers, there must be at least one breakpoint in each of n – 1 bins, representing the intervals between n markers in the map. If the markers are uniformly spaced, the bins are equally long, and a uniform variate has equal probability of landing in any of them. The distribution of landings for m variates in n – 1 bins has (n – 1)m terms, and a fully resolved map is possible only if each bin has a non-zero value.

The algorithm allows the parents of inbred populations to differ in marker order, so long as sufficient homology remains for recombination to occur. This can be used to investigate the frequency of transposition of individual markers, versus the frequency of chromosome structural rearrangements of groups of markers, between the typically distantly related parents of a mapping population. At the same time, order difference can reflect errors in the independent ordering of homologous linkage groups. The quality controls needed to identify and correct such errors remain to be fully developed.

A map, consisting solely of dominant markers in inbred individuals, cannot be resolved into pairs of homologous linkage groups without additional evidence, e.g. FISH with marker-labeled BACs. When codominant markers are present, the homologous linkage groups can be identified, but their precise alignment remains problematic because each marker is mapped independently, and the recombination fraction might differ between the original parents. As observed by Mester et al. (2003a) the relative order of cis-linked dominant markers in inbreds is determined easily on the basis of their mutual recombination fraction, but the ordering of trans-linked dominant markers on a single, integrated linkage group is difficult. It is not clear if a proportional spacing of dominant markers between nearest codominant markers suffices to obtain a most likely order when homologous linkage groups are combined.

The algorithm did not produce a plausible map in the wheat example, although local ordering was often better (shorter) than in the published map. The problem was not only limited population size and missing data, which themselves severely tested the ability of the algorithm to come up with a reasonable map; it was also a gapped distribution of markers in the recombinational space, coupled with unexpectedly low recombination fraction among markers in several different published linkage groups. As the algorithm progressed to higher recombination fractions between markers to be joined, it became more subject to sampling and typing errors in individual markers at the ends of the coalescing linkage groups. Some of these markers differed markedly from their nearest published neighbors in segregation pattern, such that their inclusion would lengthen the map wherever they went. Errors at individual markers had much less effect in the simulations, which had large, dense maps and sampling of marker positions from a uniform recombinational space, such that gaps within linkage groups were generally smaller than gaps between linkage groups. In the wheat example, the gap sizes between published linkage groups frequently overlapped the distribution of gap sizes within linkage groups, a situation undoubtedly made worse by the small population size. Segregation distortion might also have been involved: not strong selection at one or two markers, but weak selection over groups of markers distributed among several linkage groups, leading to concerted transmission of mutually exclusive sets of markers and reduced independence of their linkage groups. Our simulations of segregation distortion were too simplistic to evaluate this possibility.

There are two possible remedies for this situation, although both would still concatenate different linkage groups whenever any gaps within linkage groups exceeded the least gap between linkage groups in recombination fraction. The experimental remedy is to increase the population size and improve the uniformity of marker distribution, by increasing the number of markers and perhaps by using more than one type of marker to reduce the impact of preferential marker association with centromeres or telomeres (Young et al., 1999). However, it is likely that significant gaps would remain because of recombination within tandemly repeated sequences in nucleolar organizing regions and C-bandable heterochromatin (Cuadrado et al., 1995; Friebe and Gill, 1994), and development of unique markers within such blocks remains problematic. The algorithmic remedy probably is a switch at some point from an individual to a collective measure of proximity among linkage groups, such as the mean, median, or minimum recombination fraction between members of the two groups to be joined. The collective measure would apply to whole linkage groups to identify partners for the next join, and to half linkage groups to choose which ends to join. The strategy most in keeping with Kruskal's algorithm is to limit the eligible members to terminal subsets of markers, and choose the closer of them to represent the entire linkage group in any particular comparison. Such a collective measure requires additional book-keeping, including at least one separate table of collective recombination fractions, which must be updated after each join for the rows and columns of the two linkage groups involved. The effectiveness of this strategy with various interactions among gapped marker distribution, sampling error, typing error and segregation distortion remains to be investigated, as does its effect on run time.


    Acknowledgments
 
We thank Drs Rebecca Doerge, Stephen Goodwin and David Schlipalius for helpful comments and discussions as we developed the current version of the algorithm, Drs David Stelly, H. James Price and Russell Deaton for their support of earlier versions, and two anonymous reviewers for their insightful comments on the manuscript. This work was supported by USDA-ARS CRIS project 3602-22000-013-00D. Although product names are necessary to report on available data, the USDA neither guarantees nor warrants the standard of named products, nor does use of a product name imply approval of that product instead of others that might also be suitable.

Received on April 2, 2004; revised on August 31, 2004; accepted on November 17, 2004

    REFERENCES
 TOP
 Abstract
 1 INTRODUCTION
 2 DELETION AS A...
 3 THE NEAREST-NEIGHBORING-ENDS...
 4 RESULTS
 5 DISCUSSION
 REFERENCES
 

    Ben-Dor, A. and Chor, B. (1997) On constructing radiation hybrid maps. J. Comput. Biol., 4, 517–533[Web of Science][Medline].

    Ben-Dor, A., Chor, B., Pelleg, D. (2000) RHO–Radiation Hybrid Ordering. Genome Res., 10, 365–378[Abstract/Free Full Text].

    Buetow, K.H. and Chakravarti, A. (1987) Multipoint gene mapping using seriation. I. General methods. Amer. J. Hum. Genet., 41, 180–188[Medline].

    Chung, Y.-J., Jonkers, J., Kitson, H., Fiegler, H., Humphray, S., Scott, C., Hunt, S., Yu, Y., Nishijima, I., Velds, A., et al. (2004) A whole-genome mouse BAC microarray with 1-Mb resolution for analysis of DNA copy number changes by array comparative genomic hybridization. Genome Res., 14, 188–196[Abstract/Free Full Text].

    Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C. Introduction to Algorithms, (2001) 2nd edn. , The MIT Press Cambridge, MA.

    Cuadrado, A., Ceoloni, C., Jouve, N. (1995) Variation in highly repetitive DNA composition of heterochromatin in rye studied by fluorescence in situ hybridization. Genome, 38, 1061–1069.

    Friebe, B. and Gill, B.S. (1994) C-band polymorphism and structural rearrangements detected in common wheat (Triticum aestivum). Euphytica, 78, 1–5.

    Jansen, J., de Jong, A.G., van Ooijen, J.W. (2001) Constructing dense linkage maps. Theor. Appl. Genet., 102, 1113–1122[CrossRef].

    Kruskal, J.B. (1956) On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Amer. Math. Soc., 7, 48–50.

    Marino, C.L., Nelson, J.C., Lu, Y.H., Sorrells, M.E., Leroy, P., Tuleen, N.A., Lopes, C.R., Hart, G.E. (1996) Molecular genetic maps of the group 6 chromosomes of hexaploid wheat (Triticum aestivum L. em. Thell.). Genome, 39, 359–366.

    Mester, D.I., Ronin, Y.I., Hu, Y., Peng, J., Nevo, E., Korol, A.B. (2003a) Efficient multipoint mapping: making use of dominant repulsion-phase markers. Theor. Appl. Genet., 107, 1102–1112[CrossRef][Medline].

    Mester, D., Ronin, Y., Minkov, D., Nevo, E., Korol, A. (2003b) Constructing large-scale genetic maps using an evolutionary strategy algorithm. Genetics, 165, 2269–2282[Abstract/Free Full Text].

    Mingeot, D. and Jacquemin, J.M. (1999) Mapping of RFLP probes characterized for their polymorphism on wheat. Theor. Appl. Genet., 98, 1132–1137[CrossRef].

    Nelson, J.C., Sorrells, M.E., van Deynze, A.E., Lu, Y.H., Atkinson, M., Bernard, M., Leroy, P., Faris, J.D., Anderson, J.A. (1995a) Molecular mapping of wheat. Major genes and rearrangements in homeologous groups 4, 5, and 7. Genetics, 141, 721–731[Abstract].

    Nelson, J.C., van Deynze, A.E., Autrique, E., Sorrells, M.E., Lu, Y.H., Merlino, M., Atkinson, M., Leroy, P. (1995b) Molecular mapping of wheat. Homoeologous group 2. Genome, 38, 516–524.

    Nelson, J.C., van Deynze, A.E., Autrique, E., Sorrells, M.E., Lu, Y.H., Negre, S., Bernard, M., Leroy, P. (1995c) Molecular mapping of wheat. Homoeologous group 3. Genome, 38, 525–533.

    Nuwaysir, E.F., Huang, W., Albert, T.J., Singh, J., Nuwaysir, K., Pitas, A., Richmond, T., Gorski, T., Berg, J.P., Ballin, J., et al. (2002) Gene expression analysis using oligonucleotide arrays produced by maskless photolithography. Genome Res., 12, 1749–1755[Abstract/Free Full Text].

    Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P. Numerical Recipes: The Art of Scientific Computing, (1992) 2nd edn , New York Cambridge University Press, pp. 436–447.

    Stickney, H.L., Schmutz, J., Woods, I.G., Holtzer, C.C., Dickson, M.C., Kelly, P.D., Myers, R.M., Talbot, W.S. (2002) Rapid mapping of zebrafish mutations with SNPs and oligonucleotide microarrays. Genome Res., 12, 1929–1934[Abstract/Free Full Text].

    Tsujimoto, H., Yamada, T., Hasegawa, K., Usami, N., Kojima, T., Endo, T.R., Ogihara, Y., Sasakuma, T. (2001) Large-scale selection of lines with deletions in chromosome 1B in wheat and applications for fine deletion mapping. Genome, 44, 501–508[Medline].

    van Deynze, A.E., Dubcovsky, J., Gill, K.S., Nelson, J.C., Sorrells, M.E., Dvorak, J., Gill, B.S., Lagudah, E.S., McCouch, S.R., Appels, R. (1995) Molecular-genetic maps for group 1 chromosomes of Triticeae species and their relation to chromosomes in rice and oat. Genome, 38, 45–59.

    Wong, K.-K., Chang, S., Weiler, S.R., Ganesan, S., Chaudhuri, J., Zhu, C., Artandi, S.E., Rudolph, K.L., Gottlieb, G.J., Chin, L., Alt, F.W., DePinho, R.A. (2000) Telomere dysfunction impairs DNA repair and enhances sensitivity to ionizing radiation. Nat. Genet., 26, 85–88[CrossRef][Web of Science][Medline].

    Wu, J., Jenkins, J., Zhu, J., McCarty, J., Watson, C. (2003) Monte Carlo simulations on marker grouping and ordering. Theor. Appl. Genet., 107, 568–573[CrossRef][Web of Science][Medline].

    Young, W.P., Schupp, J.M., Keim, P. (1999) DNA methylation and AFLP marker distribution in the soybean genome. Theor. Appl. Genet., 99, 785–792[CrossRef].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
GeneticsHome page
B. Kusterer, H.-P. Piepho, H. F. Utz, C. C. Schon, J. Muminovic, R. C. Meyer, T. Altmann, and A. E. Melchinger
Heterosis for Biomass-Related Traits in Arabidopsis Investigated by Quantitative Trait Loci Analysis of the Triple Testcross Design With Recombinant Inbred Lines
Genetics, November 1, 2007; 177(3): 1839 - 1850.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
21/8/1579    most recent
bti164v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (3)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Crane, C. F.
Right arrow Articles by Crane, Y. M.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Crane, C. F.
Right arrow Articles by Crane, Y. M.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?