Analysis of segmental duplications via duplication distance
1Department of Computer Science and 2Center for Computational Molecular Biology, Brown University, Providence, RI, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Segmental duplications are common in mammalian genomes, but their evolutionary origins remain mysterious. A major difficulty in analyzing segmental duplications is that many duplications are complex mosaics of fragments of numerous other segmental duplications.
Results: We introduce a novel measure called duplication distance that describes the minimum number of duplications necessary to create a target string by repeated insertions of fragments of a source string.We derive an efficient algorithm to compute duplication distance, and we use the algorithm to analyze segmental duplications in the human genome. Our analysis reveals possible ancestral relationships between segmental duplications including numerous examples of duplications that contain multiple, nested insertions of fragments from one or more other duplications. Using duplication distance, we also identify a small number of segmental duplications that appear to have seeded many other duplications in the genome, lending support to a two-step model of segmental duplication in the genome.
Availability: Software for computing duplication distance is available upon request.
Contact: clkahn{at}cs.brown.edu; braphael{at}cs.brown.edu.
| 1 INTRODUCTION |
|---|
|
|
|---|
Segmental duplications, or low-copy repeats, are relatively long, nearly identical sequences found in many copies in a genome, but distinct from high-copy (retro) transposon sequences. Segmental duplications are common in mammalian genomes, and current estimates suggest that 5% of the human genome consists of segmental duplications >1kb in length with
90% sequence identity (Bailey and Eichler, 2006). Many of these segmental duplications are recent alterations in the human genome, occuring after the separation of Old World monkey and great ape lineages (Horvath et al., 2005), and it has been proposed that some segmental duplications contain genes under strong positive selection (Ciccarelli et al., 2005). Moreover, segmental duplications are implicated in disease-causing rearrangements and copy-number polymorphisms in human populations (Bailey and Eichler, 2006). The exact mechanism by which these large segments are duplicated and transposed to other genomic loci is not well understood. Studies of the human genome sequence revealed the surprising fact that most segmental duplications are found within complex mosaics of fragments of other duplicated sequences (Bailey et al., 2002). This complicated organization makes the determination of the evolutionary history of segmental duplications a challenging problem. One model that has been proposed to explain the observed mosaic organization (reviewed in Bailey and Eichler, 2006) is a two-step model. In the first step, primary duplications are created by the duplication and aggregation of multiple sequences at a single genomic locus. These primary duplication then seed multiple secondary duplications at other loci (Fig. 1). A similar two-step model involving extrachromosomal intermediates explains similar complex mosaics of duplications that are observed in cancer genomes (Raphael and Pevzner, 2004; Windle and Wahl, 1992).
|
The study of the duplication process in human evolution and cancer motivates the problem of defining an appropriate measure of similarity between two strings containing duplicated segments. Surprisingly, there has been little work on this problem. Duplications have been studied in the context of whole-genome duplication (Alekseyev and Pevzner, 2007; El-Mabrouk and Sankoff, 2003), computing reversal distance in the presence of gene families (Chen et al., 2005; El-Mabrouk, 2002), and in the reconciliation of gene trees and species trees in the presence of orthologous and paralogous genes (Arvestad et al., 2004). However, none of these methods are appropriate for the study of the organization of segmental duplications. Here we introduce a new measure, duplication distance, that is defined as the most parsimonious sequence of duplication events that constructs a target string through repeated insertions of substrings of a source string. This measure allows for insertions at arbitrary positions in the target string including the insertion of duplicated segments inside the previously inserted substrings. We describe a dynamic programming algorithm to compute the duplication distance between two strings in polynomial time.
We use the duplication distance to analyze segmental duplications in the human genome. We identify numerous examples of segmental duplications that are formed from nested insertions of fragments from other segmental duplications. Moreover, we identify a small subset of segmental duplications that seed many other duplications in the genome, lending support to the two-step model of segmental duplication. Finally, we identify segments containing duplicated material that are derived from multiple primary duplications.
| 2 METHODS |
|---|
|
|
|---|
We describe our novel duplication distance model of segmental duplication in Section 2.1. Then, in Section 2.2, we give a polynomial-time algorithm for computing duplication distance between signed strings.
2.1 Duplication distance
The duplication distance from a source string X to a target string Y is a measure of similarity that attempts to describe the most parsimonious duplication scenario in which segments of X are duplicated and arranged to build Y. Duplication distance, then, is an asymmetric distance measure defined between two signed strings X and Y, where the set of characters appearing in X contains the set of characters appearing in Y.
DEFINITION 2.1.
A duplicate operation,s,t,p(X), copies a substring xs...xt of a source string X and pastes it into another string Z at position p. Specifically, if X=x1...xm and Z=z1...zn, then Z
![]()
s,t,p(X)=z1...zp–1xs...xt zp...zn. (Fig. 2).
The duplication distance from a source string X to a target string Y, denoted DX(Y), is the minimum length of a sequence of duplicate operations needed to build the string Y. Note that DX(Y)
|Y|, where |Y| denotes the length of Y.
In the next section, we describe an efficient dynamic programming algorithm for computing duplication distance in the case where X is non-ambiguous, meaning it has at most one copy of any character, yielding the following theorem.
|
THEOREM 2.2
Let X and Y be signed strings such that X is non-ambiguous, and all the characters in Y are present in X. Letdenote the empty string. A minimum-length sequence
=(
s1,t1,p1(X),
s2,t2,p2(X),...,
sr,tr,pr(X)) of duplicate operations such that
s1,t1,p1(X)
s2,t2,p2(X)
...
sr,tr,pr(X)=Y can be computed in O(|Y|4) time.
2.2 Algorithm to compute duplication distance
In our presentation, we shall distinguish between substrings and subsequences of a given string; a substring is a contiguous sequence of characters within the string whereas the characters of a subsequence need not be contiguous. Note that, given a string Z, a substring of Z is also a subsequence of Z, but a subsequence need not be a substring. First, we define some terminology.
DEFINITION 2.3.
Given two subsequencesand
of a string Z, the subsequence
is inside
in Z if the beginning of
occurs before
in Z and the end of
occurs after
in Z and no characters of
occur in between successive characters of
.
For example, in the string Z=abcdef, the subsequence
=bd is inside the subsequence
=aef.
DEFINITION 2.4.
Two subsequencesand
of a string Z overlap if either (i)
and
share a position in Z, or (ii)
is not inside
but there exists some substring of
that is inside
, or vice versa.
We derive a recurrence for computing duplication distance based on the following observation. Consider a source string X, a target string Y and a sequence of duplicate operations of the form
si,ti,pi(X) that produce Y. The substrings of X that are duplicated during the construction of Y appear as mutually non-overlapping subsequences of Y. This is because if two substrings a and b are copied from X and inserted into Y in succession, then after a has been inserted into Y, b can either be inserted before a, after a or inside a in Y. This argument can be extended for any sequence of duplicate operations; as each segment is inserted into Y in succession, we maintain the invariant that the inserted segments remain mutually non-overlapping in Y.
We formalize this property in the following definition.
DEFINITION 2.5.
(FEASIBLE SET OF SUBSEQUENCES) Let X and Y be strings. A feasible set=(
1,...,
k} is a set of subsequences of Y such that: (i) For all i
j,
i and
j are non-overlapping; (ii) The union of
1,...,
k cover the positions of Y.
Every sequence of duplicate operations that builds Y from the empty string corresponds to a feasible set whose cardinality is equal to the length of the sequence of operations. Moreover, given a feasible set of subsequences of Y whose cardinality is k, we can construct the corresponding sequence of k duplicate operations that builds Y from X in which each member of the feasible set is duplicated in one operation using an insertion-sort-like algorithm in O(k2) time. Therefore, a minimum-cardinality feasible set of subsequences of Y will correspond to the minimum-length sequence of duplicate operations.
We now present an efficient dynamic programming algorithm to compute the cardinality of the smallest feasible set of subsequences of Y. It is straightforward to augment the algorithm also to compute the feasible set itself.
Consider a source string X and a target string Y, and let Ys,t denote the substring of Y beginning at index s and ending at index t. Let d(Ys,t) denote the value of DX(Ys,t), and let d(Ys,t)=0 for indices s>t. Suppose, by induction, we have computed the value of d(Yi,j) for every substring of Ys,t satisfying s
i
j
t. If we compute the minimum-cardinality feasible set of subsequences of Ys,t, there are two possibilities to consider: (i) the characters at indices s and t in Y belong to different subsequences in the feasible set, or (ii) the characters at indices s and t in Y belong to the same subsequence in the feasible set (Fig. 3).
|
In the first case, we can find the optimal partition of the string Ys,t into two substrings since we have already computed the value of d(Yi,j) for all substrings of Ys,t. In the second case, we know that one element of the minimum-cardinality feasible set is a subsequence whose first character is ys and whose last character is yt. Because we assume that X is non-ambiguous (i.e. X has at most one copy of any character), there is at most one substring of X that begins with ys and ends with yt and that corresponds to a subsequence of Ys,t. We denote this substring by Xs,t. Note that this substring may be undefined if the substring of X starting at character ys and ending at character yt is not a subsequence of Ys,t. Given the string Xs,t, there is a corresponding subsequence
of Ys,t such that
=Xs,t. If Xs,t is duplicated in one operation and
is an element of the minimum-cardinality feasible set, then d(Ys,t) will equal one plus the duplication distance from X to each of the maximal strings in between successive characters of
, defined as follows.
DEFINITION2.6.
Given a string Z and a subsequenceof Z, the internal strings of
are the maximal substrings of Z that are inside
(i.e. that appear between successive characters of
). Let Z\
denote the set of internal strings of
.
Recall that because Ys,t may contain multiple copies of some characters, there may be multiple placements of the string Xs,t in Ys,t, i.e. instances of the same character sequence occurring at different indices in Ys,t (Fig. 4). Therefore, we compute, over all placements of Xs,t in Ys,t, the sum of the value of d(Yi,j) for each internal string of a placement of Xs,t.
|
We define the set
(Xs,t) as the set of subsequences of Ys,t corresponding to placements of Xs,t, and we express the full recurrence as follows.
THEOREM2.7.
Given a string Ys,t and indices st, the value d(Ys,t) satisfies the following recurrence.
where
(1) and
(2) Note that in computing the recurrence in Equation (1), the value as,t can be computed in time linear in the size of Ys,t. The value bs,t can be computed efficiently via a second recurrence that computes the optimal placement
(Xs,t) in O(|Ys,t|2) time. We omit the details. Since the running time to compute d(Ys,t) is dominated by the time to compute bs,t, and the value bs,t is computed once for each of the O(|Y|2) substrings of Y, the total running time to compute the cardinality of the smallest feasible set of subsequences of Y is O(|Y|4) time.
Note that our algorithm also works when we allow duplicate reversal operations on signed strings, i.e. duplicate operations in which the copied substring of X is inverted before being inserted into Y. In order to allow for duplicate reversals, it suffices to allow the substring Xs,t to be defined in either orientation: that is, either ys occurs before yt in X or –yt occurs before –ys in X.
| 3 RESULTS |
|---|
|
|
|---|
We applied our duplication distance algorithm to analyze the relationships between segmental duplications in the human genome. We used data from Jiang et al. (2007) who identified 4692 short segments, called ancestral duplicons, that are repeated multiple times in the human reference genome (hg17, May 2004), and 437 larger segments, called duplication blocks, that are complex mosaics of duplicons. We enumerated the duplicons according to their location in the genome, and aligned each duplication block to all duplicons using Nucmer (Kurtz et al., 2004) thereby representing each duplication block as a signed string in the alphabet of integers between –4692 and +4692. This process yielded a total of 429 duplication blocks, as 8 blocks could not be aligned to duplicons.
3.1 Pairwise duplication distances
In order to analyze segmental duplications, we defined a normalized duplication distance. Our normalized distance relaxes some of the requirements in Theorem 2.2. In particular, to compute the normalized distance from a source string to a target string, we allow the source string to be ambiguous and we allow for characters in the target string that do not appear in the source string.
For an ordered pair (Bi, Bj) of duplication blocks, we identify all the characters that are in Bj but that do not appear in Bi, denoted {Bj}\{Bi}. We then define a non-ambiguous string Sij by deleting all but one copy of any repeated characters in Bi and concatenating the result with a string composed of the characters of {Bj}\{Bi} alternating with dummy characters. The string Sij is a non-ambiguous subsequence of Bi augmented to contain all the characters in Bj. Thus, Sij and Bj are a suitable source–target string pair between which we can compute a duplication distance.
We define the normalized duplication distance D'(i,j) between Bi and Bj by
|
|
|Bj|, so D'(i,j)
1. If D'(i,j)=1, the number of duplicate operations necessary to build Bj from the source string Sij is equal to the length of Bj, indicating that strings Sij and Bj do not share conserved subsequences of length greater than one. The normalized duplication distance allows us to compare segmental duplications regardless of their size by mitigating the effect that longer target strings will tend to have larger duplication distances than shorter target strings even when both share many subsequences with a source string. Note that because duplication distance is not a symmetric measure, D'(i,j) need not equal D'(j,i).
We found that of the 183 612 pairwise distances computed, 94 543 or
51%, had normalized duplication distances strictly less than one, demonstrating that many segmental duplications are related. The minimum normalized duplication distance value obtained was 0.5 and occurred between source duplication block chr3:75361884-75839295 and target duplication block chr1:219025923-219082537. Figure 5 shows the relationship between these two blocks and identifies several segments of the source block that were copied into the target block in the most parsimonious duplication scenario. Several of the duplicated segments from the source block are non-contiguous in the target block, and instead are inside (Definition 2.3) one another. This pattern of nested insertions was common: among the 94 543 pairwise normalized duplication distances strictly less than one, 4703 (
5%) exhibited nested insertions. Moreover, if we define the closest source block Bi* for each duplication block Bj by i*=argminiD'(i,j), then 43 out of 429 (
10%) of duplication blocks exhibited nested insertions from their closest source block.
|
|
3.2 Relationships between duplication blocks
The two-step model of duplication (Fig. 1) proposes that some duplication blocks are built by successive aggregation of duplicons, and then fragments of these seeding blocks are duplicated to create other duplication blocks. Thus, we expected that seeding blocks would have significant numbers of other duplication blocks at close duplication distance. Using the pairwise distances D' between all ordered pairs of the 429 duplication blocks, we discovered a small cadre of duplication blocks that had many close neighbors using the following approach. First, we identified the distance d0.005 at which exactly 0.5% of the pairwise distances were less or equal to than this value (i.e. 99.5% of the pairwise distances were >d0.005). Next, we identified the duplication blocks Bi which have more than 3.5% of the other 428 duplication blocks within distance d0.005, a 7-fold enrichment of close duplication blocks. Formally, we defined a duplication block Bi to be a seeding block if |{Bj|j
i and D'(i,j)
d0.005}|
0.035(428)
=14. We call the set of duplication blocks {Bj |j
i and D'(i,j)
d0.005} the neighbors of Bi. Thus, a neighbor Bj of a seeding block Bi is significantly closer to Bi—in duplication distance—than Bj is to other duplication blocks on average. Moreover, the duplication scenario in which Bj is constructed from a seeding block Bi by a series of duplicate operations is more parsimonious than the duplication scenario in which Bj is constructed via duplications of individual duplicons.
We represented the relationships between seeding blocks and their respective neighbors in a directed graph G=(V,E). The vertices V of G are the seeding blocks and the neighbors of seeding blocks, and there is a directed edge (Bi,Bj)
E if and only if Bi is a seeding block and Bj is a neighbor of Bi. The graph (Fig. 6) therefore, consists of several star subgraphs, where the center vertex of each star is a seeding block.
The graph reveals complex relationships between seeding blocks and their respective neighbors. In particular, some seeding blocks are neighbors of other seeding blocks. Moreover, some vertices in the graph have in-degree greater than one. These vertices correspond to duplication blocks that are close in duplication distance to multiple seeding blocks.
In some instances, a duplication scenario that involves duplicates from two seeding blocks in the creation of a single target string is more parsimonious (i.e. lower total duplication distance) than a duplication scenario that considers only one seeding block. For example, there exist two seeding blocks that provide parsimonious duplication scenarios for duplication block chr9:67819772-68015022 (Fig. 7). Furthermore, because segments from each of the seeding blocks are nested inside segments from the other seeding block, our analysis suggests the possibility that the target block was seeded by both source blocks contemporaneously.
|
| 4 DISCUSSION |
|---|
|
|
|---|
We introduced a novel measure called duplication distance that describes the minimum number of duplications necessary to create a target string by repeated insertions of fragments of a source string, and described an efficient algorithm for computing duplication distance.
Our analysis of human segmental duplications revealed insights into the relationships between some duplication blocks, but is only a first step in the comprehensive analysis of the evolutionary history of segmental duplications. Jiang et al. (2007) also considered the problem of determining the phylogenetic relationship between segmental duplications. They represented each duplication block by a binary vector indicating the presence or absence of ancestral duplicons and performed hierarchical clustering on these binary vectors. However, this approach ignored the order and orientation of duplicons within each duplication block, and the subsequences of duplicons that are shared between multiple duplication blocks. We expect that duplication distance will provide a more refined phylogenetic analysis. Indeed, we have seen that there are many non-trivial patterns of conserved duplicons between pairs of duplication blocks implying that deriving the ancestral relationships between duplication blocks requires an analysis of their shared subsequences—not just the amount of shared duplicon content. Further improvements in our duplication distance algorithm, such as relaxing the assumptions that the source string is non-ambiguous and the target string is initially empty are possible and will lead to improved representations of the relationships between segmental duplications. Finally, cancer genomes are known to harbor extensive duplications of genomic segments (Albertson et al., 2003) in a very complex architecture (Raphael et al., 2003, 2008; Volik et al., 2006). We expect that duplication distance will prove useful in future studies of the duplication architecture of cancer genomes, studies that are becoming possible by cancer genome sequencing projects now in progress.
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
Funding: C.L.K. is supported by a National Science Foundation Graduate Research Fellowship. B.J.R. is supported by a Career Award at the Scientific Interface from the Burroughs Wellcome Fund.
| REFERENCES |
|---|
|
|
|---|
Albertson DG, et al. Chromosome aberrations in solid tumors. Nat. Genet (2003) 34:369–376.[CrossRef][Web of Science][Medline]
Alekseyev MA, Pevzner PA. Whole genome duplications and contracted breakpoint graphs. SIAM J. Comput (2007) 36:1748–1763.[CrossRef]
Arvestad L, et al. Gene tree reconstruction and orthology analysis based on an integrated model for duplications and sequence evolution. In: RECOMB04: Proceedings of the Eighth Annual International Conference on Resaerch in Computational Molecular Biology. (2004) New York, USA: ACM. 326–335.
Bailey J, Eichler E. Primate segmental duplications: crucibles of evolution, diversity and disease. Nat. Rev. Genet (2006) 7:552–564.[Web of Science][Medline]
Bailey J, et al. Recent segmental duplications in the human genome. Science (2002) 297:1003–1007.
Chen X, et al. Assignment of orthologous genes via genome rearrangement. IEEE/ACM Trans. Comput. Biol. Bioinform (2005) 2:302–315.[CrossRef]
Ciccarelli F, et al. Complex genomic rearrangements lead to novel primate gene function. Genome Res (2005) 15:343–351. i137.
El-Mabrouk N. Reconstructing an ancestral genome using minimum segmentsduplications and reversals. J. Comput. Syst. Sci (2002) 65:442–464.[CrossRef]
El-Mabrouk N, Sankoff D. The reconstruction of doubled genomes. SIAM J. Comput (2003) 32:754–792.[CrossRef]
Horvath J, et al. Punctuated duplication seeding events during the evolution of human chromosome 2p11. Genome Res (2005) 15:914–927.
Jiang Z, et al. Ancestral reconstruction of segmental duplications reveals punctuated cores of human genome evolution. Nature Genet (2007) 39:1361–1368.[CrossRef][Web of Science][Medline]
Kurtz S, et al. Versatile and open software for comparing large genomes. Genome Biol (2004) 5:120.
Raphael BJ, et al. Reconstructing tumor genome architectures. Bioinformatics (2003) 19(Suppl. 2):II171–II1.
Raphael BJ, Pevzner PA. Reconstructing tumor amplisomes. Bioinformatics (2004) 20(Suppl. 1):273.
Raphael B, et al. Asequence-based survey of the complex structural organization of tumor genomes. Genome Biol (2008) 9:R59.[CrossRef][Medline]
Volik S, et al. Decoding the fine-scale structure of a breast cancer genome and transcriptome. Genome Res (2006) 16:394–404.
Windle BE, Wahl GM. Molecular dissection of mammalian gene amplification: new mechanistic insights revealed by analyses of very early events. Mutat. Res (1992) 276:199–224. i138.[CrossRef][Web of Science][Medline]
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||


denote the empty string. A minimum-length sequence
=(
of a string Z, the subsequence
=(




