Poisson adjacency distributions in genome comparison: multichromosomal, circular, signed and unsigned cases
Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
The number of common adjacencies of genetic markers, as a measure of the similarity of two genomes, has been widely used as indicator of evolutionary relatedness and as the basis for inferring phylogenetic relationships. Its probability distribution enables statistical tests in detecting whether significant evolutionary signal remains in the marker order. In this article, we derive the probability distributions of the number of adjacencies for a number of types of genome—signed or unsigned, circular or linear, single-chromosome or multichromosomal. Generating functions are found for singlechromosome cases, from which exact counts can be calculated. Probability approaches are adopted for multichromosomal cases, where we find the exact values for expectations and variances. In both cases, the limiting distributions are derived in term of numbers of adjacencies. For all unsigned cases, the limiting distribution is Poisson with parameter 2; for all signed cases, the limiting distribution is Poisson with parameter
.
Contact: wxu060{at}uottawa.ca
| 1 INTRODUCTION |
|---|
|
|
|---|
The linear order of markers, such as genes or other sites, on chromosomes is a characteristic structural feature of the genome shared by all individuals in a species. During evolution, various rearrangement events disrupt this order by moving or inverting segments of chromosomes. At the time of the event, the order of the markers within a segment is either conserved or inverted and the order of markers outside the segment is conserved. A breakpoint may be created between two markers that have hitherto been adjacent in the order if one is inside the segment while the other remains outside the segment affected.
The number of breakpoints b in the comparison of two genomes is the oldest and simplest metric representing the evolutionary divergence of species through chromosomal rearrangements (Nadeau and Taylor, 1984; Sankoff and Blanchette, 1998).
Chromosomes are generally circular in prokaryotes, mitochondria and chloroplasts, and a single chromosome contains the entire genome, although sometimes smaller circles, plasmids, contain some of this information. Eukaryotic nuclear genomes are partitioned among linear chromosomes, from a few to a few dozen in number.
Available data on chromosomes may or may not identify the DNA strand on which the markers are found (called signed or unsigned data, respectively).
Let a be the number of pairs of markers adjacent in two genomes with the same markers 1,2,...,n and the same number of linear chromosomes
l. Then
|
|
In this article, we study the distribution of the number of common adjacencies under the null hypothesis that the n markers are ordered completely randomly on the genomes (N.B. it suffices to randomize just one of the genomes, since relabeling markers can convert one of the genome to a canonical order, e.g. 1,2,...,n, without changing a). For multichromosomal genomes, the number of markers on each chromosome is also random. We study the cases of single or multiple chromosomes, which can be linear or circular, and signed or unsigned. The unsigned single-chromosome case is related to the dinner table problem (Robbins, 1980) and non-attacking kings problem (Abramson and Moser, 1966). G. Tesler has previously derived results for linear, single-chromosome genomes (Tesler, 2005).
In Section 2, we find the generating functions for the case of a single-chromosome genome and approximate the probability distributions for large n. In Section 3, we extend our discussion to multichromosomal genomes and directly derive the exact formulae for the expectation and variance of a as well as an approximation of the probability distribution for large n.
The remainder of this section introduces notation and recalls fundamental theorems used in this article.
1.1 Notation and terminology
In the single-chromosome unsigned case we call the genome R where the markers are ordered from 1 to n the reference genome, i.e. R=1,2,...,n. The random genome G=g1,g2,...,gn is sampled from a uniform distribution on the set of permutations of 1,...,n. For multichromosomal genomes, the beginning and ending marker of each chromosome in the reference genome are given, while in the random genome only the number of chromosomes is given. In the signed case, all the markers in the reference genome R have positive sign, while in the random genome G, a positive or negative sign is assigned at random to each marker independently.
If for some i we have |gi–gi+1|=1 in the unsigned case, or gi+1–gi=1 in the signed case, we say there is an adjacency in common between the two genomes, otherwise there is a breakpoint. In the multichromosomal case, if gi is the last term on a chromosome and/or gi+1 is the first, we identify neither an adjacency nor a breakpoint.
1.2 Generating functions
The ordinary generating function (OGF) is defined as the formal power series:
|
| (1) |
If An,m denotes the number of (random) genomes with n markers and m adjacencies in common with the reference genome, consider the bivariate generating function
|
| (2) |
mAn,m, then the probability
|
| (3) |
|
| (4) |
|
| (5) |

denotes the k-th partial derivative with respect to u.
The probability generating function
|
| (6) |
|
| (7) |
|
| (8) |
1.3 Convergence of probability distributions
A probability measure is determined by its moments if it has finite moments
k=E[xk] of all orders and the power series
has a positive radius of convergence for r.
THEOREM 1
[Theorem 30.2 in (Billingsley, 1995)] Suppose that the distribution of X is determined by its moments and the Xn have moments of all orders and that limnE[X]=E[Xr] for r=1,2,.... Then the distribution of Xn converges to the distribution of X.
THEOREM 2.
For probability distributions of Xn, if their kth factorial moment converges to µk, then their probability distributions converge to Poisson distribution with mean µ.
PROOF.
A distribution is determined by its moments ifconverges for any value of r. For a Poisson[µ] distribution,
, which converges for any µ and r. Hence, Theorem 1 applies. Since regular moments E[Xr] are just the linear combinations of factorial moments E[X(r)], the conclusion in Theorem 1 is also true for factorial moments. Finally for Poisson[µ], E[X(r)]=µr.
| 2 THE GENERATING FUNCTION APPROACH TO SINGLE-CHROMOSOME GENOMES |
|---|
|
|
|---|
In this section, we consider four different cases of genomes containing only one chromosome: unsigned linear, unsigned circular, signed linear and signed circular chromosomes. We first derive the generating functions for each case. For unsigned cases, the limiting probability distributions are derived via factorial moments while for signed cases the direct derivation of the exact distribution from generating functions is possible.
We first introduce an operation (Flajolet and Sedgewick, 2008) that we call the star operation. For any genome, from the existing adjacencies, this operation distinguishes an arbitrary set of adjacencies and labels them with stars. For a genome with m adjacencies, there are 2m different ways of picking starred adjacencies.
By using starred genomes, we can avoid complications due to overcounting certain nested configurations. We can then make use of a straightforward relation (Lemma 1) between starred genomes and genomes without stars to derive the main result in Theorem 3.
LEMMA 1.
If F(z,u) is the bivariate generating function counting the number of genomes with n elements and m adjacencies and G(z,v) is the bivariate generating function counting the number of genomes with n elements and l starred adjacencies, then the star operation corresponds to the substitution u1+v and
(9)
(10)
PROOF.
If fn,m, the coefficient of znum in F is the number of genomes with n elements and m adjacencies, the star operation on these genomes will producegenomes with l starred adjacencies, where l=0,1,...,m. We have
![]()
. Comparing G(z,v)=
n,mfn,mzn(1+v)m with F(z,u)=
n,m fn,mznum, we have the desired result.
2.1 The four generating functions
THEOREM 3.
Denote by Fu,l(z,u),Fu,c(z,u),Fs,l(z,u) and Fs,c(z,u) the generating functions correspondingly to unsigned linear, unsigned circular, signed linear and signed circular single-chromosome genomes, where the coefficients of powers of z and u count markers and adjacencies, respectively. Then we have:
(11)
(12)
(13)
(14)
PROOF.
We count the numbers for the corresponding starred configurations first and derive the generating functions for the original questions via Equation (10). Define a synteny block as a block of markers numbered successively either in a increasing or in a decreasing order (for the signed case, there is a minus sign before decreasing ordered markers). For a synteny block of size s, there are s–1 adjacencies. The starred synteny block is just the synteny block where each adjacency is starred. The generating function S(z,v) counting the number of starred synteny blocks is. The starred genomes are the compositions of starred synteny blocks and free markers—markers that not involved in any starred adjacencies. Call these starred synteny blocks and free markers free components. Now we derive the expressions for G(z,v) using the fact that starred genomes are permutations (signed/unsigned, linear/circular) of these free components.
- Unsigned linear genomes. If there are n free components, there are n! unsigned linear permutations of them. Each free component can be a free marker or a starred synteny block, so the generating function for free components is
![]()
. Then we have
So
- Unsigned circular genomes. Given n free components, there are (n–1)! circular permutations. So that:
- Signed linear genomes. Given n free components there are n! linear permutations. Each free component is either a free marker, which may take two different signs, or a starred synteny block. So the generating function for free components is
. We have:
- Signed circular genomes. Similarly we have:
2.2 From factorial moments to limiting probability distributions for unsigned genomes
After expanding the generating functions F(z,u), the coefficient of the term znum is just the number of permutations with n elements and m adjacencies. While this approach is easily followed for signed genomes, it leads to complicated multiple summations for the unsigned cases. Next we derive the probability distribution for unsigned cases by means of factorial moments.
THEOREM 4.
X is the random variable counting the number of adjacencies for unsigned linear or circular single-chromosome genomes. Its k-th factorial moment iswhile its probability distributions
(15) The limiting distribution is Poisson[2].
(16)
PROOF.
Set P(z,u)=1+uz–z, Q(z,u)=1–uz+z.Then
The kth derivative of PnQ–n can be expanded as
Then we have
where n(i) stands for n(n–1)...(n–i+1) and n(0)=1.
Since P(z,u)|u=1=1 and Q(z,u)|u=1=1, we have
and
(17) Let Xu be the number of common adjacencies for unsigned single-chromosome genomes, either linear or circular.
- Unsigned linear case:
Then for the ith factorial moment, we can calculate:
(18)
(19)
- Unsigned circular case:
(20)
(21)
From Theorem 2, we conclude that the limiting distribution for the number of adjacencies is Poisson[2].
The probability generating function:
(22)
(23)
2.3 Derivation of distributions for signed cases
The relatively simpler generating functions for signed genomes enable the direct derivation of probability distributions. We have
THEOREM 5.
For signed linear or circular single-chromosome genomes, the probability distributions of the number of adjacencies are:
(24) and their limiting distributions are Poisson[
(25) ].
PROOF.
From the generating functions F(z,u) we get the corresponding probability generating functions Pn(u), which give us the probability distribution immediately.
(26) From P[Xs=k]=[uk]P
(27) (u) we have
(28)
(29)
| 3 THE PROBABILITY APPROACH FOR MULTICHROMOSOMAL GENOMES |
|---|
|
|
|---|
For multichromosomal genomes, the variation in the number of chromosomes, shape (linear or circular) and length (the number of markers) of each chromosome complicate the exact calculation. However, some dominant tendencies emerge when the number of markers is much larger than the number of linear chromosomes. We use a probabilistic approach to characterize these tendencies.
Since the methods for unsigned and signed genomes are essentially the same, we treat the two cases at the same time. For either case, suppose there are n markers,
l linear chromosomes and
c circular chromosomes in the reference genome and n genes, 
linear chromosomes and 
circular chromosomes in the random genome.
Let
i be the event that marker gi and gi+1 form an adjacency, in the form of either (gi,gi+1) or (gi+1,gi). (In the signed case, (gi,gi+1) or (–gi–1,–gi).)
Denote
as the set of adjacencies in the reference genome, i.e. markers i and i+1 where i is not the end of a chromosome. Clearly |
|=n–
l.
Let 
u (
for signed cases) be the indicator random variable for the event
i, i.e. 
(or 
) counts 1 when
i occurs, 0 otherwise. In the random genome, let pu (or ps) be the probability of event
i, where i takes any value from set
.
LEMMA 2.
In the random genome, the probability of the eventi, where i
, is
for the unsigned case and
for the signed case.
PROOF.
In the random genome, marker gi can be located at the end of some linear chromosome, with probability. When this happens, for unsigned genomes, there is only one possible position for gi+1 to form an adjacency with gi, which gives the probability
. For signed genomes,
i happens when gi, gi+1 is located at the left end of the chromosome or–gi–1, –gi is located at the right end of the chromosome. Either of the two cases gives the probability
.
Gene gi can also be placed in the interior of chromosomes with probability
. For unsigned genomes, two possible positions are available for gi+1 to form an adjacency with gi, with total probability
. While for signed genomes, one possible position is available depending on the sign of gi, with probability
.
(30)
THEOREM 6.
The expected number of adjacencies is
(31)
PROOF.
Let Xu (Xs) be the number of adjacencies for unsigned genomes (signed genomes), which is just the summation of's (
's) for all i in
.
The expectations are easily derived:
(32)
(33)
THEOREM 7.
The variance of the number of adjacencies is
(34)
PROOF.
V[X]=E[X2]–E2[X], and we first calculate the non-centered second moment E[X2], which can be expressed as the following summation.In the last expression, there are (n–
(35) l)(n–
l–3)+2
l, 2(n–2
l) and n–
l summands in the three summations correspondingly. For the present version of this article, we will not detail the case-by-case calculation of these summands, which results in the quantities in the statement of this theorem.
THEOREM 8.
The limiting probability distribution is Poisson[2] for unsigned genomes and Poisson[] for signed genomes.
PROOF
We first prove kth factorial moment converges to 2k for unsigned genomes and 2–k for signed genomes. Then by Theorem 2, we get the above conclusion.Since
i is the indicator random variable of value 0 or 1, the k-th factorial moment can be written as the following summation, where the k index runs over all k–tuples on the set
and no two indices take on the same value.
(36) Since the value of conditional expection depends on indices i1,i2,...,ik, the summation on the right hand side of (36) is split into two summations:
(37) Denote E[
(38) i1
i2...
ik|gi1,gi2,...,gik] as the conditional expectation for given indices {ij: 1
j
k}, when the il-th element on the random genome is gil for all 1
l
k. Then the unconditional expectation can be expressed as:
The set {gi1,gi2,...,gik: i1,i2,...,ik
(39) } can be split into:
Then the expection E'=E[
i1
i2...
ik] on the right hand side of (37) becomes:
(40) Since
(41) 1 asymptotically occurs with probability 1, as n goes to
, then
, which takes the maximum value among all conditional expectations as:
Since the number of summands in
(42) 1 is at least n(n–3)(n–6)...s(n–3k+3)=nk+O(nk–1), the number of summands in
2 is in the order O(nk–1) and the unconditional expectation in
2 is no larger than E[
i1
i2...
ik|
1], then
(43) So
(44) 2 is at least one order of magnitude smaller than
2.
From the convergence of the factorial moments, we have the convergence of the probability distribution: the limiting probability distribution of number of adjacencies is Poisson[2] for unsigned genomes and Poisson[
(45) ] for signed genomes.
REMARK 1.
Using the methods of Theorem 6, we can calculate the covariance betweeni and
j as Cov(
i,
j)=E[
i
j]–E[
i]E[
j]:
Since
(46) and
, the covariances are at least one order of magnitude smaller than the variance. The
i's can be treated as independent identical random variables under a mild approximation, which leads to Poisson distributions.
| 4 CONCLUSION |
|---|
|
|
|---|
In this article, we used a combinatorial approach to find the generating functions counting the number of genomes with given numbers of markers and adjacencies for genomes with only one chromosome. We used probabilistic methods to calculate the exact values for random expectations and variances of the number of adjacencies for genomes with any number of linear and circular chromosomes. The overall conclusion is that the limiting probability distribution is Poisson[2] for the unsigned case and Poisson[
] for the signed case.
Based on the limiting Poisson distribution, we can devise a statistical test for whether the two genomes contain a significant evolutionary signal, when the number of markers is not too small. For unsigned genomes with number of adjacencies a, the P-value is calculated by
. For signed genomes with number of adjacencies a, the P-value is calculated by
. Based on Table 1, when the unsigned distance is >5 or the signed adjacency distance is >2, a statistical test with a critical region of 5% will reject the null hypothesis of randomness and accept that there is a significant evolutionary signal between the two genomes involved.
|
| ACKNOWLEDGEMENTS |
|---|
|
|
|---|
We thank Glenn Tesler for discussing his previous work on this topic with us, and Daniel Panario for his suggestions and encouragement. Research supported in part by a grant to D.S. from the Natural Sciences and Engineering Research Council of Canada (NSERC). D.S. holds the Canada Research Chair in Mathematical Genomics.
Conflict of Interest: none declared.
| REFERENCES |
|---|
|
|
|---|
Abramson M, Moser W. Combinations, successions and the n-kings problem. Math. Mag (1966) 39:269–273.
Billingsley P. Probability and Measure, 3rd edn. (1995) New York: Wiley InterScience.
Flajolet P, Sedgewick R. Analytic Combinatorics. (2008) Cambridge University Press.
Nadeau JH, Taylor BA. Lengths of chromosomal segments conserved since divergence of man and mouse. Proceedings of the National Academy of Sciences (USA) (1984) 81:814–818.
Robbins DP. The probability that neighbors remain neighbors after random rearrangements. Am. Math. Mon (1980) 87:122–124.[CrossRef][Web of Science]
Sankoff D, Blanchette M. Multiple genome rearrangement and breakpoint phylogeny. J. Comp. Biol (1998) 5:555–570.
Tesler G. Decomposition of permutations into rising and falling subsequences. (2005) unpublished manuscript, 2005.
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
]=E[Xr] for r=1,2,.... Then the distribution of Xn converges to the distribution of X.
1+v and




(u) we have




's (
's) for all i in 



j


1 asymptotically occurs with probability 1, as n goes to
, then 
1 is at least n(n–3)(n–6)...s(n–3k+3)=nk+O(nk–1), the number of summands in 

