Skip Navigation

Bioinformatics 2008 24(16):i146-i152; doi:10.1093/bioinformatics/btn295
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xu, W.
Right arrow Articles by Sankoff, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xu, W.
Right arrow Articles by Sankoff, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2008. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Poisson adjacency distributions in genome comparison: multichromosomal, circular, signed and unsigned cases

Wei Xu *, Benoît Alain and David Sankoff

Department of Mathematics and Statistics, University of Ottawa, Ottawa, Ontario, Canada

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE GENERATING FUNCTION...
 3 THE PROBABILITY APPROACH...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

The number of common adjacencies of genetic markers, as a measure of the similarity of two genomes, has been widely used as indicator of evolutionary relatedness and as the basis for inferring phylogenetic relationships. Its probability distribution enables statistical tests in detecting whether significant evolutionary signal remains in the marker order. In this article, we derive the probability distributions of the number of adjacencies for a number of types of genome—signed or unsigned, circular or linear, single-chromosome or multichromosomal. Generating functions are found for singlechromosome cases, from which exact counts can be calculated. Probability approaches are adopted for multichromosomal cases, where we find the exact values for expectations and variances. In both cases, the limiting distributions are derived in term of numbers of adjacencies. For all unsigned cases, the limiting distribution is Poisson with parameter 2; for all signed cases, the limiting distribution is Poisson with parameter 1/2.

Contact: wxu060{at}uottawa.ca


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE GENERATING FUNCTION...
 3 THE PROBABILITY APPROACH...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
The linear order of markers, such as genes or other sites, on chromosomes is a characteristic structural feature of the genome shared by all individuals in a species. During evolution, various rearrangement events disrupt this order by moving or inverting segments of chromosomes. At the time of the event, the order of the markers within a segment is either conserved or inverted and the order of markers outside the segment is conserved. A breakpoint may be created between two markers that have hitherto been adjacent in the order if one is inside the segment while the other remains outside the segment affected.

The number of breakpoints b in the comparison of two genomes is the oldest and simplest metric representing the evolutionary divergence of species through chromosomal rearrangements (Nadeau and Taylor, 1984; Sankoff and Blanchette, 1998).

Chromosomes are generally circular in prokaryotes, mitochondria and chloroplasts, and a single chromosome contains the entire genome, although sometimes smaller circles, plasmids, contain some of this information. Eukaryotic nuclear genomes are partitioned among linear chromosomes, from a few to a few dozen in number.

Available data on chromosomes may or may not identify the DNA strand on which the markers are found (called signed or unsigned data, respectively).

Let a be the number of pairs of markers adjacent in two genomes with the same markers 1,2,...,n and the same number of linear chromosomes {chi}l. Then


Formula

As evolutionary rearrangements continue to disrupt adjacencies, b increases and a decreases. Now, even pairs of randomly constructed genomes may have some adjacencies, and pairs of genomes clearly related at the DNA sequence level may have highly scrambled marker order. Then, for any given pair of genomes the question arises of whether a is significantly larger than the random case. To answer this, i.e. to test a for statistical significance, we need its distribution under the null hypothesis of randomness.

In this article, we study the distribution of the number of common adjacencies under the null hypothesis that the n markers are ordered completely randomly on the genomes (N.B. it suffices to randomize just one of the genomes, since relabeling markers can convert one of the genome to a canonical order, e.g. 1,2,...,n, without changing a). For multichromosomal genomes, the number of markers on each chromosome is also random. We study the cases of single or multiple chromosomes, which can be linear or circular, and signed or unsigned. The unsigned single-chromosome case is related to the dinner table problem (Robbins, 1980) and non-attacking kings problem (Abramson and Moser, 1966). G. Tesler has previously derived results for linear, single-chromosome genomes (Tesler, 2005).

In Section 2, we find the generating functions for the case of a single-chromosome genome and approximate the probability distributions for large n. In Section 3, we extend our discussion to multichromosomal genomes and directly derive the exact formulae for the expectation and variance of a as well as an approximation of the probability distribution for large n.

The remainder of this section introduces notation and recalls fundamental theorems used in this article.

1.1 Notation and terminology
In the single-chromosome unsigned case we call the genome R where the markers are ordered from 1 to n the reference genome, i.e. R=1,2,...,n. The random genome G=g1,g2,...,gn is sampled from a uniform distribution on the set of permutations of 1,...,n. For multichromosomal genomes, the beginning and ending marker of each chromosome in the reference genome are given, while in the random genome only the number of chromosomes is given. In the signed case, all the markers in the reference genome R have positive sign, while in the random genome G, a positive or negative sign is assigned at random to each marker independently.

If for some i we have |gigi+1|=1 in the unsigned case, or gi+1gi=1 in the signed case, we say there is an adjacency in common between the two genomes, otherwise there is a breakpoint. In the multichromosomal case, if gi is the last term on a chromosome and/or gi+1 is the first, we identify neither an adjacency nor a breakpoint.

1.2 Generating functions
The ordinary generating function (OGF) is defined as the formal power series:


Formula 1

(1)
where [zn] denotes An the coefficient of zn in F.

If An,m denotes the number of (random) genomes with n markers and m adjacencies in common with the reference genome, consider the bivariate generating function


Formula 2

(2)
For a given n, if X is the random variable counting the number of adjacencies in a random genome and An={sum}mAn,m, then the probability


Formula 3

(3)
Let X(k)=X(X–1)...s(Xk+1). Then the kth factorial moment is


Formula 4

(4)


Formula 5

(5)
where {partial}Formula denotes the k-th partial derivative with respect to u.

The probability generating function


Formula 6

(6)


Formula 7

(7)


Formula 8

(8)
Substituting u by et gives the familiar moment generating function.

1.3 Convergence of probability distributions
A probability measure is determined by its moments if it has finite moments {alpha}k=E[xk] of all orders and the power series Formula has a positive radius of convergence for r.

THEOREM 1
[Theorem 30.2 in (Billingsley, 1995)] Suppose that the distribution of X is determined by its moments and the Xn have moments of all orders and that limnE[XFormula]=E[Xr] for r=1,2,.... Then the distribution of Xn converges to the distribution of X.

THEOREM 2.
For probability distributions of Xn, if their kth factorial moment converges to µk, then their probability distributions converge to Poisson distribution with mean µ.

PROOF.
A distribution is determined by its moments if Formula converges for any value of r. For a Poisson[µ] distribution, Formula , which converges for any µ and r. Hence, Theorem 1 applies. Since regular moments E[Xr] are just the linear combinations of factorial moments E[X(r)], the conclusion in Theorem 1 is also true for factorial moments. Finally for Poisson[µ], E[X(r)]=µr{blacksquare}


    2 THE GENERATING FUNCTION APPROACH TO SINGLE-CHROMOSOME GENOMES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE GENERATING FUNCTION...
 3 THE PROBABILITY APPROACH...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this section, we consider four different cases of genomes containing only one chromosome: unsigned linear, unsigned circular, signed linear and signed circular chromosomes. We first derive the generating functions for each case. For unsigned cases, the limiting probability distributions are derived via factorial moments while for signed cases the direct derivation of the exact distribution from generating functions is possible.

We first introduce an operation (Flajolet and Sedgewick, 2008) that we call the star operation. For any genome, from the existing adjacencies, this operation distinguishes an arbitrary set of adjacencies and labels them with stars. For a genome with m adjacencies, there are 2m different ways of picking starred adjacencies.

By using starred genomes, we can avoid complications due to overcounting certain nested configurations. We can then make use of a straightforward relation (Lemma 1) between starred genomes and genomes without stars to derive the main result in Theorem 3.

LEMMA 1.
If F(z,u) is the bivariate generating function counting the number of genomes with n elements and m adjacencies and G(z,v) is the bivariate generating function counting the number of genomes with n elements and l starred adjacencies, then the star operation corresponds to the substitution u->1+v and


Formula 9

(9)


Formula 10

(10)

PROOF.
If fn,m, the coefficient of znum in F is the number of genomes with n elements and m adjacencies, the star operation on these genomes will produce Formula genomes with l starred adjacencies, where l=0,1,...,m. We have Formula Formula . Comparing G(z,v)={sum}n,mfn,mzn(1+v)m with F(z,u)={sum}n,m fn,mznum, we have the desired result. {blacksquare}

2.1 The four generating functions

THEOREM 3.
Denote by Fu,l(z,u),Fu,c(z,u),Fs,l(z,u) and Fs,c(z,u) the generating functions correspondingly to unsigned linear, unsigned circular, signed linear and signed circular single-chromosome genomes, where the coefficients of powers of z and u count markers and adjacencies, respectively. Then we have:


Formula 11

(11)


Formula 12

(12)


Formula 13

(13)


Formula 14

(14)

PROOF.
We count the numbers for the corresponding starred configurations first and derive the generating functions for the original questions via Equation (10). Define a synteny block as a block of markers numbered successively either in a increasing or in a decreasing order (for the signed case, there is a minus sign before decreasing ordered markers). For a synteny block of size s, there are s–1 adjacencies. The starred synteny block is just the synteny block where each adjacency is starred. The generating function S(z,v) counting the number of starred synteny blocks is Formula . The starred genomes are the compositions of starred synteny blocks and free markers—markers that not involved in any starred adjacencies. Call these starred synteny blocks and free markers free components. Now we derive the expressions for G(z,v) using the fact that starred genomes are permutations (signed/unsigned, linear/circular) of these free components.
  1. Unsigned linear genomes. If there are n free components, there are n! unsigned linear permutations of them. Each free component can be a free marker or a starred synteny block, so the generating function for free components is Formula Formula . Then we have


    Formula

    So


    Formula


  2. Unsigned circular genomes. Given n free components, there are (n–1)! circular permutations. So that:


    Formula



    Formula


  3. Signed linear genomes. Given n free components there are n! linear permutations. Each free component is either a free marker, which may take two different signs, or a starred synteny block. So the generating function for free components is Formula . We have:


    Formula



    Formula


  4. Signed circular genomes. Similarly we have:


    Formula



    Formula

     {blacksquare}

2.2 From factorial moments to limiting probability distributions for unsigned genomes
After expanding the generating functions F(z,u), the coefficient of the term znum is just the number of permutations with n elements and m adjacencies. While this approach is easily followed for signed genomes, it leads to complicated multiple summations for the unsigned cases. Next we derive the probability distribution for unsigned cases by means of factorial moments.

THEOREM 4.
X is the random variable counting the number of adjacencies for unsigned linear or circular single-chromosome genomes. Its k-th factorial moment is


Formula 15

(15)
while its probability distributions


Formula 16

(16)
The limiting distribution is Poisson[2].

PROOF.
Set P(z,u)=1+uzz, Q(z,u)=1–uz+z.

Then


Formula



Formula

The kth derivative of PnQn can be expanded as


Formula

Then we have


Formula



Formula

where n(i) stands for n(n–1)...(ni+1) and n(0)=1.

Since P(z,u)|u=1=1 and Q(z,u)|u=1=1, we have


Formula

and


Formula 17

(17)

  1. Unsigned linear case:


    Formula 18

    (18)
    Then for the ith factorial moment, we can calculate:


    Formula 19

    (19)

  2. Unsigned circular case:


    Formula 20

    (20)


    Formula 21

    (21)

Let Xu be the number of common adjacencies for unsigned single-chromosome genomes, either linear or circular.


Formula

From Theorem 2, we conclude that the limiting distribution for the number of adjacencies is Poisson[2].

The probability generating function:


Formula 22

(22)


Formula 23

(23)
 {blacksquare}

2.3 Derivation of distributions for signed cases
The relatively simpler generating functions for signed genomes enable the direct derivation of probability distributions. We have

THEOREM 5.
For signed linear or circular single-chromosome genomes, the probability distributions of the number of adjacencies are:


Formula 24

(24)


Formula 25

(25)
and their limiting distributions are Poisson[1/2].

PROOF.
From the generating functions F(z,u) we get the corresponding probability generating functions Pn(u), which give us the probability distribution immediately.


Formula 26

(26)


Formula 27

(27)
From P[Xs=k]=[uk]PFormula(u) we have


Formula 28

(28)


Formula 29

(29)


    3 THE PROBABILITY APPROACH FOR MULTICHROMOSOMAL GENOMES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE GENERATING FUNCTION...
 3 THE PROBABILITY APPROACH...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
For multichromosomal genomes, the variation in the number of chromosomes, shape (linear or circular) and length (the number of markers) of each chromosome complicate the exact calculation. However, some dominant tendencies emerge when the number of markers is much larger than the number of linear chromosomes. We use a probabilistic approach to characterize these tendencies.

Since the methods for unsigned and signed genomes are essentially the same, we treat the two cases at the same time. For either case, suppose there are n markers, {chi}l linear chromosomes and {chi}c circular chromosomes in the reference genome and n genes, {chi}Formula linear chromosomes and {chi}Formula circular chromosomes in the random genome.

Let {gamma}i be the event that marker gi and gi+1 form an adjacency, in the form of either (gi,gi+1) or (gi+1,gi). (In the signed case, (gi,gi+1) or (–gi–1,–gi).)

Denote {Lambda} as the set of adjacencies in the reference genome, i.e. markers i and i+1 where i is not the end of a chromosome. Clearly |{Lambda}|=n{chi}l.

Let {Gamma}Formulau ({Gamma}Formula for signed cases) be the indicator random variable for the event {gamma}i, i.e. {Gamma}Formula (or {Gamma}Formula) counts 1 when {gamma}i occurs, 0 otherwise. In the random genome, let pu (or ps) be the probability of event {gamma}i, where i takes any value from set {Lambda}.

LEMMA 2.
In the random genome, the probability of the event {gamma}i, where iisin{Lambda}, is Formula for the unsigned case and Formula for the signed case.

PROOF.
In the random genome, marker gi can be located at the end of some linear chromosome, with probability Formula . When this happens, for unsigned genomes, there is only one possible position for gi+1 to form an adjacency with gi, which gives the probability Formula . For signed genomes, {gamma}i happens when gi, gi+1 is located at the left end of the chromosome or–gi–1, gi is located at the right end of the chromosome. Either of the two cases gives the probability Formula .

Gene gi can also be placed in the interior of chromosomes with probability Formula . For unsigned genomes, two possible positions are available for gi+1 to form an adjacency with gi, with total probability Formula . While for signed genomes, one possible position is available depending on the sign of gi, with probability Formula .

Summing up we have,


Formula 30

(30)
 {blacksquare}

THEOREM 6.
The expected number of adjacencies is


Formula 31

(31)

PROOF.
Let Xu (Xs) be the number of adjacencies for unsigned genomes (signed genomes), which is just the summation of {Gamma}Formula's ({Gamma}Formula's) for all i in {Lambda}.


Formula

The expectations are easily derived:


Formula 32

(32)


Formula 33

(33)
 {blacksquare}

THEOREM 7.
The variance of the number of adjacencies is


Formula 34

(34)

PROOF.
V[X]=E[X2]–E2[X], and we first calculate the non-centered second moment E[X2], which can be expressed as the following summation.


Formula 35

(35)
In the last expression, there are (n{chi}l)(n{chi}l–3)+2{chi}l, 2(n–2{chi}l) and n{chi}l summands in the three summations correspondingly. For the present version of this article, we will not detail the case-by-case calculation of these summands, which results in the quantities in the statement of this theorem.

THEOREM 8.
The limiting probability distribution is Poisson[2] for unsigned genomes and Poisson[1/2] for signed genomes.

PROOF
We first prove kth factorial moment converges to 2k for unsigned genomes and 2k for signed genomes. Then by Theorem 2, we get the above conclusion.

Since {Gamma}i is the indicator random variable of value 0 or 1, the k-th factorial moment can be written as the following summation, where the k index runs over all k–tuples on the set {Lambda} and no two indices take on the same value.


Formula 36

(36)

Since the value of conditional expection depends on indices i1,i2,...,ik, the summation on the right hand side of (36) is split into two summations:


Formula 37

(37)


Formula 38

(38)
Denote E[{Gamma}i1{Gamma}i2...{Gamma}ik|gi1,gi2,...,gik] as the conditional expectation for given indices {ij: 1≤j≤k}, when the il-th element on the random genome is gil for all 1≤l≤k. Then the unconditional expectation can be expressed as:


Formula 39

(39)
The set {gi1,gi2,...,gik: i1,i2,...,ikisin{Lambda}} can be split into:


Formula



Formula

Then the expection E'=E[{Gamma}i1{Gamma}i2...{Gamma}ik] on the right hand side of (37) becomes:


Formula 40

(40)


Formula 41

(41)
Since A1 asymptotically occurs with probability 1, as n goes to {infty}, then Formula , which takes the maximum value among all conditional expectations as:


Formula 42

(42)
Since the number of summands in {Sigma}1 is at least n(n–3)(n–6)...s(n–3k+3)=nk+O(nk–1), the number of summands in {Sigma}2 is in the order O(nk–1) and the unconditional expectation in {Sigma}2 is no larger than E[{Gamma}i1{Gamma}i2...{Gamma}ik|A1], then


Formula 43

(43)


Formula 44

(44)
So {Sigma}2 is at least one order of magnitude smaller than {Sigma}2.


Formula 45

(45)
From the convergence of the factorial moments, we have the convergence of the probability distribution: the limiting probability distribution of number of adjacencies is Poisson[2] for unsigned genomes and Poisson[1/2] for signed genomes. {blacksquare}

REMARK 1.
Using the methods of Theorem 6, we can calculate the covariance between {Gamma}i and {Gamma}j as Cov({Gamma}i,{Gamma}j)=E[{Gamma}i{Gamma}j]–E[{Gamma}i]E[{Gamma}j]:


Formula 46

(46)
Since Formula and Formula , the covariances are at least one order of magnitude smaller than the variance. The {Gamma}i's can be treated as independent identical random variables under a mild approximation, which leads to Poisson distributions.


    4 CONCLUSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE GENERATING FUNCTION...
 3 THE PROBABILITY APPROACH...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
In this article, we used a combinatorial approach to find the generating functions counting the number of genomes with given numbers of markers and adjacencies for genomes with only one chromosome. We used probabilistic methods to calculate the exact values for random expectations and variances of the number of adjacencies for genomes with any number of linear and circular chromosomes. The overall conclusion is that the limiting probability distribution is Poisson[2] for the unsigned case and Poisson[1/2] for the signed case.

Based on the limiting Poisson distribution, we can devise a statistical test for whether the two genomes contain a significant evolutionary signal, when the number of markers is not too small. For unsigned genomes with number of adjacencies a, the P-value is calculated by Formula . For signed genomes with number of adjacencies a, the P-value is calculated by Formula Formula . Based on Table 1, when the unsigned distance is >5 or the signed adjacency distance is >2, a statistical test with a critical region of 5% will reject the null hypothesis of randomness and accept that there is a significant evolutionary signal between the two genomes involved.


View this table:
[in this window]
[in a new window]

 
Table 1. P-values for given number of adjacencies when n is large

 

    ACKNOWLEDGEMENTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE GENERATING FUNCTION...
 3 THE PROBABILITY APPROACH...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 
We thank Glenn Tesler for discussing his previous work on this topic with us, and Daniel Panario for his suggestions and encouragement. Research supported in part by a grant to D.S. from the Natural Sciences and Engineering Research Council of Canada (NSERC). D.S. holds the Canada Research Chair in Mathematical Genomics.

Conflict of Interest: none declared.


    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 THE GENERATING FUNCTION...
 3 THE PROBABILITY APPROACH...
 4 CONCLUSION
 ACKNOWLEDGEMENTS
 REFERENCES
 

    Abramson M, Moser W. Combinations, successions and the n-kings problem. Math. Mag (1966) 39:269–273.

    Billingsley P. Probability and Measure, 3rd edn. (1995) New York: Wiley InterScience.

    Flajolet P, Sedgewick R. Analytic Combinatorics. (2008) Cambridge University Press.

    Nadeau JH, Taylor BA. Lengths of chromosomal segments conserved since divergence of man and mouse. Proceedings of the National Academy of Sciences (USA) (1984) 81:814–818.[Abstract/Free Full Text]

    Robbins DP. The probability that neighbors remain neighbors after random rearrangements. Am. Math. Mon (1980) 87:122–124.[CrossRef][Web of Science]

    Sankoff D, Blanchette M. Multiple genome rearrangement and breakpoint phylogeny. J. Comp. Biol (1998) 5:555–570.

    Tesler G. Decomposition of permutations into rising and falling subsequences. (2005) unpublished manuscript, 2005.


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Xu, W.
Right arrow Articles by Sankoff, D.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Xu, W.
Right arrow Articles by Sankoff, D.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?