Skip Navigation


Bioinformatics Advance Access originally published online on February 21, 2006
Bioinformatics 2006 22(10):1225-1231; doi:10.1093/bioinformatics/btl064
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/10/1225    most recent
btl064v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (14)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ababneh, F.
Right arrow Articles by Robinson, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ababneh, F.
Right arrow Articles by Robinson, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences

Faisal Ababneh 1, Lars S. Jermiin 2,3,4,*, Chunsheng Ma 5 and John Robinson 1

1 School of Mathematics and Statistics, University of Sydney NSW 2006, Australia
2 School of Biological Sciences, University of Sydney NSW 2006, Australia
3 Sydney University Biological Informatics and Technology Centre, University of Sydney NSW 2006, Australia
4 Unité de Biologie Moléculaire de Gène chez les Extrêmophiles, Institut Pasteur 75724 Paris Cedex, France
5 Department of Mathematics and Statistics, Wichita State University Wichita, KS 67260-0033, USA

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

Motivation: Most phylogenetic methods assume that the sequences of nucleotides or amino acids have evolved under stationary, reversible and homogeneous conditions. When these assumptions are violated by the data, there is an increased probability of errors in the phylogenetic estimates. Methods to examine aligned sequences for these violations are available, but they are rarely used, possibly because they are not widely known or because they are poorly understood.

Results: We describe and compare the available tests for symmetry of k-dimensional contingency tables from homologous sequences, and develop two new tests to evaluate different aspects of the evolutionary processes. For any pair of sequences, we consider a partition of the test for symmetry into a test for marginal symmetry and a test for internal symmetry. The proposed tests can be used to identify appropriate models for estimation of evolutionary relationships under a Markovian model. Simulations under more or less complex evolutionary conditions were done to display the performance of the tests. Finally, the tests were applied to an alignment of small-subunit ribosomal RNA sequences of five species of bacteria to outline the evolutionary processes under which they evolved.

Availability: Programs written in R to do the tests on nucleotides are available from http://www.maths.usyd.edu.au/u/johnr/testsym/

Contact: lars.jermiin{at}usyd.edu.au


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
Alignments of nucleotide sequences are often analyzed using substitution models of varying complexity, from the simplest Markov model (Jukes and Cantor, 1969) to the most general time-reversible Markov model (Lanave et al., 1984), which assumes stationarity, homogeneity and reversibility. Here stationarity implies that the marginal probabilities of the four nucleotides remain constant throughout the tree; homogeneity implies that the instantaneous rate matrix is constant over an edge, which may be termed local homogeneity, or constant over the tree, which may be termed global homogeneity; and reversibility implies that the process is stationary and permits us to ignore the direction of evolution—more detailed definitions are available in Jayaswal et al. (2005) and Ababneh et al. (2006). The most general Markovian model, which does not use these constraints, is that of Barry and Hartigan (1987). These models generally consider unrooted trees and assume independently and identically distributed sites, although the models only require independence conditional on values at a root, which can be taken as an internal node. Some models of intermediate complexity are described in Yang and Roberts (1995) and Foster (2004), who considered non-homogeneous models on rooted trees with {Gamma}-distributed rate-heterogeneity across independent sites.

In choosing a suitable substitution model to analyze their phylogenetic data, many researchers have chosen to employ an approach implemented in a program called ModelTest (Posada and Crandall, 1998). In so doing, they implicitly assumed that the sequences evolved under stationary, homogeneous and reversible conditions, even though this might not have been so. When model mis-specification involves using a time-reversible model to analyze sequences generated under more general conditions, the probability of errors in the phylogenetic estimate is increased (Jermiin et al., 2004); worryingly, it also is possible to infer the correct phylogeny irrespective of the fact that the phylogenetic signal has been lost through multiple substitutions at the same sites (Ho and Jermiin, 2004).

Several methods have been used to assess whether phylogenetic data can be assumed to have evolved under stationary conditions, but some of these, i.e. the commonly used ones, are flawed (reviewed in Jermiin et al., 2004). Here we describe and compare appropriate methods to determine whether phylogenetic data are consistent with evolution under stationary, homogeneous and reversible conditions.

Suppose we have k matched observations of n independently and identically distributed variables taking values in r categories. An example of such data would be an alignment of k = 5 sequences of n = 2000 nucleotides (implying that r = 4) or amino acids (implying that r = 20)—other examples are discussed in, for instance, Agresti (1990, Chapters 10 and 11). Data of this nature can be summarized in k-dimensional tables with rk categories. Hypotheses of interest concern symmetry in these tables. In the particular cases of homologous nucleotide or amino acid sequences, tests of symmetry or marginal symmetry can be used to consider goodness of fit of the Markov models used to describe evolutionary processes. The importance of using these infrequently used tests prior to phylogenetic analysis of aligned sequence data has long been common knowledge (Tavaré, 1986; Lanave and Pesole, 1993; Rzhetsky and Nei, 1995; Waddell and Steel, 1997; Waddell et al., 1999) but has not yet been accommodated by the wider scientific community.

In the simple case where k = 2, matched-pairs tests can be used to test for symmetry and marginal symmetry. We will show that Bowker's (1948) {chi}2 test statistic for symmetry can be partitioned into two independent components, one component being Stuart's (1955) {chi}2 test statistic for marginal symmetry, and the other component being a {chi}2 test statistic for internal symmetry. This partition was formally proposed by O'Neill (1975). There are similar tests available in the case of multiplicative models—discussed, e.g. in Chapter 10 of Agresti (1990)—in which tests for symmetry are asymptotically equivalent to Bowker's (1948) test, and tests for quasi-symmetry and marginal homogeneity are related to the tests for internal symmetry and marginal homogeneity discussed here. However, it is not clear that these are asymptotically equivalent.

In the more complex cases where k > 2, a test of marginal symmetry has been formulated for analyses of nucleotide sequences by Rzhetsky and Nei (1995). We will derive a combined test for marginal symmetry of all sequences, essentially equivalent to that proposed by Rzhetsky and Nei (1995), and relate this test to tests for all pairs.

Finally, we consider a Markov model for evolution and discuss the use of these tests in deciding on appropriate topologies for a set of data assumed to be generated under the model. We obtain results by simulation that illustrate the use of the tests and we apply the tests to bacterial data that have been discussed previously, e.g. in Galtier and Gouy (1995) and Foster (2004). We finish with a discussion of the merits and limitations of the methods.


    2 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
2.1 Decomposition of Bowker's (1948) test statistic
Consider an r x r contingency table with the ij-th cell containing the frequency nij. We will derive an orthogonal decomposition of the test statistic of Bowker (1948) for testing symmetry in terms of that of Stuart (1955) for testing marginal symmetry. The null hypothesis for symmetry is

Formula
where fij is the probability that a randomly chosen variable belongs to the ij-th category (Bowker, 1948), and the null hypothesis for marginal symmetry is

Formula
where fi. is the sum of fij over j (Stuart, 1955). The two hypotheses are obviously the same in a 2 x 2 contingency table—in general, however, symmetry implies marginal symmetry, whereas the opposite is not necessarily so.

The test statistic of Bowker (1948) for symmetry is given by

Formula
or alternatively,

Formula
where

Formula
and B is a diagonal matrix with elements n12 + n21, ... , n1r + nr1, n23 + n32, ... , nr–1,r + nr,r–1.

The test statistic of Stuart (1955) for marginal symmetry is

Formula
where d = (n1.n.1, ... , nr–1. n.r–1)T and V is the (r – 1) x (r 1) matrix with the elements

Formula
Here V is the estimated covariance matrix of d under the assumption of marginal symmetry. To derive an alternative expression of Formula in terms of m, notice that d can be written as

Formula 1(1)
where C is a Formula 1 matrix, uniquely defined by (1), and of the following form for the case r = 4,

Formula 1
As a result, Stuart's test statistic can be expressed as

Formula 1

Note that, conditional on the elements of B, the elements of B–1/2m are asymptotically independent standard normal variables, under the assumption of symmetry, implying that this is also the unconditional distribution. Accordingly,

Formula 1
is distributed asymptotically as Formula 1, whereas Formula 1 is a projection matrix of rank r – 1, as can be seen directly by verifying that V = CBCT. Consequently,

Formula 1
which leads immediately to the following theorem.

THEOREM 1
Under the hypothesis of symmetry, H0B, Formula 1 and Formula 1 are asymptotically distributed as independent {chi}2 variables with r – 1 and (r – 1)(r – 2)/2 degrees of freedom, respectively. In addition, Formula 1 is asymptotically distributed as a {chi}2 variable with r – 1 degrees of freedom, under the null hypothesis of marginal symmetry H0S.

It is worth noting that the test statistic for internal symmetry is

Formula 1
where CKT = 0. Accordingly, we must consider contrasts KB–1m to help interpret internal symmetry. In the case r = 4, we could take

Formula 1

A test statistic for marginal symmetry closely related to Stuart's (1955) test statistic was presented by Bhapkar (1966) as

Formula 1
where G is simply the estimated covariance matrix of d,

Formula 1
Noting that

Formula 1
it can be seen, as was shown by Ireland et al. (1969), that Formula 1

2.2 Tests with more than two matched observations
The simplest extension to k matched observations is to obtain tests for all pairs of observations as in the last section. Of course, as k increases this leads to problems of multiple comparisons, so we need to interpret the p-values with some care. This simple approach enables us, however, to find observations that match on the basis of symmetry, marginal symmetry and internal symmetry. The p-values can be set out in a two-way table for all pairs, giving a useful method of grouping the observations, even though there are multiple comparison problems. This will be illustrated for nucleotide sequences later.

We may also wish to have an overall test for marginal symmetry. Denote by Formula 1 the probability of an observation belonging to the ij-th category of the j-th variable, j = 1, ... , k, ij = 1, ... , r. Write

Formula 1
Clearly, Formula 1 are the marginal probabilities of the j-th variable. We will use similar notation for an observed table. For instance, Formula 1 represents the observed frequency or count in the i1, ... , ik-th cell of a rk table, and Formula 1 represents the total number of observed counts in the i-th category of the j-th dimension. The null hypothesis is

Formula 1
Such a test was proposed by Rzhetsky and Nei (1995) for the analysis of nucleotide sequences. We will derive an asymptotically equivalent test here and relate it to the tests for pairs given in the previous section.

Consider the case k = 3, which will have obvious extensions to any k. Let Formula 1, where Formula 1 is the number of sites in the j-th sequence for which the variable takes a value 1, ..., r. We can then write the expectation and covariance matrix of n as

Formula 1
and

Formula 1
where

Formula 1
where {sum}* denotes summation over {ell} != j or j'. Let

Formula 1
and

Formula 1
where Ih is an h x h unit matrix and 0h,k denotes a h x k matrix of zeros. Now put d = HLn; multiplication by L compares sequences 1 and 2 and sequences 1 and 3, while multiplication by H selects the first r – 1 values, thus giving exactly 2(r – 1) contrasts, which have a covariance matrix of full rank. This generalizes d of Equation (1).

We consider the hypothesis

Formula 1
Under H0S,

Formula 1
and

Formula 1
V can be estimated by Formula 1, which can be obtained by replacing fj by Formula 1 and Formula 1 by Formula 1, where Formula 1 is the observed r x r matrix of observations for each pair of sequences j and j'. Then, to test H0S we can use the statistic

Formula 1
Under H0S, this is asymptotically distributed as a Formula 1 variate. Equivalently, we can use the computationally simpler,

Formula 1
where

Formula 1
for Formula 1 and 1r is a vector of length r with all elements 1.

The method developed here is described in general terms but can be used to analyze molecular sequence data. The method differs slightly from that of Rzhetsky and Nei (1995) by estimating the covariance matrix under H0S instead of estimating it under the general model. For the case k = 2, Rzhetsky and Nei's (1995) test statistic is that of Bhapkar (1966), while the test statistic considered here is just that of Stuart (1955).


    3 RESULTS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
3.1 Analyses of simulated sequence data
Consider a nucleotide sequence of n sites, where each site evolves independently according to the same Markov process, X, which takes values in discrete space {1, 2, 3, 4} at any point in continuous time. The value of X at time t is denoted by X(t). Assuming that the sites change independently according to the same Markov process and the conditional probabilities of change remain constant over time, we can describe the substitution process X(t), t ≥ 0, by the transition function

Formula 1
where Pij(t) is the probability that the nucleotide is j at time t, given that it was i at time 0. Let Rij be the instantaneous transition rate from nucleotide i to nucleotide j and let R be the matrix of these rates. Then, representing Pij(t) in matrix notation as P(t),

Formula 2(2)
Here R is a time-independent rate matrix satisfying Rij ≥ 0, i != j, Formula 2, so R1 = 0, where 1T = (1, 1, 1, 1) and 0T = (0, 0, 0, 0) and {pi}T R = 0T, where {pi}T = ({pi}1, {pi}2, {pi}3, {pi}4) is the stationary distribution.

Now suppose there are k matched nucleotide sequences of length n derived from a common ancestor. At each nucleotide site consider the Markov processes giving rise to X1(t), ... , Xk(t) at time t. We can generalize the single nucleotide case to this situation by noting that we have X1(0) = ··· = Xk(0) at time t = 0. If the two edges of the tree starting at this ancestral node are of lengths t1 and t2 and split the taxa into groups X1(t1) = ··· = Xm(t1) and Xm+1(t2) = ··· = Xk(t2), then the joint probability of the processes at these nodes is

Formula 3(3)
where A and B here denote the two groups and F(0) =diag(f01, ... , f04) for f0i = P(X1(0) = i), i = 1, 2, 3, 4. Note that we have a homogeneous process between any time points, here between the ancestral or root node at t = 0 and the nodes at t1 or t2, but we permit different transition probabilities for the processes in the two groups. This can be repeated at the time point t1, using in place of F(0) the diagonal matrix of the conditional probabilities at the node corresponding to A, given the values taken at the node corresponding to B. Multiplication by the marginal probabilities at the node corresponding to B gives a 43 array of joint probabilities of the nodes deriving from A and the node corresponding to B. This is continued until all groups have just one member. This permits us to generate the entire 4k array of probabilities

Formula 3
The whole process can be represented by a rooted tree with nodes at the values at each split of the groups. The order in which the groups split gives the topology of the tree and the times give the lengths of edges between nodes.

The approach outlined above can be used to generate matched nucleotide sequences under controlled conditions (for more details, see Ababneh et al., 2006). Sequences were generated in this manner to illustrate the performance of the tests for marginal symmetry and internal symmetry.

EXAMPLE 1
Consider two matched sequences generated under the model (2) and (3) with the same time-independent rate matrix

Formula 3
which implies that {pi}T = (0.25, 0.25, 0.25, 0.25) is the stationary distribution of the process, but with f0 = (0.2, 0.2, 0.2, 0.4)T, and with t1 = t2 = 1. If we simulate the evolution of two nucleotide sequences of length 1000 using the methods of Ababneh et al. (2006), then we can obtain the divergence matrix N = (nij), where nij is the number of times the pair of nucleotides (i, j) occurs at the same site in the two sequences, and apply the tests. Doing so 1000 times, we obtained p-values for all tests that were uniformly distributed on (0, 1), as expected, illustrating that under homogeneity (i.e. RA = RB) the tests are unable to detect that the sequences had evolved under non-stationary conditions.

EXAMPLE 2
Consider the simplest case of non-homogeneity. If R2 = {rho}R1, t1 = t2 = 1, but f0 != {pi}, as in the previous example, then we might expect the test for marginal symmetry to indicate lack of symmetry and the test for internal symmetry to give no evidence of an effect. This indeed occurred: with a simulation taking parameters as in the first example, with RA = R and RB = {rho}R, giving uniform p-values for the test for internal symmetry but having 60 and 90% of p-values <0.05 in the test for marginal symmetry, for {rho} equal to 3 and 5, respectively. This shows that the test for marginal symmetry can detect lack of stationarity when sequences have evolved under non-homogeneous conditions (e.g. when R2 = {rho}R1).

EXAMPLE 3
Consider a model under which the test for marginal symmetry is not significant but the test for internal symmetry shows significant differences. For simplicity we will consider the general time-reversible model for which {Pi}R is symmetric, where {Pi} = diag({pi}) and t1 = t2 = 1. Consider the spectral decomposition

Formula 3
where {Lambda} = diag({lambda}1, ... , {lambda}4). By taking different uj and unequal values of {lambda}1, ... , {lambda}4 for A and B (i.e. the two sequences), while keeping stationarity (f0 = {pi}A = {pi}B), we achieve a suitable model. We took f0 = (0.25, 0.25, 0.25, 0.25)T, {lambda}1 = 0, {lambda}2 = 5, {lambda}3 = 3, {lambda}4 = 2 and Formula 3 Formula 3, Formula 3, Formula 3 and we obtained UB by interchanging the second and fourth columns of UA. Using this and so obtaining F(t) from (3), and then getting 1000 simulations of matched sequences of length 1000, we obtained p-values for the test for marginal symmetry, which were uniform, as expected; on the other hand, we obtained 14% of p-values < 0.05 for the test for internal symmetry. We then increased {lambda}2 to 10, 15 and 20, and obtained 55, 68 and 78% p-values < 0.05, respectively. This illustrates that the test for internal symmetry measures divergence from symmetry in addition to that which might be because of marginal symmetry.

EXAMPLE 4
Consider five homologous sequences generated by simulation under a model incorporating non-stationarity and non-homogeneity. We consider a model for a tree corresponding to the ‘merge’ matrix used in the hierarchical clustering algorithms given in the statistical packages Splus and R:

Formula 3
Here the numbers –1, ... , –5 refer to the leaves of the tree and the numbers 1, 2, 3, 4 refer to the rows of the matrix and to internal nodes corresponding to common ancestors of the nodes in the row, with 4 being a root node. We use the ‘heights’: 0.1, 0.5, 0.8, 1.0, corresponding to the internal nodes represented by the rows of the ‘merge’ matrix. The rooted tree can be described equivalently in the Newick format described on page 590 in Felsenstein (2004) as

Formula 3
We used Formula 3 = (0.1, 0.1, 0.1, 0.7) and {pi}T = (0.25, 0.25, 0.25, 0.25), and for all edges leading from the root to leaves –4, –5 we used the time-independent rate matrix employed in the first example (denoted R); for the remaining edges leading from the root to leaves –1, –2, –3 we used a time-independent rate matrix equal to {rho}R. These are simple choices that will give results qualitatively like those of the bacterial data considered in the next section.

We performed the overall test, using the statistic Ts of Section 2.2 as well as the matched-pairs test of homogeneity, on all pairs of sequences. We generated 1000 alignments of 1000 nucleotides from this model using first {rho} = 1, in which case all p-values were uniformly distributed, as expected, with the mean and standard deviation of Ts being 11.6 and 4.3, respectively, compared with the mean and standard deviation of a Formula 3 variate of 12 and 4.9. These results imply that the test of marginal symmetry is unable to detect that the sequences have evolved under non-stationary conditions, when the evolutionary processes otherwise are homogeneous.

We repeated the experiment with {rho} = 10 and obtained the results in Table 1, which shows the percentage of p-values from the test for marginal symmetry that were <0.05. The values of Ts for this case led to a mean and standard deviation of 25.0 and 10.2, respectively. Clearly, the tests on all pairs gave greater information.


View this table:
[in this window]
[in a new window]
 
Table 1 Percentage of p-values of Stuart's (1955) test <0.05 (based on simulated data)

 
Here we have allowed the stationary probabilities to remain equal throughout the tree, except for the root. If we had also permitted the values of {pi} to differ for the edges leading to –1, –2, –3 from those leading to –4, –5, then more dramatic results would have been obtained using the test for marginal symmetry.

3.2 Analyses of bacterial sequence data
Galtier and Gouy (1995) inferred a bacterial phylogeny using the small-subunit ribosomal RNA sequences from Aquifex pyrophilus, Thermotoga maritima, Thermus thermophilus, Deinococcus radiodurans and a fifth species chosen from the following genera: Chlamydia, Spirochaeta, Bacterides, Agrobacterium, Escherichia, Fusobacterium, Clostridium, Anabaena, Micrococcus and Bacillus. They used a Markov model that assumes that {pi}A = {pi}T and {pi}C = {pi}G whereas {pi}C + {pi}G was allowed to vary across the tree; hence, they used a non-stationary and non-homogeneous model to infer the bacterial phylogeny.

To illustrate the use of the matched-pairs tests of homogeneity, we used essentially the same data, except that the fifth species was represented by Bacillus subtilis. The alignment consisted of 1238 sites from five species. The overall test for marginal symmetry based on Ts from Section 2.2 was applied, giving an observed value 103.4, indicating a significantly large deviation from marginal symmetry (p ≤ 0.0001). More information was obtained using the pairwise tests of symmetry, marginal symmetry and internal symmetry, which gave the p-values shown in Table 2. It is clear that all divergence matrices for the set Aquifex, Thermus and Thermotoga were symmetric, as was the divergence matrix for Bacillus and Deinococcus, and that all other divergence matrices were highly asymmetric. Further, there is no indication of internal asymmetry.


View this table:
[in this window]
[in a new window]
 
Table 2 Tests for bacterial data (A: Aquifex, B: Bacillus, D: Deinococcus, Ts: Thermus, Ta: Thermotoga)

 
The simplest model for this outcome of the tests must satisfy
  • lack of stationary;
  • all terminal edges leading to Aquifex, Thermus and Thermotoga have the same rate matrix R1;
  • all terminal edges leading to Bacillus and Deinococcus have the same rate matrix R2;
  • R1 != R2.

We can present the fourth condition here in a simpler form if we take R1 = S{Pi}1 and R2 = {rho}S{Pi}2, where either {rho} = 1 and {Pi}1 != {Pi}2 or {rho} != 1 and {Pi}1 != {Pi}2, and {Pi}1 and {Pi}2 are diagonal matrices with the stationary distributions of R1 and R2, respectively, on their diagonals. Consideration of this simple form using the same S for both R1 and R2 has support because the test for internal symmetry was not significant, although such an assumption is not strictly justified. We might also take S to be symmetric, making the process on each edge reversible under stationarity, although the tests do not give information on this.

Given the knowledge obtained above, an educated decision can be made about the substitution model. The results in Table 2 imply that the general time-reversible model (Lanave et al., 1984) is not sufficiently complex to account for differences in the data, and that a more general Markov model is needed. Accordingly, we chose to use the BH algorithm (Jayaswal et al., 2005), which implements the general Markov model of Barry and Hartigan (1987). Assuming that the sites are independently and identically distributed, we estimated the likelihood for all trees, including the most likely tree (Fig. 1). The tree groups the sequences in a manner where the GC-rich and GC-poor sequences are interspersed.


Figure 1
View larger version (5K):
[in this window]
[in a new window]
 
Fig. 1 The most likely phylogeny inferred using the general Markov model (logL = –4289.511). The edges are drawn to scale and the GC content of the sequences is included.

 
If we had ignored the implications of the results in Table 2, then we might have chosen to analyze the data using the general time-reversible model. We did so for the sake of argument. Assuming independent and identically distributed sites and using the general time-reversible model implemented in PAUP* (Swofford, 2001), we found that the most likely tree groups the sequences in a manner that is consistent with a division based solely on their GC content (Fig. 2). Given this, there is good reason to question whether the tree represents the phylogeny or a confounded estimate of the phylogeny.


Figure 2
View larger version (6K):
[in this window]
[in a new window]
 
Fig. 2 The most likely phylogeny inferred using the general time-reversible Markov model (logL = –4401.417). The edges are drawn to scale and the GC content of the sequences is included.

 
The difference in log likelihood between the most likely tree inferred using the general Markov model (Fig. 1) and the most likely tree inferred using the general time-reversible Markov model (Fig. 2) is large (logL = 111.906). In order to determine the relative contribution of the trees and the models, we obtained the log likelihood for the second tree under the general Markov model, and found the difference in log likelihood between the two trees to be small (logL = 6.709). Consequently, we can conclude that 94% of the large difference in log likelihood is due to our choice of a more appropriate Markov model to approximate the evolutionary processes across the whole tree. This choice was made on the basis of the matched-pairs tests outlined in the previous section.


    4 DISCUSSION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 
The tests presented in this paper assume that the individual observations are independently and identically distributed. It is possible to weaken this condition in two ways. First, it is not necessary to assume that the sequences are independent samples of nucleotides. Instead we can assume that, given the sequence at the root, the Markov processes operating along divergent lineages are independent. If they are not operating independently, then the tests will not be appropriate, since the test statistics will not then have the expected asymptotic distributions. Second, we may consider a model in which some sites are invariant, in the sense that the value of the nucleotide taken at the root must remain unchanged along the descendant lineages. In this case we simply change values of n1, ... , 1, ... , n4, ... , 4, which does not affect any of the test statistics considered here (Formula 3, Formula 3, Formula 3 and Ts), although it does make asymptotically negligible changes to the statistics of Bhapkar (1966) and Rzhetsky and Nei (1995). More generally, if the sites evolved independently under the same stationary condition but under different homogeneous models (i.e. rate heterogeneity across sites), then the joint distribution would be a mixture of the joint distributions at each site, but would retain the symmetries of the probabilities at each site. If this were the case, then the tests presented here would retain the properties of testing for symmetry. If there is dependence between the processes at, for example, adjacent sites in protein- and RNA-coding DNA, then the test will not be valid. We note that consistency with the hypotheses of symmetry does not imply stationarity and homogeneity, but only that the data are consistent with such hypotheses. For example, if the stationary distributions at different sites differ and at each site the substitution process is stationary, then the hypotheses of symmetry will still hold, so the tests have no power to detect such differences.

The problems of lack of independence of evolution and rate heterogeneity across sites may be mitigated by partitioning the data on the basis of additional information (e.g. considering codon sites), so it is recommended that sequence data be partitioned into appropriate bins before conducting these tests.

In the context of phylogenetic inference, the purpose of the tests considered here is to aid in the choice of substitution model. If a model incorporating stationarity and homogeneity throughout a tree is appropriate, then the hypothesis of symmetry will hold, so small p-values from our tests may be used to indicate that these assumptions do not hold and so may invalidate the use of standard methods based on these assumptions. The paired tests give further information about the subtrees associated with each pair, thus helping in the choice of the simplest substitution model for the edges joining the two leaf nodes consistent with the results of the tests. If two sequences have the same composition, then the hypothesis of marginal symmetry holds. Failure of the hypothesis of internal symmetry implies violation of the assumption of homogeneity of the Markov processes operating along the two edges joining the two leaves. This was shown in Example 4 of Section 3.

All tests proposed in this paper are approximately pivotal; i.e. the distribution of the test statistics is neither dependent on any model specifying the tree topology nor on parameters of the substitution models. This contrasts with the two approaches described by Foster (2004), one of which involved a {chi}2 test statistic used in PAUP* (Swofford, 2001), which does not account for covariance due to the phylogeny of the sequences. This test would be appropriate if the sequences were not matched. Foster (2004) noted that this statistic does not have the standard asymptotic distribution and proposed a method where, under an hypothesized model given by a tree and a set of parameters for a Markov process producing the observed data, maximum likelihood estimates of parameters are obtained and used to simulate datasets, for each of which the test statistic is calculated. The observed statistic is compared with the simulated distribution. This is essentially a parametric bootstrap approach to test the hypothesized model. The hypothesis of Foster (2004) is not as general as those of symmetry, as it involves a hypothesized model, but in practice the interpretation of the results will be similar to those of the tests of symmetry proposed here. Foster's (2004) second approach gives tests of correct size. Simulations have shown that the tests based on the statistic proposed here and that proposed by Foster (2004) have similar power.

Foster's (2004) second approach is based on the idea of posterior predictive assessment, given in Gelman et al. (1996), and in a phylogenetic context by Huelsenbeck et al. (2001) and Bollback (2002). This Bayesian method uses simulation from the posterior distribution of parameters, including a set of trees and models for generation of simulated data on these trees, given the data. In most cases, the simulation is achieved indirectly using the Markov Chain Monte Carlo technique. The tests depend on a test statistic and a particular hypothesized model. In the case where the test statistic is pivotal, such as the test statistics proposed in this paper, the tests using posterior predictive assessment would be equivalent to those based on the pivotal statistic, as noted in Section 2.4 of Gelman et al. (1996).

We note that the tests considered in the preceding two paragraphs are given in a more general context and can be used to test more complex hypotheses than those involving stationarity and homogeneity, and so have a greater generality than the tests for symmetry discussed in this paper. However, the tests of symmetry, marginal symmetry and internal symmetry can be made on data prior to a phylogenetic analysis in order to aid in choice of substitution models, whereas the other tests discussed above are performed after analysis of a particular substitution model to test its adequacy.


    Acknowledgments
 
The hospitality of Patrick Forterre and Institut Pasteur is gratefully acknowledged by L.S.J. The authors also wish to thank Keith A Crandall, Simon YW Ho, Vivek Jayaswal and the reviewers for their constructive comments. F.A. was supported by a postgraduate scholarship from the Al-Hussain Bin Talal University in Jordan. This research was partly funded by a Discovery Grant (DP0453173) from the Australian Research Council.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Keith A Crandall

Received on November 23, 2005; revised on February 15, 2006; accepted on February 19, 2006

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 METHODS
 3 RESULTS
 4 DISCUSSION
 REFERENCES
 

    Ababneh, F., et al. (2006) Generation of the exact distribution and simulation of matched nucleotide sequences on a phylogenetic tree. J. Math. Model. Algor, . (in press).

    Categorical Data Analysis Agresti, A. (1990) Wiley Series in Probability and Mathematical Statistics, New York.

    Barry, D. and Hartigan, J.A. (1987) Statistical analysis of hominoid molecular evolution. Stat. Sci, . 2, 191–210.

    Bhapkar, V.P. (1966) A note on the equivalence of two test criteria for hypotheses in categorical data. J. Am. Stat. Assoc, . 61, 228–235[CrossRef][Web of Science].

    Bollback, J.P. (2002) Bayesian model adequacy and choice in phylogenetics. Mol. Biol. Evol, . 19, 1171–1180[Abstract/Free Full Text].

    Bowker, A.H. (1948) A test for symmetry in contingency tables. J. Am. Stat. Assoc, 43, 572–574[CrossRef][Web of Science][Medline].

    Felsenstein, J. Inferring Phylogenies, (2004) , Sunderland Sinauer Associates.

    Foster, P.G. (2004) Modeling compositional heterogeneity. Syst. Biol, . 53, 485–495[Abstract/Free Full Text].

    Galtier, N. and Gouy, M. (1995) Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA, 92, 11317–11321[Abstract/Free Full Text].

    Gelman, A., et al. (1996) Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sinica, 6, 733–807.

    Ho, S.Y.W. and Jermiin, L.S. (2004) Tracing the decay of the historical signal in biological sequence data. Syst. Biol, . 53, 623–637[Abstract/Free Full Text].

    Huelsenbeck, J.P., et al. (2001) Bayesian inference of phylogeny and its impact on evolutionary biology. Science, 294, 2310–2314[Abstract/Free Full Text].

    Ireland, C., et al. (1969) Symmetry and marginal homogeneity of an r x r contingency table. J. Am. Stat. Assoc, . 64, 1323–1341[CrossRef][Web of Science].

    Jayaswal, V., et al. (2005) Estimation of phylogeny using a general Markov model. Evol. Bioinf. Online, 1, 62–80.

    Jermiin, L.S., et al. (2004) The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol, . 53, 638–643[Free Full Text].

    Jukes, T.H. and Cantor, C.R. (1969) Evolution of protein molecules. In Munro, C.R. (Ed.). Mammalian Protein Metabolism, , New York Academic Press, pp. 21–132.

    Lanave, C., et al. (1984) A new method for calculating evolutionary substitution rates. J. Mol. Evol, . 20, 86–93[CrossRef][Web of Science][Medline].

    Lanave, C. and Pesole, G. (1993) Stationary MARKOV processes in the evolution of biological macromolecules. Binary, 5, 191–195.

    Problems in Dependence O'Neill, M.E. (1975) PhD Thesis, University of Sydney.

    Posada, D. and Crandall, K.A. (1998) MODELTEST: testing the model of DNA substitution. Bioinformatics, 14, 817–818[Abstract/Free Full Text].

    Rzhetsky, A. and Nei, M. (1995) Tests of applicability of several substitution models for DNA sequence data. Mol. Biol. Evol, . 12, 131–151[Abstract].

    Stuart, A. (1955) A test for homogeneity of the marginal distributions in a two-way classification. Biometrika, 42, 412–416[Free Full Text].

    Swofford, D.L. (2001) PAUP*: Phylogenetic Analysis Using Parsimony (* and other methods). Version 4. Massachusetts, Sinauer Associates.

    Tavaré, S. (1986) Some probabilistic and statistical problems on the analysis of DNA sequences. Lect. Math. Life Sci, . 17, 57–86.

    Waddell, P.J. and Steel, M.A. (1997) General time reversible distances with unequal rates across sites: Mixing {Gamma} and inverse Gaussian distributions with invariant sites. Mol. Phylogenet. Evol, . 8, 398–414[CrossRef][Web of Science][Medline].

    Waddell, P.J., et al. (1999) Using novel phylogenetic methods to evaluate mammalian mtDNA, including amino acid-invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the positions of hedgehog, armadillo, and elephant. Syst. Biol, . 48, 31–53[CrossRef][Web of Science][Medline].

    Yang, Z. and Roberts, D. (1995) On the use of nucleotide sequences to infer early branches in the tree of life. Mol. Biol. Evol, . 12, 451–458[Abstract].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?


This article has been cited by other articles:


Home page
J MOLLUS STUDHome page
B. Nitz, R. Heim, U. E. Schneppat, I. Hyman, and G. Haszprunar
Towards a new standard in slug species descriptions: the case of Limax sarnensis Heim & Nitz n. sp. (Pulmonata: Limacidae) from the Western Central Alps
J. Mollus. Stud., August 1, 2009; 75(3): 279 - 294.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
F. Squartini and P. F. Arndt
Quantifying the Stationarity and Time Reversibility of the Nucleotide Substitution Process
Mol. Biol. Evol., December 1, 2008; 25(12): 2525 - 2535.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
E. Susko and A. J. Roger
On Reduced Amino Acid Alphabets for Phylogenetic Inference
Mol. Biol. Evol., September 1, 2007; 24(9): 2139 - 2150.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
N. Rodriguez-Ezpeleta, H. Brinkmann, B. Roure, N. Lartillot, B. F. Lang, and H. Philippe
Detecting and Overcoming Systematic Errors in Genome-Scale Phylogenies
Syst Biol, June 1, 2007; 56(3): 389 - 399.
[Abstract] [Full Text] [PDF]


Home page
Mol Biol EvolHome page
V. Gowri-Shankar and M. Rattray
A Reversible Jump Method for Bayesian Phylogenetic Inference with a Nonhomogeneous Substitution Model
Mol. Biol. Evol., June 1, 2007; 24(6): 1286 - 1299.
[Abstract] [Full Text] [PDF]


Home page
Syst BiolHome page
V. Jayaswal, J. Robinson, and L. Jermiin
Estimation of Phylogeny and Invariant Sites under the General Markov Model of Nucleotide Sequence Evolution
Syst Biol, April 1, 2007; 56(2): 155 - 162.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
R. G. Beiko and R. L. Charlebois
A simulation test bed for hypotheses of genome evolution
Bioinformatics, April 1, 2007; 23(7): 825 - 831.
[Abstract] [Full Text] [PDF]


Home page
BioinformaticsHome page
J. W. K. Ho, C. E. Adams, J. B. Lew, T. J. Matthews, C. C. Ng, A. Shahabi-Sirjani, L. H. Tan, Y. Zhao, S. Easteal, S. R Wilson, et al.
SeqVis: Visualization of compositional heterogeneity in large alignments of nucleotides
Bioinformatics, September 1, 2006; 22(17): 2162 - 2163.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/10/1225    most recent
btl064v1
Right arrow Comments: Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when Comments are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrow Search for citing articles in:
ISI Web of Science (14)
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Ababneh, F.
Right arrow Articles by Robinson, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Ababneh, F.
Right arrow Articles by Robinson, J.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?