Bioinformatics Advance Access originally published online on February 21, 2006
Bioinformatics 2006 22(10):1225-1231; doi:10.1093/bioinformatics/btl064
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Matched-pairs tests of homogeneity with applications to homologous nucleotide sequences
1 School of Mathematics and Statistics, University of Sydney NSW 2006, Australia
2 School of Biological Sciences, University of Sydney NSW 2006, Australia
3 Sydney University Biological Informatics and Technology Centre, University of Sydney NSW 2006, Australia
4 Unité de Biologie Moléculaire de Gène chez les Extrêmophiles, Institut Pasteur 75724 Paris Cedex, France
5 Department of Mathematics and Statistics, Wichita State University Wichita, KS 67260-0033, USA
*To whom correspondence should be addressed.
| ABSTRACT |
|---|
|
|
|---|
Motivation: Most phylogenetic methods assume that the sequences of nucleotides or amino acids have evolved under stationary, reversible and homogeneous conditions. When these assumptions are violated by the data, there is an increased probability of errors in the phylogenetic estimates. Methods to examine aligned sequences for these violations are available, but they are rarely used, possibly because they are not widely known or because they are poorly understood.
Results: We describe and compare the available tests for symmetry of k-dimensional contingency tables from homologous sequences, and develop two new tests to evaluate different aspects of the evolutionary processes. For any pair of sequences, we consider a partition of the test for symmetry into a test for marginal symmetry and a test for internal symmetry. The proposed tests can be used to identify appropriate models for estimation of evolutionary relationships under a Markovian model. Simulations under more or less complex evolutionary conditions were done to display the performance of the tests. Finally, the tests were applied to an alignment of small-subunit ribosomal RNA sequences of five species of bacteria to outline the evolutionary processes under which they evolved.
Availability: Programs written in R to do the tests on nucleotides are available from http://www.maths.usyd.edu.au/u/johnr/testsym/
Contact: lars.jermiin{at}usyd.edu.au
| 1 INTRODUCTION |
|---|
|
|
|---|
Alignments of nucleotide sequences are often analyzed using substitution models of varying complexity, from the simplest Markov model (Jukes and Cantor, 1969) to the most general time-reversible Markov model (Lanave et al., 1984), which assumes stationarity, homogeneity and reversibility. Here stationarity implies that the marginal probabilities of the four nucleotides remain constant throughout the tree; homogeneity implies that the instantaneous rate matrix is constant over an edge, which may be termed local homogeneity, or constant over the tree, which may be termed global homogeneity; and reversibility implies that the process is stationary and permits us to ignore the direction of evolutionmore detailed definitions are available in Jayaswal et al. (2005) and Ababneh et al. (2006). The most general Markovian model, which does not use these constraints, is that of Barry and Hartigan (1987). These models generally consider unrooted trees and assume independently and identically distributed sites, although the models only require independence conditional on values at a root, which can be taken as an internal node. Some models of intermediate complexity are described in Yang and Roberts (1995) and Foster (2004), who considered non-homogeneous models on rooted trees with
-distributed rate-heterogeneity across independent sites. In choosing a suitable substitution model to analyze their phylogenetic data, many researchers have chosen to employ an approach implemented in a program called ModelTest (Posada and Crandall, 1998). In so doing, they implicitly assumed that the sequences evolved under stationary, homogeneous and reversible conditions, even though this might not have been so. When model mis-specification involves using a time-reversible model to analyze sequences generated under more general conditions, the probability of errors in the phylogenetic estimate is increased (Jermiin et al., 2004); worryingly, it also is possible to infer the correct phylogeny irrespective of the fact that the phylogenetic signal has been lost through multiple substitutions at the same sites (Ho and Jermiin, 2004).
Several methods have been used to assess whether phylogenetic data can be assumed to have evolved under stationary conditions, but some of these, i.e. the commonly used ones, are flawed (reviewed in Jermiin et al., 2004). Here we describe and compare appropriate methods to determine whether phylogenetic data are consistent with evolution under stationary, homogeneous and reversible conditions.
Suppose we have k matched observations of n independently and identically distributed variables taking values in r categories. An example of such data would be an alignment of k = 5 sequences of n = 2000 nucleotides (implying that r = 4) or amino acids (implying that r = 20)other examples are discussed in, for instance, Agresti (1990, Chapters 10 and 11). Data of this nature can be summarized in k-dimensional tables with rk categories. Hypotheses of interest concern symmetry in these tables. In the particular cases of homologous nucleotide or amino acid sequences, tests of symmetry or marginal symmetry can be used to consider goodness of fit of the Markov models used to describe evolutionary processes. The importance of using these infrequently used tests prior to phylogenetic analysis of aligned sequence data has long been common knowledge (Tavaré, 1986; Lanave and Pesole, 1993; Rzhetsky and Nei, 1995; Waddell and Steel, 1997; Waddell et al., 1999) but has not yet been accommodated by the wider scientific community.
In the simple case where k = 2, matched-pairs tests can be used to test for symmetry and marginal symmetry. We will show that Bowker's (1948)
2 test statistic for symmetry can be partitioned into two independent components, one component being Stuart's (1955)
2 test statistic for marginal symmetry, and the other component being a
2 test statistic for internal symmetry. This partition was formally proposed by O'Neill (1975). There are similar tests available in the case of multiplicative modelsdiscussed, e.g. in Chapter 10 of Agresti (1990)in which tests for symmetry are asymptotically equivalent to Bowker's (1948) test, and tests for quasi-symmetry and marginal homogeneity are related to the tests for internal symmetry and marginal homogeneity discussed here. However, it is not clear that these are asymptotically equivalent.
In the more complex cases where k > 2, a test of marginal symmetry has been formulated for analyses of nucleotide sequences by Rzhetsky and Nei (1995). We will derive a combined test for marginal symmetry of all sequences, essentially equivalent to that proposed by Rzhetsky and Nei (1995), and relate this test to tests for all pairs.
Finally, we consider a Markov model for evolution and discuss the use of these tests in deciding on appropriate topologies for a set of data assumed to be generated under the model. We obtain results by simulation that illustrate the use of the tests and we apply the tests to bacterial data that have been discussed previously, e.g. in Galtier and Gouy (1995) and Foster (2004). We finish with a discussion of the merits and limitations of the methods.
| 2 METHODS |
|---|
|
|
|---|
2.1 Decomposition of Bowker's (1948) test statistic
Consider an r x r contingency table with the ij-th cell containing the frequency nij. We will derive an orthogonal decomposition of the test statistic of Bowker (1948) for testing symmetry in terms of that of Stuart (1955) for testing marginal symmetry. The null hypothesis for symmetry is
![]() |
![]() |
The test statistic of Bowker (1948) for symmetry is given by
![]() |
![]() |
![]() |
The test statistic of Stuart (1955) for marginal symmetry is
![]() |
![]() |
in terms of m, notice that d can be written as
![]() | (1) |
matrix, uniquely defined by (1), and of the following form for the case r = 4,
![]() |
![]() |
Note that, conditional on the elements of B, the elements of B1/2m are asymptotically independent standard normal variables, under the assumption of symmetry, implying that this is also the unconditional distribution. Accordingly,
![]() |
, whereas
is a projection matrix of rank r 1, as can be seen directly by verifying that V = CBCT. Consequently,
![]() |
THEOREM 1
Under the hypothesis of symmetry, H0B,and
are asymptotically distributed as independent
2 variables with r 1 and (r 1)(r 2)/2 degrees of freedom, respectively. In addition,
is asymptotically distributed as a
2 variable with r 1 degrees of freedom, under the null hypothesis of marginal symmetry H0S.
It is worth noting that the test statistic for internal symmetry is
![]() |
![]() |
A test statistic for marginal symmetry closely related to Stuart's (1955) test statistic was presented by Bhapkar (1966) as
![]() |
![]() |
![]() |
2.2 Tests with more than two matched observations
The simplest extension to k matched observations is to obtain tests for all pairs of observations as in the last section. Of course, as k increases this leads to problems of multiple comparisons, so we need to interpret the p-values with some care. This simple approach enables us, however, to find observations that match on the basis of symmetry, marginal symmetry and internal symmetry. The p-values can be set out in a two-way table for all pairs, giving a useful method of grouping the observations, even though there are multiple comparison problems. This will be illustrated for nucleotide sequences later.
We may also wish to have an overall test for marginal symmetry. Denote by
the probability of an observation belonging to the ij-th category of the j-th variable, j = 1, ... , k, ij = 1, ... , r. Write
![]() |
are the marginal probabilities of the j-th variable. We will use similar notation for an observed table. For instance,
represents the observed frequency or count in the i1, ... , ik-th cell of a rk table, and
represents the total number of observed counts in the i-th category of the j-th dimension. The null hypothesis is
![]() |
Consider the case k = 3, which will have obvious extensions to any k. Let
, where
is the number of sites in the j-th sequence for which the variable takes a value 1, ..., r. We can then write the expectation and covariance matrix of n as
![]() |
![]() |
![]() |
* denotes summation over
j or j'. Let
![]() |
![]() |
We consider the hypothesis
![]() |
![]() |
![]() |
, which can be obtained by replacing fj by
and
by
, where
is the observed r x r matrix of observations for each pair of sequences j and j'. Then, to test H0S we can use the statistic
![]() |
variate. Equivalently, we can use the computationally simpler,
![]() |
![]() |
and 1r is a vector of length r with all elements 1. The method developed here is described in general terms but can be used to analyze molecular sequence data. The method differs slightly from that of Rzhetsky and Nei (1995) by estimating the covariance matrix under H0S instead of estimating it under the general model. For the case k = 2, Rzhetsky and Nei's (1995) test statistic is that of Bhapkar (1966), while the test statistic considered here is just that of Stuart (1955).
| 3 RESULTS |
|---|
|
|
|---|
3.1 Analyses of simulated sequence data
Consider a nucleotide sequence of n sites, where each site evolves independently according to the same Markov process, X, which takes values in discrete space {1, 2, 3, 4} at any point in continuous time. The value of X at time t is denoted by X(t). Assuming that the sites change independently according to the same Markov process and the conditional probabilities of change remain constant over time, we can describe the substitution process X(t), t
0, by the transition function
![]() |
![]() | (2) |
0, i
j,
, so R1 = 0, where 1T = (1, 1, 1, 1) and 0T = (0, 0, 0, 0) and
T R = 0T, where
T = (
1,
2,
3,
4) is the stationary distribution.
Now suppose there are k matched nucleotide sequences of length n derived from a common ancestor. At each nucleotide site consider the Markov processes giving rise to X1(t), ... , Xk(t) at time t. We can generalize the single nucleotide case to this situation by noting that we have X1(0) =
= Xk(0) at time t = 0. If the two edges of the tree starting at this ancestral node are of lengths t1 and t2 and split the taxa into groups X1(t1) =
= Xm(t1) and Xm+1(t2) =
= Xk(t2), then the joint probability of the processes at these nodes is
![]() | (3) |
![]() |
The approach outlined above can be used to generate matched nucleotide sequences under controlled conditions (for more details, see Ababneh et al., 2006). Sequences were generated in this manner to illustrate the performance of the tests for marginal symmetry and internal symmetry.
EXAMPLE 1
Consider two matched sequences generated under the model (2) and (3) with the same time-independent rate matrixwhich implies that
T = (0.25, 0.25, 0.25, 0.25) is the stationary distribution of the process, but with f0 = (0.2, 0.2, 0.2, 0.4)T, and with t1 = t2 = 1. If we simulate the evolution of two nucleotide sequences of length 1000 using the methods of Ababneh et al. (2006), then we can obtain the divergence matrix N = (nij), where nij is the number of times the pair of nucleotides (i, j) occurs at the same site in the two sequences, and apply the tests. Doing so 1000 times, we obtained p-values for all tests that were uniformly distributed on (0, 1), as expected, illustrating that under homogeneity (i.e. RA = RB) the tests are unable to detect that the sequences had evolved under non-stationary conditions.
EXAMPLE 2
Consider the simplest case of non-homogeneity. If R2 =R1, t1 = t2 = 1, but f0
![]()
, as in the previous example, then we might expect the test for marginal symmetry to indicate lack of symmetry and the test for internal symmetry to give no evidence of an effect. This indeed occurred: with a simulation taking parameters as in the first example, with RA = R and RB =
R, giving uniform p-values for the test for internal symmetry but having 60 and 90% of p-values <0.05 in the test for marginal symmetry, for
equal to 3 and 5, respectively. This shows that the test for marginal symmetry can detect lack of stationarity when sequences have evolved under non-homogeneous conditions (e.g. when R2 =
R1).
EXAMPLE 3
Consider a model under which the test for marginal symmetry is not significant but the test for internal symmetry shows significant differences. For simplicity we will consider the general time-reversible model for whichR is symmetric, where
= diag(
) and t1 = t2 = 1. Consider the spectral decomposition
where
= diag(
1, ... ,
4). By taking different uj and unequal values of
1, ... ,
4 for A and B (i.e. the two sequences), while keeping stationarity (f0 =
A =
B), we achieve a suitable model. We took f0 = (0.25, 0.25, 0.25, 0.25)T,
1 = 0,
2 = 5,
3 = 3,
4 = 2 and
![]()
,
,
and we obtained UB by interchanging the second and fourth columns of UA. Using this and so obtaining F(t) from (3), and then getting 1000 simulations of matched sequences of length 1000, we obtained p-values for the test for marginal symmetry, which were uniform, as expected; on the other hand, we obtained 14% of p-values < 0.05 for the test for internal symmetry. We then increased
2 to 10, 15 and 20, and obtained 55, 68 and 78% p-values < 0.05, respectively. This illustrates that the test for internal symmetry measures divergence from symmetry in addition to that which might be because of marginal symmetry.
EXAMPLE 4
Consider five homologous sequences generated by simulation under a model incorporating non-stationarity and non-homogeneity. We consider a model for a tree corresponding to the merge matrix used in the hierarchical clustering algorithms given in the statistical packages Splus and R:Here the numbers 1, ... , 5 refer to the leaves of the tree and the numbers 1, 2, 3, 4 refer to the rows of the matrix and to internal nodes corresponding to common ancestors of the nodes in the row, with 4 being a root node. We use the heights: 0.1, 0.5, 0.8, 1.0, corresponding to the internal nodes represented by the rows of the merge matrix. The rooted tree can be described equivalently in the Newick format described on page 590 in Felsenstein (2004) as
We used
= (0.1, 0.1, 0.1, 0.7) and
T = (0.25, 0.25, 0.25, 0.25), and for all edges leading from the root to leaves 4, 5 we used the time-independent rate matrix employed in the first example (denoted R); for the remaining edges leading from the root to leaves 1, 2, 3 we used a time-independent rate matrix equal to
R. These are simple choices that will give results qualitatively like those of the bacterial data considered in the next section.
We performed the overall test, using the statistic Ts of Section 2.2 as well as the matched-pairs test of homogeneity, on all pairs of sequences. We generated 1000 alignments of 1000 nucleotides from this model using first
= 1, in which case all p-values were uniformly distributed, as expected, with the mean and standard deviation of Ts being 11.6 and 4.3, respectively, compared with the mean and standard deviation of a
variate of 12 and 4.9. These results imply that the test of marginal symmetry is unable to detect that the sequences have evolved under non-stationary conditions, when the evolutionary processes otherwise are homogeneous.
We repeated the experiment with
= 10 and obtained the results in Table 1, which shows the percentage of p-values from the test for marginal symmetry that were <0.05. The values of Ts for this case led to a mean and standard deviation of 25.0 and 10.2, respectively. Clearly, the tests on all pairs gave greater information.
|
Here we have allowed the stationary probabilities to remain equal throughout the tree, except for the root. If we had also permitted the values of
to differ for the edges leading to 1, 2, 3 from those leading to 4, 5, then more dramatic results would have been obtained using the test for marginal symmetry.
3.2 Analyses of bacterial sequence data
Galtier and Gouy (1995) inferred a bacterial phylogeny using the small-subunit ribosomal RNA sequences from Aquifex pyrophilus, Thermotoga maritima, Thermus thermophilus, Deinococcus radiodurans and a fifth species chosen from the following genera: Chlamydia, Spirochaeta, Bacterides, Agrobacterium, Escherichia, Fusobacterium, Clostridium, Anabaena, Micrococcus and Bacillus. They used a Markov model that assumes that
A =
T and
C =
G whereas
C +
G was allowed to vary across the tree; hence, they used a non-stationary and non-homogeneous model to infer the bacterial phylogeny.
To illustrate the use of the matched-pairs tests of homogeneity, we used essentially the same data, except that the fifth species was represented by Bacillus subtilis. The alignment consisted of 1238 sites from five species. The overall test for marginal symmetry based on Ts from Section 2.2 was applied, giving an observed value 103.4, indicating a significantly large deviation from marginal symmetry (p
0.0001). More information was obtained using the pairwise tests of symmetry, marginal symmetry and internal symmetry, which gave the p-values shown in Table 2. It is clear that all divergence matrices for the set Aquifex, Thermus and Thermotoga were symmetric, as was the divergence matrix for Bacillus and Deinococcus, and that all other divergence matrices were highly asymmetric. Further, there is no indication of internal asymmetry.
|
The simplest model for this outcome of the tests must satisfy
- lack of stationary;
- all terminal edges leading to Aquifex, Thermus and Thermotoga have the same rate matrix R1;
- all terminal edges leading to Bacillus and Deinococcus have the same rate matrix R2;
- R1
R2.
We can present the fourth condition here in a simpler form if we take R1 = S
1 and R2 =
S
2, where either
= 1 and
1
2 or
1 and
1
2, and
1 and
2 are diagonal matrices with the stationary distributions of R1 and R2, respectively, on their diagonals. Consideration of this simple form using the same S for both R1 and R2 has support because the test for internal symmetry was not significant, although such an assumption is not strictly justified. We might also take S to be symmetric, making the process on each edge reversible under stationarity, although the tests do not give information on this.
Given the knowledge obtained above, an educated decision can be made about the substitution model. The results in Table 2 imply that the general time-reversible model (Lanave et al., 1984) is not sufficiently complex to account for differences in the data, and that a more general Markov model is needed. Accordingly, we chose to use the BH algorithm (Jayaswal et al., 2005), which implements the general Markov model of Barry and Hartigan (1987). Assuming that the sites are independently and identically distributed, we estimated the likelihood for all trees, including the most likely tree (Fig. 1). The tree groups the sequences in a manner where the GC-rich and GC-poor sequences are interspersed.
|
If we had ignored the implications of the results in Table 2, then we might have chosen to analyze the data using the general time-reversible model. We did so for the sake of argument. Assuming independent and identically distributed sites and using the general time-reversible model implemented in PAUP* (Swofford, 2001), we found that the most likely tree groups the sequences in a manner that is consistent with a division based solely on their GC content (Fig. 2). Given this, there is good reason to question whether the tree represents the phylogeny or a confounded estimate of the phylogeny.
|
The difference in log likelihood between the most likely tree inferred using the general Markov model (Fig. 1) and the most likely tree inferred using the general time-reversible Markov model (Fig. 2) is large (logL = 111.906). In order to determine the relative contribution of the trees and the models, we obtained the log likelihood for the second tree under the general Markov model, and found the difference in log likelihood between the two trees to be small (logL = 6.709). Consequently, we can conclude that 94% of the large difference in log likelihood is due to our choice of a more appropriate Markov model to approximate the evolutionary processes across the whole tree. This choice was made on the basis of the matched-pairs tests outlined in the previous section.
| 4 DISCUSSION |
|---|
|
|
|---|
The tests presented in this paper assume that the individual observations are independently and identically distributed. It is possible to weaken this condition in two ways. First, it is not necessary to assume that the sequences are independent samples of nucleotides. Instead we can assume that, given the sequence at the root, the Markov processes operating along divergent lineages are independent. If they are not operating independently, then the tests will not be appropriate, since the test statistics will not then have the expected asymptotic distributions. Second, we may consider a model in which some sites are invariant, in the sense that the value of the nucleotide taken at the root must remain unchanged along the descendant lineages. In this case we simply change values of n1, ... , 1, ... , n4, ... , 4, which does not affect any of the test statistics considered here (
,
,
and Ts), although it does make asymptotically negligible changes to the statistics of Bhapkar (1966) and Rzhetsky and Nei (1995). More generally, if the sites evolved independently under the same stationary condition but under different homogeneous models (i.e. rate heterogeneity across sites), then the joint distribution would be a mixture of the joint distributions at each site, but would retain the symmetries of the probabilities at each site. If this were the case, then the tests presented here would retain the properties of testing for symmetry. If there is dependence between the processes at, for example, adjacent sites in protein- and RNA-coding DNA, then the test will not be valid. We note that consistency with the hypotheses of symmetry does not imply stationarity and homogeneity, but only that the data are consistent with such hypotheses. For example, if the stationary distributions at different sites differ and at each site the substitution process is stationary, then the hypotheses of symmetry will still hold, so the tests have no power to detect such differences. The problems of lack of independence of evolution and rate heterogeneity across sites may be mitigated by partitioning the data on the basis of additional information (e.g. considering codon sites), so it is recommended that sequence data be partitioned into appropriate bins before conducting these tests.
In the context of phylogenetic inference, the purpose of the tests considered here is to aid in the choice of substitution model. If a model incorporating stationarity and homogeneity throughout a tree is appropriate, then the hypothesis of symmetry will hold, so small p-values from our tests may be used to indicate that these assumptions do not hold and so may invalidate the use of standard methods based on these assumptions. The paired tests give further information about the subtrees associated with each pair, thus helping in the choice of the simplest substitution model for the edges joining the two leaf nodes consistent with the results of the tests. If two sequences have the same composition, then the hypothesis of marginal symmetry holds. Failure of the hypothesis of internal symmetry implies violation of the assumption of homogeneity of the Markov processes operating along the two edges joining the two leaves. This was shown in Example 4 of Section 3.
All tests proposed in this paper are approximately pivotal; i.e. the distribution of the test statistics is neither dependent on any model specifying the tree topology nor on parameters of the substitution models. This contrasts with the two approaches described by Foster (2004), one of which involved a
2 test statistic used in PAUP* (Swofford, 2001), which does not account for covariance due to the phylogeny of the sequences. This test would be appropriate if the sequences were not matched. Foster (2004) noted that this statistic does not have the standard asymptotic distribution and proposed a method where, under an hypothesized model given by a tree and a set of parameters for a Markov process producing the observed data, maximum likelihood estimates of parameters are obtained and used to simulate datasets, for each of which the test statistic is calculated. The observed statistic is compared with the simulated distribution. This is essentially a parametric bootstrap approach to test the hypothesized model. The hypothesis of Foster (2004) is not as general as those of symmetry, as it involves a hypothesized model, but in practice the interpretation of the results will be similar to those of the tests of symmetry proposed here. Foster's (2004) second approach gives tests of correct size. Simulations have shown that the tests based on the statistic proposed here and that proposed by Foster (2004) have similar power.
Foster's (2004) second approach is based on the idea of posterior predictive assessment, given in Gelman et al. (1996), and in a phylogenetic context by Huelsenbeck et al. (2001) and Bollback (2002). This Bayesian method uses simulation from the posterior distribution of parameters, including a set of trees and models for generation of simulated data on these trees, given the data. In most cases, the simulation is achieved indirectly using the Markov Chain Monte Carlo technique. The tests depend on a test statistic and a particular hypothesized model. In the case where the test statistic is pivotal, such as the test statistics proposed in this paper, the tests using posterior predictive assessment would be equivalent to those based on the pivotal statistic, as noted in Section 2.4 of Gelman et al. (1996).
We note that the tests considered in the preceding two paragraphs are given in a more general context and can be used to test more complex hypotheses than those involving stationarity and homogeneity, and so have a greater generality than the tests for symmetry discussed in this paper. However, the tests of symmetry, marginal symmetry and internal symmetry can be made on data prior to a phylogenetic analysis in order to aid in choice of substitution models, whereas the other tests discussed above are performed after analysis of a particular substitution model to test its adequacy.
| Acknowledgments |
|---|
The hospitality of Patrick Forterre and Institut Pasteur is gratefully acknowledged by L.S.J. The authors also wish to thank Keith A Crandall, Simon YW Ho, Vivek Jayaswal and the reviewers for their constructive comments. F.A. was supported by a postgraduate scholarship from the Al-Hussain Bin Talal University in Jordan. This research was partly funded by a Discovery Grant (DP0453173) from the Australian Research Council.
Conflict of Interest: none declared.
| FOOTNOTES |
|---|
Associate Editor: Keith A Crandall
Received on November 23, 2005; revised on February 15, 2006; accepted on February 19, 2006
| REFERENCES |
|---|
|
|
|---|
Ababneh, F., et al. (2006) Generation of the exact distribution and simulation of matched nucleotide sequences on a phylogenetic tree. J. Math. Model. Algor, . (in press).
Categorical Data Analysis Agresti, A. (1990) Wiley Series in Probability and Mathematical Statistics, New York.
Barry, D. and Hartigan, J.A. (1987) Statistical analysis of hominoid molecular evolution. Stat. Sci, . 2, 191210.
Bhapkar, V.P. (1966) A note on the equivalence of two test criteria for hypotheses in categorical data. J. Am. Stat. Assoc, . 61, 228235[CrossRef][Web of Science].
Bollback, J.P. (2002) Bayesian model adequacy and choice in phylogenetics. Mol. Biol. Evol, . 19, 11711180
Bowker, A.H. (1948) A test for symmetry in contingency tables. J. Am. Stat. Assoc, 43, 572574[CrossRef][Web of Science][Medline].
Felsenstein, J. Inferring Phylogenies, (2004) , Sunderland Sinauer Associates.
Foster, P.G. (2004) Modeling compositional heterogeneity. Syst. Biol, . 53, 485495
Galtier, N. and Gouy, M. (1995) Inferring phylogenies from DNA sequences of unequal base compositions. Proc. Natl. Acad. Sci. USA, 92, 1131711321
Gelman, A., et al. (1996) Posterior predictive assessment of model fitness via realized discrepancies. Stat. Sinica, 6, 733807.
Ho, S.Y.W. and Jermiin, L.S. (2004) Tracing the decay of the historical signal in biological sequence data. Syst. Biol, . 53, 623637
Huelsenbeck, J.P., et al. (2001) Bayesian inference of phylogeny and its impact on evolutionary biology. Science, 294, 23102314
Ireland, C., et al. (1969) Symmetry and marginal homogeneity of an r x r contingency table. J. Am. Stat. Assoc, . 64, 13231341[CrossRef][Web of Science].
Jayaswal, V., et al. (2005) Estimation of phylogeny using a general Markov model. Evol. Bioinf. Online, 1, 6280.
Jermiin, L.S., et al. (2004) The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Syst. Biol, . 53, 638643
Jukes, T.H. and Cantor, C.R. (1969) Evolution of protein molecules. In Munro, C.R. (Ed.). Mammalian Protein Metabolism, , New York Academic Press, pp. 21132.
Lanave, C., et al. (1984) A new method for calculating evolutionary substitution rates. J. Mol. Evol, . 20, 8693[CrossRef][Web of Science][Medline].
Lanave, C. and Pesole, G. (1993) Stationary MARKOV processes in the evolution of biological macromolecules. Binary, 5, 191195.
Problems in Dependence O'Neill, M.E. (1975) PhD Thesis, University of Sydney.
Posada, D. and Crandall, K.A. (1998) MODELTEST: testing the model of DNA substitution. Bioinformatics, 14, 817818
Rzhetsky, A. and Nei, M. (1995) Tests of applicability of several substitution models for DNA sequence data. Mol. Biol. Evol, . 12, 131151[Abstract].
Stuart, A. (1955) A test for homogeneity of the marginal distributions in a two-way classification. Biometrika, 42, 412416
Swofford, D.L. (2001) PAUP*: Phylogenetic Analysis Using Parsimony (* and other methods). Version 4. Massachusetts, Sinauer Associates.
Tavaré, S. (1986) Some probabilistic and statistical problems on the analysis of DNA sequences. Lect. Math. Life Sci, . 17, 5786.
Waddell, P.J. and Steel, M.A. (1997) General time reversible distances with unequal rates across sites: Mixing
and inverse Gaussian distributions with invariant sites. Mol. Phylogenet. Evol, . 8, 398414[CrossRef][Web of Science][Medline].
Waddell, P.J., et al. (1999) Using novel phylogenetic methods to evaluate mammalian mtDNA, including amino acid-invariant sites-LogDet plus site stripping, to detect internal conflicts in the data, with special reference to the positions of hedgehog, armadillo, and elephant. Syst. Biol, . 48, 3153[CrossRef][Web of Science][Medline].
Yang, Z. and Roberts, D. (1995) On the use of nucleotide sequences to infer early branches in the tree of life. Mol. Biol. Evol, . 12, 451458[Abstract].
This article has been cited by other articles:
![]() |
B. Nitz, R. Heim, U. E. Schneppat, I. Hyman, and G. Haszprunar Towards a new standard in slug species descriptions: the case of Limax sarnensis Heim & Nitz n. sp. (Pulmonata: Limacidae) from the Western Central Alps J. Mollus. Stud., August 1, 2009; 75(3): 279 - 294. [Abstract] [Full Text] [PDF] |
||||
![]() |
F. Squartini and P. F. Arndt Quantifying the Stationarity and Time Reversibility of the Nucleotide Substitution Process Mol. Biol. Evol., December 1, 2008; 25(12): 2525 - 2535. [Abstract] [Full Text] [PDF] |
||||
![]() |
E. Susko and A. J. Roger On Reduced Amino Acid Alphabets for Phylogenetic Inference Mol. Biol. Evol., September 1, 2007; 24(9): 2139 - 2150. [Abstract] [Full Text] [PDF] |
||||
![]() |
N. Rodriguez-Ezpeleta, H. Brinkmann, B. Roure, N. Lartillot, B. F. Lang, and H. Philippe Detecting and Overcoming Systematic Errors in Genome-Scale Phylogenies Syst Biol, June 1, 2007; 56(3): 389 - 399. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Gowri-Shankar and M. Rattray A Reversible Jump Method for Bayesian Phylogenetic Inference with a Nonhomogeneous Substitution Model Mol. Biol. Evol., June 1, 2007; 24(6): 1286 - 1299. [Abstract] [Full Text] [PDF] |
||||
![]() |
V. Jayaswal, J. Robinson, and L. Jermiin Estimation of Phylogeny and Invariant Sites under the General Markov Model of Nucleotide Sequence Evolution Syst Biol, April 1, 2007; 56(2): 155 - 162. [Abstract] [Full Text] [PDF] |
||||
![]() |
R. G. Beiko and R. L. Charlebois A simulation test bed for hypotheses of genome evolution Bioinformatics, April 1, 2007; 23(7): 825 - 831. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. W. K. Ho, C. E. Adams, J. B. Lew, T. J. Matthews, C. C. Ng, A. Shahabi-Sirjani, L. H. Tan, Y. Zhao, S. Easteal, S. R Wilson, et al. SeqVis: Visualization of compositional heterogeneity in large alignments of nucleotides Bioinformatics, September 1, 2006; 22(17): 2162 - 2163. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||




































= diag(
1, ... ,
,
,
and we obtained UB by interchanging the second and fourth columns of UA. Using this and so obtaining F(t) from (3), and then getting 1000 simulations of matched sequences of length 1000, we obtained p-values for the test for marginal symmetry, which were uniform, as expected; on the other hand, we obtained 14% of p-values < 0.05 for the test for internal symmetry. We then increased 

= (0.1, 0.1, 0.1, 0.7) and 




