Bioinformatics Advance Access originally published online on January 5, 2006
Bioinformatics 2006 22(6):708-715; doi:10.1093/bioinformatics/btk001
© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org
A non-linear optimization procedure to estimate distances and instantaneous substitution rate matrices under the GTR model
Daniele Catanzaro 1,
Raffaele Pesenti 2 and
Michel C. Milinkovitch 1,*
1Laboratory of Evolutionary Genetics, Institute for Molecular Biology and Medicine (IBMM), Université Libre de Bruxelles CP300, Rue Jeener et Brachet 12, B-6041, Gosselies, Belgium
2DINFO, Dipartimento di Ingegneria Informatica, University of Palermo Viale delle Scienze I-90128 Palermo, Italy
*To whom correspondence should be addressed.
 |
ABSTRACT
|
|---|
Motivation: The general-time-reversible (GTR) model is one of the most popular models of nucleotide substitution because it constitutes a good trade-off between mathematical tractability and biological reality. However, when it is applied for inferring evolutionary distances and/or instantaneous rate matrices, the GTR model seems more prone to inapplicability than more restrictive time-reversible models. Although it has been previously noted that the causes for intractability are caused by the impossibility of computing the logarithm of a matrix characterised by negative eigenvalues, the issue has not been investigated further.
Results: Here, we formally characterize the mathematical conditions, and discuss their biological interpretation, which lead to the inapplicability of the GTR model. We investigate the relations between, on one hand, the occurrence of negative eigenvalues and, on the other hand, both sequence length and sequence divergence. We then propose a possible re-formulation of previous procedures in terms of a non-linear optimization problem. We analytically investigate the effect of our approach on the estimated evolutionary distances and transition probability matrix. Finally, we provide an analysis on the goodness of the solution we propose. A numerical example is discussed.
Contact: mcmilink{at}ulb.ac.be
 |
1 INTRODUCTION
|
|---|
Currently, the GTR model of DNA sequence evolution (
Barry and Hartigan, 1987;
Felsenstein, 1984;
Lanave et al., 1984;
Lio and Goldman, 1998;
Rodriguez et al., 1990;
Tavare, 1987;
Zharkikh, 1994) is probably
one of the best available trade-off between mathematical tractability
and biological reality (
Felsenstein, 2004, pp. 210211;
Li, 1997, pp. 8186;
Page and Holmes, 1998, pp. 152156;
Swofford et al., 1996, pp. 433434;
Yang, 1994. The GTR
model describes DNA sequence evolution in terms of transition
probabilities,
pij(
t), from one nucleotide to another, and assumes
that instantaneous substitution rate matrix,
R, remains constant
over time. This stationary homogeneous Markov process can be
expressed in a matrix form using Kolmogorov differential equation
(
Lio and Goldman, 1998):
 | (1) |
where
P(
t) = {
pij(
t)} is usually referred as the transition probability
matrix and
R = {
rij} as the instantanueous substitution rate
matrix (
Felsenstein, 2004;
Lanave et al., 1984;
Lio and Goldman, 1998;
Yang, 1994). The solution of
Equation (1) is the following exponential
matrix:
 | (2) |
R is a real matrix with four non-positive eigenvalues [of which one is equal to zero (see, e.g. Lanave et al., 1984)], non-diagonal elements that must be non-negative and diagonal elements that must be the opposite of the sum of the non-diagonal elements (from the corresponding row). In turn, these conditions, together with Equation (2), imply that, for any value of t, P(t) is a real positive matrix characterised by four positive eigenvalues (of which one is equal to 1).
The GTR model also assumes reversibility: the net rate from nucleotides j to nucleotide i is equal to the net rate from i to j (Yang, 1994), i.e.
 | (3) |
From
(2) and (3) it follows (
Rodriguez et al., 1990)
 | (4) |
where

is the diagonal matrix
whose elements are the respective nucleotides equilibrium frequencies.
Equation (4) can be rewritten (e.g. when considering a pair
of aligned sequences separated by a time

), as
 | (5) |
F# is called the symmetrized form (Waddell and Steel, 1997) of the divergence matrix (Rodriguez et al., 1990) of the observed pair of sequences. Estimating
and/or R [e.g. to compute
] from aligned sequences is at the core of many methods used for phylogeny inference [e.g. maximum likelihood, distance matrix methods, invariants (Felsenstein, 2004)].
On the basis of conditions (1)(5), Rodriguez et al. (1990) showed that the evolutionary distance
between two aligned sequences and the corresponding instantaneous substitution rate matrix R can be obtained by
 | (6) |
 | (7) |
where
log(·) is the logarithmic matrix function defined for
a square matrix with positive eigenvalues, and evaluated via
diagonalization:
 | (8) |
where

and

are, respectively, the eigenvector matrix of
P and the
diagonal matrix of the eigenvalues of
P.
Rodriguez noted that this strategy is inapplicable in some cases: some conditions of inapplicability are related to the signs of the eigenvalues of P. More specifically, when at least one of the four eigenvalues of P is non-positive, the logarithm matrix function is not defined and computation of Equations (6) and (7) is not possible.
Here, starting from the seminal work of Lanave et al. (1984), Rodriguez et al. (1990), and Waddell and Steel (1997), (1) we formally characterize the mathematical and biological conditions under which the GTR model is not applicable; (2) we present sufficient criteria to a priori reject incongruent estimations of P; (3) we extend the estimation procedures proposed by Rodriguez et al. (1990) and Waddell and Steel (1997) in the form of a non-linear optimization problem that makes the GTR model always applicable; (4) we discuss the properties and the goodness of solutions obtained with our approach; (5) we suggest a procedure, different from the one proposed by Waddell and Steel (1997), to overcome the inapplicability of distance matrix methods when the entries of the distance matrix are undefined and (6) we provide a numerical example to clarify the procedure we propose.
 |
2 APPROACH
|
|---|
Here, we characterize, from both a mathematical and a biological
point of view, the GTR model and identify necessary and sufficient
conditions that
P must satisfy to allow the estimation of
R using
Equation (7).
2.1 On mathematical assumptions of the GTR model
As indicated earlier, the GTR model assumes that R is a constant matrix with a non-positive spectrum (i.e. all eigenvalues must be non-positive). In addition, let us note that, even if we relax the assumption of a constant instantaneous rate matrix R, the corresponding net transition matrix P(t) must still be characterized by positive eigenvalues. Indeed, if (1) we decompose R as the product of
and B, where B is the symmetric matrix of rates [see (Felsenstein, 2004), p. 207] and (2) we allow R to be time variant (this translates into B being time variant as
is assumed time invariant), we have

| (9) |
where the average instantaneous rate matrix

| (10) |
must still be characterized by a non-positive
spectrum as it is the product of a positive-definite matrix

times the sum of (an infinite number of) negative semi-definite
matrices
B(

). Hence, for any time

, if we observe that at least one eigenvalue of

is not positive,
P cannot be considered as a
net transition matrix of a Markovian process described by
Equation (2).
If not stated otherwise, we consider throughout the present paper that the matrices P and F# are computed from the observed pair of aligned sequences using the procedure described in Waddell and Steel (1997).
2.2 On the congruency between P and the GTR model
Here, we determine the conditions under which the net transition matrix P is congruent with the GTR model in general, and with Equation (2) in particular.
Because P =
1 F# and both
and F# are symmetric, P is characterized by positive eigenvalues if and only if both
and F# have positive eigenvalues (i.e. are positive-definite). We note also that
is a diagonal matrix with positive diagonal entries
i, hence, is characterized by positive eigenvalues, if all four types of nucleotides (A,T,C,G) are observed in the two aligned DNA sequences S
and Sß (obviously, negative entries
i would be biologically meaningless) (Keilson, 1979). Consequently, we can conclude that P is characterized by positive eigenvalues, hence is congruent with (2), if and only if F# is characterized by positive eigenvalues (i.e. is positive-definite).
As F# is symmetric, it is positive definite if and only if the Sylvester's criterion is satisfied [(Brinkhuis and Tikhomirov, 2005), p. 409] i.e.
 | (11) |
where

is a minor (a submatrix) of
F# made of its first
k rows and
k columns. Conditions (11)
are explicitly reported in
Table 1, assuming that
F# is written
as follows:
 | (12) |
2.3 Biological interpretation of Sylvester's conditions
When
k = 1 the criterion indicates that, for each state (A,T,C,G),
the two aligned sequences must exhibit the same state at minimum
one site. Note that this condition makes a five-state GTR models
[with gap as a fifth state; (
McGuire et al., 2001)]
inapplicable unless one forces any pair of sequence to always
share a gap state at minimum one character, a
condition that is not biologically meaningful. For
k > 1,
Sylvester's conditions impose constraints on (subsets of) nucleotide
transition probabilities. For example, when
k = 2, the criterion
indicates that for any pair of states 1 and 2, the geometric
mean between the frequency of homologous sites with identical
state 1 and the frequency of homologous sites with identical
state 2 must be greater than the frequency of homologous sites
exhibiting different states (1 and 2). A direct consequences
is that the risk of not satisfying that condition (hence, the
probability of obtaining negative eigenvalues of
F#) increases
with an increasing number of observed substitutions between
the two aligned sequences.
Unfortunately, given the analytical complexity of the expressions when k
3, biological interpretation of the Sylvester's conditions is much less straightforward than in the previous cases. Roughly speaking, we interpret that Sylvester's conditions when k
3 impose a relative uniformity among the different possible nucleotide transitions. To illustrate our point, let us consider, given sequences S
and Sß, that k = 3 and let us assume that f23 = 0. Then, the Sylvester's condition becomes
. If we further assume f22
f33 > 0 (i.e. f22/f33
1), then Sylvester's condition becomes
. The latter inequality implies that if f23 = 0, then it cannot occur that both f12 and f13 reach their maximum values allowed when k = 2.
2.4 Sufficient conditions to exclude incongruent estimations of F#
Here, we show how one can use the Sylvester's criterion to determine some conditions that are sufficient for a priori identification of estimated F# matrices characterized by non-positive eigenvalues. In particular, given two generic sequences S
and Sß, we prove that, for any pair of character states (nucleotides), F# cannot be estimated through the procedures proposed by Waddell and Steel (1997) (because it would be characterized by non-positive eigenvalues) when the ratio between, on one hand, the number of sites exhibiting different nucleotides in S
and Sß and, on the other hand, the length of the sequences (l12), exceeds a given threshold.
Let us consider first the Sylvester's criterion for k = 2. Let us define
 | (13) |
 | (14) |
Accordingly, if one estimates F# as in Waddell and Steel (1997), l12 =
1 +
2 is, trivially, a constant that corresponds to the overall number of sites with a nucleotide of type 1 or 2 in the two sequences under consideration. Then, Sylvester's criterion for k = 2 can be translated into
 | (15) |
From (15), we obtain
 | (16) |
where the second inequality derives from
the fact that
 | (17) |
and
such value is obtained when
 | (18) |
Hence,
we can claim that
F#, as estimated using procedure from (
Waddell and Steel, 1997),
is surely not positive definite when 2
f12 
(
l12/2) that, given
(13) and (14) implies
 | (19) |
In plain words, condition (19) indicates that, when two aligned sequences S
and Sß are characterized by a number of homologous sites with different states greater than the number of homologous sites with identical states, this condition is sufficient to affirm that the procedures suggested by Waddell and Steel (1997) cannot be used.
Following the same line of reasoning, we can use Sylvester's third and fourth conditions to prove that an F# estimated following Waddell and Steel (1997) is surely not positive definite when at least one of the following conditions is met:
 | (20) |
 | (21) |
Note that conditions (20) and (21) are not
worth checking as they can be directly derived from condition
(19) by summation.
2.5 Some notes about the Logdet distance
Similarly to GTR-corrected distances, some conditions can lead to the inapplicability of the logdet correction (Lake, 1994; Lockahart et al., 1994; Steel, 1994). By referring to the original logdet distance formulation given by Steel (1994), we interpret that the logdet correction becomes uncomputable when the determinant of the matrix F is negative (its natural logarithm would be undefined). Unfortunately, the Sylvester's criterion cannot be applied in this case because F is not symmetric. Hence, we cannot readily extend to the logdet distance the analysis described above for GTR corrections. This issue is out of the scope of the present work and warrants additional analysis.
 |
3 METHODS
|
|---|
Here, we generalize
Rodriguez et al.'s (1990) procedure [extended
in
Waddell and Steel (1997)] as a non-linear optimization problem
such that it returns estimates of
F# and
P that (1) are congruent
with
Equation (2), and (2) optimize a measure relevant to the
evolution of the sequence
S
and
Sß. Our procedure
returns the same results as
Rodriguez et al.'s (1990) when the
latter yields
F# and
P matrices characterized by a positive
spectrum.
3.1 An alternative way to estimate P
Let us consider the generic aligned homologous sequences S
and Sß, each of length L. Let us define
as the set of the four different bases (A,C,G,T) and
as the number of nucleotides of state j of S
that underwent substitution into state i of Sß. Analogously, let us define
.
We then estimate matrices P and F# by determining the net transition probabilities that maximize the probability dP(S
, Sß) of observing the two sequences, given P.
 | (22) |
This is equivalent to maximizing

. We chose this measure (among many
others possible) because it is also maximized when using
Rodriguez et al.'s (1990) procedure. When the likelihood of observing the sequences pair
is maximal,
dP(
S
,
Sß) is minimal and vice versa.
Consider now the following non-linear optimization problem:
 | (23) |
 | (24) |
 | (25) |
 | (26) |
 | (27) |
 | (28) |
 | (29) |
Constraint (24) imposes to search
for a net transition probability matrix that can be written
as the product of

and
F#: if one would only search for a positive
semi-definite net transition matrix such that its rows sum to
one, one could obtain (as solution of the optimization problem)
a matrix
P that respects all the constraints but that might
not be written as the product of two symmetric matrices. This
would cause problems when computing
R using
Rodriguez et al's (1990) procedure, because

would be unknown. Constraint (25) imposes
to search for a symmetric and positive semi-definite matrix
F#, i.e. a matrix characterised by non-negative eigenvalues.
It is implemented by imposing conditions (11). Constraints (26)
and (27) impose the condition of normalization on the rows of
the solution and on the nucleotide frequencies. Constraints
(28) and (29) trivially impose that the elements
fij of the
solution of the problem are non-negative and that the variables
j are included in given sets
Qj of feasible values for nucleotide
equilibrium frequencies.
The solution of problem (24)(29) depends on which values are assumed included in the set Qj. One reasonable choice is to directly use the average of the observed frequencies of state j in the two homologous sequences. One alternative would be that the set Qj includes all possible values between the observed frequencies of nucleotide j in each of the two homologous sequences. Using the first option (in accordance with Rodriguez et al's (1990) approach),
j can be computed using
 | (30) |
This choice is a reasonable tradeoff
between generalisation of the model and the efficiency of solving
it algorithmically. Indeed, when

is assigned, both the objective
function and the set of feasible solutions are convex, such
that the solution of the problem (23) is unique (
Papadimitriou and Steiglitz, 1998,
p. 15). Conversely, if neither
F# nor

are assigned, the set
of feasible solutions is not convex, such that the optimal solution
might not be unique anymore. Hence, when

is assigned the optimization
problem (23)(29) becomes:
 | (31) |
 | (32) |
 | (33) |
 | (34) |
 | (35) |
The
above problem can be solved by applying any standard non-linear
optimization technique [see, e.g. (
Bertsekas, 1999)].
Note that the optimal solution
to problem (31)(35) might lie either inside the set of feasible solutions defined by the constrains or along its boundary. In the former case, F# and
are each characterized by a positive spectrum, and these correspond, respectively, to the symmetrized form of the divergency matrix and to the net transition matrix obtained by applying the procedures from Rodriguez et al. (1990) and Waddell and Steel (1997). In the latter case (optimal solution along the boundary of the set of feasible solutions), F# and
have positive spectra but at least one eigenvalue is equal to 0. The presence of null eigenvalue(s) implies that
In this latter case, note that, despite
we can still estimate R using
 | (36) |
where

with
I,
V and

that, respectively are: the identical
matrix; the matrix of the eigenvectors of

; and the diagonal matrix of the eigenvalues
of matrix

, i.e.,

. Roughly speaking,

differs from

only by changing the eigenvalues
i of the latter into

in the former.
Let us finally observe that, if three eigenvalues of
are equal to 0, then
 | (37) |
where
1 is a matrix whose elements
are all equal to 1. Actually, as shown by (
Lanave et al., 1984),
the matrix
R can be decomposed into:
 | (38) |
where

and

are vectors forming an orthogonal base. Then, using (2), the transition probabilities
pij(
t) take the following form
 | (39) |
By applying the spectral decomposition theorem
to

, we obtain:
 | (40) |
where

and

are the left and right eigenvectors. From (39) and (40) follows
(37):
 | (41) |
 | (42) |
Cases in which just one or two eigenvalues of
are equal to zero, are particularly interesting. Indeed, if
, all the eigenvalues different from 1 should be equal to 0. Let us observe that using our model, or Rodriguez et al.'s (1990) model, one estimates continuous probability values on the basis of a finite dataset made of discrete characters (roughly speaking, we count how many sites have been subjected to substitution). Hence, the distribution of potential optimal solutions
(and, consequently, of times
) of the problem (31)(35) is not continuous. Therefore, in cases with just one or two null eigenvalues, increasing the length of the sequences to infinity would generate slightly different proportions of nucleotide substitution and a different net transition matrix, close to
, but with all eigenvalues different from zero (although one or two of them very small). As a consequence, if
obtained from an observed (finite) dataset presents only one or two null eigenvalues, we conclude that
should be considered a great, but not an infinite, value.
3.2 Consequences on evolutionary distances
Application of the non-linear optimization techniques described above has major consequences on evolutionary distances. The spectrum of the net transition matrix can be considered as a measure of the level of divergence between the two analyzed sequences: the difference between the greater and the smaller eigenvalues of Rt tends to zero as the evolutionary distance between the two observed sequences becomes smaller. Conversely, the difference between the greater and the smaller eigenvalues of Rt tends to infinite when the evolutionary distance between the two sequences increases.
Hence, the use of distance matrix methods remains restricted to cases characterized by a positive definite matrix F#, i.e. for pairs of sequences characterized by relatively small divergences. As, from a biological point of view, undefined (i.e. infinite) evolutionary distances are meaningless, a practical solution to overcome such a problem would be to replace
with an appropriate
, e.g. choosing
, where
2 is the smaller non-null eigenvalue of
. The rationale behind such a choice is that it is generally accepted (e.g. in experimental physics) what follows. Consider two exponential processes whose asymptotic values are zero (as it occurs in our case) and such that the second process decays five times as fast as the first one. By the time the first process has assumed a value that is 1/e times its initial value, the second process has practically reached its steady state, i.e. its value is less than one hundredth of its initial value.
Furthermore, this solution has the advantage of assigning different distances for different pairs of sequences initially characterized by undefined distances (i.e. the procedure we propose provides estimates of distances) and, therefore, it might generally lead to better results than when using a fixed arbitrary value (Waddell and Steel, 1997). Note however that the approach we propose (1) introduces arbitrariness in computing the evolutionary distances (it is arbitrary to use
rather than, e.g.
) and (2) would not be applicable if also the second eigenvalue of P is negative (because
would be characterized by three null eigenvalues). However this case should be very rarely encountered (i.e. it corresponds to a very high divergency between the two sequences) and should be preceded by other problems such as ambiguities in alignment.
3.3 Goodness of solutions
Let us suggest that the net transition matrix
can be considered valid (reasonable) from a biological point of view if, by random sampling of a population of sequences evolving according to P(t) = eRt, there is a relatively high probability to pick up a pair of sequences (Sr, Ss) whose
is greater than or equal to
. By indicating with
the number of nucleotides of type i in Sx that have undergone a substitution into the type j in Sy, with N nucleotides in each sequence, such a probability
can be written as follows:
 | (43) |
where
 | (44) |
 | (45) |
 | (46) |
 | (47) |
 | (48) |
 | (49) |
and
 | (50) |
Condition (44) imposes to consider only the
indices (and therefore all the corresponding pairs of sequences)
such that the observed numbers of substitutions

and

are smaller than the respective

and

under the same transition probabilities
pij. Conditions (45) and (46) impose that the
sums of the substitutions

and

are equal to the length
N. Finally, condition (47) and (48) trivially imposes that the
number of substitutions cannot be negative.
The overall probability [cf. Equation (43)] can be interpreted as the goodness of the stochastic model to predict the likelihood of observing the actual sequences. From this point of view, P(
) is better than
because the former implies a greater
than the latter does. In fact, for P(
), condition (44) of
becomes
 | (51) |
and
 | (52) |
Constraint
(51) is weaker than constraint (44), therefore the set

(hence, the overall probability

) is greater under
P(

)
than under

.
Notice that if one aims to obtain the transition probability matrix that maximizes
, then one should consider the matrix
, i.e. the matrix characterized by all events being equiprobable (e.g. all entries = 1/4). In this situation, the set
becomes
 | (53) |
 | (54) |
 | (55) |
 | (56) |
 | (57) |
Condition (44) in this case becomes
 | (58) |
Constraint (58) is weaker than
constraints (51), (48) and (54). Hence, when

is assumed,

; i.e. the probability to obtain the observed pair of sequences
is maximum. However, one should realize that, when using

,

is always equal to 1 for any (observed or not observed) pair
of sequences, i.e. all sequences are equiprobable. Our interpretation
for this phenomenon is that the

model does not tell us anything about the substitution process.
Therefore, we suggest that, although the goodness of

and of
P(

) are higher than that
of

, the information content (regarding the substitution process of the specific pair of
observed sequences) of

is higher (depending on its rank) than those of

and
P(

).
In summary,
represents the asymptotic value of the transition probability matrix relative to any pair of sequences; P(
) represents the asymptotic value of the transition probability matrix relative to any pair of sequences characterized by a distribution
, and
is the limit transition probability matrix that best describes a substitution process [modeled by Equation (2)] for the two observed sequences characterized by a distribution
. In other words we consider that
is the most reliable matrix when the evolution of a pair of sequences is both assumed to follow (2) and characterized by negative eigenvalues of the net transition matrix P.
 |
4 DISCUSSION: A NUMERICAL EXAMPLE
|
|---|
In this section we show a numerical example of the non-linear
optimization problem described above. This example is deliberately
made of short and divergent sequences both to insure the occurrence
of negative eigenvalues and allow the reader to easily (manually)
verify each step of the procedure. Although generally not mentioned
in publications, negative eigenvalues with real data do occur
(e.g. the comparison human versus cow for the
cytochrome b gene third positions yield the following eigenvalues:
(1., 0.423996, 0.137941, 0.111731). Whether negative
eigenvalues are widespread with real data warrants further investigation.
Let us assume we want to compute the distance matrix from the following alignment:
 | (59) |
Let
us consider the first two sequences. If we use
Rodriguez et al.'s (1990) procedure as proposed in
Waddell and Steel (1997), we obtain
the following symmetrized divergency matrix
 | (60) |
Accordingly, the matrix

is
 | (61) |
with
 | (62) |
Finally, the net transition matrix
P is
 | (63) |
This matrix is characterized
by the following eigenvalues:
 | (64) |
The occurrence of a negative eigenvalue makes
Rodriguez et al.'s (1990) procedure inapplicable. Let us also
observe that, in our example, all pairwise evolutionary distances
are undefined, as each pairwise comparison leads to a
P matrix
characterized by negative eigenvalues. Following
Waddell and Steel (1997),
the distance matrix therefore is
 | (65) |
i.e. no distance matrix method can be used.
On the other hand, solving the non-linear optimization problem for the specific pair of sequences following the method we propose above, yields the net transition matrix
 | (66) |
with eigenvalues
 | (67) |
Note that

satisfies the reversibility condition. The respective matrices
P(

) and

are
 | (68) |
and
 | (69) |
As the evolutionary distance between the two first sequences is undefined, we propose (see Section 3.2) using
, such that
 | (70) |
By
iterating the above procedure for all pairwise sequence comparisons,
all entries of the distance matrix can be computed and any distance
matrix method can be applied.
Let us now observe that, when computing, from the two first sequences, the substitution rate matrix R by using (36), we obtain
 | (71) |
This matrix, with four negative non-diagonal elements, is not compatible with the description of a continuous Markov process and this result is not imputable to the limit (36). Indeed, an instantaneous substitution rate matrix of a Markov process must satisfy the so-called conservative hypothesis, requiring that (1) non-diagonal entries are non-negative and (2) the diagonal negative elements, row by row, are equal to the opposite of the sum of the non-diagonal elements.
The only explicit hypothesis of the estimation procedures proposed by (Rodriguez et al., 1990; Waddell and Steel, 1997) is reversibility as defined by Equations (4) and (5). This hypothesis is sufficient to compute
in all cases for which P is congruent with (2), but is not sufficient to guarantee that the instantaneous substitution rate matrix is characterized by non-negative non-diagonal entries.
Therefore, imposing (11) when performing the non-linear optimization problem guarantees that
is congruent with (2), but does not guarantee that
a transition probability matrix of a markovian process. In such a circumstance, we propose three possible strategies.
First, one could accept
and R (eventually with negative non-diagonal entries) as they are congruent with (2), but one then accepts the risk of loosing the conservative continuous markov chain hypothesis and, row per row, one uses the questionable interpretation that the eventual negative instantaneous substitution rates are decreasing substitutions in favor of the positive ones.
Second, one could accept
, and decompose the instantaneous substitution rate matrix into R =
B, change the sign of all negative non-diagonal entries of B; in this case, the conservative continuous markov chains hypothesis is guaranteed, but the optimality of R is lost.
Third, one could incorporate the conservative continuous markov chains hypothesis in the set of constraints of the non-linear optimization problem, i.e. we further add the constraints
and, given (7), rii < 0 and rij > 0. In this case we obtain:
 | (72) |
and the corresponding evolutionary distance
between the two first sequences is
 | (73) |
Unfortunately such an approach is very computationally intensive and provides no guarantee that a global optimal solution can be determined.
Now, let us indicate with dP(
)(S
, Sß) and
the likelihood of observing the actual pair of sequences given P(
) and
, respectively. We obtain
 | (74) |
 | (75) |
 | (76) |
This result confirms that

is the matrix that maximizes the
likelihood of observing the specific pair of sequences. Let
us compute now the overall probability to pick up a pair of
sequences (
Sr,
Ss) that yields a likelihood greater than or
equal to

when

,
P(

) or

are considered. To estimate such probabilities,
we generated a set of 100 sequences for each substitution matrix.
We then compute the proportion of (
S
,
Sß) pairs [among
all pairs (ancestral sequence, generated sequence)] that exhibit
a
dP(
S
,
Sß) greater than

,
dP(
)(
S
,
Sß) and

. The process was repeated 100 times for each
transition probability matrix

,
P(

) and

and the three populations of sequences are characterized by the following
means and variances:
 | (77) |
 | (78) |
 | (79) |
As discussed in Section 3.3,
 |
5 CONCLUSIONS AND PERSPECTIVES
|
|---|
In this paper we formally characterized the mathematical conditions
that lead to the inapplicability of the published estimation
procedures (
Rodriguez et al., 1990;
Waddell and Steel, 1997;
Yang and Kumar, 1996) to compute evolutionary distances and
instantaneous rate matrices under the GTR model of nucleotide
substitution. We provided criteria to accept or reject such
estimations and suggested biological interpretations of these
conditions. Furthermore, we extended existing procedures (
Rodriguez et al., 1990;
Waddell and Steel, 1997,
Yang and Kumar, 1996) by reformulating
them in terms of a non-linear optimization problem. We analytically
investigated the effect of our approach on estimated evolutionary
distances, the transition probability matrix, and the instantaneous
substitution rate matrix. Our formulation yields the best net
transition matrix

that (1) is congruent with (2), (2) maximizes the proposed measure
(22) and (3) best describes the substitution process for the
specific sequence pair.
For overcoming the problem of undefined evolutionary distances, we propose a procedure that is more generally applicable and more biologically meaningful than the alternative strategy proposed by Waddell and Steel (1997).
In Section (4) we have shown that the estimation procedures proposed by Rodriguez et al., 1990; and by Waddell and Steel (1997) might lead in general to an estimated R that does not respect the conservative continuous markov chains hypothesis. This phenomenon also affects the non-linear formulation we proposed: conditions (11) are necessary and sufficient to guarantee that
is congruent with (2), but are insufficient to guarantee the conservative continuous markov chains hypothesis. We suggested three possible ways to overcome such a problem.
Finally, we observed that
has a low probability
. This situation seems contradictory: the matrix that maximizes the likelihood of the observed data [Equation (22)] also minimizes the probability of generating a pair of sequences with likelihood smaller than or equal to that of the observed data (43). However, we argue that a pair of sequences requiring the computation of
can not be considered as a random draw: they generate negative eigenvalues of P. This brings us to the perspective of questioning the validity of the GTR model. First, the strength of the argument of rejecting the GTR model as a biologically valid model of nucleotide substitution depends on the frequency with which negative eigenvalues of P are generated with real data sequence pairs (and this points warrants further investigation). Second, there are fundamental limitations to the GTR model: it does not separate the mutation process from other factors influencing the substitution process. To investigate whether this would bring about the rejection of the GTR model and motivate the development of alternative, more complex, models, one needs to identify the biological conditions under which sets of observed sequences would be characterized by very low probabilities [Equation (43)] of being generated by the GTR model [described by Equation (2)]. Finally, another interesting issue would be to characterize the amount of time
for which, given
, the probability of obtaining negative eigenvalues of
is high. The answer to this question would return the range of applicability of the GTR model. The reformulation of the GTR model as non-linear optimization problem described above is implemented in version 2 of the phylogeny inference program MetaPIGA (Lemmon and Milinkovitch, 2002), available at www.ulb.ac.be/sciences/ueg/html_files/MetaPIGA.html
 |
Acknowledgments
|
|---|
We would like to thank Joseph Felsenstein and L. Gatto for helpful
general discussions regarding the GTR model, as well as C. Lanave
and Mike A. Steel for their invaluable comments on a previous
version of this manuscript.
Conflict of Interest: none declared.
 |
FOOTNOTES
|
|---|
Associate Editor: Joaquin Dopazo
Received on November 4, 2005; revised on November 29, 2005; accepted on December 9, 2005
 |
REFERENCES
|
|---|
Barry, D. and Hartigan, J.A. (1987) Statistical analysis of hominoid molecular evolution. Stat. Sci, . 2, 191207.
Bertsekas, D.P. Nonlinear Programming, (1999) 2nd edn Hardcover.
Brinkhuis, J. and Tikhomirov, V. Optimization: insights and applications, (2005) Princeton University Press.
Felsenstein, J. (1984) Distance methods for inferring phylogenies: a justification. Evolution, 38, 1624[CrossRef].
Felsenstein, J. (2004) Inferring Phylogenies. , Sunderland, UK Sinauer Associates.
Keilson, J. Markov Chain ModelsRarity and Exponentiality, (1979) , New York Springer-Verlag.
Lake, J.A. (1994) Reconstructing evolutionary trees from dna and protein sequences: Paralinear distances. Proc. Natl Acad. Sci. USA, 91, 14551459[Abstract/Free Full Text].
Lanave, C., et al. (1984) A new method for calculating evolutionary substitution rates. J. Mol. Evol, . 20, 8693[CrossRef][ISI][Medline].
Lemmon, A.R. and Milinkovitch, M.C. (2002) The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation. Proc. Natl Acad. Sci. USA, 99, 1051610521[Abstract/Free Full Text].
Lio, P. and Goldman, N. (1998) Models of molecular evolution and phylogeny. Genome Res, . 8, 12331244[Abstract/Free Full Text].
Li, W.H. Molecular Evolution, (1997) , Sunderland, UK Sinauer Associates.
Lockahart, P.J., et al. (1994) Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol, . 11, 605612[ISI].
McGuire, G. (2001) Models of sequence evolution for DNA sequences containing gaps. J. Mol. Evol, . 18, (4) 481490.
Page, R.D.M. and Holmes, E.C. Molecular Evolution: A Phylogenetic Approach, (1998) , Oxford, UK Blackwell Science.
Papadimitriou, C. and Steiglitz, K. Combinatorial Optimization, Algorithm and Complexity, (1998) , Mineola, NY, USA Dover Publications.
Rodriguez, F., et al. (1990) The general stochastic model of nucleotide substitution. J. Theor. Biol, . 142, 485501[ISI][Medline].
Steel, M.A. (1994) Recovering a tree from the markov the leaf colourations it generates under a markov model. Appl. Math. Lett, . 7, 1323.
Swofford, D.L., Olsen, G.J., Waddell, P.J., Hillis, D.M. (1996) Phylogenetic inference. In Hillis, D.M., Moritz, C., Mable, B.K. (Eds.). molecular systematics, , Sunderland, UK chapter 11 Sinauer & Associates, pp. 407514.
Tavare, S. (1987) Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci, . 17, 5786.
Waddell, P.J. and Steel, M.A. (1997) General time reversible distances with unequal rates across sites: Mixing gamma and inverse gaussian distributions with invariant sites. Mol. Phylogenet. Evol, . 8, 398414[CrossRef][ISI][Medline].
Yang, Z. and Kumar, S. (1996) Approximate methods for estimating the pattern of nucleotide substitution and the variation of substitution rate among sites. Mol. Biol. Evol, . 13, 650659[Abstract].
Yang, Z. (1994) Estimating the pattern of nucleotide substitution. J. Mol. Evol, . 39, 105111[ISI][Medline].
Zharkikh, A. (1994) Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol, . 39, 315329[CrossRef][ISI][Medline].

CiteULike
Connotea
Del.icio.us What's this?