Skip Navigation


Bioinformatics Advance Access originally published online on January 5, 2006
Bioinformatics 2006 22(6):708-715; doi:10.1093/bioinformatics/btk001
This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/6/708    most recent
btk001v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Catanzaro, D.
Right arrow Articles by Milinkovitch, M. C.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Catanzaro, D.
Right arrow Articles by Milinkovitch, M. C.
Social Bookmarking
 Add to CiteULike   Add to Connotea   Add to Del.icio.us  
What's this?

© The Author 2006. Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oxfordjournals.org

A non-linear optimization procedure to estimate distances and instantaneous substitution rate matrices under the GTR model

Daniele Catanzaro 1, Raffaele Pesenti 2 and Michel C. Milinkovitch 1,*

1Laboratory of Evolutionary Genetics, Institute for Molecular Biology and Medicine (IBMM), Université Libre de Bruxelles CP300, Rue Jeener et Brachet 12, B-6041, Gosselies, Belgium
2DINFO, Dipartimento di Ingegneria Informatica, University of Palermo Viale delle Scienze I-90128 Palermo, Italy

*To whom correspondence should be addressed.


    ABSTRACT
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION: A NUMERICAL...
 5 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 

Motivation: The general-time-reversible (GTR) model is one of the most popular models of nucleotide substitution because it constitutes a good trade-off between mathematical tractability and biological reality. However, when it is applied for inferring evolutionary distances and/or instantaneous rate matrices, the GTR model seems more prone to inapplicability than more restrictive time-reversible models. Although it has been previously noted that the causes for intractability are caused by the impossibility of computing the logarithm of a matrix characterised by negative eigenvalues, the issue has not been investigated further.

Results: Here, we formally characterize the mathematical conditions, and discuss their biological interpretation, which lead to the inapplicability of the GTR model. We investigate the relations between, on one hand, the occurrence of negative eigenvalues and, on the other hand, both sequence length and sequence divergence. We then propose a possible re-formulation of previous procedures in terms of a non-linear optimization problem. We analytically investigate the effect of our approach on the estimated evolutionary distances and transition probability matrix. Finally, we provide an analysis on the goodness of the solution we propose. A numerical example is discussed.

Contact: mcmilink{at}ulb.ac.be


    1 INTRODUCTION
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION: A NUMERICAL...
 5 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 
Currently, the GTR model of DNA sequence evolution (Barry and Hartigan, 1987; Felsenstein, 1984; Lanave et al., 1984; Lio and Goldman, 1998; Rodriguez et al., 1990; Tavare, 1987; Zharkikh, 1994) is probably one of the best available trade-off between mathematical tractability and biological reality (Felsenstein, 2004, pp. 210–211; Li, 1997, pp. 81–86; Page and Holmes, 1998, pp. 152–156; Swofford et al., 1996, pp. 433–434; Yang, 1994. The GTR model describes DNA sequence evolution in terms of transition probabilities, pij(t), from one nucleotide to another, and assumes that instantaneous substitution rate matrix, R, remains constant over time. This stationary homogeneous Markov process can be expressed in a matrix form using Kolmogorov differential equation (Lio and Goldman, 1998):

Formula 1(1)
where P(t) = {pij(t)} is usually referred as the transition probability matrix and R = {rij} as the instantanueous substitution rate matrix (Felsenstein, 2004; Lanave et al., 1984; Lio and Goldman, 1998; Yang, 1994). The solution of Equation (1) is the following exponential matrix:

Formula 2(2)

R is a real matrix with four non-positive eigenvalues [of which one is equal to zero (see, e.g. Lanave et al., 1984)], non-diagonal elements that must be non-negative and diagonal elements that must be the opposite of the sum of the non-diagonal elements (from the corresponding row). In turn, these conditions, together with Equation (2), imply that, for any value of t, P(t) is a real positive matrix characterised by four positive eigenvalues (of which one is equal to 1).

The GTR model also assumes reversibility: the net rate from nucleotides j to nucleotide i is equal to the net rate from i to j (Yang, 1994), i.e.

Formula 3(3)
From (2) and (3) it follows (Rodriguez et al., 1990)

Formula 4(4)
where {Pi} is the diagonal matrix whose elements are the respective nucleotides equilibrium frequencies. Equation (4) can be rewritten (e.g. when considering a pair of aligned sequences separated by a time Formula 4), as

Formula 5(5)

F# is called the symmetrized form (Waddell and Steel, 1997) of the divergence matrix (Rodriguez et al., 1990) of the observed pair of sequences. Estimating Formula 5 and/or R [e.g. to compute Formula 5] from aligned sequences is at the core of many methods used for phylogeny inference [e.g. maximum likelihood, distance matrix methods, invariants (Felsenstein, 2004)].

On the basis of conditions (1)–(5), Rodriguez et al. (1990) showed that the evolutionary distance Formula 5 between two aligned sequences and the corresponding instantaneous substitution rate matrix R can be obtained by

Formula 6(6)

Formula 7(7)
where log(·) is the logarithmic matrix function defined for a square matrix with positive eigenvalues, and evaluated via diagonalization:

Formula 8(8)
where {Omega} and {Lambda} are, respectively, the eigenvector matrix of P and the diagonal matrix of the eigenvalues of P.

Rodriguez noted that this strategy is inapplicable in some cases: some conditions of inapplicability are related to the signs of the eigenvalues of P. More specifically, when at least one of the four eigenvalues of P is non-positive, the logarithm matrix function is not defined and computation of Equations (6) and (7) is not possible.

Here, starting from the seminal work of Lanave et al. (1984), Rodriguez et al. (1990), and Waddell and Steel (1997), (1) we formally characterize the mathematical and biological conditions under which the GTR model is not applicable; (2) we present sufficient criteria to a priori reject incongruent estimations of P; (3) we extend the estimation procedures proposed by Rodriguez et al. (1990) and Waddell and Steel (1997) in the form of a non-linear optimization problem that makes the GTR model always applicable; (4) we discuss the properties and the goodness of solutions obtained with our approach; (5) we suggest a procedure, different from the one proposed by Waddell and Steel (1997), to overcome the inapplicability of distance matrix methods when the entries of the distance matrix are undefined and (6) we provide a numerical example to clarify the procedure we propose.


    2 APPROACH
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION: A NUMERICAL...
 5 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 
Here, we characterize, from both a mathematical and a biological point of view, the GTR model and identify necessary and sufficient conditions that P must satisfy to allow the estimation of R using Equation (7).

2.1 On mathematical assumptions of the GTR model
As indicated earlier, the GTR model assumes that R is a constant matrix with a non-positive spectrum (i.e. all eigenvalues must be non-positive). In addition, let us note that, even if we relax the assumption of a constant instantaneous rate matrix R, the corresponding net transition matrix P(t) must still be characterized by positive eigenvalues. Indeed, if (1) we decompose R as the product of {Pi} and B, where B is the symmetric matrix of rates [see (Felsenstein, 2004), p. 207] and (2) we allow R to be time variant (this translates into B being time variant as {Pi} is assumed time invariant), we have


Formula 9

(9)
where the average instantaneous rate matrix


Formula 10

(10)
must still be characterized by a non-positive spectrum as it is the product of a positive-definite matrix {Pi} times the sum of (an infinite number of) negative semi-definite matrices B({tau}). Hence, for any time Formula 8, if we observe that at least one eigenvalue of Formula 8 is not positive, P cannot be considered as a net transition matrix of a Markovian process described by Equation (2).

If not stated otherwise, we consider throughout the present paper that the matrices P and F# are computed from the observed pair of aligned sequences using the procedure described in Waddell and Steel (1997).

2.2 On the congruency between P and the GTR model
Here, we determine the conditions under which the net transition matrix P is congruent with the GTR model in general, and with Equation (2) in particular.

Because P = {Pi}–1 F# and both {Pi} and F# are symmetric, P is characterized by positive eigenvalues if and only if both {Pi} and F# have positive eigenvalues (i.e. are positive-definite). We note also that {Pi} is a diagonal matrix with positive diagonal entries {pi}i, hence, is characterized by positive eigenvalues, if all four types of nucleotides (A,T,C,G) are observed in the two aligned DNA sequences S{alpha} and Sß (obviously, negative entries {pi}i would be biologically meaningless) (Keilson, 1979). Consequently, we can conclude that P is characterized by positive eigenvalues, hence is congruent with (2), if and only if F# is characterized by positive eigenvalues (i.e. is positive-definite).

As F# is symmetric, it is positive definite if and only if the Sylvester's criterion is satisfied [(Brinkhuis and Tikhomirov, 2005), p. 409] i.e.

Formula 11(11)
where Formula 11 is a minor (a submatrix) of F# made of its first k rows and k columns. Conditions (11) are explicitly reported in Table 1, assuming that F# is written as follows:

Formula 12(12)


View this table:
[in this window]
[in a new window]
 
Table 1 The table explicitly shows the k-conditions of the Sylvester's criterion (11)

 
2.3 Biological interpretation of Sylvester's conditions
When k = 1 the criterion indicates that, for each state (A,T,C,G), the two aligned sequences must exhibit the same state at minimum one site. Note that this condition makes a five-state GTR models [with ‘gap’ as a fifth state; (McGuire et al., 2001)] inapplicable unless one forces any pair of sequence to always share a ‘gap’ state at minimum one character, a condition that is not biologically meaningful. For k > 1, Sylvester's conditions impose constraints on (subsets of) nucleotide transition probabilities. For example, when k = 2, the criterion indicates that for any pair of states 1 and 2, the geometric mean between the frequency of homologous sites with identical state 1 and the frequency of homologous sites with identical state 2 must be greater than the frequency of homologous sites exhibiting different states (1 and 2). A direct consequences is that the risk of not satisfying that condition (hence, the probability of obtaining negative eigenvalues of F#) increases with an increasing number of observed substitutions between the two aligned sequences.

Unfortunately, given the analytical complexity of the expressions when k ≥ 3, biological interpretation of the Sylvester's conditions is much less straightforward than in the previous cases. Roughly speaking, we interpret that Sylvester's conditions when k ≥ 3 impose a ‘relative’ uniformity among the different possible nucleotide transitions. To illustrate our point, let us consider, given sequences S{alpha} and Sß, that k = 3 and let us assume that f23 = 0. Then, the Sylvester's condition becomes Formula 12. If we further assume f22 ≥ f33 > 0 (i.e. f22/f33 ≥ 1), then Sylvester's condition becomes Formula 12. The latter inequality implies that if f23 = 0, then it cannot occur that both f12 and f13 reach their maximum values allowed when k = 2.

2.4 Sufficient conditions to exclude incongruent estimations of F#
Here, we show how one can use the Sylvester's criterion to determine some conditions that are sufficient for a priori identification of estimated F# matrices characterized by non-positive eigenvalues. In particular, given two generic sequences S{alpha} and Sß, we prove that, for any pair of character states (nucleotides), F# cannot be estimated through the procedures proposed by Waddell and Steel (1997) (because it would be characterized by non-positive eigenvalues) when the ratio between, on one hand, the number of sites exhibiting different nucleotides in S{alpha} and Sß and, on the other hand, the length of the sequences (l12), exceeds a given threshold.

Let us consider first the Sylvester's criterion for k = 2. Let us define

Formula 13(13)

Formula 14(14)

Accordingly, if one estimates F# as in Waddell and Steel (1997), l12 = {chi}1 + {chi}2 is, trivially, a constant that corresponds to the overall number of sites with a nucleotide of type 1 or 2 in the two sequences under consideration. Then, Sylvester's criterion for k = 2 can be translated into

Formula 15(15)
From (15), we obtain

Formula 16(16)
where the second inequality derives from the fact that

Formula 17(17)
and such value is obtained when

Formula 18(18)
Hence, we can claim that F#, as estimated using procedure from (Waddell and Steel, 1997), is surely not positive definite when 2f12 ≥ (l12/2) that, given (13) and (14) implies

Formula 19(19)

In plain words, condition (19) indicates that, when two aligned sequences S{alpha} and Sß are characterized by a number of homologous sites with different states greater than the number of homologous sites with identical states, this condition is sufficient to affirm that the procedures suggested by Waddell and Steel (1997) cannot be used.

Following the same line of reasoning, we can use Sylvester's third and fourth conditions to prove that an F# estimated following Waddell and Steel (1997) is surely not positive definite when at least one of the following conditions is met:

Formula 20(20)

Formula 21(21)
Note that conditions (20) and (21) are not worth checking as they can be directly derived from condition (19) by summation.

2.5 Some notes about the Logdet distance
Similarly to GTR-corrected distances, some conditions can lead to the inapplicability of the logdet correction (Lake, 1994; Lockahart et al., 1994; Steel, 1994). By referring to the original logdet distance formulation given by Steel (1994), we interpret that the logdet correction becomes uncomputable when the determinant of the matrix F is negative (its natural logarithm would be undefined). Unfortunately, the Sylvester's criterion cannot be applied in this case because F is not symmetric. Hence, we cannot readily extend to the logdet distance the analysis described above for GTR corrections. This issue is out of the scope of the present work and warrants additional analysis.


    3 METHODS
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION: A NUMERICAL...
 5 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 
Here, we generalize Rodriguez et al.'s (1990) procedure [extended in Waddell and Steel (1997)] as a non-linear optimization problem such that it returns estimates of F# and P that (1) are congruent with Equation (2), and (2) optimize a measure relevant to the evolution of the sequence S{alpha} and Sß. Our procedure returns the same results as Rodriguez et al.'s (1990) when the latter yields F# and P matrices characterized by a positive spectrum.

3.1 An alternative way to estimate P
Let us consider the generic aligned homologous sequences S{alpha} and Sß, each of length L. Let us define {Gamma} as the set of the four different bases (A,C,G,T) and Formula 21 as the number of nucleotides of state j of S{alpha} that underwent substitution into state i of Sß. Analogously, let us define Formula 21.

We then estimate matrices P and F# by determining the net transition probabilities that maximize the probability dP(S{alpha}, Sß) of observing the two sequences, given P.

Formula 22(22)
This is equivalent to maximizing Formula 22. We chose this measure (among many others possible) because it is also maximized when using Rodriguez et al.'s (1990) procedure. When the likelihood of observing the sequences pair is maximal, dP(S{alpha}, Sß) is minimal and vice versa.

Consider now the following non-linear optimization problem:

Formula 23(23)

Formula 24(24)

Formula 25(25)

Formula 26(26)

Formula 27(27)

Formula 28(28)

Formula 29(29)
Constraint (24) imposes to search for a net transition probability matrix that can be written as the product of {Pi} and F#: if one would only search for a positive semi-definite net transition matrix such that its rows sum to one, one could obtain (as solution of the optimization problem) a matrix P that respects all the constraints but that might not be written as the product of two symmetric matrices. This would cause problems when computing R using Rodriguez et al's (1990) procedure, because {Pi} would be unknown. Constraint (25) imposes to search for a symmetric and positive semi-definite matrix F#, i.e. a matrix characterised by non-negative eigenvalues. It is implemented by imposing conditions (11). Constraints (26) and (27) impose the condition of normalization on the rows of the solution and on the nucleotide frequencies. Constraints (28) and (29) trivially impose that the elements fij of the solution of the problem are non-negative and that the variables {pi}j are included in given sets Qj of feasible values for nucleotide equilibrium frequencies.

The solution of problem (24)–(29) depends on which values are assumed included in the set Qj. One reasonable choice is to directly use the average of the observed frequencies of state j in the two homologous sequences. One alternative would be that the set Qj includes all possible values between the observed frequencies of nucleotide j in each of the two homologous sequences. Using the first option (in accordance with Rodriguez et al's (1990) approach), {pi}j can be computed using

Formula 30(30)
This choice is a reasonable trade–off between generalisation of the model and the efficiency of solving it algorithmically. Indeed, when {Pi} is assigned, both the objective function and the set of feasible solutions are convex, such that the solution of the problem (23) is unique (Papadimitriou and Steiglitz, 1998, p. 15). Conversely, if neither F# nor {Pi} are assigned, the set of feasible solutions is not convex, such that the optimal solution might not be unique anymore. Hence, when {Pi} is assigned the optimization problem (23)–(29) becomes:

Formula 31(31)

Formula 32(32)

Formula 33(33)

Formula 34(34)

Formula 35(35)
The above problem can be solved by applying any standard non-linear optimization technique [see, e.g. (Bertsekas, 1999)].

Note that the optimal solution Formula 35 to problem (31)–(35) might lie either inside the set of feasible solutions defined by the constrains or along its boundary. In the former case, F# and Formula 35 are each characterized by a positive spectrum, and these correspond, respectively, to the symmetrized form of the divergency matrix and to the net transition matrix obtained by applying the procedures from Rodriguez et al. (1990) and Waddell and Steel (1997). In the latter case (optimal solution along the boundary of the set of feasible solutions), F# and Formula 35 have positive spectra but at least one eigenvalue is equal to 0. The presence of null eigenvalue(s) implies that Formula 35

In this latter case, note that, despite Formula 35 we can still estimate R using

Formula 36(36)
where Formula 36 with I, V and {Lambda} that, respectively are: the identical matrix; the matrix of the eigenvectors of Formula 36; and the diagonal matrix of the eigenvalues of matrix Formula 36, i.e., Formula 36. Roughly speaking, Formula 36 differs from Formula 36 only by changing the eigenvalues {lambda}i of the latter into Formula 36 in the former.

Let us finally observe that, if three eigenvalues of Formula 36 are equal to 0, then

Formula 37(37)
where 1 is a matrix whose elements are all equal to 1. Actually, as shown by (Lanave et al., 1984), the matrix R can be decomposed into:

Formula 38(38)
where Formula 38 and Formula 38 are vectors forming an orthogonal base. Then, using (2), the transition probabilities pij(t) take the following form

Formula 39(39)
By applying the spectral decomposition theorem to Formula 39, we obtain:

Formula 40(40)
where Formula 40 and Formula 40 are the left and right eigenvectors. From (39) and (40) follows (37):

Formula 41(41)

Formula 42(42)

Cases in which just one or two eigenvalues of Formula 42 are equal to zero, are particularly interesting. Indeed, if Formula 42, all the eigenvalues different from 1 should be equal to 0. Let us observe that using our model, or Rodriguez et al.'s (1990) model, one estimates continuous probability values on the basis of a finite dataset made of discrete characters (roughly speaking, we count how many sites have been subjected to substitution). Hence, the distribution of potential optimal solutions Formula 42 (and, consequently, of times Formula 42) of the problem (31)–(35) is not continuous. Therefore, in cases with just one or two null eigenvalues, increasing the length of the sequences to infinity would generate slightly different proportions of nucleotide substitution and a different net transition matrix, close to Formula 42, but with all eigenvalues different from zero (although one or two of them very small). As a consequence, if Formula 42 obtained from an observed (finite) dataset presents only one or two null eigenvalues, we conclude that Formula 42 should be considered a great, but not an infinite, value.

3.2 Consequences on evolutionary distances
Application of the non-linear optimization techniques described above has major consequences on evolutionary distances. The spectrum of the net transition matrix can be considered as a measure of the level of divergence between the two analyzed sequences: the difference between the greater and the smaller eigenvalues of Rt tends to zero as the evolutionary distance between the two observed sequences becomes smaller. Conversely, the difference between the greater and the smaller eigenvalues of Rt tends to infinite when the evolutionary distance between the two sequences increases.

Hence, the use of distance matrix methods remains restricted to cases characterized by a positive definite matrix F#, i.e. for pairs of sequences characterized by relatively small divergences. As, from a biological point of view, undefined (i.e. infinite) evolutionary distances are meaningless, a practical solution to overcome such a problem would be to replace Formula 42 with an appropriate Formula 42, e.g. choosing Formula 42, where {lambda}2 is the smaller non-null eigenvalue of Formula 42. The rationale behind such a choice is that it is generally accepted (e.g. in experimental physics) what follows. Consider two exponential processes whose asymptotic values are zero (as it occurs in our case) and such that the second process decays five times as fast as the first one. By the time the first process has assumed a value that is 1/e times its initial value, the second process has practically reached its steady state, i.e. its value is less than one hundredth of its initial value.

Furthermore, this solution has the advantage of assigning different distances for different pairs of sequences initially characterized by undefined distances (i.e. the procedure we propose provides estimates of distances) and, therefore, it might generally lead to better results than when using a fixed arbitrary value (Waddell and Steel, 1997). Note however that the approach we propose (1) introduces arbitrariness in computing the evolutionary distances (it is arbitrary to use Formula 42 rather than, e.g. Formula 42) and (2) would not be applicable if also the second eigenvalue of P is negative (because Formula 42 would be characterized by three null eigenvalues). However this case should be very rarely encountered (i.e. it corresponds to a very high divergency between the two sequences) and should be preceded by other problems such as ambiguities in alignment.

3.3 Goodness of solutions
Let us suggest that the net transition matrix Formula 42 can be considered valid (reasonable) from a biological point of view if, by random sampling of a population of sequences evolving according to P(t) = eRt, there is a relatively high probability to pick up a pair of sequences (Sr, Ss) whose Formula 42 is greater than or equal to Formula 42. By indicating with Formula 42 the number of nucleotides of type i in Sx that have undergone a substitution into the type j in Sy, with N nucleotides in each sequence, such a probability Formula 42 can be written as follows:

Formula 43(43)
where

Formula 44(44)

Formula 45(45)

Formula 46(46)

Formula 47(47)

Formula 48(48)

Formula 49(49)
and

Formula 50(50)
Condition (44) imposes to consider only the indices (and therefore all the corresponding pairs of sequences) such that the observed numbers of substitutions Formula 50 and Formula 50 are smaller than the respective Formula 50 and Formula 50 under the same transition probabilities pij. Conditions (45) and (46) impose that the sums of the substitutions Formula 50 and Formula 50 are equal to the length N. Finally, condition (47) and (48) trivially imposes that the number of substitutions cannot be negative.

The overall probability [cf. Equation (43)] can be interpreted as the goodness of the stochastic model to predict the likelihood of observing the actual sequences. From this point of view, P({infty}) is better than Formula 50 because the former implies a greater Formula 50 than the latter does. In fact, for P({infty}), condition (44) of Formula 50 becomes

Formula 51(51)
and

Formula 52(52)
Constraint (51) is weaker than constraint (44), therefore the set Formula 52 (hence, the overall probability Formula 52) is greater under P({infty}) than under Formula 52.

Notice that if one aims to obtain the transition probability matrix that maximizes Formula 52, then one should consider the matrix Formula 52, i.e. the matrix characterized by all events being equiprobable (e.g. all entries = 1/4). In this situation, the set Formula 52 becomes

Formula 53(53)

Formula 54(54)

Formula 55(55)

Formula 56(56)

Formula 57(57)
Condition (44) in this case becomes

Formula 58(58)
Constraint (58) is weaker than constraints (51), (48) and (54). Hence, when Formula 58 is assumed, Formula 58; i.e. the probability to obtain the observed pair of sequences is maximum. However, one should realize that, when using Formula 58, Formula 58 is always equal to 1 for any (observed or not observed) pair of sequences, i.e. all sequences are equiprobable. Our interpretation for this phenomenon is that the Formula 58 model does not tell us anything about the substitution process. Therefore, we suggest that, although the goodness of Formula 58 and of P({infty}) are higher than that of Formula 58, the information content (regarding the substitution process of the specific pair of observed sequences) of Formula 58 is higher (depending on its rank) than those of Formula 58 and P({infty}).

In summary, Formula 58 represents the asymptotic value of the transition probability matrix relative to any pair of sequences; P({infty}) represents the asymptotic value of the transition probability matrix relative to any pair of sequences characterized by a distribution {Pi}, and Formula 58 is the limit transition probability matrix that best describes a substitution process [modeled by Equation (2)] for the two observed sequences characterized by a distribution {Pi}. In other words we consider that Formula 58 is the most reliable matrix when the evolution of a pair of sequences is both assumed to follow (2) and characterized by negative eigenvalues of the net transition matrix P.


    4 DISCUSSION: A NUMERICAL EXAMPLE
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION: A NUMERICAL...
 5 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 
In this section we show a numerical example of the non-linear optimization problem described above. This example is deliberately made of short and divergent sequences both to insure the occurrence of negative eigenvalues and allow the reader to easily (manually) verify each step of the procedure. Although generally not mentioned in publications, negative eigenvalues with real data do occur (e.g. the comparison ‘human versus cow’ for the cytochrome b gene third positions yield the following eigenvalues: (1., 0.423996, –0.137941, 0.111731). Whether negative eigenvalues are widespread with real data warrants further investigation.

Let us assume we want to compute the distance matrix from the following alignment:

Formula 59(59)
Let us consider the first two sequences. If we use Rodriguez et al.'s (1990) procedure as proposed in Waddell and Steel (1997), we obtain the following symmetrized divergency matrix

Formula 60(60)
Accordingly, the matrix {Pi} is

Formula 61(61)
with

Formula 62(62)
Finally, the net transition matrix P is

Formula 63(63)
This matrix is characterized by the following eigenvalues:

Formula 64(64)
The occurrence of a negative eigenvalue makes Rodriguez et al.'s (1990) procedure inapplicable. Let us also observe that, in our example, all pairwise evolutionary distances are undefined, as each pairwise comparison leads to a P matrix characterized by negative eigenvalues. Following Waddell and Steel (1997), the distance matrix therefore is

Formula 65(65)
i.e. no distance matrix method can be used.

On the other hand, solving the non-linear optimization problem for the specific pair of sequences following the method we propose above, yields the net transition matrix

Formula 66(66)
with eigenvalues

Formula 67(67)
Note that Formula 67 satisfies the reversibility condition. The respective matrices P({infty}) and Formula 67 are

Formula 68(68)
and

Formula 69(69)

As the evolutionary distance between the two first sequences is undefined, we propose (see Section 3.2) using Formula 69, such that

Formula 70(70)
By iterating the above procedure for all pairwise sequence comparisons, all entries of the distance matrix can be computed and any distance matrix method can be applied.

Let us now observe that, when computing, from the two first sequences, the substitution rate matrix R by using (36), we obtain

Formula 71(71)

This matrix, with four negative non-diagonal elements, is not compatible with the description of a continuous Markov process and this result is not imputable to the limit (36). Indeed, an instantaneous substitution rate matrix of a Markov process must satisfy the so-called conservative hypothesis, requiring that (1) non-diagonal entries are non-negative and (2) the diagonal negative elements, row by row, are equal to the opposite of the sum of the non-diagonal elements.

The only explicit hypothesis of the estimation procedures proposed by (Rodriguez et al., 1990; Waddell and Steel, 1997) is reversibility as defined by Equations (4) and (5). This hypothesis is sufficient to compute Formula 71 in all cases for which P is congruent with (2), but is not sufficient to guarantee that the instantaneous substitution rate matrix is characterized by non-negative non-diagonal entries.

Therefore, imposing (11) when performing the non-linear optimization problem guarantees that Formula 71 is congruent with (2), but does not guarantee that Formula 71 a transition probability matrix of a markovian process. In such a circumstance, we propose three possible strategies.

First, one could accept Formula 71 and R (eventually with negative non-diagonal entries) as they are congruent with (2), but one then accepts the risk of loosing the conservative continuous markov chain hypothesis and, row per row, one uses the questionable interpretation that the eventual negative instantaneous substitution rates are decreasing substitutions in favor of the positive ones.

Second, one could accept Formula 71, and decompose the instantaneous substitution rate matrix into R = {Pi}B, change the sign of all negative non-diagonal entries of B; in this case, the conservative continuous markov chains hypothesis is guaranteed, but the optimality of R is lost.

Third, one could incorporate the conservative continuous markov chains hypothesis in the set of constraints of the non-linear optimization problem, i.e. we further add the constraints Formula 71 and, given (7), rii < 0 and rij > 0. In this case we obtain:

Formula 72(72)
and the corresponding evolutionary distance between the two first sequences is

Formula 73(73)

Unfortunately such an approach is very computationally intensive and provides no guarantee that a global optimal solution can be determined.

Now, let us indicate with dP({infty})(S{alpha}, Sß) and Formula 73 the likelihood of observing the actual pair of sequences given P({infty}) and Formula 73, respectively. We obtain

Formula 74(74)

Formula 75(75)

Formula 76(76)
This result confirms that Formula 76 is the matrix that maximizes the likelihood of observing the specific pair of sequences. Let us compute now the overall probability to pick up a pair of sequences (Sr, Ss) that yields a likelihood greater than or equal to Formula 76 when Formula 76, P({infty}) or Formula 76 are considered. To estimate such probabilities, we generated a set of 100 sequences for each substitution matrix. We then compute the proportion of (S{alpha}, Sß) pairs [among all pairs (ancestral sequence, generated sequence)] that exhibit a dP(S{alpha}, Sß) greater than Formula 76, dP({infty})(S{alpha}, Sß) and Formula 76. The process was repeated 100 times for each transition probability matrix Formula 76, P({infty}) and Formula 76 and the three populations of sequences are characterized by the following means and variances:

Formula 77(77)

Formula 78(78)

Formula 79(79)
As discussed in Section 3.3, Formula 79


    5 CONCLUSIONS AND PERSPECTIVES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION: A NUMERICAL...
 5 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 
In this paper we formally characterized the mathematical conditions that lead to the inapplicability of the published estimation procedures (Rodriguez et al., 1990; Waddell and Steel, 1997; Yang and Kumar, 1996) to compute evolutionary distances and instantaneous rate matrices under the GTR model of nucleotide substitution. We provided criteria to accept or reject such estimations and suggested biological interpretations of these conditions. Furthermore, we extended existing procedures (Rodriguez et al., 1990; Waddell and Steel, 1997, Yang and Kumar, 1996) by reformulating them in terms of a non-linear optimization problem. We analytically investigated the effect of our approach on estimated evolutionary distances, the transition probability matrix, and the instantaneous substitution rate matrix. Our formulation yields the best net transition matrix Formula 79 that (1) is congruent with (2), (2) maximizes the proposed measure (22) and (3) best describes the substitution process for the specific sequence pair.

For overcoming the problem of undefined evolutionary distances, we propose a procedure that is more generally applicable and more biologically meaningful than the alternative strategy proposed by Waddell and Steel (1997).

In Section (4) we have shown that the estimation procedures proposed by Rodriguez et al., 1990; and by Waddell and Steel (1997) might lead in general to an estimated R that does not respect the conservative continuous markov chains hypothesis. This phenomenon also affects the non-linear formulation we proposed: conditions (11) are necessary and sufficient to guarantee that Formula 79 is congruent with (2), but are insufficient to guarantee the conservative continuous markov chains hypothesis. We suggested three possible ways to overcome such a problem.

Finally, we observed that Formula 79 has a low probability Formula 79. This situation seems contradictory: the matrix that maximizes the likelihood of the observed data [Equation (22)] also minimizes the probability of generating a pair of sequences with likelihood smaller than or equal to that of the observed data (43). However, we argue that a pair of sequences requiring the computation of Formula 79 can not be considered as a random draw: they generate negative eigenvalues of P. This brings us to the perspective of questioning the validity of the GTR model. First, the strength of the argument of rejecting the GTR model as a biologically valid model of nucleotide substitution depends on the frequency with which negative eigenvalues of P are generated with real data sequence pairs (and this points warrants further investigation). Second, there are fundamental limitations to the GTR model: it does not separate the mutation process from other factors influencing the substitution process. To investigate whether this would bring about the rejection of the GTR model and motivate the development of alternative, more complex, models, one needs to identify the biological conditions under which sets of observed sequences would be characterized by very low probabilities [Equation (43)] of being generated by the GTR model [described by Equation (2)]. Finally, another interesting issue would be to characterize the amount of time Formula 79 for which, given Formula 79, the probability of obtaining negative eigenvalues of Formula 79 is high. The answer to this question would return the range of applicability of the GTR model. The reformulation of the GTR model as non-linear optimization problem described above is implemented in version 2 of the phylogeny inference program MetaPIGA (Lemmon and Milinkovitch, 2002), available at www.ulb.ac.be/sciences/ueg/html_files/MetaPIGA.html


    Acknowledgments
 
We would like to thank Joseph Felsenstein and L. Gatto for helpful general discussions regarding the GTR model, as well as C. Lanave and Mike A. Steel for their invaluable comments on a previous version of this manuscript.

Conflict of Interest: none declared.


    FOOTNOTES
 
Associate Editor: Joaquin Dopazo

Received on November 4, 2005; revised on November 29, 2005; accepted on December 9, 2005

    REFERENCES
 TOP
 ABSTRACT
 1 INTRODUCTION
 2 APPROACH
 3 METHODS
 4 DISCUSSION: A NUMERICAL...
 5 CONCLUSIONS AND PERSPECTIVES
 REFERENCES
 

    Barry, D. and Hartigan, J.A. (1987) Statistical analysis of hominoid molecular evolution. Stat. Sci, . 2, 191–207.

    Bertsekas, D.P. Nonlinear Programming, (1999) 2nd edn Hardcover.

    Brinkhuis, J. and Tikhomirov, V. Optimization: insights and applications, (2005) Princeton University Press.

    Felsenstein, J. (1984) Distance methods for inferring phylogenies: a justification. Evolution, 38, 16–24[CrossRef].

    Felsenstein, J. (2004) Inferring Phylogenies. , Sunderland, UK Sinauer Associates.

    Keilson, J. Markov Chain Models–Rarity and Exponentiality, (1979) , New York Springer-Verlag.

    Lake, J.A. (1994) Reconstructing evolutionary trees from dna and protein sequences: Paralinear distances. Proc. Natl Acad. Sci. USA, 91, 1455–1459[Abstract/Free Full Text].

    Lanave, C., et al. (1984) A new method for calculating evolutionary substitution rates. J. Mol. Evol, . 20, 86–93[CrossRef][ISI][Medline].

    Lemmon, A.R. and Milinkovitch, M.C. (2002) The metapopulation genetic algorithm: An efficient solution for the problem of large phylogeny estimation. Proc. Natl Acad. Sci. USA, 99, 10516–10521[Abstract/Free Full Text].

    Lio, P. and Goldman, N. (1998) Models of molecular evolution and phylogeny. Genome Res, . 8, 1233–1244[Abstract/Free Full Text].

    Li, W.H. Molecular Evolution, (1997) , Sunderland, UK Sinauer Associates.

    Lockahart, P.J., et al. (1994) Recovering evolutionary trees under a more realistic model of sequence evolution. Mol. Biol. Evol, . 11, 605–612[ISI].

    McGuire, G. (2001) Models of sequence evolution for DNA sequences containing gaps. J. Mol. Evol, . 18, (4) 481–490.

    Page, R.D.M. and Holmes, E.C. Molecular Evolution: A Phylogenetic Approach, (1998) , Oxford, UK Blackwell Science.

    Papadimitriou, C. and Steiglitz, K. Combinatorial Optimization, Algorithm and Complexity, (1998) , Mineola, NY, USA Dover Publications.

    Rodriguez, F., et al. (1990) The general stochastic model of nucleotide substitution. J. Theor. Biol, . 142, 485–501[ISI][Medline].

    Steel, M.A. (1994) Recovering a tree from the markov the leaf colourations it generates under a markov model. Appl. Math. Lett, . 7, 13–23.

    Swofford, D.L., Olsen, G.J., Waddell, P.J., Hillis, D.M. (1996) Phylogenetic inference. In Hillis, D.M., Moritz, C., Mable, B.K. (Eds.). molecular systematics, , Sunderland, UK chapter 11 Sinauer & Associates, pp. 407–514.

    Tavare, S. (1987) Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Life Sci, . 17, 57–86.

    Waddell, P.J. and Steel, M.A. (1997) General time reversible distances with unequal rates across sites: Mixing gamma and inverse gaussian distributions with invariant sites. Mol. Phylogenet. Evol, . 8, 398–414[CrossRef][ISI][Medline].

    Yang, Z. and Kumar, S. (1996) Approximate methods for estimating the pattern of nucleotide substitution and the variation of substitution rate among sites. Mol. Biol. Evol, . 13, 650–659[Abstract].

    Yang, Z. (1994) Estimating the pattern of nucleotide substitution. J. Mol. Evol, . 39, 105–111[ISI][Medline].

    Zharkikh, A. (1994) Estimation of evolutionary distances between nucleotide sequences. J. Mol. Evol, . 39, 315–329[CrossRef][ISI][Medline].


Add to CiteULike CiteULike   Add to Connotea Connotea   Add to Del.icio.us Del.icio.us    What's this?



This Article
Right arrow Abstract Freely available
Right arrow FREE Full Text (Print PDF) Freely available
Right arrow All Versions of this Article:
22/6/708    most recent
btk001v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Archive
Right arrow Download to citation manager
Right arrowRequest Permissions
Google Scholar
Right arrow Articles by Catanzaro, D.