Computational Advances in Maximum Likelihood Methods for Molecular Phylogeny

Eric E. Schadt; Janet S. Sinsheimer; Kenneth Lange

doi:10.1101/gr.8.3.222

Computational Advances in Maximum Likelihood Methods for Molecular Phylogeny

Departments of ¹Biomathematics and ²Biostatistics, The University of California, Los Angeles, California 90095 USA; ³Departments of Biostatistics and Mathematics, The University of Michigan, Ann Arbor, Michigan 48109-2029 USA

Abstract

We have developed a generalization of Kimura’s Markov chain model for base substitution at a single nucleotide site. This generalized model incorporates more flexible transition rates and consequently allows irreversible as well as reversible chains. Because the model embodies just the right amount of symmetry, it permits explicit calculation of finite-time transition probabilities and equilibrium distributions. The model also meshes well with maximum likelihood methods for phylogenetic analysis. Quick calculation of likelihoods and their derivatives can be carried out by adapting Baum’s forward and backward algorithms from the theory of hidden Markov chains. Analysis of HIV sequence data illustrates the speed of the algorithms on trees with many contemporary taxa. Analysis of some of Lake’s data on the origin of the eukaryotic nucleus contrasts the reversible and irreversible versions of the model.

The object of phylogenetic analysis is to infer the correct evolutionary relationships among three or more contemporary species from amino acid or nucleotide sequence data on a representative member of each species. As more nucleotide sequence data accumulate from a multitude of contemporary species and as computers become ever faster, phylogenetic methods should evolve accordingly. In our opinion, properly implemented maximum likelihood methods are the best vehicles for statistical inference. First, maximum likelihood methods use all of the available data. Second, they can incorporate realistic evolutionary models. (They can also incorporate naive models.) Third, they permit estimation of parameters and their associated standard errors. Fourth, and finally, they allow comparison of different evolutionary trees through their maximum likelihood statistics. Competing methods of analysis, such as maximum parsimony, involve much less computation but neglect relevant portions of the data. They also ordinarily do not yield reliable parameter estimates.

In this paper we present a generalization of Kimura’s model for base substitution at a single nucleotide site. Our generalization strikes a good compromise between parameter richness and analytic tractability. By including just enough symmetry, we can extend Kimura’s arguments for deducing finite-time transition probabilities from his two-parameter model to our eight-parameter model. We also describe an algorithm that simultaneously calculates the likelihood and its derivatives quickly. These calculations can be implemented on a parallel computer for extremely fast analysis of many evolutionary trees. We demonstrate the feasibility of the model and the new likelihood algorithms by analyzing HIV sequence data and RNA sequence data bearing on the origin of the eukaryotic nucleus.

The model adopted in this paper specifies the expected branch length separating a mother and a daughter node of an evolutionary tree as the average number of substitutions per nucleotide position (site) between the nodes. This expectation is a function of the conditional probability p_ij (t) = Pr (Z_t = j|Z ₀ = i) that the daughter node occupies state j at time tgiven that the mother node occupies state i at time 0. Herei and j are any of the four possible nucleotides, namely, the purines A (adenine) and G (guanine) and the pyrimidines C (cytosine) and T (thymine). The probabilities p_ij (t) are entries of the finite-time transition matrix P(t) characterizing the Markov chain Z_t of successive nucleotides occurring along the branch at the given site. Under the Markov chain model, the probability of the observed nucleotides at the site depends on (1) the rooted tree connecting the contemporary taxa, (2) the distribution of nucleotides at the root, and (3) the matrix functionP(t). Phylogenetic reconstruction using maximum likelihood (Felsenstein 1981) attempts to infer the correct tree by maximizing the likelihood of the observed data with respect to entries of the various matrices P(t) or parameters determining these entries. The tree yielding the largest maximum likelihood is declared to be the best tree.

For the purposes of this paper, we make the simplifying assumption that the nucleotide patterns at different sites are independently and identically distributed. Many models for the 4 × 4 matrixP(t) have been proposed. Barry and Hartigan (1987)place no restrictions on P(t). Because all row sums of P(t) equal 1, this entails 12 parameters per branch. In practice, this many parameters can not be reliably estimated, even though, in some cases, the tree topology and branch lengths can be. Beginning with Kimura (1980) and Felsenstein (1981), various researchers have constructed models that capture the changes occurring at each site as a continuous-time Markov chain involving the four states A, G, C, and T. These time-homogeneous models are nicely summarized by Yang (1994). If in this contextω _ij is the infinitesimal transition rate of moving from state i to state j ≄′i and if ω_ii = −Σ_{j ≄′ i}ω_ij , thenP(t) can be expressed as the matrix exponential $\text{[math]}$ of the infinitesimal generator (or infinitesimal transition matrix) Ω = (ω_ij ). Formula 1 permits, in principle, estimation of the off-diagonal entries of Ω and a branch timet_b for each branch b of an evolutionary tree. Because the branch times are confounded with the infinitesimal transition rates, it is necessary to set one arbitrary branch time equal to 1 and to determine all other branch times relative to this reference branch time.

In practice, numerical evaluation of matrix exponential 1 is a major computational bottleneck hindering maximum likelihood estimation (Yang 1994). Thus, models such as Kimura’s (1980) that yield analytic expressions for the entries of P(t) are extremely valuable. Kimura’s highly symmetric two-parameter model incorporates a common infinitesimal rate α for all evolutionary transitions (purine to purine or pyrimidine to pyrimidine) and a common infinitesimal rate λ for all evolutionary transversions (purine to pyrimidine or pyrimidine to purine). the underlying Markov chain along any branch is reverisble.

Previous Section Next Section

The Generalized Kimura Model

The infinitesimal generator under our generalization of Kimura’s model is $\text{[math]}$ This model obviously makes fewer biological assumptions than Kimura’s model. In essence, we impose the constraint that the infinitesimal rate for a transversion depends only on the destination state and not on the initial state. In contrast, Lake’s model (Lake 1987) of evolutionary parsimony imposes the opposite constraint that the infinitesimal rate for a transversion depends only on the initial state and not on the destination state. Oddly enough, Lake’s model does not lead to analytic expressions for the entries ofP(t), whereas ours does.

Kimura’s two-parameter model is a special case of the model determined by Equation . In view of Kolmogorov’s circulation criterion (Kelly 1979), it is easy to verify that Λ in Equation leads to a reversible Markov chain if and only if $\text{[math]}$ When these reversibility conditions hold, Λ parametrizes the special case of Tavare’s (1986) general reversible model discussed by Tamura and Nei (1993).

Other special cases of the infinitesimal generator Λ appear in the dnaml program of Felsenstein’s computer package PHYLIP (Kishino and Hasegawa 1989; Felsenstein and Churchill 1996) and in the nucml program of the MOLPHY package (Hasegawa et al. 1985; Adachi et al. 1996). PHYLIP depends on two parameters k and u and the root probabilities (π_A, π_G, π_C, π_T). These five independent parameters define the entries of Λ as follows: $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ Equivalently, $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ In practice, PHYLIP either requires the user to specify the components of the stationary distribution π or estimates these by the average nucleotide frequencies observed in the existing organisms. The two remaining parameters u and k are subject to the constraints u > 0,k + π_A + π_G > 0, andk + π_C + π_T > 0. Note that the PHYLIP model satisfies the reversibility criteria (Equation ).

If branch lengths are unconstrained and equilibrium has been reached, then phylogenies with underlying reversible Markov chains can not be rooted (Felsenstein 1981). This pulley principle of Felsenstein fails if any of the three necessary conditions mentioned are violated. Thus, in principle, methods based on the irreversible matrix Λ can single out a best rooted tree when sufficient data are available. We will comment further on this point in our application to Lake’s data.

In calculating the finite-time transition matrix P(t) for the infinitesimal generator Λ, we pattern our arguments afterKimura (1980), who exploits symmetry and derives two coupled linear differential equations for the entries of P(t). In our calculations, it is convenient to let Y denote either pyrimidine and R either purine. We defineq_AY (t) to be the probability the chain is in either of the two pyrimidines C or T at timet given it starts in A at time 0. We likewise defineq_YA (t) to be the probability the chain is in A at time t given it starts in either pyrimidine at time 0 and q_YY (t) to be the probability the chain is in either pyrimidine at time t given it started in either pyrimidine at time 0. These probabilities all make sense owing to the symmetry assumptions incorporated in Λ. Similar probabilities can be defined for the purines instead of the pyrimidines.

The probability q_AY (t + h) satisfies the Chapman–Kolmogorov relation $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ Here p_AA (t) andp_AG (t) are entries of the matrixP(t). For small h, equation 6 becomes $\text{[math]}$ $\text{[math]}$ Rearranging terms and taking the limit as h → 0 in the resulting difference quotient yields the forward differential equation $\text{[math]}$ where $\text{[math]}$ $\text{[math]}$ In view of the initial condition q_AY (0) = 0, equation (7) has solution $\text{[math]}$ We next solve for p_AG (t) by writing the forward approximation $\text{[math]}$ $\text{[math]}$ Repetition of our previous arguments then gives the forward differential equation $\text{[math]}$ where $\text{[math]}$ $\text{[math]}$ Substituting the solution (Equation ) forq_AY (t) in Equation and noting the initial condition p_AG (0) = 0, we find after straightforward calculus that $\text{[math]}$ $\text{[math]}$ From the identity p_AG (t) +p _AA(t) +q_AY (t) = 1, we finally harvest the entry p_AA (t) ofP(t).

In summary, we have calculated the entriesp_AG (t) andp_AA (t) and the sump_AY (t) =p_AC (t) +p_AT (t) from the top row [p_AA (t),p_AG (t),p_AC (t),p_AT (t)] of P(t). To fill out the top row, we briefly indicate how to calculate similar expressions for q_RR (t) andq_RC(t) =p_AC (t), where R stands for either purine. The forward differential equations forq_RR (t) and forq_RC (t) turn out to be $\text{[math]}$ and $\text{[math]}$ where c ₁ is defined as above and $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ The solution of Equation satisfying the initial conditionq_RR (0) = 1 is $\text{[math]}$ and the solution of Equation satisfying the initial conditionq_RC (0) = 0 is $\text{[math]}$ $\text{[math]}$ (13)

This completes the top row of P(t).

The remaining rows of P(t) can be similarly calculated and are provided in the . In the limit ast → ∞, the top row of P(t), along with all other rows, tends to the equilibrium distribution π with components $\text{[math]}$ $\text{[math]}$ (14) $\text{[math]}$ $\text{[math]}$ It is straightforward to check that π satisfies the balance condition π Λ = 0 characterizing equilibrium. When reversibility holds, the equilibrium distribution (equation 14) reduces to the equilibrium distribution of the PHYLIP model as listed in equation 5.

Previous Section Next Section

Likelihood Computation

Maximum likelihood methods are notoriously slow compared to competing methods such as maximum parsimony (Yang 1994). However, every advance in computing speed makes maximum likelihood methods more attractive and the excuses for preferring other methods less compelling. Of course, fast hardware is only one of the reasons for the improved position of maximum likelihood. Better algorithms also play a role. In this section we sketch an algorithm for the fast computation of the derivatives of the likelihood of an evolutionary tree. This algorithm extends the peeling method already in place for likelihood computation (Felsenstein 1981; Waterman 1996) and builds on an extension (Lange et al. 1995) of Baum’s algorithms from the theory of hidden Markov chains (Baum 1972; Devijver 1985). While derivative-free methods of optimization are viable, quick optimization almost always depends on the use of derivatives, computed either exactly or numerically. The decisive advantage of derivative based methods is already clear in timed comparisons of dnaml with its sister program fastDNAml (Olsen 1994); dnaml employs a derivative-free method and fastDNAml Newton’s method.

Before introducing the peeling algorithm and describing how to modify it to compute derivatives, it is helpful to introduce several definitions. Imagine an evolutionary tree drawn with its root at the top and its tips at the bottom. Every internal node n of the tree, including the root, generates three subtrees rooted atn. One of these is simply the subtree descending fromn and ultimately reaching a subset of descendant tips. This subtree can be divided into a left subtree and a right subtree by taking either the left or right branch descending from n.There is also a complementary subtree starting at the original root and containing all of the branches not included in the subtree rooted atn. Figure 1 illustrates the four subtrees at node 5 of a four-taxon tree. Subfigure a is the whole tree, c is the subtree rooted at 5, d is the left subtree, e is the right subtree, and b is the complementary subtree.

View larger version:

Download as PowerPoint Slide

Figure 1.

A four-taxon tree with subtrees at node 5.

In the peeling algorithm, several probabilities enter into the calculation of the likelihood P of the tree at a single site. These are (1) the conditional probabilityT _m → n(i,j) that the daughter node n of a branch m → nis in state j given that the mother node m is in state i; (2) the conditional probabilityD_n (i) of the subtree rooted at noden given node n is in state i; (3) the conditional probability L_n (i) of the left subtree rooted at node n given node n is in statei; (4) the conditional probabilityR_n (i) of the right subtree rooted at noden given node n is in state i; and (5) the joint probability U_n (i) of node nbeing in state i and the complementary subtree showing its observed bases at its tips. In definition 1, note thatT_m → n (i,j) is nothing more than an entry of the finite-time transition matrixP(t) pertaining to the branchm → n. The letters D and U inD_n (i) and U_n (i) of definitions (2) and (5) are chosen to indicate downward and upward directions in the tree.

In the modified peeling algorithm, we make two passes through the tree. In the first pass, we traverse the tree in a postorder (upward) fashion and compute and store the conditional probabilities [L_n (i), R_n (i),D_n (i)] as we visit interior noden. When we visit the root node at the end of the first pass, we recover the likelihood P of the tree. In the second pass, we traverse the tree in a preorder (downward) fashion and compute the derivatives of the likelihood with respect to each of the model parameters using the stored quantities [L_n (i), R_n (i),D_n (i)] as we visit node n.

More specifically in the postorder phase of the algorithm, we employ the recurrence relations $\text{[math]}$ $\text{[math]}$ to compute the conditional probabilitiesL_m (i) andR_m (i), where l is the left daughter of m, and r is the right daughter ofm. When r and l are tips with observed states s_l and s_r, the above recurrences reduce to $\text{[math]}$ $\text{[math]}$ Given the conditional probabilitiesL_m (i) andR_m (i), we compute $\text{[math]}$ Derivation of these more or less obvious recurrences is left to the reader. Clearly, the computations on node m can only be carried out after m’s two daughter nodes have been processed. A postorder transversal obeys this convention. Once we reach the root, we recover the likelihood via $\text{[math]}$ where π_i is the prior probability of the root being in state i.

In the preorder phase of the algorithm, we exploit the product rule of differentiation. Any given model parameter θ may impinge on the root distribution π_i and one or more of the branch transition probabilities T_{m → n} (i,j). The product rule implies that the partial derivative of P with respect to θ decomposes as $\text{[math]}$ in notation that will be clear momentarily. The branch contribution (∂/∂θ)P _{m → n} can be computed by replacing the function T _{m →n} (i,j) by its partial derivative (∂/∂θ)T _{m → n}(i,j) in the likelihood representations $\text{[math]}$ involving a left branch and $\text{[math]}$ a right branch, respectively. In view of Equation , we collect the first contribution $\text{[math]}$ when we commence the preorder traversal at the root.

Readers can rightfully object that the recipes for computing (∂/∂θ)P _{m → n} invoke the undefined vectors U_m (i) and, whenn is a tip, D_n (j). The second vector, D_n (j), can be defined asD_n (j) = 1_{{j=s_n} _}, wheres_n is the base at the tip. Specification ofU_m (i) is more problematic. By definition U_root (i) = π_i. In general, if m → nis a left branch, then $\text{[math]}$ and if m → n is a right branch, then $\text{[math]}$ These recurrences require that mother nodes be visited before daughter nodes. A preorder traversal of a phylogenetic tree achieves this node visiting strategy.

As an example of the two-pass algorithm, consider the four-taxon tree shown in Figure 1a. Pass one and two consist of the steps listed in Tables 1 and 2. The partial derivatives (∂/∂θ)π_i and (∂/∂θ)T _{m → n}(i,j) are calculated exactly and stored in a table prior tothe start of pass two. When node n is visited during pass two,U_n (j) is computed first, then (∂/∂θ)P _{m → n} is computed for each parameter θ in the model. Because the (∂/∂θ)T _{m → n}(i,j) are already computed and stored in a table, computing (∂/∂θ)P _{m → n} for a right branch requires only that the termsU_m (i),R_m (i), (∂/∂θ)T _{m→ n}(i,j), andD_n (j) be multiplied and summed overi and j. For a left branch, we substituteL_m (i) forR_m (i) in this calculation. When all of the nodes have been visited, the partial derivatives of Pare recovered via Equation .

View this table:

Table 1.

Likelihood Computation for Fig. 1a

View this table:

Table 2.

Computation of Derivatives for Fig. 1a

Previous Section Next Section

Likelihood Maximization and Simulated Annealing

We have implemented the extension of Kimura’s model described above and the likelihood algorithms sketched in the previous section in a C++ program named LINNEAUS. Maximum likelihood estimation is performed within LINNEAUS by a C++ version of the optimization program SEARCH (Lange 1997). SEARCH minimizes a nonlinear function by the method of recursive quadratic programming, with quasi-Newton updates of the approximate second differential (Hessian) of the objective function based on successive values of the first differential (gradient) of the objective function (Polak 1997). It is possible to implement Newton’s method using SEARCH, but the second derivatives in our model are cumbersome to compute, and LINNEAUS avoids explicit calculation of them. Exploiting the ability of SEARCH to perform numerical differentiation, we estimate that the modified peeling algorithm described in the previous section approximately doubles the speed of LINNEAUS (data not shown). These timing trials have the beneficial side effect of validating our code for the derivatives ofP(t). Finally, SEARCH permits upper and lower bounds and linear equality constraints to be imposed on parameters. Linear constraints are convenient because the reversibility conditions (Equation ) can be linearized by taking logarithms. Reparameterizing with log parameters has the added virtue of eliminating nonnegativity constraints.

SEARCH needs initial parameter values and an initial approximate Hessian. If maximum likelihood estimation is performed for a sequence of similar trees, then the parameter estimates from one tree can be used as initial parameter values for the next tree. The identity matrix serves as the default initial Hessian in SEARCH. Within LINNEAUS, the subprogram SEARCH minimizes the negative loglikelihood $\text{[math]}$ where L_i is the loglikelihood of the data at sitei of m sites. The sum of cross products of scores $\text{[math]}$ provides an alternative to the identity matrix in approximating the Hessian $\text{[math]}$ In this regard, note that H is non-negative definite and has the same expectation as −d ² L relative to the data (Rao 1973).

Maximum likelihood estimation for a single tree must be fast enough to accommodate comparison of large numbers of different trees through their maximum likelihoods. With n taxa, there are (2n − 3)!/[2ⁿ⁻²(n − 2)!] rooted trees and (2n − 5)/[2ⁿ⁻³(n − 3)!] unrooted trees. These numbers are manageable for n ≤ 7, but exhaustive consideration of all trees is out of the question forn > 10, and we must turn to heuristic methods of searching tree space. For example, Felsenstein (1981) suggests building an n-taxon tree by starting with any two-taxon subtree and successively adding one taxon at a time. At each stage the new taxon is attached to the existing branch that gives the subtree with largest maximum likelihood. Because the initial pair of taxa is arbitrary, Felsenstein further recommends repeating the construction using different rearrangement orders. Unfortunately, even further refinements of this greedy strategy still examine only a small fraction of tree space (Olsen 1994; Yang 1994).

Because the combinatorial optimization method of simulated annealing permits a wider search, Lundy (1985) suggested its use in phylogenetic reconstruction. For the sake of brevity, we omit a description of simulated annealing and refer readers to the references (Kirkpatrick 1983; Press 1988) for motivation and most details of implementation. The unique elements in the current setting are the energy function, the proposal mechanism, and the initial temperature. It is natural to define the energy of a tree to be its negative maximum loglikelihood. In the proposal stage, we randomly select any node other than the root node, delete the subtree rooted at this node, and reattach the subtree to a randomly selected branch among the remaining branches. The new tree is accepted with the Metropolis probability $\text{[math]}$ where f is the energy function (the negative maximum loglikelihood) and t_cur is the current temperature. To obtain a good starting temperature, we randomly generate 10 trees and set the initial temperature equal to a constant times the largest absolute difference between the corresponding energies. The constant 2 has proved to be satisfactory for up to 10 taxa. Because simulated annealing gives no guarantee of convergence to the best tree under realistic cooling schedules, we typically conduct several independent runs.

Previous Section Next Section

Program Performance

One obvious way of improving program speed is to exploit parallel processing. The fastDNAml program of Olsen et al. (1994) achieves superior performance by splitting likelihood computation for a phylogenetic tree into several processes that take advantage of massively parallel and clustered computing architectures. In writing LINNEAUS in Microsoft Visual C++ 5.0, we chose the most natural synchronous multithreading approach to program parallelization. LINNEAUS simply assigns each available processor to a different site of an aligned sequence. We achieve a further gain in efficiency by amalgamating those sites with a common pattern of observations and multiplying the loglikelihood and score computed for the pattern by the number of participating sites. Note that most sites in an alignment show complete agreement over all taxa; the exceptions (or segregating sites) generate most of the patterns. Under a symmetric multiprocessing operating system such as Windows NT 4.0, our current code detects the number of processors available at runtime and assigns a different observational pattern to each processor. When a processor finishes its thread, that is, computing the corresponding loglikelihood and score, it is reassigned to a new pattern unless none remains. Because most of the time in likelihood maximization is spent in these threads, our computing times per likelihood maximization are nearly inversely proportional to the number of available processors. Linear scalability of this sort is one of the hallmarks of good parallel code.

To test the performance of LINNEAUS, we examined 12 full-length HIV genome sequences available from the Los Alamos HIV web site (Myers 1996). These taxa were obtained from 10 individuals and also include one clade B consensus sequence. The size (∼10,000 nucleotide sites per sequence) and the number of segregating sites (∼15% of the sites per sequence) represent a challenging computational problem. We selected 20 groups of 4, 5, and 6 taxa each and 7 groups of 10 taxa each from the original 12 taxa. Table 3 lists the average time, plus or minus one standard deviation, necessary to maximize the likelihood of one tree for the different size trees. The ALR Revolution 2× desktop computer used in these timing studies has two 300 MHz Intel Pentium II processors, each with a 512-kb L2 cache and 512 Mb of RAM. We have replicated the linear scalability results evident in Table 3 on an older four-processor machine. These four-processor results are comparable to the two-processor results and are omitted. We have also partially ported our code to a 14-processor Sun Ultra Enterprise 5000 with 1GB of RAM running under the Solaris operating system. Our preliminary results on this high-end computer demonstrate that in excess of 100 likelihoods can be maximized per second for 10-taxon trees.

View this table:

Table 3.

Per Tree Timing Results for Maximum Likelihood

Previous Section Next Section

Data Example

In this section, we apply our generalization of Kimura’s model to a portion of the four-taxon data used by Lake (1988) to infer the closest prokaryotic relative of the eukaryotes. Lake aligned ribosomal RNA from a eukaryote (Aretemia salina) and from three prokaryotes, the eocyte (Desulfurococcus mobilis), the halobacterium (Halococcus morrhuae), and the eubacterium (Bacillus subtilis). Here we examine a subset of 1092 aligned nucleotide quartets from Lake’s data. Adopting his designation, let E designate the topology with A. salina and D. mobilisas sister taxa, let F designate the topology with A. salinaand H. morrhuae as sister taxa, and let G designate the topology with A. salina and B. subtilis as sister taxa (see Fig. 2).

View larger version:

Download as PowerPoint Slide

Figure 2.

The E, F, and G topologies represent all possible topologies for the four-taxon case.

Table 4 gives parameter estimates and maximum loglikelihoods (base e) for the most probable trees under four different models. Model 1 is the equilibrium version of F84 model described in Kishino and Hasegawa (1989). Model 2 is our model (Equation ) with the reversibility restrictions (Equation ) and equilibrium at the root imposed. Both models 1 and 2 satisfy the pulley principle and therefore, cannot be used to root trees. Model 3 coincides with model 2 except that the equilibrium assumption is relaxed. This entails adding three additional parameters to determine the root distribution. Model 4 coincides with model 3 except that the reversibility assumption is relaxed. Simulated data that we omit show that models 3 and 4 can be used to root trees.

View this table:

Table 4.

Parameter Estimates for the Best Trees Under Four Sets of Modeling Assumptions

Under all four models, the best tree has the E topology. For models 1 and 2, the maximum loglikelihoods of the next best topology, the F topology in both cases, are −4614.2 and −4605.2, respectively. Tree 5 in Figure 3 depicts the best rooted tree for both models 3 and 4. Unfortunately, we have little confidence in this rooting. In the case of model 4, the two remaining trees in the E topology have maximum loglikelihoods of −4537.3 and −4537.4, respectively. Similarly inconclusive results hold for model 3. Under models 3 and 4, the best maximum loglikelihoods for trees with the F or G topology (−4564.1 and −4544.2, respectively) are substantially worse than the maximum likelihoods listed in Table 4. LINNEAUS reached the maximum likelihood estimates for all four models listed in Table 4in just 11 iterations.

View larger version:

Download as PowerPoint Slide

Figure 3.

The five rooted trees corresponding to the unrooted tree with the E topology. These rooted trees are obtained by placing the root in each of the five branches of the unrooted tree.

Models 1, 2, and 3 are nested within model 4. Within the context of the best topology, or where applicable the best rooted tree, we can therefore conduct likelihood ratio tests of the three submodels versus model 4. The χ² statistics given at the bottom of Table 4overwhelming reject each of the submodels. Clearly, assuming that the four nucleotides are at equilibrium is a poor idea in these data. This impression is reinforced by the empirical frequencies shown in Table5. Relaxing the reversibility restrictions also substantially improves model fit. Finally, the fact that we estimate the length of branch 2 in model 3 at the somewhat artificial lower bound of 0.001 is another indication of the superiority of model 4.

View this table:

Table 5.

Observed Base Proportions for the Four Contemporary Taxa

Because all four models represent restrictions of the multinomial model with 4⁴ − 1 = 255 degrees of freedom (Navidi et al. 1991), goodness of fit can be roughly assessed by comparing the maximum loglikelihood (−4536.2) under the best tree of model 4 with the maximum loglikelihood (−4361.3) under the unrestricted multinomial model. The resulting χ² test statistic χ² ₂₃₈ = 349.8 has asymptoticP value ≅0.0001. This apparent lack of fit may spring from the failure of any of three different assumptions: (1) the transversion symmetries in our matrix Λ, (2) the applicability of large sample approximations to sparse data, and (3) the assumption that the various sites act independently and identically. Although it is impossible to make a firm judgment in the current case, it is worth noting that the phenomenon of heterogeneity of substitution rates across sites is well-documented (Gojobori 1982; Yang 1993). In particular, Rzhetsky (1995) shows substantial variation in the substitution rates between the stem and loop regions of 16S-like ribosomal RNA genes from higher eukaryotes.

Previous Section Next Section

Conclusions

Maximum likelihood is at an obvious computational disadvantage in molecular phylogeny. To a lesser extent, it is also handicapped by the simplicity of the models it traditionally invokes. The thrust of the current paper is that advances in model construction and algorithm design can go hand in hand. Although our generalization of Kimura’s base-substitution model involves a more complicated infinitesimal generator, the new generator Λ retains just enough symmetry to permit exact calculation of the matrix exponential e^tΛ and its corresponding equilibrium distribution π. Furthermore, this more flexible model can be coupled with better algorithms for maximum likelihood estimation and search of tree space.

A key ingredient in the rapid optimization of any function is the ability to eval-

uate it and its derivatives quickly. Our new algorithm for likelihood differentiation should be viewed in this light. Combining this algorithm with a good, general purpose optimization subprogram and parallelizing the resulting code leads to very fast maximum likelihood estimation per tree. In fact, we now consider our likelihood optimization methods fast enough to harness simulated annealing in a search of tree space for the best tree. While simulated annealing is slower than competing methods in phylogenetic inference, it is slower precisely because it analyzes a greater number of trees, and so, tends to the best tree with greater probability.

Our infinitesimal generator Λ, which is irreversible in general, opens up the possibility of rooting trees. As Felsenstein’s celebrated pulley principle shows, rooting is impossible in a reversible model when the root frequencies are at equilibrium and no constraints are imposed on branch lengths. Abandoning any of these three assumptions makes it possible in principle to root a tree. For instance, if constraints are imposed on branch lengths by assuming a molecular clock, it is possible to root a tree. However, the practical difficulty of rooting in the presence of limited data has led many researchers to dismiss the possibility altogether. Yang (1994) raises the additional numerical concern that the eigenvalues of the infinitesimal generator may be complex when reversibility fails. It is interesting to note that all eigenvalues of our matrix Λ are real.

Our experience only partially bears out the pessimism about rooting. Although topologically equivalent trees usually have similar maximum likelihoods and the root position within the best tree is usually poorly determined, we do see substantial increases in loglikelihoods when passing from simple reversible models to our more complex irreversible model. In our opinion, the massive amounts of data coming out of the various genome sequencing projects will fundamentally change the course of phylogenetic inference. Instead of aligned sequences of 10³ sites, we may well have aligned sequences of 10⁶–10⁸ sites. Of course, this quantitative shift merely emphasizes the necessity of further gains in computing efficiency if maximum likelihood is to be the phylogenetic inference engine of choice.

In any event, the results presented in this article are meant to demonstrate the feasibility of major improvements in computing speed in maximum likelihood methods. Researchers interested in a free copy of LINNEAUS should contact one of the authors.

Previous Section Next Section

Acknowledgments

We thank Professor James A. Lake for providing the aligned DNA sequences for our Section 6 example. The authors would also like to thank the reviewers and the editor for their helpful comments and clarifications. Research was supported in part by U.S. Public Health Service grant GM53275 (to K.L.) and CA16042 (to J.S.S.) and National Institute of General Medical Sciences (NIGMS) Systems and Integrative Biology Training grant GM08185-11 (to E.E.S.).

The publication costs of this article were defrayed in part by payment of page charges. This article must therefore be hereby marked “advertisement” in accordance with 18 USC section 1734 solely to indicate this fact.

Previous Section Next Section

Appendix

If the constants c ₁ throughc ₇ are defined as in the text, and if $\text{[math]}$ $\text{[math]}$ then the 16 entries of P(t) are $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ The infinitesimal generator Λ has eigenvalues 0, −c ₁, −c ₃, and −c ₆ and corresponding right eigenvectors $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ $\text{[math]}$ These can be checked by straightforward substitution.

Previous Section Next Section

Footnotes

↵4 Corresponding author.
E-MAIL klange{at}umich.edu; FAX (650) 855-6115.

Received December 10, 1997.
Accepted January 30, 1998.

Cold Spring Harbor Laboratory Press

Previous Section

REFERENCES

↵
1. Adachi J.,
2. Hasegawa M.
(1996) MOLPHY, Version 2.3. (The Institute of Statistical Mathematics, Tokyo, Japan).
Google Scholar
↵
1. Barry D.,
2. Hartigan J.A.
(1987) Statistical analysis of hominoid molecular evolution. Stat. Sci. 2:191–210.
Google Scholar
↵
1. Baum L.E.
(1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3:1–8.
Google Scholar
↵
1. Devijver P.A.
(1985) Baum’s forward-backward algorithm revisited. Pattern Recognition Lett. 3:369–373.
CrossRef Google Scholar
↵
1. Felsenstein J.
(1981) Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17:368–376.
CrossRef Medline Google Scholar
↵
1. Felsenstein J.,
2. Churchill G.A.
(1996) A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93–104.
Abstract
↵
1. Gojobori T.,
2. Ishii K.,
3. Nei M.
(1982) Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotides. J. Mol. Evol. 18:414–423.
CrossRef Medline Google Scholar
↵
1. Hasegawa M.,
2. Kishino H.,
3. Yano T.
(1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174.
CrossRef Medline Google Scholar
↵
1. Kelly F.P.
(1979) Reversibility and stochastic networks. (Wiley, New York, NY).
Google Scholar
↵
1. Kimura M.
(1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120.
CrossRef Medline Google Scholar
↵
1. Kirkpatrick S.,
2. Gelatt C.D.,
3. Vecchi M.P.
(1983) Optimization by simulated annealing. Science 220:671–680.
Abstract/FREE Full Text
↵
1. Kishino H.,
2. Hasegawa M.
(1989) Evaluation of the maximum likelihood estimates of the evolutionary tree topologies from DNA sequence data and the branching order in Hominoidea. J. Mol. Evol. 29:170–179.
CrossRef Medline Google Scholar
↵
1. Lake J.A.
(1987) A rate-independent technique for analysis of nucleic acid sequences: Evolutionary parsimony. Mol. Biol. Evol. 4:167–191.
Abstract
↵

(1988) Origin of the eukaryotic nucleus determined by rate-invariant analysis of rRNA sequences. Nature 331:184–186, ibid.

CrossRef Medline Google Scholar
↵
1. Lange K.
(1997) SEARCH, Version 3.0 (University of Michigan, Ann Arbor, MI).
Google Scholar
↵
1. Lange K.,
2. Boehnke M.,
3. Cox D.R.,
4. Lunetta K.L.
(1995) Statistical methods for polyploid radiation hybrid mapping. Genome Res. 5:136–150.
Abstract/FREE Full Text
↵
1. Lundy M.
(1985) Applications of the annealing algorithm to combinatorial problems in statistics. Biometrika 72:191–198.
Abstract/FREE Full Text
↵
1. Myers G.,
2. Korber B.,
3. Wain-Hobson S.,
4. Jeang K.T.,
5. Henderson L.E.,
6. Pavlakis G.N.
(1996) Human retroviruses and AIDS. (Los Alamos National Laboratory, Los Alamos, NM).
Google Scholar
↵
1. Navidi W.C.,
2. Churchill G.A.,
3. von Haeseler A.
(1991) Methods for inferring phylogenies from nucleic acid sequence data by using maximum likelihood and linear invariants. Mol. Biol. Evol. 8:128–143.
Abstract
↵
1. Olsen G.J.,
2. Matsuda H.,
3. Hagstrom R.,
4. Overbeek R.
(1994) fastDNAml: A tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10:41–48.
Abstract/FREE Full Text
↵
1. Polak E.
(1997) Optimization: Algorithms and consistent approximations. (Springer-Verlag, New York, NY).
Google Scholar
↵
1. Press W.H.,
2. Flannery B.P.,
3. Teukolsky S.A.,
4. Vetterling W.T.
(1988) Numerical recipes in C. (Cambridge University Press, New York, NY).
Google Scholar
↵
1. Rao C.R.
(1973) Linear statistical inference and its applications (Wiley, New York, NY), 2nd ed.
Google Scholar
↵
1. Rzhetsky A.
(1995) Estimating substitution rates in ribosomal RNA genes. Genetics 141:771–783.
Abstract/FREE Full Text
↵
1. Tamura K.,
2. Nei M.
(1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512–526.
Abstract
↵
1. Tavare S.
(1986) Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Sci. 17:57–86.
Google Scholar
↵
1. Waterman M.S.
(1996) Introduction to computational biology: Maps, sequences and genomes. (Chapman & Hall, New York, NY).
Google Scholar
↵
1. Yang Z.
(1993) Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. J. Mol. Evol. 10:1396–1401.
Google Scholar
↵

(1994) Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39:105–111, ibid.

CrossRef Medline Google Scholar

[1] ↵

Adachi J.,

Hasegawa M.

(1996) MOLPHY, Version 2.3. (The Institute of Statistical Mathematics, Tokyo, Japan).

Google Scholar

[2] Adachi J.,

[3] Hasegawa M.

[4] ↵

Barry D.,

Hartigan J.A.

(1987) Statistical analysis of hominoid molecular evolution. Stat. Sci. 2:191–210.

Google Scholar

[5] Barry D.,

[6] Hartigan J.A.

[7] ↵

Baum L.E.

(1972) An inequality and associated maximization technique in statistical estimation for probabilistic functions of Markov processes. Inequalities 3:1–8.

Google Scholar

[8] Baum L.E.

[9] ↵

Devijver P.A.

(1985) Baum’s forward-backward algorithm revisited. Pattern Recognition Lett. 3:369–373.

CrossRef Google Scholar

[10] Devijver P.A.

[11] ↵

Felsenstein J.

(1981) Evolutionary trees from DNA sequences: A maximum likelihood approach. J. Mol. Evol. 17:368–376.

CrossRef Medline Google Scholar

[12] Felsenstein J.

[13] ↵

Felsenstein J.,

Churchill G.A.

(1996) A Hidden Markov Model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93–104.

Abstract

[14] Felsenstein J.,

[15] Churchill G.A.

[16] ↵

Gojobori T.,

Ishii K.,

Nei M.

(1982) Estimation of average number of nucleotide substitutions when the rate of substitution varies with nucleotides. J. Mol. Evol. 18:414–423.

CrossRef Medline Google Scholar

[17] Gojobori T.,

[18] Ishii K.,

[19] Nei M.

[20] ↵

Hasegawa M.,

Kishino H.,

Yano T.

(1985) Dating of the human-ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 22:160–174.

CrossRef Medline Google Scholar

[21] Hasegawa M.,

[22] Kishino H.,

[23] Yano T.

[24] ↵

Kelly F.P.

(1979) Reversibility and stochastic networks. (Wiley, New York, NY).

Google Scholar

[25] Kelly F.P.

[26] ↵

Kimura M.

(1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J. Mol. Evol. 16:111–120.

CrossRef Medline Google Scholar

[27] Kimura M.

[28] ↵

Kirkpatrick S.,

Gelatt C.D.,

Vecchi M.P.

(1983) Optimization by simulated annealing. Science 220:671–680.

Abstract/FREE Full Text

[29] Kirkpatrick S.,

[30] Gelatt C.D.,

[31] Vecchi M.P.

[32] ↵

Kishino H.,

Hasegawa M.

(1989) Evaluation of the maximum likelihood estimates of the evolutionary tree topologies from DNA sequence data and the branching order in Hominoidea. J. Mol. Evol. 29:170–179.

CrossRef Medline Google Scholar

[33] Kishino H.,

[34] Hasegawa M.

[35] ↵

Lake J.A.

(1987) A rate-independent technique for analysis of nucleic acid sequences: Evolutionary parsimony. Mol. Biol. Evol. 4:167–191.

Abstract

[36] Lake J.A.

[37] ↵

(1988) Origin of the eukaryotic nucleus determined by rate-invariant analysis of rRNA sequences. Nature 331:184–186, ibid.

CrossRef Medline Google Scholar

[38] ↵

Lange K.

(1997) SEARCH, Version 3.0 (University of Michigan, Ann Arbor, MI).

Google Scholar

[39] Lange K.

[40] ↵

Lange K.,

Boehnke M.,

Cox D.R.,

Lunetta K.L.

(1995) Statistical methods for polyploid radiation hybrid mapping. Genome Res. 5:136–150.

Abstract/FREE Full Text

[41] Lange K.,

[42] Boehnke M.,

[43] Cox D.R.,

[44] Lunetta K.L.

[45] ↵

Lundy M.

(1985) Applications of the annealing algorithm to combinatorial problems in statistics. Biometrika 72:191–198.

Abstract/FREE Full Text

[46] Lundy M.

[47] ↵

Myers G.,

Korber B.,

Wain-Hobson S.,

Jeang K.T.,

Henderson L.E.,

Pavlakis G.N.

(1996) Human retroviruses and AIDS. (Los Alamos National Laboratory, Los Alamos, NM).

Google Scholar

[48] Myers G.,

[49] Korber B.,

[50] Wain-Hobson S.,

[51] Jeang K.T.,

[52] Henderson L.E.,

[53] Pavlakis G.N.

[54] ↵

Navidi W.C.,

Churchill G.A.,

von Haeseler A.

(1991) Methods for inferring phylogenies from nucleic acid sequence data by using maximum likelihood and linear invariants. Mol. Biol. Evol. 8:128–143.

Abstract

[55] Navidi W.C.,

[56] Churchill G.A.,

[57] von Haeseler A.

[58] ↵

Olsen G.J.,

Matsuda H.,

Hagstrom R.,

Overbeek R.

(1994) fastDNAml: A tool for construction of phylogenetic trees of DNA sequences using maximum likelihood. Comput. Appl. Biosci. 10:41–48.

Abstract/FREE Full Text

[59] Olsen G.J.,

[60] Matsuda H.,

[61] Hagstrom R.,

[62] Overbeek R.

[63] ↵

Polak E.

(1997) Optimization: Algorithms and consistent approximations. (Springer-Verlag, New York, NY).

Google Scholar

[64] Polak E.

[65] ↵

Press W.H.,

Flannery B.P.,

Teukolsky S.A.,

Vetterling W.T.

(1988) Numerical recipes in C. (Cambridge University Press, New York, NY).

Google Scholar

[66] Press W.H.,

[67] Flannery B.P.,

[68] Teukolsky S.A.,

[69] Vetterling W.T.

[70] ↵

Rao C.R.

(1973) Linear statistical inference and its applications (Wiley, New York, NY), 2nd ed.

Google Scholar

[71] Rao C.R.

[72] ↵

Rzhetsky A.

(1995) Estimating substitution rates in ribosomal RNA genes. Genetics 141:771–783.

Abstract/FREE Full Text

[73] Rzhetsky A.

[74] ↵

Tamura K.,

Nei M.

(1993) Estimation of the number of nucleotide substitutions in the control region of mitochondrial DNA in humans and chimpanzees. Mol. Biol. Evol. 10:512–526.

Abstract

[75] Tamura K.,

[76] Nei M.

[77] ↵

Tavare S.

(1986) Some probabilistic and statistical problems in the analysis of DNA sequences. Lect. Math. Sci. 17:57–86.

Google Scholar

[78] Tavare S.

[79] ↵

Waterman M.S.

(1996) Introduction to computational biology: Maps, sequences and genomes. (Chapman & Hall, New York, NY).

Google Scholar

[80] Waterman M.S.

[81] ↵

Yang Z.

(1993) Maximum-likelihood estimation of phylogeny from DNA sequences when substitution rates differ over sites. J. Mol. Evol. 10:1396–1401.

Google Scholar

[82] Yang Z.

[83] ↵

(1994) Estimating the pattern of nucleotide substitution. J. Mol. Evol. 39:105–111, ibid.

CrossRef Medline Google Scholar

Computational Advances in Maximum Likelihood Methods for Molecular Phylogeny

Abstract

The Generalized Kimura Model

Likelihood Computation

Likelihood Maximization and Simulated Annealing

Program Performance

Data Example

Conclusions

Acknowledgments

Appendix

Footnotes

REFERENCES

This Article

Article Category

Services

Citing Articles

Google Scholar

PubMed/NCBI

Share

Preprint Server

Navigate This Article

Current Issue

In This Issue