Globally optimal parameter estimates for nonlinear diffusions
aa r X i v : . [ m a t h . S T ] J a n The Annals of Statistics (cid:13)
Institute of Mathematical Statistics, 2010
GLOBALLY OPTIMAL PARAMETER ESTIMATES FORNONLINEAR DIFFUSIONS
By Aleksandar Mijatovi´c and Paul Schneider
Imperial College London and Warwick Business School
This paper studies an approximation method for the log-likelihoodfunction of a nonlinear diffusion process using the bridge of the diffu-sion. The main result (Theorem 1) shows that this approximationconverges uniformly to the unknown likelihood function and cantherefore be used efficiently with any algorithm for sampling fromthe law of the bridge. We also introduce an expected maximum like-lihood (EML) algorithm for inferring the parameters of discretely ob-served diffusion processes. The approach is applicable to a subclassof nonlinear SDEs with constant volatility and drift that is linear inthe model parameters. In this setting, globally optimal parametersare obtained in a single step by solving a linear system. Simulationstudies to test the EML algorithm show that it performs well whencompared with algorithms based on the exact maximum likelihoodas well as closed-form likelihood expansions.
1. Introduction.
In the natural and social sciences, diffusion processesare widely used as models for random phenomena that evolve continuouslyin time. They are popular because they arise as solutions of stochastic dif-ferential equations, which are natural probabilistic generalizations of thedeterministic models described by ordinary differential equations. It is wellknown that if the data are recorded at discrete times, parametric inferencefor diffusions using maximum likelihood estimates is difficult, primarily be-cause it is usually impossible to find the corresponding likelihood function[see Sørensen (2004) for a review of methods of inference in the diffusion set-ting]. In this paper, we are concerned with the estimation of the parametersin the stochastic differential equation (SDE) dX t = µ ( X t , θ ) dt + dW t where µ ( · , · ) : R × R N +1 → R , (1.1) Received February 2008; revised April 2009.
AMS 2000 subject classifications.
Key words and phrases.
Nonlinear diffusion, maximum likelihood, EM algorithm, esti-mation, global optimization.
This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in
The Annals of Statistics ,2010, Vol. 38, No. 1, 215–245. This reprint differs from the original in paginationand typographic detail. 1
A. MIJATOVI ´C AND P. SCHNEIDER is an arbitrary continuous (possibly) nonlinear function and W t is a standardBrownian motion. In order to guarantee the existence of a nonexplosive solu-tion of (1.1), we need to assume that for each parameter value θ ∈ R N +1 , thefunction µ ( · , θ ) : R → R is locally Lipschitz with linear growth [see Kloedenand Platen (1999), Chapter 4]. The task is to infer the vector of coefficients θ ∈ Θ ⊆ R N +1 in the drift µ ( · , θ ) from K + 1 observed realizations x , . . . , x K of the diffusion X t , where Θ is some compact subset in the parameter space.When the exact likelihood function is available, the parameters can bedetermined by maximizing the joint likelihood of the observations. The truelikelihood function is, however, available only in very few cases. A varietyof approximations exist and are well documented in the literature [see A¨ıt-Sahalia (2002) and Jensen and Poulsen (2002), Hurn, Jeisman and Lindsay(2007), Schneider (2006) for empirical comparisons of the available methods].A general method for parameter inference based on the EM algorithm is tomaximize an approximate likelihood function, which can be defined if onecan simulate the bridge of the diffusion in (1.1) (see Section 2 for the precisedefinition of this approximation). Recently, an exact simulation approachfor diffusion bridges was developed in Beskos et al. (2006) and an efficientalgorithm for sampling from bridges of ergodic diffusions was proposed inBladt and Sørensen (2007). Either of these simulation methods can be usedto define the approximate likelihood function mentioned above. The maintheoretical contribution of the present paper is Theorem 1, which states that,under some additional regularity conditions on the drift µ ( · , θ ) in (1.1), theapproximation for the likelihood function obtained by the simulation of thebridge of the diffusion in (1.1) is justified because it approximates uniformlythe true likelihood. For the precise statement of the result, see Section 2.In this paper, we also propose a new algorithm for the inference of pa-rameters when diffusion (1.1) takes a simpler form, as given in (3.1). Ourmethod circumvents the use of numerical optimization to determine the pa-rameters for diffusion models of the form (3.1). The estimation algorithmtransforms the original problem into a related inference problem that has aunique global solution θ ∗ ∈ R N +1 which is obtained in a single step by solvinga linear system of dimension ( N + 1) × ( N + 1) given in (3.5). By Theorem 1,the related inference problem approximates uniformly on compact subsetsof the parameter space the original inference problem. We also show thatthe approximations of the expectations that feature in linear system (3.5)converge uniformly on bounded subsets of the parameter space as the timeinterval between consecutive data points goes to zero (i.e., the number ofobservations K + 1 goes to infinity). This property is implied by Theorem3. Diffusions that are not of the form (1.1) can often be transformed intothe required structure by a well-known change-of-variable method [see (3.6)at the end of Section 3]. The constant diffusion coefficient requirement in LOBALLY OPTIMAL PARAMETER ESTIMATES FOR NONLINEAR DIFFUSIONS (1.1) is therefore not as restrictive as it may seem at first glance. Many ofthe widely used diffusion processes with state-dependent volatility fall intothis class. The square root process and the flexible diffusion used in A¨ıt-Sahalia (1996) and Jones (2003b) [see (4.3) and (4.4) for the SDEs describingthe models] can be dealt with in this fashion. The likelihood functions ofboth processes, conditional upon S&P 100 implied volatility index data, areanalyzed using the EML algorithm in Section 4.2. In the case of the squareroot process, direct maximum likelihood estimation is also performed andit is shown that the parameter values obtained agree with the ones foundusing an algorithm based on the EML procedure. In this paper, we consideronly one-dimensional diffusion processes, even though the EML algorithmcan be applied to the multidimensional case without introducing additionalcomputational complexity when the underlying process is reducible [see A¨ıt-Sahalia (2008) for the precise definition]. However, extending the result ofTheorem 1 to higher dimensions is a much harder problem.The paper is organized as follows. Section 2 describes how to approximatethe likelihood function and states our main theoretical result (Theorem 1).Section 3 defines and derives the EML algorithm for diffusions given by (3.1).Section 4 consists of two subsections: in Section 4.1, a comparison of the EMLalgorithm with exact maximum likelihood estimation and the analytic like-lihood approximation method from A¨ıt-Sahalia (2002) is performed; Section4.2 estimates the square root and flexible diffusion processes conditional onimplied volatility data. Section 5 concludes the paper. The Appendices Aand B contain the proofs of Theorems 1 and 3.
2. The main result.
Let x , . . . , x K denote K + 1 realizations of the dif-fusion X t given in (1.1), observed at times t , . . . , t K . To avoid notationalcomplexity, we assume evenly spaced time intervals between consecutivedata points: ∆ = t k − t k − for all k ∈ { , . . . , K } . Since we are assuming thatthe drift µ ( · , θ ) : R → R is locally Lipschitz and has linear growth, SDE (1.1)has a weakly unique solution for any starting point x in its domain andany parameter value θ ∈ Θ ⊆ R N +1 .The starting point of our approach is the EM algorithm, which we nowbriefly review [see Dempster, Laird and Rubin (1977) or McLachlan andKrishnan (1997) for the general theory]. We begin by considering two con-secutive data points and then apply our analysis to the entire data set.Between two consecutive observations x k − and x k at times t k − and t k ,respectively, we introduce M − u , . . . , u M − and define u := x k − , u M := x k . Note that the lengthof the time interval between u m − and u m , for any m ∈ { , . . . , M } , equals δ := ∆ /M . Given two observed realizations, u and u M , the task is to findthe parameter θ = ( a , . . . , a N ) such that the value π ( u M | u , θ ) of the con-ditional transition density of the diffusion X t is maximized. Consider the A. MIJATOVI ´C AND P. SCHNEIDER joint likelihood π ( u M , . . . , u | u , θ ) of the variable u M and the latent aux-iliary variables u , . . . , u M − conditional on u . The Markov property of thediffusion X t implies the following representation: π ( u M , . . . , u | u , θ ) = M Y m =1 π ( u m | u m − , θ ) . (2.1)In order to formulate the EM algorithm in our setting, we need to intro-duce the following notation. Let random variables U m := X t k − + mδ for all m ∈ { , . . . , M − } correspond to the auxiliary states between the consec-utive observations and let the random vector U := ( U , . . . , U M − ) be theauxiliary state vector. The joint distribution of U is given by the law of thebridge of the diffusion X t , which starts at time t k − at the level u andfinishes at the level u M at time t k , denoted by Q θ (or, more accurately, Q ∆ ,u ,u M θ ). The subscript θ signifies the dependence of this probability lawon the parameters in the model. The EM algorithm starts with some feasiblevalue θ of the parameter θ and repeats the following two steps:E- step : determine the conditional expectation θ E Q θn [log π ( U, u M | u , θ )];M- step : maximize this expression with respect to θ .The function in the E-step of the algorithm is known as the complete likeli-hood . The important observation is that the expectation defining the com-plete likelihood is taken with respect to the distribution of the vector U ,which is given by the law Q θ n of the diffusion bridge. With each iteration,the value π ( u M | u , θ ) is increased and therefore the algorithm is guaran-teed to converge to a stationary point which, in some pathological cases, isnot a local maximum [see McLachlan and Krishnan (1997) for convergenceproperties of the EM algorithm]. It is thus key to understanding the be-havior of the complete likelihood θ E Q θn [log π ( U, u M | u , θ )] for any fixedparameter value θ n .There are two problems associated with the E-step of the algorithm inour setting. The first is that the joint density π ( u M , . . . , u | u , θ ) for thelaw of the process X t given by SDE (1.1) cannot be obtained in closed form.The second problem is that the law of the bridge of the diffusion X t (i.e.,the process X t conditional upon X t k − = u and X t k = u M ) which arises inthe expectation is also unknown.Using an Euler scheme approximation for the solution of SDE (1.1),together with Markov property (2.1), one can obtain an approximationfor the joint likelihood function π ( u M , . . . , u | u , θ ) in the following way.Over any short time period of length δ , the Euler scheme approximatesthe solution X t + δ of SDE (1.1) at time t + δ , conditional upon the level X t , by the normal random variable X t + µ ( X t , θ ) δ + W δ with mean X t + µ ( X t , θ ) δ and variance δ . Over a longer time period ∆, a succession of LOBALLY OPTIMAL PARAMETER ESTIMATES FOR NONLINEAR DIFFUSIONS such normal variables is used to approximate the original process [see Kloe-den and Platen (1999), Section 10.2, for the general theory and conver-gence properties of Euler schemes for SDEs]. Each transition density π ( u m | u m − , θ ), m ∈ { , . . . , M } , in (2.1) can therefore be approximated by thenormal density φ ( u m ; u m − + µ ( u m − , θ ) δ, δ ) with mean u m − + µ ( u m − , θ ) δ and variance δ = ∆ M defined above. Identity (2.1) implies that an approxi-mation for the joint likelihood π ( u M , . . . , u | u , θ ) is given by the product Q Mm =1 φ ( u m ; u m − + µ ( u m − , θ ) δ, δ ). This approximation is useful because itdepends explicitly [through the known drift function µ ( · , θ )] on the modelparameter θ and has been used in Pedersen (1995) and Brandt and Santa-Clara (2002) to obtain an approximation for the transition density. Themethod, known as simulated maximum likelihood (SML), is based on thefollowing convergence result which holds under global Lipschitz and lineargrowth conditions [see Theorem 2 in Pedersen (1995)]: π ( u M | u , θ )(2.2) = lim M →∞ Z R M − M Y m =1 φ ( u m ; u m − + µ ( u m − , θ ) δ, δ ) du · · · du M − . The SML algorithm uses this relationship to obtain an approximation forthe likelihood function directly. As we will now show, it is possible to circum-vent the difficult issue of the computation of high-dimensional integrals andobtain optimal parameter values without having to compute approximationsof transition densities.In the same spirit as in Jones (1998), Eraker (2001), Elerian, Chib andShephard (2001) and Roberts and Stramer (2001), the auxiliary data pointsintroduced at the beginning of this section are there to exploit the con-vergence of the discrete-time approximation to the diffusion X t . As shownabove, the main problem is to find the maximum of the complete likeli-hood θ E Q θ [log π ( U, u M | u , θ )] for any value θ in the parameter space.Instead of doing this, we solve an “approximate” problem, where the func-tion π ( u M , . . . , u | u , θ ) under the expectation is replaced by the density Q Mm =1 φ ( u m ; u m − + µ ( u m − , θ ) δ, δ ) of the Euler scheme approximation. Theapproximate likelihood function can then be obtained as soon as we cansimulate the trajectories of the diffusion bridge [using, e.g., the algorithmin Beskos et al. (2006) or Bladt and Sørensen (2007)] governed by the law Q θ . In Section 3, we will show that the approximate problem has a uniquemaximum that can be obtained as a solution of a linear system of size( N + 1) × ( N + 1), where N + 1 is the dimension of the parameter θ , if theunderlying diffusion takes the form (3.1).A natural question that arises at this point concerns the quality of theapproximation of the complete likelihood by the sequence of functions ( θ A. MIJATOVI ´C AND P. SCHNEIDER E Q θ [ P Mm =1 log φ ( U m ; U m − + ∆ M µ ( U m − , θ ) , ∆ M )]) M ∈ N . Numerical experimentsin Section 4 suggest that this approximation works well. Under some addi-tional regularity hypothesis on the drift µ ( · , θ ), this intuitive claim can bejustified by the following theorem. Theorem 1.
Suppose that, in addition to the local Lipschitz conditionwith linear growth, we also assume that the function µ : R × R N +1 → R in(1.1) is twice differentiable in the state variable with bounded second deriva-tive. Let θ be a fixed value in the parameter space. The following equalitythen holds: lim M →∞ E Q θ " M X m =1 (cid:18) log π ( U m | U m − , θ ) − log φ (cid:18) U m ; U m − + ∆ M µ ( U m − , θ ) , ∆ M (cid:19)(cid:19) = 0 for all θ in the parameter space, where φ ( y ; x, δ ) is the normal density func-tion with mean x and variance δ . Furthermore, the limit is uniform in θ oneach compact subset of the parameter space. Note that the nature of Theorem 1 is fundamentally different from that ofthe result (2.2) above, proved in Pedersen (1995), because the expectationin the theorem is taken with respect to the law of the bridge of the diffusion X t , rather than the law of the diffusion X t itself (i.e., conditional upon X t k = u M = x t k and X t k − = u = x t k − ). The proof of Theorem 1 is foundin Appendix A. It is based on the fact that in the one-dimensional case,there exists an explicit formula for the transition density of the diffusion interms of the Brownian bridge [see Rogers (1985)]. Because this is a specialproperty of the one-dimensional case, the proof does not easily generalize tothe multidimensional setting.The main contribution of Theorem 1 is that it provides the theoreticalbasis for using the approximate complete likelihood described above withany method capable of simulating the diffusion bridge of the process definedby (1.1), including the algorithms in Beskos et al. (2006) and Bladt andSørensen (2007), provided the regularity conditions on the drift are met.In Section 3, we use Theorem 1 to justify a key step in a new estimationalgorithm for discretely observed diffusions.
3. Expected maximum likelihood (EML) algorithm.
In this section, weare concerned with the estimation of the parameters in the SDE dX t = µ ( X t , θ ) dt + dW t where µ ( x, θ ) := g ( x ) + N X i =0 a i f i ( x ) , (3.1) LOBALLY OPTIMAL PARAMETER ESTIMATES FOR NONLINEAR DIFFUSIONS driven by the standard Brownian motion W t . The drift µ ( · , θ ) : R → R isgiven by an arbitrary family of independent, possibly nonlinear, Lipschitzfunctions g, f i : R → R with linear growth. The task is to infer the vectorof coefficients θ := ( a , . . . , a N ) ∈ Θ ⊆ R N +1 in the drift µ ( · , θ ) from K + 1observed realizations, x , . . . , x K , of the diffusion X t . As we shall see, theEML algorithm consists of solving a linear system of size ( N + 1) × ( N + 1)given in (3.5) and converges to the global maximum in a single step.Having constructed the approximation to the complete likelihood in Sec-tion 2, we now turn to the initial estimation problem. By the M-step ofthe EM algorithm, our task is to maximize the approximate complete like-lihood function θ E Q θ [ P Mm =1 log φ ( U m ; U m − + δµ ( U m − , θ ) , δ )] for anyfixed value of the model parameter θ . The following obvious proposition iscrucial to all that follows. Proposition 2.
The complete likelihood θ E Q θ [ P Mm =1 log φ ( U m ; U m − + δµ ( U m − , θ ) , δ )] is a nondegenerate quadratic form with a uniqueglobal maximum. It is clear that the complete likelihood in Proposition 2 is a nondegeneratequadratic form in θ , bounded above by a constant, which implies that allof its eigenvalues must be negative. Therefore, there exists a unique globalmaximum.The following simple calculation will yield the globally optimal parametervalue θ ⋆ = ( a ⋆ , . . . , a ⋆N ), which exists by Proposition 2. By setting the partialderivative with respect to each coordinate a j , j ∈ { , . . . , N } , of θ in thefunction given in Proposition 2 to zero, we obtain the linear system Aθ ⊤ = b ,where θ = ( a , . . . , a N ), A = δ P Mm =1 A m , b = P Mm =1 b m and, for any m ∈{ , . . . , M } , A m := E Q θ [ f ( U m − ) f ( U m − )] · · · E Q θ [ f N ( U m − ) f ( U m − )]... . . . ... E Q θ [ f ( U m − ) f N ( U m − )] · · · E Q θ [ f N ( U m − ) f N ( U m − )] , (3.2) b ⊤ m := ( E Q θ [( U m − U m − − g ( U m − ) δ ) f ( U m − )] · · · (3.3) × E Q θ [( U m − U m − − g ( U m − ) δ ) f N ( U m − )]) ⊤ . Since there exists a unique global maximum of the approximate completelikelihood, the inverse A − must also exist and the unique optimal parametervalue is given by θ ⋆ = ( A − b ) ⊤ . For K + 1 observations of the process X t , theglobally optimal parameter value θ ⋆ is obtained in the same way. The onlydifference is that matrix (3.2) and vector (3.3) are computed using M · K ,rather than M , auxiliary and observed realizations [see (3.5)]. A. MIJATOVI ´C AND P. SCHNEIDER
The globally optimal value θ ⋆ of the parameter vector solves the linearsystem whose coefficients are yet to be determined. Computing the expec-tations E Q θ [ · ] in closed form is impossible because it requires the unknowntransition density π ( u M − , . . . , u | u M , u , θ ) of the bridge of the diffusion X t . The key idea that helps to circumvent this problem is to replace the lawof the bridge of X t with the law of the corresponding Brownian bridge inall of the coefficients of matrix (3.2) and vector (3.3). The crucial additionalbenefit of this substitution is that it removes the dependence of the coeffi-cients of the linear system on the parameter θ , which implies that the EMprocedure terminates after only one iteration. Therefore, by Proposition 2,the EML algorithm is guaranteed to converge to the globally optimal pa-rameter value θ ⋆ in a single step. A recent parameter estimation algorithmfor general one-dimensional diffusion models given in Beskos et al. (2006),based on a sophisticated sampling method known as retrospective sampling ,also employs the EM approach. The EM algorithm is also used in Bladt andSørensen (2007), where the time-reversal symmetry of ergodic diffusions isexploited to sample from the corresponding bridge. Unlike in the case of theEML algorithm, in both of those settings, an iteration of E-step and M-stepis required in order to obtain the stationary value of the model parameter.We now need to consider the quality of the weak approximation of thelaw of the bridge of the diffusion X t (i.e., a process X t conditioned to startat X = x and finish at X ∆ = y , where ∆ is the length of the time intervalbetween consecutive observations in the data) by the law of the Brownianbridge (i.e., a Brownian motion W t conditioned to start at W = x and finishat W ∆ = y ). This question is of importance because it tells us how far thecoefficients of the linear system given by the matrix (3.2) and the targetvector (3.3) are from the ones used in the EML algorithm (3.5). It is intu-itively clear that when ∆ goes to zero, the Brownian bridge approximationmust improve in quality. Since the law of the diffusion bridge is absolutelycontinuous with respect to the law of the Brownian bridge, it is possibleto bound the approximation error explicitly in terms of ∆ and the modelparameter θ . Theorem 3.
Assume that functions g, f i : R → R , i ∈ { , . . . , N } , in thedrift of SDE (3.1) satisfy the linear growth condition and are twice differen-tiable with bounded second derivatives, and let G : R M → R be a polynomiallybounded measurable function for some integer M ∈ N . Let Q ∆ ,x,yθ denote thelaw of the bridge, starting at x and finishing at y , of the diffusion X t thatsolves SDE (3.1) with the parameter value θ = ( a , . . . , a N ) ∈ R N +1 and let W t denote the standard Brownian motion. The measure Q ∆ ,x,yθ is then ab-solutely continuous with respect to the law of the corresponding Brownian LOBALLY OPTIMAL PARAMETER ESTIMATES FOR NONLINEAR DIFFUSIONS bridge W ∆ ,x,y and the Radon–Nikodym derivative is given by d Q ∆ ,x,yθ d W ∆ ,x,y = L ∆ θ E W ∆ ,x,y [ L ∆ θ ] where L ∆ θ := exp (cid:18) − Z ∆0 ( µ ( W s , θ ) + µ ′ ( W s , θ )) ds (cid:19) . (a) The following inequality holds for all x, y in the domain of X t and alltimes < t < · · · < t M < ∆ : | E Q ∆ ,x,yθ [ G ( X t , . . . , X t M )] − E W ∆ ,x,y [ G ( W t , . . . , W t M )] | ≤ E W ∆ ,x,y (cid:20)(cid:18) L ∆ θ E W ∆ ,x,y [ L ∆ θ ] − (cid:19) (cid:21) k G k , where k G k := E W ∆ ,x,y [ G ( W t , . . . , W t M ) ] / denotes the L -norm of therandom variable G ( W t , . . . , W t M ) . (b) Let S ( θ ) := sup z ∈ R { µ ( z, θ ) + µ ′ ( z, θ ) } and I ( θ ) := inf z ∈ R { µ ( z, θ ) + µ ′ ( z,θ ) } be the maximum and minimum of the integrand in L ∆ θ , respectively.The following inequality then holds: | E Q ∆ ,x,yθ [ G ( X t , . . . , X t M )] − E W ∆ ,x,y [ G ( W t , . . . , W t M )] |≤ (cid:18) exp (cid:18) ∆2 ( S ( θ ) − I ( θ )) (cid:19) − (cid:19) k G k . The absolute continuity of the measures Q ∆ ,x,yθ and W ∆ ,x,y is well knownand the form of the Radon–Nikodym derivative in Theorem 3 follows fromLemma 1 in Beskos et al. (2006) and expressions (A.1) and (A.2) in Ap-pendix A. The inequality in part (a) of the theorem bounds the error arisingfrom the approximation of the law Q ∆ ,x,yθ by the measure W ∆ ,x,y in termsof the variance of the Radon–Nikodym derivative and the L -norm of theintegrand. Since the latter is independent of the model parameter θ , thisinequality provides a way of bounding the error for a general integrand G interms of the second moment of the Radon–Nikodym derivative. In practice,the second moment can always be estimated by simulation, thus yielding amodel-specific bound on the error of the coefficients in linear system (3.5)used in the EML algorithm. Figure 4 contains the graphs of the densitiesof the Radon–Nikodym derivative for the nonlinear SDE in (4.4) used inSection 4. A cursory inspection of the scale of the domains of these densi-ties shows how tight the bound in part (a) of Theorem 3 really is, even forrelatively large time steps ∆. A. MIJATOVI ´C AND P. SCHNEIDER
It is intuitively clear that the Brownian bridge approximation works wellfor short time intervals ∆ and less well as the time step grows. This view issupported by the inequality in part (b) of Theorem 3, which is a consequenceof the bound in part (a). Furthermore, (b) implies that the approximation ofthe law Q ∆ ,x,yθ by W ∆ ,x,y is a good one, even for larger time steps ∆, providedthat the drift µ ( · , θ ) does not vary much as a function of the state. Thisimplies that the method of approximation proposed in the EML algorithmwould work well in the case of the diffusion with a periodic drift used inExample 1 in Beskos et al. (2006), for time steps ∆ as large as . Also,note that the norm of the random variable G ( W t , . . . , W t M ) in the Hilbertspace L ( W ∆ ,x,y ) is finite for a polynomially bounded function G becausethe law W ∆ ,x,y of the Brownian bridge is Gaussian with bounded variance.The proof of Theorem 3 can be found in Appendix B.Having replaced the law of the diffusion bridge Q ∆ ,x,yθ by the law of theBrownian bridge W ∆ ,x,y , which is independent of the parameter θ , we areleft with the task of calculating the expectations in the coefficients of matrix(3.2) and vector (3.3). A numerical integration approach would be feasiblebecause we have an explicit formula for the normal density of the marginalsof the probability measure W ∆ ,x,y . However, because of the numerous two-dimensional integrals in (3.3), the problem does not lend itself well to thisapproach.An alternative approach is to simulate the paths of the Brownian bridgeand use Monte Carlo methods to obtain the relevant expectations. This canbe done by using the modified Brownian bridge sampler defined in Durhamand Gallant (2002) and Chib and Shephard (2002), given by the followingrecursive formula: u m +1 = u m + u M − u m M − m + r M − m − M − m δ / Z m , (3.4)where δ = ∆ /M is the length of the time interval between consecutive aux-iliary states, u = x, u M = y are the initial and final points of the Brow-nian bridge and Z m ∼ N (0 ,
1) are independent random variables for all m ∈ { , . . . , M − } . It is proved in Stramer and Yan (2007b) (see Propo-sition 1) that the joint density of the modified Brownian bridge equals thejoint law of the Brownian bridge at the discretization times, which impliesthat scheme (3.4) introduces no discretization bias and is therefore prefer-able to the Euler approximation. Since the parameter θ does not appearin the evolution equation (3.4) of the modified Brownian bridge, the EMLalgorithm can be described as follows.Let x , . . . , x K be the K + 1 observations at times t , . . . , t K of the diffusion X t given by SDE (3.1) and let ∆ = t k − t k − for all k ∈ { , . . . , K } . Let M − u mk , m ∈ { , . . . , M − } , LOBALLY OPTIMAL PARAMETER ESTIMATES FOR NONLINEAR DIFFUSIONS between the observed data points x k − and x k such that u k = x k − and u Mk = x k for all k ∈ { , . . . , K } . Let S be the number of the simulationsused. The EML algorithm then consists of the following simple steps. Step
1. For each k ∈ { , . . . , K } and each s ∈ { , . . . , S } , generate a Brow-nian bridge path ( u ( s ) mk ) m =0 ,...,M using (3.4). Step
2. Find the unique solution of the linear system Aθ ⊤ = b , where A ij := ∆ M K X k =1 M X m =1 S X s =1 f i ( u ( s ) m − k ) f j ( u ( s ) m − k ) and(3.5) b i := K X k =1 M X m =1 S X s =1 ( u ( s ) mk − u ( s ) m − k − g ( u ( s ) m − k )) f i ( u ( s ) m − k )with i, j ∈ { , . . . , N } , to obtain the globally optimal parameter value θ ⋆ =( a ⋆ , . . . , a ⋆N ).An appealing feature of the EML algorithm described above is that it cir-cumvents the iterative process that is ubiquitous in the general EM frame-work. The invertibility of matrix A is, by Proposition 2, equivalent to thenondegeneracy of the complete likelihood function, which is implied by thelinear independence of the functions f i in the drift (3.1). Note that if auxil-iary state variables u k , . . . , u M − k , for all k ∈ { , . . . , K } , are not introduced,then we can remove the expectation operators in (3.2) and (3.3) or, equiv-alently, the sums over s and m in step 2 of the above algorithm. In thiscase, the EML algorithm reduces to the classic linear regression. Under theconjugate normal prior, the estimates of the drift parameters then coincidewith the posterior mean in a Bayesian analysis of the coefficients.We conclude this section with a brief comment about diffusion modelswith state-dependent diffusion functions. A scalar diffusion dX t = µ ( X t , θ ) dt + σ ( X t , ϑ ) dW t can always be transformed using a change-of-variable Y = γ ( X, ϑ ) = R X duσ ( u,ϑ ) ,which depends on the diffusion parameter vector ϑ , into dY t = µ Y ( X t , θ, ϑ ) dt + dW t (3.6) where µ Y ( y, θ, ϑ ) = µ ( γ − ( y, ϑ ) , θ ) σ ( γ − ( y, ϑ ) , ϑ ) − ∂σ∂x ( γ − ( y, ϑ ) , ϑ ) . Note that if the original drift µ ( · , θ ) is affine in the parameter θ , then sois the transformed drift µ Y ( · , θ, ϑ ). Therefore, an application of the EMLalgorithm is feasible for any fixed value of the diffusion parameter ϑ . InSection 4, we will discuss how to apply the EML algorithm to diffusions ofthis kind [see the models given by (4.3) and (4.4)]. A. MIJATOVI ´C AND P. SCHNEIDER
4. Applications.
There are at least two potential applications for theEML algorithm. The first is the classical parameter estimation problem fordiffusion models. The advantage of the EML approach is that the resultingparameter estimates are globally optimal and the bias introduced throughthe Euler approximation is, by Theorem 1, arbitrarily small. The secondapplication is based on the fact that the EML algorithm is computationallyvery fast. The speed of the algorithm enables one to easily explore the de-pendence of the likelihood function on the diffusion parameter ϑ [see (3.6)],along with the dependence of the globally optimal drift parameter θ as afunction of ϑ [see examples (4.3) and (4.4)]. Both of these applications willbe illustrated in the present section.4.1. Base cases.
To test the EML algorithm for potential biases arisingin the Euler and the Brownian bridge approximations, we start by estab-lishing two base cases. The first case is an Ornstein–Uhlenbeck diffusion[see (4.1)], where the true transition density, and, therefore, the likelihoodfunction, is known. The second case is a diffusion with a nonlinear drift[see (4.2)], where we employ the closed-form likelihood expansion from A¨ıt-Sahalia (2002) as a benchmark, because this method is known to producevery accurate approximations of the true transition density [see, e.g., Schnei-der (2006), Hurn, Jeisman and Lindsay (2007), Jensen and Poulsen (2002)].To put the EML algorithm to the test, we simulate 1000 data sets fromeach of the two models [the Ornstein–Uhlenbeck process in (4.1) and thenonlinear diffusion in (4.2)] with K + 1 = 500 observations in each data set.As mentioned above, in the first case, we perform an exact ML estimationon each of the data sets using the exact transition density in the likelihoodfunction and in the second case, we first apply the closed-form likelihoodexpansion from A¨ıt-Sahalia (2002) to obtain an approximation to the likeli-hood function which is then used in quasi-ML estimation. The two modelsare given by dX t = ( a − a X t ) dt + dW t , model A,(4.1) dX t = ( a + a X t + a X t ) dt + dW t , model B,(4.2)with parameter values a = 10 , a = 2 . a = 1 , a = − , a = − . t is measured in years and the consecutive obser-vations in the generated data sets are one month apart. In other words,∆ = and we choose an auxiliary state variable for each day of the month,that is, M − S = 1000 and one S = 200simulated paths. Figure 1 displays the comparison of the EML procedureusing 1000 simulations with the direct ML estimation for model A. The bi-ases and standard deviations of the parameter estimates are shown in Table LOBALLY OPTIMAL PARAMETER ESTIMATES FOR NONLINEAR DIFFUSIONS Fig. 1.
Empirical distribution of estimates from model A . This figure shows the empir-ical distribution of the maximum likelihood estimator and the EML estimator over 1000estimations for the Ornstein–Uhlenbeck model (4.1). Plotted are the maximum likelihoodestimator, obtained by maximizing the true transition density using a gradient solver, andthe EML estimator, obtained by using formula (3.5). Table 1
Base cases A and B : bias and standard deviation. This table shows the estimation resultsfor the parameters from equations (4.1) and (4.2) on 1000 data sets generated using thetrue transition density for model A and a very fine Euler approximation for model B (100 auxiliary data points). The benchmark estimations for model A are performed usingexact maximum likelihood. The benchmark estimations for model B are obtained usingclosed-form likelihood expansions. The column “bias” shows the mean bias of theestimated parameters. Bias is defined as ˆ θ ( i ) − θ for the i th estimation. The column“Std. Dev.” shows the standard deviation of the parameter estimates. For the EMLestimation, 30 auxiliary data points were used ML/A¨ıt-Sahalia EML S = 1000 EML S = 200 θ Bias Std. Dev. Bias Std. Dev. Bias Std. Dev.
Model A a
10 0 . . . . a . . . . . a . . . . a − − . − . − . . a − . . . . .
1. Explicit gradients are used in the likelihood search in the case of both theexact likelihood function and the closed-form likelihood expansion.A striking observation is the higher bias of the ML estimator over the EMLestimator. The closed-form likelihood expansion and EML display similarbiases. No noticeable difference can be seen by choosing 200 or 1000 simu-lations to approximate the expectations in the EML algorithm. In Stramerand Yan (2007a), the authors suggest that in a related problem of Monte A. MIJATOVI ´C AND P. SCHNEIDER
Carlo estimation for the transition densities of diffusions, the optimal num-ber of simulations S is of the order M , which, in the two cases discussedhere, amounts to approximately 900 simulations. Also, note that the EMLprocedure takes about a second to produce the optimal parameter valuesfor each of the data sets described in this subsection. The hardware used toperform these experiments was a PC with a 64-bit Xeon 2.8 MHz processor.4.2. Exploring the likelihood function.
The empirical features of the dy-namics of equity indices such as the S&P 100 include time-varying volatility,a level effect for the volatility of the variance of return [see Jones (2003a)]and evidence for jumps [Andersen, Benzoni and Lund (2002)]. We are nowgoing to investigate how a diffusion model, specified by a nonlinear SDE, fitsthe implied volatility data. In this section, we study the relation betweenthe diffusion and drift parameters for each of the processes dV t = κ ( γ − V t ) dt + σ p V t dW t , model I,(4.3) dV t = ( a + a V t + a V t ) dt + q σ V t + σ V t dW t , model II,(4.4)conditional on S&P 100 implied volatility data given by a time series of thevolatility index VXO. Empirical studies in Jones (2003a) and A¨ıt-Sahaliaand Kimmel (2007) have rejected the square root process (4.3) as a spec-ification for the variance dynamics of S&P 100. Nevertheless, the relationbetween the parameters of the square root process, conditional upon realdata, can be investigated using the EML algorithm. The second model isa nonlinear diffusion (4.4), introduced by A¨ıt-Sahalia (1996), and is poten-tially flexible enough to accommodate the rich dynamics exhibited by theS&P 100 implied volatility index data.We start by transforming the SDEsin (4.3) and (4.4) into a form with a unit diffusion coefficient using formula(3.6). For model I, we apply the transformation y ( x ) = 2 √ x/σ , which yields dY t = (cid:18) b Y t + b Y t (cid:19) dt + dW t . (4.5)Model II requires the change-of-variable y ( x ) = √ xσ + √ σ + xσ ) √ σ , whichtransforms it into dY t = µ ( Y t ) dt + dW t with µ ( y ) = b q e − y √ σ ( e y √ σ − σ ) /σ + b e − y √ σ ( e y √ σ − σ ) q e − y √ σ ( e y √ σ − σ ) /σ σ + b e − y √ σ ( e y √ σ − σ ) q e − y √ σ ( e y √ σ − σ ) /σ σ − e − y √ σ ( σ + e y √ σ )2 q e − y √ σ ( e y √ σ − σ ) /σ . In the case of model I, we also perform maximum likelihood estimation withthe true transition density of the square root variance (4.5). The resultingparameters are θ = 0 . , κ = 5 . σ = 0 . σ , we can transform the time series for thevolatility index using the change of coordinates y ( x ) = 2 √ x/σ and performthe estimation of the parameters in (4.5) using EML. This operation takesabout one second on a personal computer (64-bit Xeon 2.8 MHz, runningLinux) for the given data set. By repeating this process for each value of σ ona finite grid in the interval [0 . , . Fig. 2.
Model I . (a) displays the maximum likelihood function (up to a proportionalityfactor) of specification (4.3) as a function of σ on the x -axis, conditional on S&P 100implied volatility data. For a given σ , the optimal values of θ ⋆ and κ ⋆ are computed usingEML. The likelihood function is computed using the SML algorithm with the Brownianbridge importance sampler [see Durham and Gallant (2002)]. (b) displays globally optimalvalues θ ⋆ and κ ⋆ as functions of σ . The noncentral chi-square density is given in terms of special functions that are difficultto handle numerically for nonsymbolic computational tools with finite precision such asFortran, C/C++ and MATLAB. We perform this estimation using Mathematica. A. MIJATOVI ´C AND P. SCHNEIDER as having a single parameter σ and apply the SML algorithm [as describedin Durham and Gallant (2002)] or some analytic likelihood approximationto find the likelihood function for σ [see Figure 2(a)]. This application ofEML has therefore reduced the dimension of the parameter space from 3to 1. Note that, in this approach, SML affords an additional computationaladvantage over other likelihood approximation methods because one canreuse the modified Brownian bridge paths that were generated for EML asan importance sampler when computing the likelihood function. Using thetime series of implied volatilities, an analogous estimation can be performedfor the model given by SDE (4.4).The analysis of the likelihood functions, together with the optimal driftparameters for the processes (4.3) and (4.4), provides some interesting in-sights into the process specification as well as the estimation method. Thefirst observation is that the ML estimates obtained from the true transitiondensity of the square root process (4.3) agree with the EML estimates. Thisis clear from Figures 2(a) and 2(b) at the point σ = 0 . and the likelihood function are smooth inthe parameters of the state-dependent volatility function. Preliminary EMLestimates also suggest that the Euler approximation on a daily level is notsufficient for a nonlinear diffusion like the one in (4.4). Even a discretiza-tion of 5 subintervals per day appears too coarse. The estimates stabilizebetween 10 and 30 subintervals. This is in line with the findings in Robertsand Stramer (2001), Figure 4, in a similar setting. The likelihood functionfor model (4.4) is extremely flat in the diffusion parameters close to theoptimal region (see Figure 3) and care should be taken with the estimation.Finally, a likelihood-ratio test applied to the two variance models revealsthat the specification (4.4) is preferable to the square root specification.
5. Conclusion.
This paper is concerned with an approximation proce-dure for the likelihood function of a nonlinear diffusion, given a discrete setof observations. The method can be used with any algorithm for samplingfrom the law of the diffusion bridge [e.g., Beskos et al. (2006) or Bladt andSørensen (2007)] and is shown to converge uniformly on compact subsets ofthe parameter space (see Theorem 1).We also develop a new expected maximum likelihood (EML) algorithmfor the estimation of parameters governing a nonlinear diffusion process.It provides globally optimal parameter values when the drift is affine in the It is beneficial for the stability of the method to keep the random numbers fixed inall of the expectations arising in (3.2) and (3.3). This principle is shown to guarantee theconvergence of the MCEM algorithm in Papaspiliopoulos and Sermaidis (2007).LOBALLY OPTIMAL PARAMETER ESTIMATES FOR NONLINEAR DIFFUSIONS Fig. 3.
Model
II.
The likelihood of model (4.4) as a function of σ (on the x -axis) and σ (on the y -axis), conditional on S&P 100 implied volatility data. For given σ and σ ,the values a ⋆ , a ⋆ and a ⋆ are computed using EML. The likelihood function for model II given in (4.4) is computed using the SML algorithm with the Brownian bridge importancesampler [see Durham and Gallant (2002)]. coefficients and the diffusion function is constant. For diffusions with a state-dependent volatility function, our method is used to express the likelihoodas a function of the volatility parameters only, thereby significantly reduc- Fig. 4.
Simulated densities of the Radon–Nikodym derivative d Q ∆ ,x,yθ /d W ∆ ,x,y of thelaw of the bridge of the diffusion X t with respect to the law of the Brownian bridge (seeTheorem 3 for the precise definition). The diffusion X t used in this example is given by anonlinear SDE (4.4). The parameter value θ is implied by the VXO data. The coordinatesof θ are approximately given by a = 0 . , a = − , a = − and the diffusion coefficientsare σ = 0 . and σ = 3 . The graphs correspond to time-horizons ∆ equal to two weeksand one month, and a fixed starting point x = 0 . for the diffusion bridge and the Brow-nian bridge. Three different end points y of the respective bridges are chosen for eachtime-horizon. Note that the density in all of the cases is concentrated in a small neighbor-hood of , thus making the bound in (a) of Theorem 3 very tight, even for time intervals ∆ as long as one month. A. MIJATOVI ´C AND P. SCHNEIDER ing the dimension of the parameter space for a gradient-based solver. Theframework is easy to implement and is guaranteed to solve the expectationmaximization problem in a single iteration. It uses auxiliary data pointsand is based on two observations: the fact that the complete likelihood (i.e.,the joint likelihood of the observed and auxiliary data points) of the Eulerscheme approximates uniformly the complete likelihood of the diffusion asthe time interval between the consecutive auxiliary data points goes to zero,and the fact that the law of the Brownian bridge approximates well thelaw of the diffusion bridge. Global optimality (Proposition 2), theoreticalbounds (Theorem 3) and asymptotic results (Theorem 1) are established,quantifying the quality of the approximations. Additional numerical exper-iments suggest that the method works very well for multivariate nonlineardiffusions, even for large time intervals between observed data points.A topic for further research is the possible extension of the EML frame-work to the estimation of jump-diffusions. Instead of using the law of theBrownian bridge, a semi-nonparametric density [see Gallant and Tauchen(2009, 2006)] could be used to approximate the conditional density p ( x τ | x τ , x τ ) = p ( x τ | x τ ) p ( x τ | x τ ) /p ( x τ | x τ ) , ≤ τ < τ < τ , of the cor-responding bridge process with jumps. The complete likelihood functionwould then be obtained by conditioning on a high-frequency discretization(i.e., using many auxiliary state variables) of the jump-diffusion which wouldidentify well the jump parameters.APPENDIX A: PROOF OF THEOREM 1Recall that we have a diffusion X t which is a solution of the SDE dX t = µ ( X t , θ ) dt + dW t , where µ ( · , θ ) : R → R is a bounded Lipschitz function withbounded first and second derivatives. Without loss of generality, we canassume that the parameter space Θ is a compact subset of R N +1 . For s > t ,let π t,s ( x | x , θ ) denote the probability density function of X s conditionalon X t = x . It is well known that such a density exists [see (7) and (8) inRogers (1985)] and, by Girsanov’s theorem, can be expressed as π t,t + δ ( x | x , θ ) = 1 √ πδ e − ( x − x ) / (2 δ ) e R xx µ ( u,θ ) du Φ θ ( δ, x , x ) , (A.1)where Φ θ ( δ, x , x ) := E (cid:20) exp (cid:18) − δ Z g θ ( x + u ( x − x ) + √ δW u ) du (cid:19)(cid:21) . (A.2)Here, W denotes the Brownian bridge W u := W u − uW , for u ∈ [0 , g θ is given by g θ ( u ) := µ ′ ( u, θ ) + µ ( u, θ ) . Our task is to provethat the sum ∆ /δ − X k =1 E Q θ (cid:20) log (cid:18) π t,t + δ ( X δ ( k +1) | X δk , θ ) φ ( X δ ( k +1) ; X δk + δµ ( X δk , θ ) , δ ) (cid:19)(cid:21) (A.3) LOBALLY OPTIMAL PARAMETER ESTIMATES FOR NONLINEAR DIFFUSIONS converges to zero uniformly for all θ in the compact parameter space Θ,where Q θ is the law of the bridge of the diffusion X t , for t ∈ [0 , ∆], whichstarts at X = x and finishes at X ∆ = x ∆ for any fixed pair of real numbers x , x ∆ ∈ R . The function φ ( y ; x, δ ) in this expression is the normal densityfunction with mean x and variance δ .The first step in the proof of Theorem 1 is to understand the integrandlog (cid:18) π t,t + δ ( y | x, θ ) φ ( y ; x + δµ ( x, θ ) , δ ) (cid:19) = δ µ ( x, θ ) − ( y − x ) µ ( x, θ )(A.4) + Z yx µ ( u, θ ) du + log(Φ θ ( δ, x, y )) . We start with the following claim.
Claim 1.
The integral of the drift can be expressed as R yx µ ( u, θ ) du =( y − x ) µ ( x, θ )+ ( y − x ) µ ′ ( x, θ )+ ( y − x ) h θ ( x, y ) , where h θ ( x, y ) is a boundedmeasurable function which is linear in θ . Let µ ( u, θ ) = µ ( x, θ ) + ( u − x ) µ ′ ( x, θ ) + ( u − x ) µ ′′ ( ξ x,u , θ ) be the Taylorapproximation of order 2 of the drift µ ( · , θ ). The point ξ x,u lies in the interval( x, u ) ⊂ R [or in ( u, x ), if x is larger than u ]. For any fixed y in R , we can inte-grate this representation of µ ( · , θ ) to obtain the representation of the integralwhich is given in Claim 1. We need to check that the function h θ ( x, y ) := y − x ) R yx ( u − x ) b θ ( u, x ) du , for x = y , is bounded and measurable. Here, b θ ( u, x ) is given by the quotient b θ ( u, x ) := 2 µ ( u,θ ) − µ ( x,θ ) − ( u − x ) µ ′ ( x,θ )( u − x ) if u = x and is zero otherwise. Note that the function b θ is bounded since µ ′′ ( · , θ ) isbounded and linear in θ . Since the set Θ is compact, the estimate | h θ ( x, y ) | ≤ C ( y − x ) R yx ( u − x ) du = C holds for all y > x and some constant C . A sim-ilar bound holds for y < x . It follows from the definition of b θ that it ismeasurable on R × R since it is continuous outside the zero measure set { ( x, x ) : x ∈ R } . Fubini’s theorem implies that h θ ( x, y ) must therefore alsobe measurable. This proves Claim 1.The next task is to relate the asymptotic behavior of the function log(Φ θ ( δ,x, y )) to the drift µ ( · , θ ). This will be achieved in Claims 2 and 3. Claim 2.