Approximate Maximum Likelihood for Complex Structural Models
AApproximate Maximum Likelihood for ComplexStructural Models
Veronika Czellar ∗ , David T. Frazier † and Eric Renault ‡ June 19, 2020
Abstract
Indirect Inference (I-I) is a popular technique for estimating complex parametric modelswhose likelihood function is intractable, however, the statistical efficiency of I-I estimationis questionable. While the efficient method of moments, Gallant and Tauchen (1996),promises efficiency, the price to pay for this efficiency is a loss of parsimony and thereby apotential lack of robustness to model misspecification. This stands in contrast to simpler I-Iestimation strategies, which are known to display less sensitivity to model misspecificationprecisely due to their focus on specific elements of the underlying structural model. In thisresearch, we propose a new simulation-based approach that maintains the parsimony of I-Iestimation, which is often critical in empirical applications, but can also deliver estimatorsthat are nearly as efficient as maximum likelihood. This new approach is based on usinga constrained approximation to the structural model, which ensures identification and candeliver estimators that are nearly efficient. We demonstrate this approach through severalexamples, and show that this approach can deliver estimators that are nearly as efficientas maximum likelihood, when feasible, but can be employed in many situations wheremaximum likelihood is infeasible.
Keywords : Equality Restrictions; Constrained Inference; Indirect Inference; Generalized Tobit;Markov-Switching Multifractal Models.
Indirect inference (hereafter, I-I), as proposed by Smith (1993) and Gourieroux, et al. (1993), isa simulation-based estimation method often used when the underlying likelihood for the modelof interest is computationally challenging, or intractable. The key idea underpinning I-I is that,regardless how complicated the structural model, it is often feasible to simulate artificial datafrom this fully parametric model. As a result, statistics based on the observed data and data ∗ Department of Data Science, Economics and Finance, EDHEC Business School, France † Department of Econometrics and Business Statistics, Monash University, Melbourne, Australia. Correspond-ing author: [email protected] ‡ Department of Economics, University of Warwick and Department of Econometrics and Business Statistics,Monash University. a r X i v : . [ ec on . E M ] J un imulated from the model can be compared, with the resulting difference minimized in a givennorm to produce an estimator of the structural parameters.The implementation of I-I is most often carried out using an auxiliary model that representsan incorrect, but tractable version of the structural model under analysis. User-friendly estima-tors for the parameters of this auxiliary model provide the statistics, based on the observed andsimulated data, respectively, that are used to conduct inference on the underlying structuralparameters. However, by definition the information encapsulated in the auxiliary parameterestimates is less than the information carried in the likelihood for the structural parameters.As such, in any implementation of I-I there is a fundamental trade-off between the statisticalefficiency of the resulting estimators and their computational feasibility.The main contribution of this paper is to propose an alternative to I-I that produces struc-tural parameter estimates that, albeit also simulation-based, are arguably closer to reaching theCramer-Rao efficiency bound for the parametric structural model. The new method proposedherein, dubbed “Approximate Maximum Likelihood” (hereafter, AML), maintains the standardphilosophy of I-I that one can resort to a possibly biased approximation of the structural model,insofar as matching statistics calculated from this approximation using both simulated and ob-served data will allow us to erase the misspecification bias. In contrast to standard I-I, insteadof matching estimators of auxiliary parameters, we directly match a proxy/approximation tothe score vector of the intractable log-likelihood. These proxies are indexed by the vector ofstructural parameters, for which a preliminary plug-in estimator (based on observed data) mustbe used.However, as we later demonstrate, the dependence of this approach on the preliminary plug-inestimator differs from standard I-I estimation: as far as the asymptotic distribution of our AMLestimator is concerned, the asymptotic distribution of the preliminary estimator is immaterial,and only its probability limit (a pseudo-true value possibly different from the true unknownvalue) will impact the information conveyed by the approximate score. This is in stark contrastto I-I estimation, where the key feature in determining the asymptotic efficiency of I-I is theefficiency of the auxiliary parameter estimates. As such, since it is only the probability limits ofthe plug-in estimators that matters, our new AML approach can not be directly placed in thestandard I-I framework.While this new approach is based on matching types of scores, it should not be confused withthe score-based version of I-I proposed by Gallant and Tauchen (1996). As shown by Gourieroux,Monfort and Renault (1993) (see “The Third Version of the Indirect Estimator” in their Ap-pendix 1), Gallant and Tauchen’s (1996) estimator is actually tantamount to match estimatorsof auxiliary parameters. In particular, when fishing for efficiency, Gallant and Tauchen (1996)(see the proof of their theorem 2) ultimately import the efficiency for the estimator of auxiliaryparameters to reach the Cramer-Rao efficiency bound for the structural parameters, with thisefficiency claim ultimately requiring that the auxiliary model “smoothly embeds” the structuralmodel.In short, “efficient method of moments”, Gallant and Tauchen (1996), must resort to asemi-nonparametric score generator as an auxiliary model. Thanks to its steadily increasingdimension, the score of this auxiliary model may asymptotically span the score of the structuralmodel, and thereby deliver efficient estimators of the resulting structural parameters. However,the price to pay for this efficiency is a highly-parametrized auxiliary model that may be ill-behaved (due to the non-parsimonious nature of the auxiliary model) when there are deviationsfrom the underlying model structure, i.e., when the structural model may be partly misspecified.2his is in contrast to standard I-I estimation, which has been shown to be somewhat robust todeviations from the underlying modelling assumptions (see, e.g., Dridi et al., 2007), preciselybecause it is based on calibrating a limited number of structural parameters. Our new methodremains true to this parsimony principle since we match proxies for the actual score vector,whose dimension is the same as the structural parameters.In our AML approach, (approximate) efficiency of structural parameter estimates does notrest upon high-dimensional inference or the near-efficiency of auxiliary parameter estimates, buton the conjunction of two properties. • First, the efficiency gap between our estimates and the MLE is tightly related to the differ-ence between the asymptotic value of our plug-in estimator for the structural parameters(i.e., the pseudo-true value that will asymptotically feature in our proxy/ approximation forthe true limiting score function) and the true unknown value of the structural parameters. • Second, the fact that the Cramer-Rao efficiency bound can be (nearly) reached if theinformation identity is (nearly) maintained. More precisely, the question is to assess thedifference between the curvature of the log-likelihood at the true value of the structuralparameters (as measured by the slope of the expected score vector as a function of thestructural parameters) and the slope of the score vector when the structural parametersenter the score through data simulated at a specific parameter value. Satisfaction of theinformation identity in this context requires a type of multiplicative separability of thescore vector, which we later demonstrate is satisfied for exponential models.The motivation for our AML approach is the observation that there are many cases of interestwhere the intractability of the assumed model, and its likelihood, is entirely due to a sub-vectorof structural parameters. Examples include, for instance, dynamic discrete choice models withARMA errors (Robinson, 1982, Gourieroux et al., 1985, Poirier and Ruud, 1988), spatial discretechoice models (see, e.g., Pinske and Slade, 1998), and many dynamic equilibrium models. Insuch models, a few well-chosen restrictions would allow us to alleviate the intractability of thelikelihood due to the presence of certain latent variables.More generally, many complex economic models are such that imposing a (potentially false)constraint on the structural model yields a simpler auxiliary model with a computationallytractable likelihood. This is precisely the reason why score/LM tests are popular in econo-metrics: estimation and testing “under the null”is feasible even in very complicated models.Unfortunately, imposition of this constraint, and subsequent optimization of the constrainedlog-likelihood, will not deliver consistent estimates of the structural parameters if the constraintis not satisfied at the truth.As recently pointed out by Calvet and Czellar (2015), imposing potentially false equalityconstraints on a given structural model can be an attractive method for obtaining simple andrich auxiliary models for the purposes of I-I. For instance, in the context of a long-run riskmodel (Bansal and Yaron, 2004), Calvet and Czellar (2015) demonstrate that imposing specificequality constraints on certain parameters produces a simple auxiliary model for use in I-I (witha computationally tractable likelihood function) that closely resemble the structural model. Thefact that this resulting auxiliary model may not deliver consistent estimates of the true structuralparameters is immaterial insofar as matching a simulation-based approximation against theobservation-based version will allow us to erase the misspecification bias. The benefits of suchan approach are two-fold: one, by using constraints to define the auxiliary model, we sketch3 systematic strategy for the choice of an auxiliary model; two, this auxiliary model closelymatches the structural model and so for issues of robustness and efficiency this auxiliary modelis very useful.However, while highly-useful, the suggestion of Calvet and Czellar (2015) is incomplete, anddoes not allow for consistent estimation of the structural parameters on its own. That is, sincewe impose a number of constraints on the auxiliary model, by definition the auxiliary modelcan not consistently estimate all the structural parameters, except in the unlikely case wherethe constraints are satisfied at the true value of the structural parameters. To circumvent thisissue, Calvet and Czellar (2015) propose to add to the statistics obtained from the auxiliarymodel additional statistics so that, when considered jointly, this new vector can jointly identifythe structural parameters when estimated by I-I.Motivated by the above ideas and the approach to handling constraints within I-I proposedin Calzolari et al. (2004) and Frazier and Renault (2019), we propose a novel inference ap-proach based on constraining the structural model parameters to create a simple, but highlyinformative, proxy for the score vector that can be used to estimate the structural parameters.However, unlike the strategy put forward by Calvet and Czellar (2015), our approach providesan automatic, and nearly-efficient, method to identify the structural parameters.In addition, we demonstrate that this AML strategy can be based on a proxy for the scorevector which entails additional layers of approximation beyond simply plugging in a (wrongly)constrained estimation of the structural parameters. For example, in the context of stableprobability distributions, the likelihood function is known in closed-form only at certain specificvalues of the parameters; as an example, a unit shape parameter ( a = 1) and a zero value ofthe asymmetry parameter ( b = 0) yield a Cauchy likelihood, however, even then the partialderivatives of the likelihood function with, respect to a and b , is not available in closed-form.In such settings, our AML strategy can be implemented by invoking an additional layer ofapproximation and replacing the directions of our score vector proxy that can not be obtainedin closed-form by a finite-difference approximation. Approximating certain directions of thescore vector by finite-differences is obviously even more useful when some structural parametersare only defined on the integers. We demonstrate our methodology in such cases using theexample of Markov-Switching multifractal (MSM) volatility processes, Calvet and Fisher (2004,2008), which are especially well-suited to capture volatility dynamics through an unknown, butfinite, number of multiplicative components.While we apply our AML methodology within the confines of a MSM volatility model, wenote here that the use of MSM models are not exclusive to the analysis of volatility. Indeed,Chen, Diebold and Schorfheide (2013) propose a novel Markov-switching multifractal duration(MSMD) model to analyze inter-trade duration data in financial markets, and demonstrate itssuperiority over competing duration models. While we exemplify the AML procedure within aMSM volatility model, we note here that AML can be equivalently applied to the MSMD modelof Chen et al. (2013) using precisely the same approach detailed in this paper.The remainder of the paper is organized as follows. In Section 2, we give the general setup,discuss several interesting examples where equality constraints on the structural model yield atractable score vector that can be used for inference through score matching, and discuss ourAML estimation strategy. We also demonstrate that, in contrast to standard I-I, the choice ofan auxiliary estimator is immaterial, beyond the pseudo-true value of structural parameters thatit defines.In Section 3, we provide the asymptotic theory of AML. Further, we demonstrate that, in4he case of an exponential model, a sufficient (but not necessary) condition for AML estimatorsto achieve the Cramer-Rao efficiency bound is that the pseudo-true value used in AML coincideswith the true one. Section 4 provides Monte Carlo evidence on the finite-sample performanceof AML in two leading examples: one based on false equality constraints, and one where we arerequired to define some of the pseudo-score components using a finite-difference approximation,with the later example containing an empirical application to financial returns data using amultifractal stochastic volatility model. Monte Carlo evidence on the application to stabledistribution is provided in Appendix D. Section 5 concludes with suggestions for future researchon extensions of I-I where not only the two vectors to match both depend on the observed data,as in this paper, but even the simulator itself may depend on the observed data. Mathematicaldetails for the proofs of main results and developments of theoretical examples are provided inAppendices A, B and C. Following Gourieroux, et al. (1993) (hereafter, GMR), our goal is inference on the unknownparameters of a dynamic structural model that has a nonlinear state space representation. Thestructural model is specified through a transition, or state, equation and a measurement equation.The transition equation is of the following form u t = ϕ ( u t − , ε t , θ ) ; θ ∈ Θ ⊂ R p , where ϕ is a known function, ( u t , ε t ) Tt =1 are latent processes and ε t is a strong white noise processwith a known distribution; and the measurement equation satisfies y t = r ( y t − , x t , u t , ε t , θ ) ; θ ∈ Θ ⊂ R p , where r is a known function and ( x t , y t ) Tt =1 are observed processes. In the two equations, knownfunctions ϕ and r are indexed by a p -dimensional vector of unknown parameters θ ∈ Θ. Weassume that ( x t ) t ≤ T is a homogenous Markov process of order 1, and is independent of theprocess ( ε t ) t ≤ T (and ( u t ) t ≤ T ). Then the process ( x t ) is exogenous and the process ( x t , y t ) t ≤ T isstationary. It is worth recalling that, by standard arguments, the fact that the Markov processis of order 1 and the probability distribution of the white noise ε t is known are not restrictiveassumptions.Under the above conditions, assuming absolute continuity with respect to some dominatingmeasure, for a given initial condition z = ( y , u ) , it should be possible to write down the jointconditional probability density function l ∗ (cid:8) ( y t ) ≤ t ≤ T , ( u t ) ≤ t ≤ T | ( x t ) ≤ t ≤ T , z ; θ (cid:9) . (1)The density of the observed sequence ( y t ) t ≤ T , conditional on ( x t ) t ≤ T , is obtained by integratingout the latent variables ( u t ) ≤ t ≤ T from the density (1) and can generally be stated as l (cid:8) ( y t ) ≤ t ≤ T | ( x t ) ≤ t ≤ T ; θ (cid:9) = (cid:89) ≤ t ≤ T l { y t (cid:12)(cid:12) ( y τ ) ≤ τ ≤ t − , x t , z ; θ (cid:9) , (2)5here the last equality comes from the Markovianity and exogeneity of the process ( x t ). Thisdensity function allows us to construct the log-likelihood function L T ( θ ) = 1 T (cid:88) ≤ t ≤ T log (cid:0) l { y t (cid:12)(cid:12) ( y τ ) ≤ τ ≤ t − , x t , z ; θ (cid:9)(cid:1) . (3)A maintained assumption in this paper will be that the log-likelihood asymptotically identifiessome true unknown value, θ , of the unknown parameters, θ , and is the unique maximizer of thepopulation criterion θ = arg max θ ∈ Θ L ∞ ( θ ) , where L ∞ ( θ ) = plim T →∞ L T ( θ ) . It is important to realize that more often than not, this assumption is neither testable norassociated to a feasible estimator of θ . The likelihood function in equation (2) does not havean analytically tractable form: it is constructed via the latent likelihood in (1) through anintegration step that is infeasible to carry out, integration with respect to the T variables ( u t ) t ≤ T ,with T going to infinity. Even though direct inference on θ associated with L T ( θ ) may be infeasible, it is well-knownthat inference can be carried out using simulation-based filtering and inference approaches.Under the assumed model, it is possible to simulate values of y , ..., y T , for a given initial condition z = ( y , u ) and a given value θ of the parameters, conditionally on the observed path of theexogenous variables x , ..., x T . This is done by independently drawing simulated values ˜ ε , ..., ˜ ε T from the assumed distribution of the strong white noise ( ε t ) (the simulated values are alsoindependent of the realized values ε , ..., ε T that underpin the observations) and by computing˜ y t ( θ, z ) , for t = 0 , , . . . , T, with ˜ y ( θ, z ) = y and where˜ y t ( θ, z ) = r [˜ y t − ( θ, z ) , x t , ˜ u t ( θ, u ) , ˜ ε t , θ ]˜ u t ( θ, u ) = ϕ [˜ u t − ( θ, u ) , ˜ ε t , θ ] . While simulation is the most prevalent mechanism for inference in such settings, we notethat in many cases inference could be based directly on L T ( θ ) if we were to instead considersub-models defined by restricting the parameters θ to lie in a given set Θ ⊂ Θ. Indeed,it will often be that case that the sub-models could be chosen by imposing θ ∈ Θ so thatwe obtain a convenient factorization of the probability density function, which ensures thatintegration of the T latent variables, ( u t ) t ≤ T , no longer requires solving a T -dimensional integral,and consequently inference (over the sub-models) could be based directly on the log-likelihoodfunction (3). However, in general the sub-models specified by this constraint will not be correctlyspecified and the resulting estimates will be asymptotically biased for the parameter of interest θ . However, as we will later see, following the intuition of I-I, this misspecification bias can becorrected by matching these estimators against a simulated counterpart.The following section demonstrates that there are many interesting cases where restrictingthe parameters θ to lie in some set Θ ⊂ Θ results in log-likelihood functions that are easilytractable. Clearly, such examples are exclusive of cases where the integration, or filtering, can be performed analytically,such as cases where the Kalman filter can be performed, as in linear Gaussian state space models, or as in certainqualitative Markov switching models. The focus of this paper is nonlinear state space models, where the abovesimplifications are not generally applicable. .2 Illustrative Examples Autoregressive Discrete Choice Models
We observe the sample { y t , x t } Tt =1 generated from y t = (cid:26) y ∗ t >
00 if y ∗ t ≤ , y ∗ t = x (cid:48) t θ + u t , u t = θ u t − + ν t , where x t is a vector of explanatory variables, ν t is a Gaussian white noise and the AR (1)process ( u t ) t ≤ T is stationary ( − < θ < θ = ( θ (cid:48) , θ ) (cid:48) . Following the standard normalizationpractice for a Probit error term, we set ν t ∼ ℵ (0 , u t means that the data densitycan only be stated as the T -dimensional integral: Let A t = [0 , + ∞ ) if y t = 1 and A t = ( −∞ , y t = 0, l (cid:8) ( y t ) t ≤ T | ( x t ) t ≤ T ; θ (cid:9) = (cid:90) A · · · (cid:90) A T l ∗ (cid:8) ( y ∗ t ) t ≤ T | ( x t ) t ≤ T , z ; θ (cid:9) dy ∗ · · · dy ∗ T ,l ∗ (cid:8) ( y ∗ t ) t ≤ T | ( x t ) t ≤ T , z ; θ (cid:9) = (2 π ) − T/ R ( θ ) − / exp (cid:18) − R ( θ ) u ( θ ) (cid:19) T (cid:89) t =2 exp (cid:32) − [ u t ( θ ) − θ u t − ( θ )] (cid:33) where R ( θ ) = 1 / (1 − θ ) and u t ( θ ) = y ∗ t − x (cid:48) t θ . However, note that if one were to imposethe constraint θ = 0 in l ∗ (cid:8) ( y ∗ t ) t ≤ T | ( x t ) t ≤ T , z ; θ (cid:9) , the integral that defines this density canbe factorized into a product of T univariate integrals, which ultimately yields the usual Probitlikelihood function. As such, a convenient parametric sub-model is given by l (cid:8) ( y t ) t ≤ T | ( x t ) t ≤ T ; θ (cid:9) ; θ ∈ Θ = (cid:8) θ ∈ Θ , θ = ( θ (cid:48) , (cid:48) (cid:9) A similar finding to the above can also be applied, albeit with different notations, to spatiallycorrelated Probit models, instead of the autoregressive Probit model.
GARCH-like Stochastic Volatility Model
Observed log-returns are assumed to evolve according to r t +1 = µ + ε t +1 , E [ ε t +1 | I t ] = 0 , where the error term ε t +1 is a martingale difference sequence (hereafter, mds). We are interestedin the volatility dynamics of the process σ t = E [ ε t +1 | I t ] , As usual, the observed counterpart of volatility dynamics is given by the dynamics of the squaredreturn process. We assume that ε t is a weak ARM A ( p, p ) : ε t +1 − ω − p (cid:88) j =1 γ j ε t +1 − j = ξ t +1 − p (cid:88) j =1 β j ξ t +1 − j (4)7here ξ t +1 is a weak white noise that defines the innovation process of ε t . In other words, theARMA representation (4) is causal and invertible.It is known (see e.g. Meddahi and Renault (2004)) that ε t is a (semi-strong) GARCH ( p, q )with q ≤ p if and only if ξ t is a mds. Inspired by Franses et al. (2008), albeit with a differentmodel, we want to relax this restriction about the white noise ξ t +1 , so that we define of family ofstochastic volatility models, which contains the GARCH ( p, q ) with q ≤ p as a particular case,but, beyond this particular case, belong to the realm of nonlinear state space models. For thispurpose, it is worth setting the focus on the difference between the innovation process ξ t +1 andthe mds ν t +1 = ε t +1 − σ t .By definition (see equation (4)), the difference (cid:0) ξ t +1 − ε t +1 (cid:1) is I t -measurable, so that we areallowed to introduce the notation: ξ t +1 − ν t +1 = η t = σ t − k t so that ξ t +1 − ε t +1 = − σ t + η t = − k t , which allows us to rewrite the volatility dynamics in equation (4) as ε t +1 − ω − p (cid:88) j =1 γ j ε t +1 − j = ε t +1 − k t − p (cid:88) j =1 β j (cid:2) ε t +1 − j − k t − j (cid:3) so that k t = ω + p (cid:88) j =1 α j ε t +1 − j + p (cid:88) j =1 β j k t − j (5) α j = γ j − β j (6)In other words, we see that, without any additional assumption, the ARM A ( p, q ) represen-tation for ε t +1 in equation (4) can be characterized by a GARCH-like equation (5) with σ t = k t + η t , η t = ξ t +1 − ν t +1 , ν t +1 = ε t +1 − σ t (7)Note that, since since ν t +1 is a mds, we deduce from (7) that η t = E [ η t | I t ] = E [ ξ t +1 | I t ]and thus E [ ξ t +1 | I t ] = 0 ⇐⇒ σ t = k t ⇐⇒ σ t = ω + p (cid:88) j =1 α j ε t +1 − j + p (cid:88) j =1 β j σ t − j . That is, we again find that the GARCH case is tantamount to the mds property for the noiseprocess ξ t +1 , which implies that the process η t is identically zero.Now, beyond the GARCH case, it is worth questioning whether a non-zero process η t is justa white noise or encapsulates some additional dynamic features of conditional variance. It isthen natural to consider the following model for η t : η t = ρη t − + (cid:36)χ t , | ρ | < χ t is i.i.d. with a known distribution with zero mean. Such a model for η t leads to anonlinear state space model with the measurement equation r t +1 = µ + (cid:34) ω + p (cid:88) j =1 α j ε t +1 − j + p (cid:88) j =1 β j (cid:0) σ t − j − η t − j (cid:1) + η t (cid:35) / u t +1 for u t +1 and χ t i.i.d. with known distributions, and where the transition equation is given by(8).Similar to the general case treated in equation (2), the likelihood function of this model isonly expressible as a T -dimensional integral (due to the dynamics in (8)). However, as we havealready seen in the autoregressive Probit example, Example 1, imposing the constraint ρ = 0 inthis state space model means that the T -dimensional integral can be factorized into the productof T univariate integrals. As a consequence, stable numerical procedures can be used to computethese univariate integrals and the resulting likelihood can then be maximized. More precisely,since σ t = k [ { r τ } τ ≤ t ] + η t k [ { r τ } τ ≤ t ] = k t = ω + p (cid:88) j =1 α j ε t +1 − j + p (cid:88) j =1 β j k t − j k t can be computed recursively as a function of past observed returns { r τ } τ ≤ t , as is standard inGARCH models. Therefore, when ρ = 0, the overall likelihood is the product of the increments l [ r t +1 |{ r τ } τ ≤ t ; θ ], where for t ≥ l [ r t +1 |{ r τ } τ ≤ t ; θ ] = (cid:90) + ∞−∞ k [ { r τ } τ ≤ t ] + η t ] / f u (cid:20) r t +1 − µk [ { r τ } τ ≤ t ] + η t (cid:21) (cid:36) f χ (cid:104) η t (cid:36) (cid:105) dη t and where f u ( . ) (resp. f χ ( . )) denote the probability density function of the standardized log-return u t +1 (resp. of the noise χ t ) Generalized Tobit Model
Amemiya (1985) defines the generalized Tobit Model of Type 2 by the following observationscheme for the outcome variable y i : y i = (cid:40) y ∗ i if y ∗ i ≥ y ∗ i < , (9)with y ∗ i = x (cid:48) i θ + σε i , (10)where x i is a vector of exogenous explanatory variables, ( θ (cid:48) , σ ) (cid:48) a vector of unknown parametersand ε i is a standardized Gaussian error ε i ∼ ℵ (0 , . A complete specification for the likelihoodfunction requires specifying the conditional probability of missingness in the data:Pr[ y ∗ i < | y ∗ i , z i , θ , θ ] , z i is a vector of exogenous explanatory variables and ( θ (cid:48) , θ (cid:48) ) (cid:48) is a vector of unknown pa-rameters. The parameter θ govern the relationship between z i and the missingness mechanism,and the parameter θ characterizes the dependence between the two latent endogenous variables y ∗ i and y ∗ i . Then, if I (resp. I ) stands for the subset of indices for which ( y ∗ i ≥
0) (resp. y ∗ i < l (cid:8) ( y i ) ≤ i ≤ T | ( x i , z i ) ≤ i ≤ T ; θ (cid:9) = (cid:89) i ∈ I σ ϕ (cid:18) y i − x (cid:48) i θ σ (cid:19) Pr[ y ∗ i ≥ | y i , z i , θ , θ ] (cid:89) i ∈ I Pr[ y ∗ i < | z i , θ ] , with Pr[ y ∗ i < | z i , θ ] = (cid:90) Pr[ y ∗ i < | y ∗ i , z i , θ , θ ] 1 σ ϕ (cid:18) y ∗ i − x (cid:48) i θ σ (cid:19) dy ∗ i , where the function ϕ ( . ) stands for the probability density function of the standard normaldistribution and θ = ( θ (cid:48) , θ (cid:48) , θ (cid:48) , σ ) (cid:48) where θ ∈ R p , θ ∈ R p , θ ∈ R , σ > θ may be challenging be-cause the likelihood function involves an integral that may be necessary to compute numerically.However, imposing the (possibly false) equality constraint θ = 0 implies that y ∗ i and y ∗ i areconditionally independent, given z i , and the likelihood function under the constraint θ = 0becomes l (cid:8) ( y i ) ≤ i ≤ T | ( x i , z i ) ≤ i ≤ T ; θ (cid:9) = (cid:89) i ∈ I σ ϕ (cid:18) y i − x (cid:48) i θ σ (cid:19) Pr[ y ∗ i ≥ | z i , θ , (cid:89) i ∈ I Pr[ y ∗ i < | z i , θ , . Amemiya (1985) notes that the “special case of independence” makes the likelihood functionalmost as simple as a standard Tobit when the probability distribution of y ∗ i given z i is alsoGaussian. However, by reference to an empirical paper (Dudley and Montmarquette (1976)about the foreign aid from United States to a particular country), Amemiya (1985) notes that”it makes their model computationally advantageous. However, it seems unrealistic to assumethat the potential amount of aid, y ∗ is independent of the variable that determines whetheror not aid is given, y ∗ ”. More generally, Amemiya (1985) considers that the joint conditionaldistribution of ( y ∗ i , y ∗ i ) (cid:48) given ( x i , z i ) is Gaussian and θ stands for the correlation coefficientbetween y ∗ i and y ∗ i .However, an alternative, and often computationally more convenient choice, is to assumethat the conditional probability distribution of y ∗ i given ( y ∗ i , x i , z i ) is logistic, which yieldsPr[ y ∗ i ≥ | y ∗ i , z i , x i , θ , θ ] = [1 + exp( − z (cid:48) i θ − θ y ∗ i )] − . (11)In this case, imposing the (potentially false) equality constraint θ = 0 , leads to a “computa-tionally advantageous” model with log-likelihood function, when evaluated at θ = ( θ (cid:48) , θ (cid:48) , , σ ) (cid:48) ,with a particularly simple form L T [( θ (cid:48) , θ (cid:48) , , σ )]= 1 T (cid:88) i ∈ I (cid:26) −
12 log (cid:0) πσ (cid:1) − σ ( y i − x (cid:48) i θ ) − log (cid:16) e − z (cid:48) i θ (cid:17)(cid:27) − T (cid:88) i ∈ I log (cid:16) e z (cid:48) i θ (cid:17) . .2.4 Example 4: Markov-Switching Multifractal (MSM) Model
Similarly to Example 2, consider that observed asset returns evolve according to r t +1 = µ + ε t +1 , E [ ε t +1 | I t ] = 0 , where the error process ε t is assumed to follow ε t +1 = σ t u t +1 , E [ u t +1 | I t ] = 1with σ t denoting the volatility process. Our goal remains the analysis of the volatility process,however, in this example we use the Binomial MSM model proposed in Calvet and Fisher (2001,2004, 2008), and consider that the volatility process is defined as the product of several volatilitycomponents σ t = σ k (cid:89) k =1 M k,t . The components M k,t are unobservable (i.e., latent) variables that are often referred to as mul-tipliers or volatility components, and the overall number of components, k , is unknown.We will assume that the standardized return u t +1 is i.i.d with a probability density function f u ( . ). The latent state variables M k,t , k = 1 , ..., k , are assumed to be stationary Markov processeswith common marginal distribution, denoted by M . Given a value M k,t for the k th componentat time t , the next-period multiplier is assumed to evolve according to M k,t +1 = (cid:26) ∼ M with probability γ k M k,t with probability (1 − γ k )where the notation ( ∼ M ) stands for “drawn in the distribution M ” and M is generated fromthe stationary distribution π , where π j = Pr[ M = m j ] = 1 /d, ∀ j = 1 , ..., d, and where d = 2 k .The switching events (with transition probabilities γ k , k = 1 , ..., k ) and new draws from M are assumed to be independent across k and t . To ensure a non-negative and stationary volatilityprocess ( E ( σ t ) = σ ), we assume E ( M ) = 1 , M ≥ m ∈ (1 ,
2) such that:Pr [ M = m ] = Pr [ M = 2 − m ] = 12 . Then the state vector M t = (cid:0) M ,t , ..., M k,t (cid:1) (cid:48) can take d possible values m j , j = 1 , ..., d , sothat at each date the squared volatility process takes d possible values σ g (cid:0) m j (cid:1) , where g (cid:2)(cid:0) M ,t , ..., M k,t (cid:1)(cid:3) = k (cid:89) k =1 M k,t . γ k , k = 1 , ..., ¯ k, such that the firstcomponents (small k ) are the most persistent γ k = ¯ γb k − k , ¯ γ ∈ (0 , , b > , k = 1 , ..., ¯ k, and where a possibly higher “volatility of volatility” can be accommodated by increasing k .For this model, the structural parameter vector is θ = (cid:0) m , ¯ γ, b, σ, k (cid:1) (cid:48) and the log-likelihood associated with observed returns ( r t +1 ) t ≤ T is given by: L T ( θ ) = 1 T T (cid:88) t =1 log (cid:32) d (cid:88) j =1 σ (cid:112) g ( m j ) f u (cid:32) r t +1 − µσ (cid:112) g ( m j ) (cid:33) Pr[ M t = m j | r τ , τ ≤ t ] (cid:33) (12)where the conditional probabilities π jt = Pr[ M t = m j | r τ , τ ≤ t ] are computed recursively. ByBayes’ rule, the probability π jt can be expressed as a function of the previous probabilities π t − = (cid:0) π t − , ..., π dt − (cid:1) : π jt ∝ d (cid:88) i =1 σ (cid:112) g ( m i ) f u (cid:32) r t − µσ (cid:112) g ( m i ) (cid:33) π it − a i,j a i,j = Pr[ M t = j | M t − = i ] = k (cid:89) k =1 (cid:104) (1 − γ k )1[ m ik = m jk ] + γ k (cid:105) . Hence, unlike continuous stochastic volatility models, such as in Example 2, the Markov-switching multifractal model has a closed-form likelihood, precisely because the filtering tech-niques a la Hamilton can be applied. However, the price to pay for a volatility process with adiscrete state space is that, for sake of goodness of fit, it often takes a state space with manyelements, which implies a large number of multipliers k . Calvet and Fisher (2004) documentsthat for exchange rate data, the multifractal model “works better for larger values of k ” andchoose to set the focus on the case k = 10 for all currencies.While the log-likelihood is available in closed-form, a single evaluation requires O (cid:16) k T (cid:17) computations, where O ( . ) denotes the order of the evaluation. Therefore, if the upper bound onthe parameter space for k is too large, estimation via maximum likelihood becomes prohibitivelyexpensive.Given the potentially prohibitive computational requirements associated with a large valueof k , it is worth revisiting the likelihood function with the false equality constraint k = 2, whichis the smallest possible value of k allowing to identify all the other parameters. Under theconstraint k = 2, a single likelihood evaluation requires only 16 · T , i.e., 2 T , computations.Therefore, such a constraint could easily be imposed, and the resulting estimation procedureimplemented, to alleviate the computational burden associated with searching over the entireparameter space for k . Stable Distribution
Consider i.i.d. observations y , . . . , y T generated from a stable distribution with stability param-eter a ∈ (0 , b ∈ [ − , c > ∈ R . The structural parameter vector is given by: θ = ( a, b, c, µ ) (cid:48) (13)The practical problem for maximum likelihood inference in this context does not come froma non-linear state space where the likelihood function would involve integrals over the statevariables. However, it is known that the log-likelihood function L T ( θ ) is not available in general,except for some specific values of the parameters a and b . As such, maximum likelihood inferencecan only be implemented by the time-consuming task of numerical inverting the characteristicfunction, which is known in closed-form, to obtain the resulting (numerical approximation to)the stable density.However, for a = 1 and b = 0, the stable distribution coincides with the Cauchy distributionwhich has a closed-form log-likelihood function L T (1 , , c, µ ). Moreover, the stable model alsoallows to simulate sample paths, for instance with the method of Chambers, Mallows and Stuck(1976). This will pave the way again for an AML strategy. The common feature of all the previously discussed examples is that for all values of θ insome subset Θ ⊂ Θ, obtained by imposing some (possibly false) equality constraints, the log-likelihood function L T ( θ ) in (3) is available in closed form (up to the evaluation of univariateintegrals). Moreover, we can also show that for all five examples considered in Section 2.2,considering θ ∈ Θ allows us to compute, in closed-form, a pseudo-score vector∆ θ L T ( θ ); θ ∈ Θ (14)that can be used as the basis for inference on the unknown θ .The notation ∆ θ L T ( θ ) is used since certain components of the pseudo-score vector may notbe computed as exact partial derivatives. Of course such an approximation will be required whensome components of θ are integers, such as k in the multifractal case (Example 4). Moreover,this approximation will also be relevant in the case of stable distributions (Example 5), wheregenuine partial derivatives with respect to parameters a and b cannot always be computed.Importantly, we note that the pseudo-score vector in (14) is of the same dimension as theunknown parameters, i.e., it is a p -dimensional vector. That is, the partial derivatives for thepseudo-score are computed with respect to all components of θ , including those dimensionswhose values are fixed when θ ∈ Θ . In the following, we demonstrate that, in the examplesconsidered above, constraining θ ∈ Θ allows us to compute the pseudo-score in closed-form, atleast up to the evaluation of univariate integrals. Example 1: (
Autoregressive Discrete Choice Models ) The dynamic Probit model is astriking example of the fact that, while the complete likelihood function l (cid:8) ( y t ) t ≤ T | ( x t ) t ≤ T ; θ (cid:9) can only be stated as a T -dimensional integral, the sub-model defined by θ = 0 is much simpler,since it coincides with the usual Probit likelihood. Not only does the (possibly false) equalityconstraint θ = 0 lead to a closed-form likelihood, but the results of Gourieroux, et al. (1985)demonstrate that the partial derivatives of the likelihood function are also available in closed-form. 13nder the restriction θ = 0, for˜ u t ( θ ,
0) = ϕ ( x (cid:48) t θ )Φ ( x (cid:48) t θ ) [1 − Φ ( x (cid:48) t θ )] [ y t − Φ ( x (cid:48) t θ )] , where ϕ (resp. Φ ) denotes the probability density function (resp. the cumulative distributionfunction) of the standard normal, the computations in Gourieroux et al. (1985) yield ∂L T ( θ , ∂θ = 1 T T (cid:88) t =1 x t ˜ u t ( θ , , ∂L T ( θ , θ ) ∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ =0 = 1 T T (cid:88) t =2 ˜ u t − ( θ , u t ( θ , u t ( θ ,
0) is the generalized residual under the restriction θ = 0. Gourieroux et al.(1987) show that ˜ u t ( θ ,
0) can be interpreted as the conditional expectation of the error term u t given y t when the true value of θ is ( θ (cid:48) , (cid:48) . Example 2: (
GARCH-like Stochastic Volatility Model ) In the case of an
ARCH (1)-like stochastic volatility model, observed returns are assumed to evolve according to r t +1 = µ + ε t +1 , ε t +1 = σ t u t +1 ,k t = ω + αε t , σ t = k t + η t ,η t = ρη t − + (cid:36)χ t , we now demonstrate that the derivatives of the log-likelihood are also available in closed-form.We treat the case of an ARCH(1)-like model for the sake of expositional simplicity, and notethat the result extends to other members of this class but require more lengthy derivations.Furthermore, we assume that standardized asset (log)return u t +1 is Gaussian white noise. Forthis model, the structural parameter vector is given by: θ = ( ζ (cid:48) , ρ ) (cid:48) , ζ = ( µ, ω, α, (cid:36) ) (cid:48) , and the likelihood function (calculated from observed returns ( r t +1 ) t ≤ T ) is l [ { r t +1 } Tt =1 | θ ] = (cid:90) + ∞−∞ ... (cid:90) + ∞−∞ l ∗ [ { r t +1 , η t } Tt =1 | θ ] dη ...dη T , where l ∗ [ { r t +1 , η t } Tt =1 | θ ] is the latent likelihood: l ∗ [ { r t +1 , η t } Tt =1 | θ ] = T (cid:89) t =1 √ π (cid:112) ω + αε t + η t exp − (cid:34) r t +1 − µ (cid:112) ω + αε t + η t (cid:35) f η [ η , ..., η T | η , (cid:36), ρ ] ,f η [ η , ..., η T | η , (cid:36), ρ ] = T (cid:89) t =1 (cid:36) f χ (cid:18) η t − ρη t − (cid:36) (cid:19) . As already announced, imposing the equality constraint ρ = 0 will greatly simplify thecomputation of the observed likelihood and corresponding score vector. The main reason for14hat is the implied additive structure for the latent and observed log-likelihood functions thatcan be written: L ∗ T ( ζ,
0) = 1 T T (cid:88) t =1 log ( l ∗ [ r t +1 , η t | r τ , τ ≤ t ; ( ζ, ,L T ( ζ,
0) = 1 T T (cid:88) t =1 log ( l [ r t +1 | r τ , τ ≤ t ; ( ζ, ,l [ r t +1 | r τ , τ ≤ t ; ( ζ, (cid:90) + ∞−∞ l ∗ [ r t +1 , η t | r τ , τ ≤ t ; ( ζ, dη t . This additive structure is very convenient, not only for its computational advantages, butalso because it allows us to resort to a formula provided by Gourieroux et al (1987) to computethe observed score vector from the latent score. While this formula had been established byGourieroux et al. (1987) (as a generalization of Louis (1982) ) for i.i.d. data, it obviously allowsus to write (the algebra for proving it is perfectly similar): ∂ log ( l [ r t +1 | r τ , τ ≤ t ; ( ζ, ∂ζ = E (cid:20) ∂ log ( l ∗ [ r t +1 , η t | r τ , τ ≤ t ; ( ζ, ∂ζ (cid:12)(cid:12)(cid:12)(cid:12) { r τ } τ ≤ t +1 (cid:21) . (15)Hence, we can compute ∂L T ( ζ, ∂ζ = 1 T T (cid:88) t =1 E (cid:20) ∂ log ( l ∗ [ r t +1 , η t | r τ , τ ≤ t ; ( ζ, ∂ζ (cid:12)(cid:12)(cid:12)(cid:12) { r τ } τ ≤ t +1 (cid:21) . (16)Two remarks are in order. First, and by contrast with Gourieroux et al. (1987), due to dynamicconditional information, (16) does not give the observed score as the conditional expectation ofthe latent score given the observed data. However, we will see below that it allows a recursiveextension of the concept of generalized residual. Second, it is worth keeping in mind thatformulas (15) and (16) are written by assuming that ( ζ,
0) is the true unknown value of thestructural parameters that defines the probability distribution used in the computation of theconditional expectations. Since in our case, the constraint ρ = 0 is likely to be a false equalityconstraint, the application of (15) and (16) will only provide us with proxies of the true scorethat we dub pseudo-scores.Thanks to equation (15), we can compute the pseudo-score in closed-form. We summarizethis result in the following result, and place the derivation of the result in Appendix B. Result 1
For k ∈ {− , , } , let [1 / ( σ t ) k ] F,t = E [1 / ( σ t ) k | r τ , τ ≤ t ] denote the filtered functionof volatility, computed under the assumed model (and under the parameter restriction ρ = 0 ). hen, a closed-form pseudo-score can be obtained with the corresponding components ∂L T ( ζ, ∂µ = 1 T T (cid:88) t =1 (cid:20) σ t (cid:21) F,t ( r t +1 − µ ) ∂L T ( ζ, ∂ω = 12 T T (cid:88) t =1 (cid:20) σ t (cid:21) F,t − T T (cid:88) t =1 (cid:20) σ t (cid:21) F,t ( r t +1 − µ ) ∂L T ( ζ, ∂α = 12 T T (cid:88) t =1 (cid:20) σ t (cid:21) F,t ε t − T T (cid:88) t =1 (cid:20) σ t (cid:21) F,t ( r t +1 − µ ) ε t ∂L T ( ζ, ∂(cid:36) = − (cid:36) + 1 (cid:36) T T (cid:88) t =1 (cid:104)(cid:2) σ t (cid:3) F,t − ω − αε t (cid:105) In addition, a pseudo-score for ρ , i.e., ∂L T ( ζ, /∂ρ , can be based on the approximation (cid:36) T T (cid:88) t =2 (cid:16)(cid:2) σ t (cid:3) F,t − ω − αε t (cid:17) (cid:16)(cid:2) σ t − (cid:3) F,t − − ω − αε t − (cid:17) . (cid:3) Example 3: (
Generalized Tobit Model ) Recall that the log-likelihood for the generalizedTobit model is given by L T ( θ ) = 1 T (cid:88) i ∈ I log (cid:20) σ ϕ (cid:18) y i − x (cid:48) i θ σ (cid:19) Pr[ y ∗ i ≥ | y i , z i , θ , θ ] (cid:21) + 1 T (cid:88) i ∈ I log [Pr[ y ∗ i < | z i , θ ]]= L ,T ( θ ) + L ,T ( θ ) , where Pr[ y ∗ i < | z i , θ ] = (cid:90) Pr[ y ∗ i < | y ∗ i , z i , θ , θ ] 1 σ ϕ (cid:18) y ∗ i − x (cid:48) i θ σ (cid:19) dy ∗ i , Pr[ y ∗ i < | y ∗ i , z i , θ , θ ] = [1 + exp ( z (cid:48) i θ + θ y ∗ i )] − . As was noted previously, under the restrictions θ = 0, the above log-likelihood has a simpleclosed-form.The score of this likelihood under the restriction θ = 0 can also be obtained in closed-form.First, we can compute ∂L ,T ( θ , θ , , σ ) ∂θ = − T (cid:88) i ∈ I x i (cid:20) y i − x (cid:48) i θ σ (cid:21) , ∂L ,T ( θ , θ , , σ ) ∂θ = 1 T (cid:88) i ∈ I z i (cid:104) e z (cid:48) i θ (cid:105) − ∂L ,T ( θ , θ , , σ ) ∂θ = 1 T (cid:88) i ∈ I y i (cid:104) e ˜ x (cid:48) i θ (cid:105) − , ∂L ,T ( θ , θ , , σ ) ∂σ = 1 T (cid:88) i ∈ I (cid:20) − σ + ( y i − x (cid:48) i θ ) σ (cid:21) ∂L ,T ( θ , θ , , σ ) ∂θ = 0 , ∂L ,T ( θ , θ , , σ ) ∂σ = 0 ,∂L ,T ( θ , θ , , σ ) ∂θ = − T (cid:88) i ∈ I z i (cid:104) e − z (cid:48) i θ (cid:105) − ,∂L ,T ( θ , θ , , σ ) ∂θ = − T (cid:88) i ∈ I x (cid:48) i θ (cid:104) e − z (cid:48) i θ (cid:105) − . The pseudo-score can then be the above derivatives, computed under the restriction θ = 0, i.e.,∆ θ L T ( θ ) = ∂L ,T ( θ , θ , , σ ) ∂θ + ∂L ,T ( θ , θ , , σ ) ∂θ . Example 4: (
Markov-Switching Multifractal (MSM) Model ) For this model, thestructural parameter vector is given by: θ = (cid:0) ζ (cid:48) , k (cid:1) (cid:48) , ζ = ( m , ¯ γ, b, σ ) (cid:48) . As already announced, if we consider this model under the false equality constraint k = 2 , the log-likelihood associated with observed data { r t +1 } Tt =1 is given by L T ( ζ,
2) = 1 T T (cid:88) t =1 log (cid:32) (cid:88) j =1 σ (cid:112) g ( m j ) f u (cid:32) r t +1 − µσ (cid:112) g ( m j ) (cid:33) Pr[ M t = m j | r τ , τ ≤ t ] (cid:33) . We can then define a pseudo-score vector by∆ θ L T ( ζ,
2) = (cid:18) ∂L T ( ζ, ∂ζ (cid:48) , L T ( ζ, − L T ( ζ, (cid:19) (cid:48) . Note that filtered Pr[ M t = m j | r τ , τ ≤ t ] probabilities depend on all structural parameters asexplained above through in particular two transition probabilities: γ = ¯ γb , γ = ¯ γ. In the previous section, we have exemplified the computation of pseudo-score vectors∆ θ L T ( θ ) ; θ ∈ Θ , where L T ( θ ) = 1 T T (cid:88) t =2 log (cid:0) l { y t (cid:12)(cid:12) ( y τ ) ≤ τ ≤ t − , x t , z ; θ (cid:9)(cid:1) , from which we can compute estimators of the unknown θ ∈ Θ. While feasible, these estimatorsdo not in general deliver a consistent estimator of θ . We now demonstrate how these pseudo-scores can be used to conduct inference on θ . Throughout the remainder, we maintain thefollowing assumption on the parameters and ∆ θ L T ( θ ).17 ssumption A1( False Equality Constraints ): The parameter space can be partitioned asΘ = Θ × Θ , Θ ⊂ R p , Θ ⊂ R p , p = p + p Θ = Θ × (cid:110)(cid:0) β j (cid:1) p Assumption A2: ( Hessian matrix ) Uniformly on the interior of Θ , for some ( p × p )-dimensionalmatrix K , plim T →∞ ∂ ∆ θ L T (cid:2) ( β (cid:48) , β , (cid:48) ) (cid:48) (cid:3) ∂β (cid:48) = − K (cid:104) ( β (cid:48) , β , (cid:48) ) (cid:48) (cid:105) , and where − K (cid:2) ( β (cid:48) , β , (cid:48) ) (cid:48) (cid:3) has full column-rank.Consider the log-likelihood function computed for a simulated path { ˜ y ( h ) t ( θ, z ) } Tt =1 (for h =1 , . . . , H ) and at a value β of the structural parameters: L ( h ) T ( θ, β ) = 1 T T (cid:88) t =2 log (cid:16) l (cid:110) ˜ y ( h ) t ( θ ) (cid:12)(cid:12) (cid:0) ˜ y ( h ) τ ( θ ) (cid:1) ≤ τ ≤ t − , x t ; β (cid:111)(cid:17) . (17)Associated to L ( h ) T ( θ, β ) is the simulated pseudo-score vector∆ β L ( h ) T ( θ, β ) ; β ∈ Θ , where the (pseudo) derivative ∆ β is computed with respect to the vector β ∈ Θ of parametersin (17), and not with respect to the set of structural parameters, θ ∈ Θ, used to simulate ˜ y ( h ) t ( θ ).As is standard, we require regularity on the behavior of the Hessian matrix associated with∆ β L ( h ) T ( θ, β ). Assumption A3 ( Cross-Derivative ) : For all β ∈ Θ , the application θ −→ ∆ β L ( h ) T ( θ, β )is continuously differentiable on the interior of Θ andplim T →∞ ∂ ∆ β L ( h ) T ( θ, β ) ∂θ (cid:48) = − J ( θ, β ) , for J ( θ, β ) a ( p × p )-dimensional matrix, with J ( θ ; β ) non-singular. For the sake of notational simplicity, we have not made explicit the dependence of the likelihood function onthe initial value z of the simulated data. Since we are confining ourselves to standard settings, the dependenceof L ( h ) T on z will be immaterial asymptotically. θ will be based on matching a pseudo-score at a preliminaryestimator ˆ β T ( ˆ β T ∈ Θ ) of β . We emphasize here that ˆ β T is a preliminary estimator of β ,and not θ , since it is constrained by the possibly misspecified constraint β ∈ Θ , meaningthat it cannot, in general, be a consistent estimator for θ . We will only maintain that ˆ β T is a √ T -consistent estimator of some pseudo-true value β :ˆ β T = (cid:16) ˆ β (cid:48) T , β , (cid:48) (cid:17) (cid:48) , β = ( β , (cid:48) , β , (cid:48) ) (cid:48) . We can now define our pseudo-score matching estimator of θ as follows. Definition 1: The Approximate Maximum Likelihood (AML) estimator ˆ θ T,H of θ is definedas the solution to the following equation:∆ β L T (cid:16) ˆ β T (cid:17) = 1 H H (cid:88) h =1 ∆ β L ( h ) T (cid:16) ˆ θ T,H , ˆ β T (cid:17) . (18)The AML estimator, (18), is defined as the solution of p nonlinear equations, in p unknownparameters, so that we may expect existence of a solution θ = ˆ θ T,H . However, in practice it willbe safer to minimize a squared norm of a difference between the two terms in (18). The fact thatthe system (18) is just identified tells us that asymptotically, the behavior of the minimum shouldnot depend on the weighting matrix used in the squared norm, insofar as (18) asymptoticallydefines a unique solution, which, hopefully coincides with the true unknown value θ . This willbe the purpose of the main identification assumption (given in Section 3).We can already state the general result. Proposition 1: If √ T ( ˆ β T − β ) = O P (1), under Assumptions A1, A2 , the AML estimator,ˆ θ T,H , satisfies plim T →∞ (cid:40) √ T ∆ β L T (cid:0) β (cid:1) − H H (cid:88) h =1 √ T ∆ β L ( h ) T (cid:16) ˆ θ T,H , β (cid:17)(cid:41) = 0 . Under Assumption A3 and other well-suited identification and regularity conditions (see sec-tion 3 for a precise details), √ T (cid:16) ˆ θ T,H − θ (cid:17) → d ℵ (cid:0) , Ω ( H ) (cid:1) , Ω ( H ) = (cid:18) H (cid:19) (cid:2) J (cid:0) θ , β (cid:1)(cid:3) − (cid:2) I (cid:0) θ , β (cid:1)(cid:3) (cid:2) J (cid:0) θ , β (cid:1)(cid:3) − , and with I ( θ , β ) = lim T →∞ Var (cid:110) √ T ∆ β L T ( β ) − E (cid:104) √ T ∆ β L T ( β ) (cid:12)(cid:12) { x t } Tt =1 (cid:105)(cid:111) . (cid:3) An important message of Proposition 1 is that the probability distribution of the AMLestimator ˆ θ T,H depends on the choice of the estimator ˆ β T only through the pseudo-true value β .In other words, the AML estimator defined by (18) is asymptotically equivalent to the unfeasibleestimator ˘ θ T,H ( β ) of θ that solves∆ β L T (cid:0) β (cid:1) = 1 H H (cid:88) h =1 ∆ β L ( h ) T (cid:0) θ, β (cid:1) . .5 Comparison with I-I Approaches The pseudo-score that is considered by Gallant and Tauchen (1996) (GT hereafter) is not, ingeneral, a proxy of the structural score where the parameter vector β is of the same dimensionas the structural parameter vector θ . On the contrary, GT consider an auxiliary model withlikelihood function Q T ( β ) = 1 T (cid:88) ≤ t ≤ T log (cid:0) q { y t (cid:12)(cid:12) ( y τ ) ≤ τ ≤ t − , x t , z ; β (cid:9)(cid:1) , β ∈ B ⊂ R q . The function q { y t (cid:12)(cid:12) ( y τ ) ≤ τ ≤ t − , x t , z ; . (cid:9) is not, in general, the true transition density of theprocess { y t } Tt =1 . It is a pseudo-likelihood in the sense of Gourieroux, et al. (1984), which isprecisely the reason for using the notations q { . | . } and Q T ( . ) instead of l { . | . } and L T ( . ). Thenthe pseudo maximum likelihood estimator ˆ β T satisfies ∂Q T ∂β (cid:16) ˆ β T (cid:17) = 0 . Using ˆ β T , GT define an I-I estimator ˆ θ T,H of θ as the solution of the following programmin θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) H H (cid:88) h =1 ∆ β Q ( h ) T (cid:16) θ, ˆ β T (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) W T , (19)for W T a positive-definite matrix, and where (cid:107) x (cid:107) W T = x (cid:48) W T x . While GT only consider the case H = ∞ , the above definition is indeed the extension of GT proposed by GMR. In GMR, theauthors demonstrate that the estimator ˆ θ T,H described above is asymptotically equivalent to thestandard I-I estimator based on matching estimators of β , and which implicitly requires q ≥ p. The GT estimator ˆ θ T,H can be equivalently viewed as the solution ofmin θ (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∆ β Q T (cid:16) ˆ β T (cid:17) − H H (cid:88) h =1 ∆ β Q ( h ) T (cid:16) θ, ˆ β T (cid:17)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) W T . Therefore, if the pseudo-likelihood Q T ( . ) would coincide with the true likelihood L T ( . ), and ˆ β T would not be subject to false equality constraints, the GT I-I estimator would exactly coincidewith our AML estimator. However, it is worth keeping in mind that our philosophy for AMLis precisely the opposite: we are explicitly concerned with cases where, by the nature of theconstraints we employ, ∆ β Q T (cid:16) ˆ β T (cid:17) (cid:54) = 0 . A consequence of this difference in estimation philosophy is that GT underpin the accuracyof the I-I estimator ˆ θ T,H by the asymptotic distribution of the auxiliary estimator ˆ β T . This pointof view can be seen via a Taylor expansion of the first-order conditions ∂∂θ (cid:34) H H (cid:88) h =1 ∆ β Q ( h ) T (cid:16) ˆ θ T,H , ˆ β T (cid:17) (cid:48) (cid:35) W T √ TH H (cid:88) h =1 ∆ β Q ( h ) T (cid:16) ˆ θ T,H , ˆ β T (cid:17) = 0 . Assumptions A2 and A3 , (and with abuse of notation as if L T = Q T ),we see that o P (1) = J (cid:0) θ , β (cid:1) (cid:48) W T √ TH H (cid:88) h =1 ∆ β Q ( h ) T (cid:0) θ , β (cid:1) + J (cid:0) θ , β (cid:1) (cid:48) W T K (cid:0) β (cid:1) √ T (cid:16) ˆ β T − β (cid:17) + J (cid:0) θ , β (cid:1) (cid:48) W T J (cid:0) θ , β (cid:1) √ T (cid:16) ˆ θ T,H − θ (cid:17) GMR (see the part of their Appendix 1 entitled “The Third Version of the Indirect Estimator”)show that the above Taylor expansion allows us to view √ T (ˆ θ T,H − θ ) as an asymptotically linearfunction of the difference between ˆ β T and a similar estimator computed on simulated data. Forthis reason, the asymptotic distribution of √ T (ˆ θ T,H − θ ) is directly determined by the asymptoticdistribution of √ T ( ˆ β T − β ), which is in sharp contrast to the result of Proposition 1 for theAML estimator. Consider that the false equality constraints under which AML is implemented can be written inthe implicit form g ( θ ) = 0 , for some given function g : Θ → R d g , with d g < p . Recall that the log-likelihood function L T ( θ )is assumed to be tractable for the set of parameters satisfying this constraint. It is then possibleto estimate the parameters from the Lagrangian function L T ( β, λ ) = L T ( β ) + g ( β ) (cid:48) λ, where λ ∈ R d g is the vector of Lagrange multipliers. The estimator ˆ ζ T = ( ˆ β (cid:48) T , ˆ λ (cid:48) T ) (cid:48) can then bedefined from the first-order conditions0 = ∂ L T ( ˆ β T , ˆ λ T ) ∂β = ∆ β L T (cid:16) ˆ β T (cid:17) + ∂g ( ˆ β T ) (cid:48) ∂β ˆ λ T , g ( ˆ β T ) . From these conditions, Calzolari et al. (2004) argue that I-I score matching should becorrected by the information contained in the Lagrange multipliers. In other words, they proposethat ˆ θ T,H solve 1 H H (cid:88) h =1 ∆ β L ( h ) T (cid:16) ˆ θ T,H , ˆ β T (cid:17) + ∂g ( ˆ β T ) (cid:48) ∂β ˆ λ T = 0 , (20)which is equivalent to solving1 H H (cid:88) h =1 ∆ β L ( h ) T (cid:16) ˆ θ T,H , ˆ β T (cid:17) − ∆ β L T (cid:16) ˆ β T (cid:17) = 0 , and coincides with our AML estimator. It is worth knowing that Calzolari et al. (2004) also contemplate the I-I estimator defined by (20) in thecase of inequality constraints on the auxiliary parameters, so that ˆ λ T is a vector of Kuhn-Tucker multipliers. λ T to encapsulatethe information about the violation of constraints (information that should be added to theinformation brought by the constrained estimators ˆ β T ), it still makes sense to imagine that thefull score vector accounts for this missing information. This will be confirmed by our generalanalysis in the next subsections.In addition, it is worth noting that even though our AML approach is similar to the I-Iestimators proposed in Calzolari et al. (2004), it stems from a completely different point of view.We have defined an auxiliary model with parameter vector β as a version of the structural modelthat has been simplified. In contrast to Calzolari et al. (2004), we never contemplate simplifyingthe auxiliary model, which in their case has already chosen to be a simple approximation to thestructural model. The examples in Section 2.2 demonstrate that there are important cases where imposing asimplifying constraint of the form θ = h ( γ ) , γ ∈ R d , d < p, results in an auxiliary model that isa computationally feasible version of the structural model of interest. As explained in Calvetand Czellar (2015): “Since [under the constraints] the auxiliary and structural models are thenclosely related, the resulting indirect inference estimator is expected to have good accuracyproperties.”Calvet and Czellar (2015) propose to use estimators of the auxiliary parameters based on theobserved data, say ˆ γ T , and the simulated data, say ˜ γ T ( θ ), to estimate the structural parameters.However, while ˆ γ T and ˜ γ T ( θ ) can often be obtained relatively easily, it is important to realizethat these auxiliary parameters can not generally identify the structural parameters θ , exceptin the unlikely case that the constraints {∃ γ ∈ Γ , θ = h ( γ ) } are satisfied at θ (the true value ofthe structural parameters).To circumvent this identification issue, Calvet and Czellar (2015) propose to add additionalauxiliary statistics, with dimension at least as large as p − d , within the I-I procedure. Denotethese statistics based on observed data by ˆ η T and simulated data by ˜ η T ( θ ), then Calvet andCzellar (2015) propose to estimate θ from the following program: for ˆ β T := (ˆ γ (cid:48) T , ˆ η (cid:48) T ) (cid:48) , ˜ β T ( θ ) :=(˜ γ T ( θ ) (cid:48) , ˜ η T ( θ ) (cid:48) ) (cid:48) , an estimator of θ can be obtained bymin θ ∈ Θ (cid:16) ˆ β T − ˜ β T ( θ ) (cid:17) (cid:48) W (cid:16) ˆ β T − ˜ β T ( θ ) (cid:17) , (21)where W is a positive-definite weighting matrix of conformable dimension.In a sense, the approach of Calvet and Czellar (2015) follows the idea of estimation under thenull that is commonly encountered in testing situations in econometrics; namely, we estimatea simpler version of the model that is formed as a constrained version of the model we assumehas actually generated the data, and then we construct statistics about this simpler model to In this case, the argument to consider the recentered score vector (20) instead of a score vector (19) a la Gallantand Tauchen (1996) is not any more to correct for a misspecification bias but to hedge against possible nonasymptotic normality of estimators constrained by inequality restrictions. Then, it can be shown (see alsoFrazier and Renault (2019) for a detailed asymptotic theory in case of parameters near the boundary of theparameter space) that making the difference of the two score vectors as in (18) will restore asymptotic normalityeven though each of them is not asymptotically normal, due to the fact that the inequality constrained estimatorˆ β T is not asymptotically normal. θ ∈ Θ and not only θ ∈ Θ = { θ ∈ Θ; ∃ γ ∈ Γ , θ = h ( γ ) } .Second, since the Calvet and Czellar (2015) approach directly imposes the constraints inexplicit form within the structural model, they obtain what they consider as an “unconstrained”auxiliary model. The result is that this approach will generate simple auxiliary estimators of β . However, the downside is that since we have disregarded the impact of the constraints theapproach can not identify the entire vector of structural parameters without resorting to ad-hocstatistics. While the addition of ˆ η T to the auxiliary estimators may result in a vector of statisticsthat can identify θ , the precise choice of ˆ η T in any given example is somewhat arbitrary andlikely sub-optimal.Third, for sake of efficient inference, one should realize that, by definition, the estimator ofthe simplified structural model (indexed by a lower dimensional parameter), while convenient,overlooks relevant information. In the following section, we demonstrate that AML can, ina sense, account for this information loss, and, thus, get close to the efficiency of maximumlikelihood estimation without giving up the convenient simplification of our structural model. In this section, we describe the asymptotic distribution of the AML estimator ˆ θ T,H , which is thesolution, in θ , to ∆ β L T ( ˆ β T ) = 1 H H (cid:88) h =1 ∆ β L ( h ) T ( θ, ˆ β T ) , where ˆ β T is a consistent estimator of a pseudo-true value β ∈ Θ ⊂ Θ . The asymptotic theoryof this estimator is not completely standard since, for each h = 1 , ..., H , L ( h ) T ( θ, ˆ β T ) is a samplemean of T terms, each of them depending on ˆ β T , hence it is a double array. As explained inSection 2, in particular the result of Proposition 1 , we set the focus on situations where theasymptotic distribution of the AML estimator ˆ θ T,H depends on the estimator ˆ β T , only throughits probability limit β .Therefore, to simplify the exposition, we first set the focus on the unfeasible AML (hereafter,UAML) estimator ˘ θ T,H ( β ), defined as the solution, in θ , to∆ β L T (cid:0) β (cid:1) = 1 H H (cid:88) h =1 ∆ β L ( h ) T (cid:0) θ, β (cid:1) . Since ∆ β L T ( β ) is a pseudo-score, and may include components that can not be represented aspartial derivatives of L T ( · ), we follow van der Vaart (1998) (Chapter 5) and refer to ˘ θ T,H ( β ) asa Z-estimator of θ . Moreover, it is worth recalling that we do not accommodate here the casewhere one component of the structural parameter vector is an integer. The discussion of thiscase could be achieved by extending the range of the integer parameter to the complete set ofnon-negative real numbers, which is feasible by a piecewise linear extension.23 .1 Consistency For a given pseudo-true value β , consistency of ˘ θ T,H ( β ), for θ , follows by applying Theorem5.9 in van der Vaart (1998), which requires the following regularity condition. Assumption B1 ( Identification given β ) : For any h = 1 , ..., H , ∆ β L ( h ) T ( θ, β ) convergesin probability (as T → ∞ ), uniformly on θ ∈ Θ, towards a function M ( θ, β ) such that, forevery ε > 0, inf θ ∈ Θ: d ( θ,θ ) ≥ ε (cid:13)(cid:13) M (cid:0) θ, β (cid:1) − M (cid:0) θ , β (cid:1)(cid:13)(cid:13) > . From the i.i.d. nature of the simulation, and the definition of the simulated log-likelihood L ( h ) T ( θ, β ) in (17), it is not restrictive to assume that M ( θ, β ) does not depend on h . Similarly,∆ β L T ( β ) converges towards M ( θ , β ). Under Assumption B1 , we can state the followingresult. Proposition 2: Under Assumption B1 , the UAML estimator ˘ θ T,H ( β ) is a consistent esti-mator of the true unknown value θ : plim T →∞ ˘ θ T,H ( β ) = θ . (cid:3) We now illustrate the identification condition Assumption B1 in two examples, and demon-strate that this condition is similar to the identification condition required by ML. For thepurpose of these illustrations, we only consider that Assumption B1 enforces M (cid:0) θ, β (cid:1) − M (cid:0) θ , β (cid:1) (cid:54) = 0 , ∀ θ (cid:54) = θ . That is, we temporarily overlook the fact that the well-separated minimum of (cid:107) M ( θ, β ) − M ( θ , β ) (cid:107) generally requires additional regularity, e.g., continuity of the function M ( ., β ) and compactnessof Θ. Example: Well-specified Models Assume that ∆ β L ( h ) T ( θ, β ) is the score vector of a well-specified parametric model for which β = θ is the true unknown value of the parameters, i.e.,∆ β L ( h ) T ( θ, β ) = 1 T T (cid:88) t =1 ∂ log (cid:104) l { ˜ y ( h ) t ( θ ) |{ ˜ y ( h ) τ ( θ ) } ≤ τ ≤ t − , x t ; β } (cid:105) ∂β . Under standard regularity conditions M ( θ, β ) = E θ (cid:26) ∂ log [ l { y t |{ y τ } ≤ τ ≤ t − , x t ; β } ] ∂β (cid:27) , where E θ denotes expectation computed under the probability distribution of the process { y t } Tt =1 at the parameter value θ . The standard identification condition for maximum likelihood is then M ( θ, β ) = 0 ⇐⇒ θ = β. In particular, M (cid:0) θ, β (cid:1) − M (cid:0) θ , β (cid:1) (cid:54) = 0 , ∀ θ (cid:54) = θ = β . In other words, the identification condition in Assumption B1 for the UAML is tantamountto the identification condition for maximum likelihood. (cid:3) xample: Exponential Models Assume that conditionally on { x t } Tt =1 , the variables y t are independent, for t = 1 , ..., T , and theconditional distribution of y t only depends on the exogenous variable x t with the same index.Further, assume that this distribution has a density l { y t | x t ; θ } that is of the exponential form l { y t | x t ; θ } = exp [ c ( x t , θ ) + h ( y t , x t ) + a ( x t , θ ) (cid:48) T ( y t )] , where c ( ., . ) and h ( ., . ) are given functions and a ( x t , θ ) and T ( y t ) are r -dimensional random vec-tors, all known up to the unknown θ . The extension to dynamic models, in which conditioningvalues would also include lagged values of the process y t , can also be considered at the cost ofadditional notations. From ∂ log [ l { y t | x t ; θ } ] ∂θ = ∂c ( x t , θ ) ∂θ + ∂a ( x t , θ ) (cid:48) ∂θ T ( y t )since the conditional score vector has, by definition, a zero conditional expectation, we deducethat ∂L T ( θ ) ∂θ = 1 T T (cid:88) t =1 ∂a (cid:48) ( x t , θ ) ∂θ { T ( y t ) − E θ [ T ( y t ) | x t ] } . Following Theorem 1 in Gourieroux et al. (1987), E θ [ T ( y t ) | x t ] = m ( x t , θ ) , V ar θ [ T ( y t ) | x t ] = Ω ( x t , θ ) , which implies that ∂a ( x t , θ ) (cid:48) ∂θ = ∂m (cid:48) ( x t , θ ) ∂θ Ω − ( x t , θ ) . Therefore, the maximum likelihood estimator ˆ θ T is defined as the solution to ∂L T ( θ ) ∂θ = 1 T T (cid:88) t =1 ∂m (cid:48) ( x t , θ ) ∂θ Ω − ( x t , θ ) { T ( y t ) − m ( x t , θ ) } = 0 . (22)The first-order conditions (22) show that maximum likelihood is the GMM estimator with op-timal instruments for the conditional moment restrictions E θ [ T ( y t ) − m ( x t , θ ) | x t ] = 0 . Under the assumptions for standard asymptotic theory of efficient GMM (Hansen, 1982), i.e.,for all θ ∈ Θ, the conditional variance Ω ( x t , θ ) of the moment conditions is non-singular and theJacobian matrix E [ ∂m (cid:48) ( x t , θ ) /∂θ | x t ] is full row rank, the identification condition for consistencyof maximum likelihood is that E (cid:26) ∂m (cid:48) ( x t , θ ) ∂θ Ω − ( x t , θ ) { T ( y t ) − m ( x t , θ ) } (cid:27) = 0 = ⇒ θ = θ . We summarize the relationship between the ML identification above and the correspondingversion for UAML in the following result, the details of which can be found in Appendix C.25 esult 2 In the exponential model, the identification condition in Assumption B1 can berestated as E (cid:26) ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) (cid:8) m ( x t , θ ) − m (cid:0) x t , θ (cid:1)(cid:9)(cid:27) = ⇒ θ = θ . (23) Two cases are of primary interest to demonstrate that the identification condition for UAML istantamount to the ML identification condition. Case 1: The model is a linear regression. For some known multivariate function κ ( x t ) of x t , m ( x t , θ ) = κ ( x t ) (cid:48) θ. The identification condition (23) is then equivalent to E (cid:2) κ ( x t )Ω − (cid:0) x t , β (cid:1) κ ( x t ) (cid:48) (cid:3) ( θ − θ ) = 0 = ⇒ θ = θ . Moreover, if E [ κ ( x t )Ω − ( x t , β ) κ ( x t ) (cid:48) ] is full rank at β = θ , it is full rank for any β ∈ Θ . Case 2: The model is unconditional. In this case, a necessary identification condition is givenby E θ [ T ( y )] = E θ [ T ( y )] ⇐⇒ θ = θ . In this case, the AML identification condition (23) can be equivalently stated as ∂m (cid:48) ( β ) ∂θ Ω − (cid:0) β (cid:1) { E θ [ T ( y )] − E θ [ T ( y )] } = ⇒ θ = θ . The matrix ∂m ( β ) (cid:48) /∂θ is full row rank, irrespective of the value of β , so that if Ω( β ) isnon-singular for any β ∈ Θ , the above identification condition is implied by the identificationcondition E θ [ T ( y )] = E θ [ T ( y )] ⇐⇒ θ = θ . It is also possible to extend the above analysis to the case of latent exponential models. For thesake of brevity, the details of this extension are given in Appendix C.2. (cid:3) We now return to the general case and address consistency of AML based on a first-stepconsistent estimator of β . For this purpose, we must slightly reinforce Assumption B1 . Assumption B1 (cid:48) : The estimator ˆ β T satisfies √ T ( ˆ β T − β ) = O P (1). Assumption B1 isfulfilled, and, for any h = 1 , ..., H and any real number γ > θ ∈ Θ sup (cid:107) ˆ β T − β (cid:107) ≤ γ √ T (cid:13)(cid:13)(cid:13) ∆ β L ( h ) T ( θ, ˆ β T ) − M (cid:0) θ, β (cid:1)(cid:13)(cid:13)(cid:13) = o P (1) . Proposition 3: Under Assumption B1 (cid:48) : , the AML estimator ˆ θ T,H is a consistent estimatorof the true unknown value θ : plim T →∞ ˆ θ T,H = θ . (cid:3) .2 Asymptotic Normality and Efficiency Asymptotic normality has already been demonstrated in Proposition 1 ; see Section 2.4. En-suring the argument is rigorous only requires slightly reinforcing Assumption A3 . Assumption B2 : For any h = 1 , ..., H and any real number γ > θ ∈ Θ sup (cid:107) ˆ β T − β (cid:107) ≤ γ √ T (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∂ ∆ β L ( h ) T ( θ, β ) ∂θ (cid:48) + J ( θ, β ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = o P (1) . Proposition 4: Under Assumptions A1, A2, A3 and Assumptions B1 (cid:48) , B2, the AMLestimator ˆ θ T,H and the UAML estimator ˘ θ T,H ( β ) are asymptotically normal with zero meanand asymptotic varianceΩ ( H ) = (cid:18) H (cid:19) (cid:2) J (cid:0) θ , β (cid:1)(cid:3) − (cid:2) I (cid:0) θ , β (cid:1)(cid:3) (cid:2) J (cid:0) θ , β (cid:1)(cid:3) − . (cid:3) A natural question to ask is how close is the asymptotic variance matrix Ω = lim H →∞ Ω ( H ) to the Cramer-Rao efficiency bound. It is important to realize that efficiency loss can only occurif β (cid:54) = θ or if the pseudo score vector ∆ β L T ( θ ) is not the true score vector. More precisely,we prove the following result in Appendix A. Proposition 5 : Under the assumptions of Proposition 4 , if∆ β L T (cid:0) θ (cid:1) = 1 T T (cid:88) t =2 ∂ log (cid:0) l { y t (cid:12)(cid:12) ( y τ ) ≤ τ ≤ t − , x t , z , θ (cid:9)(cid:1) ∂θ = 1 T T (cid:88) t =2 S { y t (cid:12)(cid:12) ( y τ ) ≤ τ ≤ t − , x t , z , θ (cid:9) , and if H → ∞ , then asymptotic variance of the UAML estimator, ˘ θ T,H ( β ), (and that of theAML estimator ˆ θ T,H ) achieves the Cramer-Rao efficiency bound. (cid:3) However, it is important to note that even if ∆ β L T ( β ) = (cid:80) Tt =2 S { y t (cid:12)(cid:12) ( y τ ) ≤ τ ≤ t − , x t , z , β (cid:9) /T ,i.e., ∆ β L T ( β ) is accurately computed at the pseudo-true value β , the matrix I (cid:0) θ , β (cid:1) = lim T →∞ V ar (cid:110) √ T ∆ β L T (cid:0) β (cid:1) − E (cid:104) √ T ∆ β L T (cid:0) β (cid:1) (cid:12)(cid:12) { x t } Tt =1 (cid:105)(cid:111) will coincide with the Fisher Information Matrix only iflim T →∞ V ar (cid:110) E (cid:104) √ T ∆ β L T (cid:0) β (cid:1) (cid:12)(cid:12)(cid:12) { x t } Tt =1 (cid:17)(cid:105)(cid:111) = 0 . This property is unlikely to be fulfilled in the case of a conditional model when β (cid:54) = θ .However, it is automatically fulfilled in a model that is not conditional. Moreover, it is possibleto analytically calculate the proximity between the asymptotic variances of AML and genuinemaximum likelihood in the, previously considered, case of exponential models.27 xample: Exponential Models, Continued From the first-order conditions (22), the simulated pseudo-score can be stated as∆ β L ( h ) T ( θ, β ) = 1 T T (cid:88) t =1 ∂m (cid:48) ( x t , β ) ∂θ Ω − ( x t , β ) (cid:110) T (cid:104) ˜ y ( h ) t ( θ ) (cid:105) − m ( x t , β ) (cid:111) . Recalling the definition of the UAML estimator, we see that ˘ θ T ( β ) := lim H →∞ ˘ θ T,H ( β ) isdefined as the solution, in θ , to1 T T (cid:88) t =1 ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) { T ( y t ) − m ( x t , θ ) } = 0 , where we recall that E θ [ T ( y t ) | x t ] = m ( x t , θ ) = lim H →∞ (cid:80) Hh =1 T [˜ y ( h ) t ( θ )] /H .Comparing the above equation with (22), the only reason why UAML may be less efficientthan ML is that the evaluation of the “optimal instruments” is carried out at a pseudo-truevalue of the structural parameters (i.e., β (cid:54) = θ ). It is worth revisiting the implications of thisin the two cases considered in Result 2 . Case 1: The model is a linear regression. For some known multivariate function κ ( x t ) of x t , m ( x t , θ ) = κ ( x t ) (cid:48) θ. The equation defining the UAML estimator is then1 T T (cid:88) t =1 κ ( x t ) Ω − (cid:0) x t , β (cid:1) (cid:8) T ( y t ) − κ ( x t ) (cid:48) θ (cid:9) = 0 . From the above, we see that the presence of conditional heteroskedasticity or cross-correlation,of a parametric nature, can result in a loss of efficiency for UAML. However, if Ω ( x t , β ) = σ Id,UAML is asymptotically equivalent to maximum likelihood. Case 2: The model is unconditional. The equation defining the UAML estimator is then givenby ∂m (cid:48) ( β ) ∂θ Ω − (cid:0) β (cid:1) T T (cid:88) t =1 { T ( y t ) − m ( θ ) } = 0 . In this case, the only possible loss of efficiency will occur if the moment conditions that identify θ are overidentified, i.e., when r = dim( T ) ≥ p , so that the selection matrix ∂m (cid:48) ( β ) ∂θ Ω − ( β ) isoptimal only at β = θ . An efficiency loss will then occur if, when evaluated at β (cid:54) = θ , thevector space spanned by the rows of the selection matrix do not coincide with the space spannedby the rows when β = θ . In this section, we apply AML to two of the examples considered in Section 2.2. First, we analyzethe repeated sampling behavior of AML in the confines of the generalized Tobit model, with apseudo-score computed under the false inequality constraint discussed in Section 2.2.3. Next,we evaluate the performance of AML relative to ML in the MSM model, described in Section2.2.4, and use AML to estimate the MSM model on daily S&P500 returns. The empirical resultssuggest a large value of k for this data, which ensures ML can not be feasibly implemented.28 .1 Example 1: Generalized Tobit Model We illustrate the performance of AML in the generalized Tobit-type model via a Monte Carlostudy. We generate 1,000 replications from the structural model in equations (9)-(10) (jointlywith the logistic distribution specification for y ∗ i , as in equation (11)) for two different samplessizes T = 1000 and T = 10 , θ = (0 . , . (cid:48) = θ =(0 . , . (cid:48) , and θ = 1, and the scale parameter for the model is σ = 0 . 5. The explanatoryvariables are given by x i = ˜ x i = (1 , x i ) (cid:48) with x i generated i.i.d. from the uniform distributionon [0 , H = 10 simulated samples.For each Monte Carlo replication, we calculate the constrained auxiliary estimators and theAML estimator. We compare the resulting estimates graphically in Figures 1 and 2. For each ofthe parameters, the left figure represents the auxiliary estimator over the replications, and theright figure the AML estimator. The true parameter values are reported as horizontal lines.The results demonstrate that while the restricted model is easy to estimate, it ultimatelyprovides biased estimators of the resulting parameters for θ , θ and σ (as well as θ , which isfixed at a value of zero). In contrast, AML delivers point estimators that are well-centred overthe true values.Table 1 compares the AML and auxiliary estimators across the two samples sizes in termsof bias (Bias), mean squared error (MSE), and Monte Carlo coverage (COV). The resultsdemonstrate that AML delivers estimators with relatively small biases, and good Monte Carlocoverage. Monte Carlo coverage is calculated as the average number of times, across the Monte Carlo trials, that θ j ,i.e., the true value of the j -th parameter, is contained in the univariate confidence interval ˆ θ ij ± ˆ σ j . 96, where ˆ σ j is the standard deviation for the j -th parameter over the Monte Carlo replications and ˆ θ ij is the estimator of the j -th parameter in the i -th Monte Carlo trial. ux AMLE . . . . θ Aux AMLE . . . . . θ Aux AMLE . . . σ Aux AMLE − . . . . . θ Aux AMLE − . . . . θ Aux AMLE − θ Figure 1: Each boxplot reports the auxiliary (left boxplots) and AML (right boxplots) parameterestimates for the generalized Tobit model at T = 1 , 000 across the Monte Carlo replications.The true parameter values are θ = 0 . θ = 0 . θ = 0 . θ = 0 . θ = 1, σ = 0 . ux AMLE − . . . . θ Aux AMLE . . . . . θ Aux AMLE . . . . σ Aux AMLE . . . . . θ Aux AMLE − . . . . θ Aux AMLE − . . . . . θ Figure 2: Each boxplot reports the auxiliary (left boxplots) and AML (right boxplots) parameterestimates for the generalized Tobit model at T = 10 , 000 across the Monte Carlo replications.The true parameter values are θ = 0 . θ = 0 . θ = 0 . θ = 0 . θ = 1, σ = 0 . θ θ θ θ σT = 1 , T = 10 , T = 1 , 000 and T = 10 , θ = 0 . θ = 0 . θ = 0 . θ = 0 . θ = 1, σ = 0 . In this sub-section, we explore the behavior of AML and, when feasible, compare AML andML. As discussed in Section 2.2.4, the structural parameters in the MSM model are θ = ( ζ (cid:48) , k ) (cid:48) ,where the parameter ζ = ( m , ¯ γ, b, σ ) (cid:48) govern the behavior of the individual volatility processes,and where k denotes the (unknown) number of volatility components. The likelihood of theMSM model, L T ( ζ, k ), is given in equation (12), and can be optimized so long as small values of k are considered. Indeed, for fixed ζ , computation of the likelihood is only feasible for values of k that are not too large: a single evaluation of the log-likelihood for a sample of size T requires O (2 k T ) computations, and ML estimation becomes infeasible if the true value of k is large.However, under the constraint k = 2, the likelihood L T ( ζ, k ) requires only O (2 T ) computa-tions. This suggest the following constrained estimator for the purpose of AML: ˆ β T = arg max β ∈ Θ L T ( ζ, k ) , s.t k = 2 . (24)The likelihood L T ( ζ, k ) is not differentiable in k , since k ∈ { , , . . . , } , and so for the k compo-nent of the AML pseudo-score we use the difference approximation L T ( ζ, − L T ( ζ, β L T ( ζ, 2) = (cid:18) ∂L T ( ζ, ∂ζ (cid:48) , L T ( ζ, − L T ( ζ, (cid:19) (cid:48) . (25) The more computationally convenient constraint k = 1 can not be readily used as the parameter b vanishesfrom the log-likelihood function when k = 1. ∂L T ( ζ, /∂ζ (cid:48) can be reliably obtained using numerical differentiation.To implement AML in this example, we consider H i.i.d. simulated samples, from the MSMmodel. From these simulated samples, the AML estimator is obtained by minimizing, in the Eu-clidean norm, the difference between the average simulated pseudo-score (cid:80) Hh =1 ∆ β L ( h ) T ( θ, ˆ β T ) /H and ∆ β L T ( ˆ β T ). Monte Carlo We first consider data generated from the MSM model with µ = 0 and a relatively small value of k so that ML is computationally feasible. This allows us to compare AML and ML, and directlyassess the efficiency loss of AML relative to ML. To this end, we generate 1,000 synthetic datasets from the MSM model in Section 2.2.4 with T = 5 , 000 observations, and where the parametervalues are set as follows: m = 1 . γ = 0 . b = 4, σ = 0 . 01 and k = 4.Numerical implementation of AML and ML require optimization over the integer parameterspace for k , while optimization for the ζ components can proceed via standard approaches.For both approaches, optimization over the ζ components is carried out using a quasi-Newtonapproach, with finite-differences used to estimate the derivatives. For the k components, thelikelihood is optimized across the grid { , . . . , } , while AML considers a much larger grid ofvalues. The ability of AML to consider large values for k is possible because the computational costrequired to evaluate the AML criterion function does not increase with k , and requires O ( HT )computations for any value of k . In this Monte Carlo exercise, AML is implemented using H = 100 pseudo-samples, as the large value of H smooths the criterion function and increasesthe accuracy of numerical differentiation methods. Figure 3 displays the results of this Monte Carlo experiment. For each sub-figure, the leftplot contains the ML estimator and the right plot contains the associated AML estimator. Thetrue parameter values are reported as horizontal lines. AML provides estimators that are well-centred over the true value of the structural parameters with, as expected, a larger variance thanthe ML estimator in some cases.Table 2 compares the bias (Bias), mean squared error (MSE) and Monte Carlo coverage(COV) of the estimators. In addition, for each replication we calculate the efficiency loss of AMLwith respect to ML via the average relative standard error, denoted by SE(ML)/SE(AML) inTable 2. Using this measure, numbers below unity suggest that, on average, the ML estimator ismore efficient than the AML estimator. The results in Table 2 suggest that the two estimatorsare comparable in terms of bias and MSE for m , ¯ γ and b , with ML yielding more accurateestimators for k and σ . Analyzing the efficiency of the two estimators, we see that, accordingto the SE(ML)/SE(AML) measure, AML is nearly as efficient as ML for m , ¯ γ and b , but lessso for σ and k . The later is not entirely unexpected as imposing the invalid restriction k = 2within the pseudo-score should lead to some efficiency loss (with respect to ML). However, thisexample also demonstrates that imposing this restriction only leads to a minor loss in accuracyfor estimating m , ¯ γ and b . Technically, we implement AML by extending the grid of values over which k is optimized to the entire realline. This is done by considering a piecewise linear extension of the pseudo-score for the k component, and bytaking the closest integer to the resulting optimized value. An alternative to the finite-differences considered herein would be to use the simulation-based differentiationapproach in Frazier et al. (2019). γ b σ k ML Bias -0.0014244 0.0134517 0.1367587 0.0000088 -0.0120000MSE 0.0004834 0.0121796 1.0688123 0.0000003 0.0900000COV 0.9380000 0.9520000 0.9560000 0.9490000 0.9130000AML Bias -0.0036913 0.0280103 0.0653309 0.0002228 -0.0878051MSE 0.0005691 0.0142423 0.9924541 0.0000009 0.1727888COV 0.9510000 0.9430000 0.9440000 0.9310000 0.9150000SE(ML)/SE(AML) 0.9309007 0.9442337 1.0308558 0.5860551 0.7377811Table 2: Accuracy measures for ML and AML parameter estimates of the MSM for T = 5 , m = 1 . γ = 0 . b = 5, σ = 0 . 01 and k = 4. In ML estimation, k only takes values in { , . . . , } . ML AMLE . . . m ML AMLE . . . . . γ ML AMLE ML AMLE . . . . σ ML AMLE k Figure 3: Each boxplot reports the ML (left boxplots) and AML (right boxplots) parameterestimates for the MSM model with sample size T = 5 , 000 across the Monte Carlo replications.The true parameter values are m = 1 . γ = 0 . b = 5, σ = 0 . 01 and k = 4 and are reportedas horizontal lines.While ML has an edge in terms of accuracy, due to computational cost, ML is infeasible if thetrue value of k is large. To illustrate this point, we compare the time, in log seconds, required34o evaluate the log-likelihood function and the AML criterion function for various values of k and for a sample size of T = 5 , k = 6 , , . . . , H = 100 simulated samples. We repeat the same exercisesfor the log-likelihood function and for k = 6 , , . . . , 14, with linear extrapolation used for valuesof k ≥ 15. Figure 4 compares the mean computation times. For k small, evaluation of thelikelihood is faster than the AML criterion, given the large number of simulated paths used inthe AML criterion. However, when k becomes even moderately large, AML is clearly superiorin terms of computational cost. For values of k > 9, AML is particularly attractive in terms ofcomputation time. At a value of k = 21, a single evaluation of the log-likelihood would require5459.2 days (approximately 15 years), whereas an evaluation of the AML criterion only requires1.45 seconds.Figure 4: Computation times, in log seconds, of the likelihood function (continuous line) andAML criterion function (dash-dotted line) using H = 100. The averages presented are takenover twenty data sets simulated from the MSM model with T = 5 , m = 1 . γ = 0 . b = 4, σ = 0 . 01 and k = 6 , , . . . , 21. Small dotted line indicates extrapolated computation time forML estimation for k ≥ k . We choose k = 18 andother parameter values that resulted from the empirical example conducted later (see Table 4in the following subsection). Figure 5 displays the estimation results over 1,000 Monte Carloreplications from the DGP associated with T = 23 , 202 (as in the empirical dataset in thefollowing subsection), and where the parameter values are m = 1 . γ = 0 . b =1 . σ = 0 . k = 18. For each sample, we calculate the constrained estimator and35ML estimator using H = 100 pseudo-samples. For each sub-figure, the left plot contains theconstrained auxiliary estimates and the right plot contains the associated AML estimator. Thetrue parameter values are reported with horizontal lines. While the restricted model is easy toestimate, it provides estimators that are significantly biased for all parameters except σ . AMLcorrects the resulting bias for all structural parameters and delivers estimators that are, onaverage, centred over the true values. Analyzing the other accuracy measures given in Table 3,we see that AML generally yields estimators with low bias and Monte Carlo coverage close tothe nominal level.Figure 5: Each boxplot reports the auxiliary (left boxplots) and AML (right boxplots) parameterestimates for the MSM model with sample size T = 23 , 202 across the Monte Carlo replications.The true parameter values are m = 1 . γ = 0 . b = 1 . σ = 0 . k = 18 andreported with horizontal lines. 36 γ b σ k Auxiliary Bias 0.363348 -0.061777 12.002244 0.000943 -MSE 0.133257 0.003867 174.209123 0.000014 -COV 0.000000 0.000000 0.480000 0.939000 -AML Bias -0.001502 0.012439 0.025719 0.000033 -1.558178MSE 0.000303 0.002176 0.022416 0.000009 11.391885COV 0.936000 0.955000 0.937000 0.945000 0.897000Table 3: Accuracy measures for auxiliary and AML estimator parameter estimates of the MSMmodel with T = 23 , m = 1 . γ = 0 . b = 1 . σ = 0 . k = 18. Application: S&P500 Returns We now estimate the Binomial MSM model (with µ = 0) on demeaned daily S&P500 (simple)returns between January 3, 1928 and May 15, 2020 . The sample size is T = 23 , k ranging from k = 1 up to k = 10. Theestimated value of k obtained by AML is far larger than the feasible value associated with ML.Moreover, except for m , the remaining estimated parameters are also significantly different,with the estimated values of ¯ γ and b being markedly different across the two approaches. Thestandard errors for ML are calculated using the asymptotic formula, while those for AML arecalculated using a parametric bootstrap based and 1,000 simulated data sets from the assumedDGP.In order to compare the goodness-of-fit of the eleven models enumerated in Table 4, for eachmodel we provide one-day-ahead forecasts at each in-sample date t = 1 , . . . , T using a particlefilter of size N = 10 . For a given model, at each date t , the particle filter provides N simulatedvalues from the approximate distribution of r t |{ r , . . . , r t − } : r (1) t , . . . , r ( N ) t . At each date t = 1 , . . . , T , we calculate the α = 1% and α = 5% value-at-risk forecasts definedby VaR α,t = − q α ( r (1) t , . . . , r ( N ) t ) , where q α ( · ) indicates the α -th sample quantile, and report the failure rate of VaR α,t : p α = 1 T T (cid:88) t =1 r t < ( − VaR α,t ) . The closer p α is to α , the better the forecasts. The left panel of Table 6 reports p α for α = 0 . α = 0 . 05 for each model specification along with asymptotic standard errors in parentheses.AML provides the only model specification for which both failure rates are not significantly Downloaded from finance.yahoo.com on May 15, 2020. α = 5%expected shortfall forecasts: ES α,t = N (cid:88) i =1 r ( i ) t r ( i ) t < ( − VaR tα ) (cid:30) N (cid:88) i =1 r ( i ) t < ( − VaR tα ) . To this end, we collect the empirical returns satisfying r t | r t < − Var ( k =10)0 . ,t , under the model with k = 10, and for each value of k in Table 4, we regress these returns on ES α,t , calculated underthe corresponding value of k in Table 4. Regression intercepts, slopes, R values and p -values ofthe Wald test associated with the joint hypothesis (intercept , slope) (cid:48) = (0 , 1) are reported in theright panel of Table 6. The k = 18 specification provides the best expected shortfall forecasts,as measured by the magnitude of the corresponding p -values.Figure 6: Daily S&P500 returns between January 3, 1928 and May 15, 2020.38 m γ b σ Log-like.ML 1 1 . (0 . . (0 . - 0 . (0 . . (0 . . (0 . . (1 . . (0 . . (0 . . (0 . . (1 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . (2 . . (0 . . (0 . . (0 . . (0 . -Table 4: The table reports the ML estimator (ML) and AML estimator (AML) of the demeanedempirical S&P500 returns (left panel). Asymptotic standard errors for the ML estimator arereported in parentheses below each value. The AML standard errors are obtained using aparametric bootstrap based on 1,000 simulated samples (of length T = 23 , k .VaR failure rates ES . regressions k p . p . Intercept Slope R WaldML 1 0 . (0 . . (0 . . (0 . . (0 . · − . (0 . . (0 . . (0 . . (0 . · − . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . − . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . · − . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . . (0 . − . (0 . . (0 . particles. In the left panel, failurerates of the 1% and 5% value-at-risk are reported with asymptotic standard errors in parentheses.In the right panel, for each k , the empirical returns satisfying { r t | r t < − VaR ( k =10)0 . ,t } are regressedon { ES k . ,t | r t < − VaR ( k )0 . ,t } , where VaR ( k )0 . ,t corresponds to the 5% value-at-risk at date t forecasted with k and ES ( k )0 . ,t corresponds to the 5% expected shortfall at date t forecasted with k . For each regression, the intercepts and slopes are reported with standard errors in parenthesesalong with the R values and the p-values of the Wald test H : (intercept , slope) = (0 , In this paper, we provide an alternative to indirect inference (hereafter, I-I) estimation thatsimultaneously allows us to circumvent the intractability of maximum likelihood estimation (aswith standard I-I), but which, in contrast to naive I-I, respects the goal of obtaining asymptot-ically efficient inference in the context of a fully parametric model. Although close in spirit toI-I, the approximate maximum likelihood (hereafter, AML) method developed in this paper does not belong to the realm of I-I for two reasons: First, the asymptotic distribution of the AMLestimator only depends on the probability limit of the estimated auxiliary parameters and noton its asymptotic distribution. Second, while the AML estimator is obtained by matching twosample moments, one computed on observed data, and one computed on simulated data, bothsample moments depend on the observed data through the value of the preliminary estimator40f the auxiliary parameters. Interestingly, the sampling uncertainty carried by this preliminaryestimator has no impact on the asymptotic distribution of the AML estimator because it iserased through the matching procedure.The message of our paper is threefold. First, we demonstrate that the idea of matchingproxies of the score for the structural model seems productive to reach near efficiency for inferenceon the structural parameters. We show theoretically that, at least for exponential models ortransformation of them, the efficiency loss should be manageable since it is mainly due to theeffect of a misspecification bias created by our simplification of the structural model.Second, there are many non-linear time series models, which are popular in financial econo-metrics and dynamic/nonlinear microeconometrics, where a natural simplification of the struc-tural model yields a convenient proxy for the score of the structural model. Since the misspec-ification bias created by this simplification is only due to imposing some possible false equalityconstraints, or to numerical approximations for certain elements of the gradient vector, one mayreasonably hope that the resulting efficiency loss is minimal. While our general results (and theo-retical examples) suggest that this finding is valid in many examples, including dynamic discretechoice and stochastic volatility models, we provide numerical evidence in three specific examples:generalized Tobit, Markov-switching multifractal models and stable distributions. The numer-ical results largely confirm our intuitions. Our method can alleviate the computational cost ofmaximum likelihood associated with complex models, at the cost of a limited loss in efficiency.Moreover, we confirm that even in finite-samples, the Wald confidence intervals associated toAML estimators display excellent coverage, since, thanks to matching the misspecification bias,the preliminary estimators have no impact on the central tendency of the AML estimator.A third and even more general message is that the matching principle put forward by I-Iestimation can be extended to situations where the two empirical moments to match, one basedon observed data, one based on simulated data may both depend on the observed data through aconvenient summary of them. While we have used this idea to aim for (nearly) efficient inference,Gospodinov, et al., (2017) employ a similar approach to hedge against misspecification bias dueto the use of a misspecified simulator. Even though they have not derived the asymptoticdistribution theory in their case, the two methods are essentially similar and could be nestedwithin a general asymptotic theory where both the moments to match and the simulator dependon observed data. References [1] Amemiya, Takeshi. Advanced econometrics. Harvard university press, 1985.[2] Bansal, Ravi, and Amir Yaron. “Risks for the long run: A potential resolution of assetpricing puzzles.” The Journal of Finance 59, no. 4 (2004): 1481-1509.[3] Behrens, S. and Melissinos, A.C., Univ. of Rochester Preprint UR-776 (1981).[4] Calvet, Laurent E., and Veronika Czellar. “Through the looking glass: Indirect inferencevia simple equilibria.” Journal of Econometrics 185, no. 2 (2015): 343-358.[5] Calvet, Laurent, and Adlai Fisher. “Forecasting multifractal volatility.” Journal of Econo-metrics 105, no. 1 (2001): 27-58. 416] Calvet, Laurent E., and Adlai J. Fisher. “How to forecast long-run volatility: Regimeswitching and the estimation of multifractal processes.” Journal of Financial Econometrics2, no. 1 (2004): 49-83.[7] Calvet, Laurent E., and Adlai Fisher. Multifractal volatility: theory, forecasting, and pric-ing. Academic Press (2008).[8] Calzolari, Giorgio, Gabriele Fiorentini, and Enrique Sentana. “Constrained indirect estima-tion.” The Review of Economic Studies 71, no. 4 (2004): 945-973.[9] Chambers, John M., Colin L. Mallows, and B. W. Stuck. “A method for simulating stablerandom variables.” Journal of the american statistical association 71, no. 354 (1976): 340-344.[10] Chen, Fei, Francis X. Diebold, and Frank Schorfheide. “A Markov-switching multifractalinter-trade duration model, with application to US equities.” Journal of Econometrics 177,no. 2 (2013): 320-342.[11] Dridi, Ramdan, Alain Guay, and Eric Renault.“Indirect inference and calibration of dy-namic stochastic general equilibrium models.” Journal of Econometrics 136, no. 2 (2007):397-430.[12] Dudley, Leonard, and Claude Montmarquette. “A model of the supply of bilateral foreignaid.” The American Economic Review 66, no. 1 (1976): 132-142.[13] Franses, Philip Hans, Marco Van Der Leij, and Richard Paap. “A simple test for GARCHagainst a stochastic volatility model.” Journal of Financial Econometrics 6, no. 3 (2008):291-306.[14] Frazier, David T., and Eric Renault. “Indirect inference with(out) constraints.” Quantita-tive Economics, 11 (2020): 113-159.[15] Frazier, David T., Tatsushi Oka, and Dan Zhu. “Indirect inference with a non-smoothcriterion function.” Journal of Econometrics 212, no. 2 (2019): 623-645.[16] Gallant, A. Ronald. and George Tauchen. “Which moments to match.” Econometric Theory12, (1996): 657-681.[17] Gospodinov, Nikolay, Ivana Komunjer, and Serena Ng. “Simulated minimum distance es-timation of dynamic models with errors-in-variables.” Journal of econometrics 200, no. 2(2017): 181-193.[18] Gourieroux, Christian, Alain Monfort, and Alain Trognon. “Pseudo maximum likelihoodmethods: Theory.” Econometrica: Journal of the Econometric Society (1984): 681-700.[19] Gourieroux, Christian, Alain Monfort, and Alain Trognon. “A general approach to serialcorrelation.” Econometric Theory 1, no. 3 (1985): 315-340.[20] Gourieroux, Christian and Alain Monfort. Simulation-based Econometric Methods, OUP,(1996). 4221] Gourieroux, Christian, Alain Monfort and Eric Renault. “Indirect inference.” Journal ofApplied Econometrics 85, (1993): S85–S118.[22] Gourieroux, Christian, Alain Monfort, Eric Renault, and Alain Trognon. “Generalised resid-uals.” Journal of Econometrics 34, no. 1 (1987): 5-32.[23] Hansen, Lars Peter. “Large sample properties of generalized method of moments estima-tors.” Econometrica: Journal of the Econometric Society (1982): 1029-1054.[24] Koutrouvelis, Ioannis A. ”An iterative procedure for the estimation of the parameters ofstable laws: An iterative procedure for the estimation.” Communications in Statistics-Simulation and Computation 10, no. 1 (1981): 17-28.[25] Louis, Thomas A. “Finding the observed information matrix when using the EM algorithm.”Journal of the Royal Statistical Society: Series B (Methodological) 44, no. 2 (1982): 226-233.[26] McCulloch, J. Huston. “Simple consistent estimators of stable distribution parameters.”Communications in Statistics-Simulation and Computation 15, no. 4 (1986): 1109-1136.[27] Meddahi, Nour, and Eric Renault. “Temporal aggregation of volatility models.” Journal ofEconometrics 119, no. 2 (2004): 355-379.[28] Pinkse, Joris, and Margaret E. Slade. “Contracting in space: An application of spatialstatistics to discrete-choice models.” Journal of Econometrics 85, no. 1 (1998): 125-154.[29] Poirier, Dale J., and Paul A. Ruud. “Probit with dependent observations.” The Review ofEconomic Studies 55, no. 4 (1988): 593-614.[30] Robinson, Peter M. “On the asymptotic properties of estimators of models containing lim-ited dependent variables.” Econometrica, (1982): 27-41.[31] Smith, Anthony A. “Estimating nonlinear time series models using simulated vector au-toregressions.” Journal of Applied Econometrics 8, no. S1 (1993): S63-S84.[32] van der Vaart, Aad W. Asymptotic statistics. Vol. 3. Cambridge university press, 1998. A Proofs of Main Results A.1 Proof of Proposition 1 With standard abuse of notation, a Taylor expansion gives: √ T ∆ β L T (cid:16) ˆ β T (cid:17) = √ T ∆ β L T (cid:0) β (cid:1) − K (cid:104) ˜ β T (cid:105) √ T (cid:104) ˆ β T − β (cid:105) √ TH H (cid:88) h =1 ∆ β L ( h ) T (cid:16) θ, ˆ β T (cid:17) = 1 H H (cid:88) h =1 √ T ∆ β L ( h ) T (cid:0) θ, β (cid:1) − (cid:40) H H (cid:88) h =1 K (cid:16) ˜ β ( h ) T ( θ ) (cid:17)(cid:41) √ T (cid:104) ˆ β T − β (cid:105) β T and ˜ β ( h ) T ( θ ) , h = 1 , ..., H are all in the interval (cid:104) β , ˆ β T (cid:105) . Hence: √ T ∆ β L T (cid:0) β (cid:1) − H H (cid:88) h =1 √ T ∆ β L ( h ) T (cid:0) θ, β (cid:1) = (cid:40) K (cid:104) ˜ β T (cid:105) − H H (cid:88) h =1 K (cid:16) ˜ β ( h ) T ( θ ) (cid:17)(cid:41) √ T (cid:104) ˆ β T − β (cid:105) with, thanks to assumptions A1 and A2, and the fact that √ T (cid:104) ˆ β T − β (cid:105) = O P (1) implies that our AML estimator is such that: √ T ∆ β L T (cid:0) β (cid:1) − H H (cid:88) h =1 √ T ∆ β L ( h ) T (cid:16) ˆ θ T , β (cid:17) = o P (1)Under assumption A3, an additional Taylor expansion gives √ T ∆ β L T (cid:0) β (cid:1) − H H (cid:88) h =1 √ T ∆ β L ( h ) T (cid:0) θ , β (cid:1) + o P (1) = − (cid:32) H H (cid:88) h =1 J (cid:16) ˜ θ ( h ) T , β (cid:17)(cid:33) √ T (cid:16) ˆ θ T − θ (cid:17) where ˜ θ ( h ) T , h = 1 , ..., H are all in the interval (cid:104) θ , ˆ θ T (cid:105) . Hence: √ T (cid:16) ˆ θ T − θ (cid:17) = (cid:2) J (cid:0) θ , β (cid:1)(cid:3) − (cid:40) √ T ∆ β L T (cid:0) β (cid:1) − H H (cid:88) h =1 √ T ∆ β L ( h ) T (cid:0) θ , β (cid:1)(cid:41) + o P (1)We know from Gourieroux, Monfort and Renault (1993) (see their proposition 3 and itsproof) that: (cid:40) √ T ∆ β L T (cid:0) β (cid:1) − H H (cid:88) h =1 √ T ∆ β L ( h ) T (cid:0) θ , β (cid:1)(cid:41) → d ℵ (cid:18) , (cid:18) H (cid:19) I (cid:0) θ , β (cid:1)(cid:19) which completes the proof of Proposition 1. (cid:3) A.2 Proof of Proposition 5 By virtue of Proposition 4 , we only need to prove that the asymptotic variance Ω ( H ) of theUAML estimator ˘ θ T,H ( θ ) coincides with the Cramer-Rao efficiency bound when H → ∞ . When H → ∞ , this estimator, denoted ˘ θ T , can be seen as the solution in θ of the system of equations:∆ β L T (cid:0) θ (cid:1) = E θ [∆ β L T (cid:0) θ (cid:1) | { x t } Tt =1 ] . If we define m T ( β, θ ) = ∆ β L T ( β ) − E θ [∆ β L T ( β ) (cid:12)(cid:12) { x t } Tt =1 ] , we have, by definition,0 = √ T m T (cid:16) θ , ˘ θ T (cid:17) = √ T m T (cid:0) θ , θ (cid:1) + ∂m T ( θ , θ ) ∂θ (cid:48) √ T (cid:16) ˘ θ T − θ (cid:17) + o P (1) . β L T ( θ ),∆ β L T (cid:0) θ (cid:1) = 1 T T (cid:88) t =1 ∂ log (cid:0) l { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9)(cid:1) ∂θ (26)= 1 T T (cid:88) t =1 S { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9) , and note that, by virtue of (26), √ T m T (cid:0) θ , θ (cid:1) = √ T ∆ β L T (cid:0) θ (cid:1) = 1 √ T T (cid:88) t =1 ∂ log (cid:0) l { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9)(cid:1) ∂θ converges in distribution to a ℵ (0 , I ) random variable, where I = I ( θ , θ ) is the Fisherinformation matrix.Moreover,plim T →∞ ∂m T ( θ , θ ) ∂θ (cid:48) = plim T →∞ T T (cid:88) t =1 ∂∂θ (cid:48) E θ (cid:40) ∂ log (cid:0) l { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9)(cid:1) ∂θ (cid:41) θ = θ , and we have ∂∂θ (cid:48) E θ (cid:40) ∂ log (cid:0) l { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9)(cid:1) ∂θ (cid:41) = ∂∂θ (cid:48) (cid:90) ∂ log (cid:0) l { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9)(cid:1) ∂θ l { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9) dν (cid:0) y t (cid:12)(cid:12) { y τ } t − τ =1 , x t (cid:1) , where ν denotes some dominating measure. Thus, ∂∂θ (cid:48) E θ (cid:40) ∂ log (cid:0) l { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9)(cid:1) ∂θ (cid:41) = (cid:90) S { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9) S { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9) (cid:48) l { y t (cid:12)(cid:12) { y τ } t − τ =1 , x t , z , θ (cid:9) dν (cid:0) y t (cid:12)(cid:12) { y τ } t − τ =1 , x t (cid:1) . Therefore,plim T →∞ ∂m T ( θ , θ ) ∂θ (cid:48) = E (cid:2) S { y t | { y τ } t − τ =1 , x t , z , θ } S (cid:48) { y t | { y τ } t − τ =1 , x t , z , θ } | { x τ } tτ =1 (cid:3) is the Fisher information matrix I . Consequently, √ T (cid:16) ˘ θ T − θ (cid:17) = − (cid:0) I (cid:1) − √ T m T (cid:0) θ , θ (cid:1) + o P (1) −→ d ℵ (cid:16) , (cid:0) I (cid:1) − (cid:17) . B GARCH-like Stochastic Volatility Models: Pseudo-Score In this section, we give the necessary details required to obtain Result 1 in Section 2.3.45o this end, we first compute the latent score, and then use this to interpret the score interms of generalized residuals, it is worth computing the latent score. We first decompose thelatent log-likelihood as follows: L ∗ T ( ζ, 0) = L ∗ ,T ( µ, ω, α ) + L ∗ ,T ( (cid:36) ) ,L ∗ ,T ( µ, ω, α ) = 1 T T (cid:88) t =1 (cid:26) − (cid:2) log(2 π ) + log (cid:0)(cid:2) ω + αε t + η t (cid:3)(cid:1)(cid:3)(cid:27) − T T (cid:88) t =1 ( r t +1 − µ ) ω + αε t + η t ,L ∗ ,T ( (cid:36) ) = − log ( (cid:36) ) + 1 T T (cid:88) t =1 log f χ (cid:16) η t (cid:36) (cid:17) . Computations very similar to the case of Gaussian QMLE of ARCH models give: ∂L ∗ T ( ζ, ∂µ = 1 T T (cid:88) t =1 r t +1 − µσ t ,∂L ∗ T ( ζ, ∂ω = 12 T T (cid:88) t =1 σ t − T T (cid:88) t =1 ( r t +1 − µ ) σ t ,∂L ∗ T ( ζ, ∂α = 12 T T (cid:88) t =1 ε t σ t − T T (cid:88) t =1 ( r t +1 − µ ) σ t ε t , while ∂L ∗ T ( ζ, ∂(cid:36) = − (cid:36) − T (cid:36) T (cid:88) t =1 f (cid:48) χ (cid:0) η t (cid:36) (cid:1) f χ (cid:0) η t (cid:36) (cid:1) η t , where f (cid:48) χ is the derivative of the probability density function f χ . Note that for sake of non-negativity of variance, we expect the probability distribution of χ t to have a lower boundedsupport, like for instance a demeaned log-normal distribution. However, it is a reasonablehypothesis to see χ t as a Gaussian variable if we consider that the correction term is smallenough such that a Gaussian approximation is accurate enough. We would then get a proxy ofthe latent score by: ∂ ˜ L ∗ T ( ζ, ∂(cid:36) = − (cid:36) + 1 (cid:36) T T (cid:88) t =1 η t = − (cid:36) + 1 (cid:36) T T (cid:88) t =1 (cid:2) σ t − ω − αε t (cid:3) . The message from (16) is that we will go from latent score vector to observable one byreplacing all functions of latent volatility by its optimal filter. Let us define these filters: (cid:2) σ t (cid:3) F,t = E [ σ t | r τ , τ ≤ t ] , (27) (cid:20) σ t (cid:21) F,t = E [ 1 σ t | r τ , τ ≤ t ] , (cid:20) σ t (cid:21) F,t = E [ 1 σ t | r τ , τ ≤ t ] . ∂ ˜ L T ( ζ, ∂µ = 1 T T (cid:88) t =1 (cid:20) σ t (cid:21) F,t ( r t +1 − µ ) ,∂ ˜ L T ( ζ, ∂ω = 12 T T (cid:88) t =1 (cid:20) σ t (cid:21) F,t − T T (cid:88) t =1 (cid:20) σ t (cid:21) F,t ( r t +1 − µ ) ,∂ ˜ L T ( ζ, ∂α = 12 T T (cid:88) t =1 (cid:20) σ t (cid:21) F,t ε t − T T (cid:88) t =1 (cid:20) σ t (cid:21) F,t ( r t +1 − µ ) ε t ,∂ ˜ L T ( ζ, ∂(cid:36) = − (cid:36) + 1 (cid:36) T T (cid:88) t =1 (cid:104)(cid:2) σ t (cid:3) F,t − ω − αε t (cid:105) . We recall that we denote these pseudo-score components with notation ˜ L to stress thatthey are only approximations. They have been computed with filtering formulas (27) thatare only approximations since doing as if ρ = 0 . The filtered values (27) allow us to compute”generalized residuals” similar to the one computed in the dynamic Probit example. However,by contrast with this example, we do not have in general closed form formulas for these filters.Any filtering strategy may be worth applying in this context. At least, a very simple one is touse the ARCH (1) approximation as a convenient filter, meaning that we replace in all filteringformulas , the latent quantity σ t by the observed one ˆ σ t (erasing then the conditional expectationoperator) that comes from fitting an ARCH (1) model to our data set { r t +1 } Tt =1 .We now address the computation of the partial derivative ∂ ˜ L T ( ζ, /∂ρ of the observedlog-likelihood with respect to the parameter ρ .Using the definition of the latent likelihood, see Section 2.2.2, we can write: L T ( θ ) = 1 T log (cid:32)(cid:90) ... (cid:90) G T ( µ, ω, α ) T (cid:89) t =1 (cid:36) f χ (cid:18) η t − ρη t − (cid:36) (cid:19)(cid:33) dη ...dη T , where G T ( µ, ω, α ) = T (cid:89) t =1 √ π ω + αε t + η t ] / exp (cid:18) − ( r t +1 − µ ) ω + αε t + η t ] (cid:19) . Then, ∂L T ( θ ) ∂ρ = [ T l T ( θ )] − (cid:90) + ∞−∞ ... (cid:90) + ∞−∞ G T ( µ, ω, α ) 1 (cid:36) T ∂δρ T (cid:89) t =1 f χ (cid:18) η t − ρη t − (cid:36) (cid:19) dη ...dη T ,l T ( θ ) = (cid:90) + ∞−∞ ... (cid:90) + ∞−∞ G T ( µ, ω, α ) 1 (cid:36) T T (cid:89) t =1 f χ (cid:18) η t − ρη t − (cid:36) (cid:19) dη ...dη T . With an innovation process χ t that is a standard Gaussian, this leads (by computing the deriva-47ive of the product as a sum of products with one term differentiated in each) to: l T ( ζ, ∂L T ( ζ, ∂ρ (28)= (cid:90) + ∞−∞ ... (cid:90) + ∞−∞ G T ( µ, ω, α ) (cid:34) T (cid:89) t =1 (cid:36) f χ (cid:16) η t (cid:36) (cid:17)(cid:35) γ η,T (cid:36) dη ...dη T = (cid:90) + ∞−∞ ... (cid:90) + ∞−∞ l ∗ [ { r t +1 , η t } Tt =1 | ( ζ, γ η,T (cid:36) dη ...dη T , where γ η,T is the sample autocovariance of order 1 of the latent process γ η,T = 1 T T (cid:88) t =1 η t η t − . We note that l T ( ζ, 0) = (cid:90) + ∞−∞ ... (cid:90) + ∞−∞ l ∗ [ { r t +1 , η t } Tt =1 | ( ζ, dη ...dη T = l ∗ [ { r t +1 } Tt =1 | ( ζ, ∂ ˜ L T ( ζ, ∂ρ = 1 (cid:36) E [ γ η,T (cid:12)(cid:12)(cid:12) { r t +1 } Tt =1 (cid:105) . (29)Again, the computation of the observed score component is germane to the computation ofgeneralized residuals. However, it is worth noting that (29) is a smoothing formula instead of afiltering formula. The pseudo-score ∂ ˜ L T ( ζ, /∂ρ can then be based on the approximation1 (cid:36) T T (cid:88) t =2 (cid:0) [ σ t ] F,t − ω − αε t (cid:1) (cid:0) [ σ t − ] F,t − − ω − αε t − (cid:1) . C Details for Examples in Section 3 In this section, we give the details required to obtain Result 2 in Section 3. In addition, we alsoextend this example to consider latent exponential models. C.1 Example: Exponential Models For the sake of exposition, we assume that conditionally on { x t } Tt =1 , the variables y t , t = 1 , ..., T are independent and the conditional distribution of y t only depends on the exogenous variable x t with the same index. This distribution has a density l { y t | x t ; θ } that is assumed to be exponential: l { y t | x t ; θ } = exp [ c ( x t , θ ) + h ( y t , x t ) + a (cid:48) ( x t , θ ) T ( y t )]where c ( ., . ) and h ( ., . ) are given numerical functions and a ( x t , θ ) and T ( y t ) are r -dimensionalrandom vectors. Note that the extension to dynamic models in which conditioning values wouldalso include some lagged values of the process y t would be easy to devise. From: ∂ log [ l { y t | x t ; θ } ] ∂θ = ∂c ( x t , θ ) ∂θ + ∂a (cid:48) ( x t , θ ) ∂θ T ( y t )48e deduce, since the conditional score vector has by definition a zero conditional expectation,that: ∂L T ( θ ) ∂θ = 1 T T (cid:88) t =1 ∂a (cid:48) ( x t , θ ) ∂θ { T ( y t ) − E θ [ T ( y t ) | x t ] } Following Theorem 1 in Gourieroux et al. (1987), E θ [ T ( y t ) | x t ] = m ( x t , θ ) , V ar θ [ T ( y t ) | x t ] = Ω ( x t , θ )= ⇒ ∂a (cid:48) ( x t , θ ) ∂θ = ∂m (cid:48) ( x t , θ ) ∂θ Ω − ( x t , θ )Therefore, the maximum likelihood estimator ˆ θ T is defined as solution of: ∂L T ( θ ) ∂θ = 1 T T (cid:88) t =1 ∂m (cid:48) ( x t , θ ) ∂θ Ω − ( x t , θ ) { T ( y t ) − m ( x t , θ ) } = 0 (30)We actually generalize the remark of van der Vaart (1998), Section 4.2., noting that ”themaximum likelihood estimators are moment estimators” based on the (conditional) expectationof the sufficient statistic T ( y ). The first-order conditions (30) show that maximum likelihood isthe GMM estimator with optimal instruments for the conditional moment restrictions: E θ [ T ( y t ) − m ( x t , θ ) | x t ] = 0 . Note that we implicitly maintain the assumptions for standard asymptotic theory of efficientGMM (Hansen, 1982): for all θ ∈ Θ, the conditional variance Ω ( x t , θ ) of the moment conditionsis non-singular and the Jacobian matrix E [ ∂m (cid:48) ( x t , θ ) /∂θ | x t ] is full row rank.The identification condition for consistency of maximum likelihood is then that: E (cid:26) ∂m (cid:48) ( x t , θ ) ∂θ Ω − ( x t , θ ) { T ( y t ) − m ( x t , θ ) } (cid:27) = 0 = ⇒ θ = θ . In terms of GMM, it means that optimal instruments are assumed to identify the trueunknown value θ of the parameter vector θ , by contrast with cases put forward by Dominguezand Lobato (2004). By the Law of Iterated Expectations, this can be rewritten: E (cid:26) ∂m (cid:48) ( x t , θ ) ∂θ Ω − ( x t , θ ) (cid:8) m (cid:0) x t , θ (cid:1) − m ( x t , θ ) (cid:9)(cid:27) = 0 = ⇒ θ = θ or equivalently (by symmetry): E (cid:26) ∂m (cid:48) ( x t , θ ) ∂θ Ω − (cid:0) x t , θ (cid:1) (cid:8) m ( x t , θ ) − m (cid:0) x t , θ (cid:1)(cid:9)(cid:27) = 0 = ⇒ θ = θ . (31)By extension of (30), we have:∆ β L ( h ) T ( θ, β ) = 1 T T (cid:88) t =1 ∂m (cid:48) ( x t , β ) ∂θ Ω − ( x t , β ) (cid:110) T (cid:104) ˜ y ( h ) t ( θ ) (cid:105) − m ( x t , β ) (cid:111) (32)so that: M (cid:0) θ, β (cid:1) = E (cid:26) ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) (cid:110) T (cid:104) ˜ y ( h ) t ( θ ) (cid:105) − m (cid:0) x t , β (cid:1)(cid:111)(cid:27) . M (cid:0) θ, β (cid:1) − M (cid:0) θ , β (cid:1) = E (cid:26) ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) (cid:110) T (cid:104) ˜ y ( h ) t ( θ ) (cid:105) − T (cid:104) ˜ y ( h ) t (cid:0) θ (cid:1)(cid:105)(cid:111)(cid:27) . By the Law of Iterated Expectations: M (cid:0) θ, β (cid:1) − M (cid:0) θ , β (cid:1) = E (cid:26) ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) (cid:8) m ( x t , θ ) − m (cid:0) x t , θ (cid:1)(cid:9)(cid:27) , so that the identification Assumption B1 amounts to: E (cid:26) ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) (cid:8) m ( x t , θ ) − m (cid:0) x t , θ (cid:1)(cid:9)(cid:27) = ⇒ θ = θ . (33)When β = θ , we are back to the well-specified example and (33) is obviously identical tothe identification condition (31) for consistency of maximum likelihood.Moreover, the identification assumption (33) for consistency of the UAML estimator ˘ θ T,H ( β )is clearly likely implied by the standard condition (31) for consistency of maximum likelihood,at least in two particular cases: The model is a linear regression model w.r.t. some known multivariate function κ ( x t ) of x t : m ( x t , θ ) = κ (cid:48) ( x t ) θ. In this case, the identification condition (33) is akin to: E (cid:2) κ ( x t )Ω − (cid:0) x t , β (cid:1) κ (cid:48) ( x t ) (cid:3) ( θ − θ ) = 0 = ⇒ θ = θ . Obviously, when the matrix: E (cid:2) κ ( x t )Ω − (cid:0) x t , β (cid:1) κ (cid:48) ( x t ) (cid:3) is positive definite for β = θ , it is positive definite for any possible value of β . The model is not conditional. In this case, a necessary condition for identificationcondition is: E θ { T ( y ) } = E θ { T ( y ) } ⇐⇒ θ = θ . (34)This is basically the case considered by van der Vaart (1998) when noting that ”the max-imum likelihood estimators are moment estimators” based on the expectation of the sufficientstatistic T ( y ). This identification condition should be maintained when picking p linear indepen-dent equations out of possibly overidentified equations (34). More precisely, the identificationcondition for UAML, written as: ∂m (cid:48) ( β ) ∂θ Ω − (cid:0) β (cid:1) { E θ { T ( y ) } − E θ { T ( y ) }} = ⇒ θ = θ should generically be implied by (34), since, irrespective of the value of β , the matrix ∂m (cid:48) ( β ) /∂θ is full row rank.More generally, one may expect that the identification condition (33), when fulfilled for β = θ , should be more often than not fulfilled for any value of β .50 .2 Example: Latent Exponential Model We now extend the exponential model example to incorporate a sequence of latent variables { y ∗ t } Tt =1 , such that, conditionally on { x t } Tt =1 , the variables y ∗ t are independent, for all t = 1 , . . . , T ,and the conditional distribution of y ∗ t only depends on the exogenous variable x t with the sameindex. This distribution has a density l { y ∗ t | x t ; θ } , with respect to the dominating measure ν ( dy ∗ t ), that is assumed to be exponential: l { y ∗ t | x t ; θ } = exp [ c ( x t , θ ) + h ( y ∗ t , x t ) + a (cid:48) ( x t , θ ) T ( y ∗ t )]Let g be a known vector function that defines the observed endogenous variable y t as: y t = g ( y ∗ t , x t ) . Then, conditionally on { x t } Tt =1 , the variables y t , t = 1 , ..., T are independent and the conditionaldistribution of y t only depends on the exogenous variables x t with the same index. This con-ditional distribution has a density l { y t | x t ; θ } , with respect to the measure ν g ( dy ), which is thetransformation of the original measure ν ( dy ∗ t ) by g , and where we recall that ν ( dy ∗ t ) was thedominating measure used to define the latent density l { y ∗ t | x t ; θ } . The observable log-likelihoodcan then be stated as L T ( θ ) = 1 T T (cid:88) t =1 log [ l { y t | x t ; θ } ] . In general, the observable density is not of an exponential form, see Gourieroux et al. (1987)for the particular case where y t = g ( y ∗ t ) and for examples of Probit, bivariate Probit, Tobit,generalized Tobit, disequilibrium and Gompit models. As already mentioned in Section 2.3,Gourieroux et al. (1987), extending a result of Louis (1982), give a method to compute theobservable score as a conditional expectation of the latent score ∂L T ( θ ) ∂θ = 1 T T (cid:88) t =1 E θ (cid:20) ∂ log [ l { y ∗ t | x t ; θ } ] ∂θ | y t , x t ] (cid:21) . Then, by applying (30) we get ∂L T ( θ ) ∂θ = 1 T T (cid:88) t =1 ∂m (cid:48) ( x t , θ ) ∂θ Ω − ( x t , θ ) { E θ [ T ( y ∗ t ) | y t , x t ] − m ( x t , θ ) } . (35)As exemplified by Gourieroux et al. (1987) for many limited dependent variable models, wecan define and compute a generalized error as: u ( y t , x t , θ ) = ˜ T ( y t , x t , θ ) − m ( x t , θ )˜ T ( y t , x t , θ ) = E θ [ T ( y ∗ t ) | y t , x t ] . Then, the maximum likelihood estimator ˆ θ T is defined as solution of ∂L T ( θ ) ∂θ = 1 T T (cid:88) t =1 ∂m (cid:48) ( x t , θ ) ∂θ Ω − ( x t , θ ) u ( y t , x t , θ ) = 0 . (36)51ence, the identification condition for consistency of maximum likelihood can be written: E (cid:20) ∂m (cid:48) ( x t , θ ) ∂θ Ω − ( x t , θ ) u ( y t , x t , θ ) (cid:21) = 0 ⇐⇒ θ = θ . (37)We also note that MLE is not any more a moment estimator with optimal instruments(confirming that the model is not exponential any more) since: V ar [ u (cid:0) y t , x t , θ (cid:1) | x t ] = V ar [ E θ [ T ( y ∗ t ) | y t , x t ] | x t ]] (cid:54) = Ω (cid:0) x t , θ (cid:1) = V ar [ T ( y ∗ t ) | x t ] . More generally, by extension of (36) we have:∆ β L ( h ) T ( θ, β ) = 1 T T (cid:88) t =1 ∂m (cid:48) ( x t , β ) ∂θ Ω − ( x t , β ) u (cid:104) ˜ y ( h ) t ( θ ) , x t , β (cid:105) . Hence, M (cid:0) θ, β (cid:1) = E (cid:26) ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) u (cid:104) ˜ y ( h ) t ( θ ) , x t , β (cid:105)(cid:27) . so that M (cid:0) θ, β (cid:1) − M (cid:0) θ , β (cid:1) = E (cid:26) ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) (cid:104) u (cid:104) ˜ y ( h ) t ( θ ) , x t , β (cid:105) − u (cid:104) ˜ y ( h ) t (cid:0) θ (cid:1) , x t , β (cid:105)(cid:105) . (cid:27) When β = θ , we are back to the well-specified example and we note that by definition: E { u (cid:104) ˜ y ( h ) t (cid:0) θ (cid:1) , x t , θ (cid:105) | x t } = 0 = ⇒ ∀ hE (cid:110) h ( x t ) u (cid:104) ˜ y ( h ) t (cid:0) θ (cid:1) , x t , θ (cid:105)(cid:111) = 0 = ⇒ M (cid:0) θ, β (cid:1) − M (cid:0) θ , β (cid:1) = E (cid:26) ∂m (cid:48) ( x t , θ ) ∂θ Ω − (cid:0) x t , θ (cid:1) u (cid:104) ˜ y ( h ) t ( θ ) , x t , θ (cid:105)(cid:27) = 0 . so that the identification condition M (cid:0) θ, β (cid:1) − M (cid:0) θ , β (cid:1) ⇐⇒ θ = θ , can be written E (cid:26) ∂m (cid:48) ( x t , θ ) ∂θ Ω − (cid:0) x t , θ (cid:1) u (cid:104) ˜ y ( h ) t ( θ ) , x t , θ (cid:105)(cid:27) = 0 ⇐⇒ θ = θ . (38)By commuting the roles of θ and θ , this is clearly tantamount to the identification condition(37) for maximum likelihood. In the general case, the identification condition B1( β ) for UAMLcan be written: E (cid:26) ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) (cid:104) u (cid:104) ˜ y ( h ) t ( θ ) , x t , β (cid:105) − u (cid:104) ˜ y ( h ) t (cid:0) θ (cid:1) , x t , β (cid:105)(cid:105)(cid:27) = 0 ⇐⇒ θ = θ . E (cid:26) ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) (cid:2) ˜ m ( x t , θ, β ) − ˜ m ( x t , θ , β ) (cid:3)(cid:27) = 0 ⇐⇒ θ = θ , where ˜ m ( x t , θ, β ) = E [ u (cid:16) ˜ y ( h ) t ( θ ) , x t , β (cid:17) | x t ] . By comparison with (38), we see that while both generalized errors u (cid:104) ˜ y ( h ) t ( θ ) , x t , β (cid:105) and u (cid:104) ˜ y ( h ) t ( θ ) , x t , β (cid:105) will in general have a non-zero conditional expectation given x t (when β / ∈{ θ, θ } ), identification means that when θ (cid:54) = θ , their difference cannot be orthogonal to the p specific functions of x t that define the rows of the selection matrix: ∂m (cid:48) ( x t , β ) ∂θ Ω − (cid:0) x t , β (cid:1) . This condition is similar to the condition (33) of identification for UAML in the exponentialmodel example, except that, due to the transformation y t = g ( y ∗ t , x t ), the conditional expecta-tion given x t along simulated paths still depend on β . In the particular case of a latent modeldefined by a univariate linear and homoskedastic regression equation: m ( x t , θ ) = x (cid:48) t θ, Ω ( x t , θ ) = σ , the identification condition in Assumption B1 for UAML becomes: E (cid:8) x t (cid:2) ˜ m ( x t , θ, β ) − ˜ m ( x t , θ , β ) (cid:3)(cid:9) = 0 ⇐⇒ θ = θ . For instance, in the case of a Probit model ( σ = 1): E (cid:26) x t ϕ ( x (cid:48) t β )Φ( x (cid:48) t β ) [1 − Φ( x (cid:48) t β )] (cid:2) Φ( x (cid:48) t θ ) − Φ( x (cid:48) t θ ) (cid:3)(cid:27) = 0 ⇐⇒ θ = θ , which we can compare to the standard identification condition for a Probit model E (cid:26) x t ϕ ( x (cid:48) t θ )Φ( x (cid:48) t θ ) [1 − Φ( x (cid:48) t θ )] (cid:2) Φ( x (cid:48) t θ ) − Φ( x (cid:48) t θ ) (cid:3)(cid:27) = 0 ⇐⇒ θ = θ . These conditions appear to be quite reasonable. D Example 5: ( Stable Distribution ) Consider i.i.d. observations y , . . . , y T generated from a stable distribution with stability param-eter a ∈ (0 , b ∈ [ − , c > µ ∈ R . The structural parameter vector is given by θ = ( a, b, ζ (cid:48) ) (cid:48) , ζ = ( c, µ ) (cid:48) . We consider this model under the false equality constraint:( a, b ) (cid:48) = (1 , (cid:48) µ and scale c , which gives the log-likelihood: L T (1 , , ζ ) = − log [ πc ] − T T (cid:88) t =1 log (cid:34) (cid:18) y t − µc (cid:19) (cid:35) We can define the pseudo-score vector as:∆ θ L T (1 , , ζ ) = (cid:18) ∂L T (1 , , ζ ) ∂ζ (cid:48) , L T (2 , , ζ ) − L T (1 , , ζ ) , ˜ L T (1 , , ζ ) − L T (1 , , ζ ) (cid:19) (cid:48) . Note that, the finite difference [ L T (2 , , ζ ) − L T (1 , , ζ )] is a convenient approximation of thepartial derivative ∂L T (1 , , ζ ) /∂a since the log-likelihood function L T (2 , , ζ ) is computed as thelikelihood for i.i.d. draws in a Normal distribution with mean µ and variance 2 c . Second, thefinite difference [ L T (1 , , ζ ) − L T (1 , , ζ )] is a convenient approximation of the partial derivative ∂L T (1 , , ζ ) /∂b since the log-likelihood function L T (1 , , ζ ) could be computed as the likelihoodfor i.i.d. draws in a Landau distribution with location parameter µ and scale parameter cL T (1 , , ζ ) = T (cid:88) t =1 log( f ( y t )) , where f ( y ) = 1 πc (cid:90) ∞ e − x cos (cid:20) x (cid:18) y − µe (cid:19) + 2 xπ log (cid:16) xc (cid:17)(cid:21) dx. To speed up the computation, we use the following approximation to f ( y ) given by Behrens andMelissinos (1981). f ( y ) (cid:117) √ πc exp {− ( y − µ ) / (2 c ) − exp [ − ( { y − µ } /c )] / } . D.1 Monte Carlo We now compare the behavior of AML using the above pseudo-score, and H = 10 simulations,against two alternative approaches: one based on sample quantiles, due to McCullough (1986),and one based on an auxiliary regression model, due to Koutrouvelis (1981). To this end,we generate 1,000 synthetic datasets from the alpha stable models, each with T = 10 , θ = (1 . , − . , , (cid:48) .We display the resulting estimators across the replications in Figure 7. Analyzing theresults, we see that the three procedures perform similarly for σ , but display different behaviorfor α, β, δ , although all estimators seem quite reliable, and are well-centred over the true values.Table 7 records the Monte Carlo bias (Bias), root mean squared error (RMSE), and MonteCarlo coverage (COV), based on individual 95% Wald interval, across the replications. Theresults demonstrate that the methods all yield accurate estimators of the corresponding truevalues. However, we note that the simpler methods do outperform AML in terms of bias andRMSE, but display worse coverage than AML in almost all cases. Similar results were obtained whether or not the approximation was employed. Given the similarity of theresults, and the drastic speed difference, the approximation approach is more reasonable to apply in practice. We remark that while ML estimation is feasible in the α -stable model for small numbers of observations,given the sample size considered herein, obtaining the MLE proved to be computationally infeasible. .41.61.82.0 AMLE Kout McC Method a method AMLEKoutMcC −1.0−0.50.00.51.01.5 AMLE Kout McC Method b method AMLEKoutMcC0.0950.100 AMLE Kout McC Method s method AMLEKoutMcC 0.000.05 AMLE Kout McC Method d method AMLEKoutMcC Figure 7: Boxplots of estimators across 1000 Monte Carlo replications from the stable distribu-tion. The true values used to generate the data are θ = ( a, b, c, µ ) = (1 . , − . , . , (cid:48) . AML-approximate maximum likelihood estimator, Kout- Koutrouvelis (1981) regression approach,McC- McCullough (1986) quantile approach. 55able 7: Summary accuracy measures for stable example. Acronyms are as described in Figure7, while Aux refers to the auxiliary estimator estimated under the restriction ( a, b ) = (1 , a b AML Aux Kout McC AML Aux Kout McCMean 1.8190 1.0000 1.7994 1.8031 Mean -0.0948 0.0000 -0.0966 -0.1039Bias 19.0315 -800.0000 -0.6072 3.1120 Bias 5.1846 100.0000 3.4381 -3.8520RMSE 9.6607 80.0000 1.4561 2.9785 RMSE 13.6959 10.0000 6.5433 7.9542COV 0.9600 0.0000 0.9410 0.9540 COV 0.9650 0.0000 0.9540 0.9440 c µc µ