[PDF] Consistency of maximum likelihood estimation for some dynamical systems

Abstract

We consider the asymptotic consistency of maximum likelihood parameter estimation for dynamical systems observed with noise. Under suitable conditions on the dynamical systems and the observations, we show that maximum likelihood parameter estimation is consistent. Our proof involves ideas from both information theory and dynamical systems. Furthermore, we show how some well-studied properties of dynamical systems imply the general statistical properties related to maximum likelihood estimation. Finally, we exhibit classical families of dynamical systems for which maximum likelihood estimation is consistent. Examples include shifts of finite type with Gibbs measures and Axiom A attractors with SRB measures.

Full PDF

aa r X i v : . [ m a t h . S T ] N ov The Annals of Statistics (cid:13)

Institute of Mathematical Statistics, 2015

CONSISTENCY OF MAXIMUM LIKELIHOOD ESTIMATION FORSOME DYNAMICAL SYSTEMS

By Kevin McGoff , Sayan Mukherjee ,Andrew Nobel and Natesh Pillai Duke University, Duke University, University of North Carolina andHarvard University

We consider the asymptotic consistency of maximum likelihoodparameter estimation for dynamical systems observed with noise.Under suitable conditions on the dynamical systems and the obser-vations, we show that maximum likelihood parameter estimation isconsistent. Our proof involves ideas from both information theoryand dynamical systems. Furthermore, we show how some well-studiedproperties of dynamical systems imply the general statistical prop-erties related to maximum likelihood estimation. Finally, we exhibitclassical families of dynamical systems for which maximum likelihoodestimation is consistent. Examples include shifts of ﬁnite type withGibbs measures and Axiom A attractors with SRB measures.

1. Introduction.

Maximum likelihood estimation is a common, well-studied and powerful technique for statistical estimation. In the contextof a statistical model with an unknown parameter, the maximum likelihoodestimate of the unknown parameter is, by deﬁnition, any parameter valueunder which the observed data is most likely; such parameter values are saidto maximize the likelihood function with respect to the observed data. Inclassical statistical models, one typically thinks of the unknown parameteras a real number or possibly a ﬁnite dimensional vector of real numbers.Here we consider maximum likelihood estimation for statistical models in

Received June 2013; revised July 2014. Supported by NSF Grant DMS-10-45153. Supported by NIH (Systems Biology): 5P50-GM081883, AFOSR: FA9550-10-1-0436,NSF CCF-1049290 and NSF DMS-12-09155. Supported in part by NSF Grants DMS-09-07177 and DMS-13-10002. Supported by NSF Grant DMS-11-07070.

AMS 2000 subject classiﬁcations.

Primary 37A50, 37A25, 62B10, 62F12, 62M09; sec-ondary 37D20, 60F10, 62M05, 62M10, 94A17.

Key words and phrases.

Dynamical systems, hidden Markov models, maximum likeli-hood estimation, strong consistency.

This is an electronic reprint of the original article published by theInstitute of Mathematical Statistics in

The Annals of Statistics ,2015, Vol. 43, No. 1, 1–29. This reprint diﬀers from the original in pagination andtypographic detail. 1

MCGOFF, MUKHERJEE, NOBEL AND PILLAI which each parameter value corresponds to a stochastic system observedwith noise.Hidden Markov models (HMMs) provide a natural setting in which tostudy both stochastic systems with observational noise and maximum likeli-hood estimation. In this setting, one has a parametrized family of stochasticprocesses that are assumed to be Markov, and one attempts to performinference about the underlying parameters from noisy observations of theprocess. There has been a substantial amount of work on statistical infer-ence for HMMs, and we do not attempt a complete survey of that areahere. In the 1960s, Baum and Petrie [5, 37] studied consistency of maximumlikelihood estimation for ﬁnite state HMMs. Since that time, several otherauthors have shown that maximum likelihood estimation is consistent forHMMs under increasingly general conditions [13, 16, 18, 29–31], culminat-ing with the work of Douc et al. [15], which currently provides the mostgeneral conditions on HMMs under which maximum likelihood estimationhas been shown to be consistent.We focus here on the consistency of maximum likelihood estimation forparametrized families of deterministic systems observed with noise. Inferencemethods for deterministic systems from noisy observations are of interest ina variety of scientiﬁc areas; for a few examples, see [19, 20, 28, 38–40, 46, 49].For the purpose of this article, the terms deterministic system and dy-namical system refer to a map T : X → X . The set X is referred to as thestate space, and the transformation T governs the evolution of states overone (discrete) time increment. Our main interest here lies in families of dy-namical systems observed with noise. More precisely, we consider a statespace X and a parameter space Θ , and to each θ in Θ , we associate a dy-namical system T θ : X → X . Note that the state space X does not dependon θ . For each θ in Θ , we assume that the system is started at equilibriumfrom a T θ -invariant measure µ θ . See Section 2 for precise deﬁnitions. Weare particularly interested in situations in which the family of dynamicalsystems is observed via noisy measurements (or observations). We considera general observation model speciﬁed by a family of probability densities { g θ ( ·| x ) : θ ∈ Θ , x ∈ X } , where g θ ( ·| x ) prescribes the distribution of an obser-vation given that the state of the dynamical system is x and the state ofnature is θ . Under some additional conditions (see Section 3), our ﬁrst mainresult states that maximum likelihood estimation is a consistent method ofestimation of the parameter θ .We have chosen to state the conditions of our main consistency resultin terms of statistical properties of the family of dynamical systems andthe observations. However, these particular statistical properties have notbeen directly studied in the dynamical systems literature. In the interestof applying our general result to speciﬁc systems, we also establish severalconnections between well-studied properties of dynamical systems and the ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS statistical properties relevant to maximum likelihood estimation. Finally,we apply these results to some examples, including shifts of ﬁnite type withGibbs measures and Axiom A attractors with SRB (Sinai–Ruelle–Bowen)measures. It is widely accepted in the ﬁeld of ergodic theory and dynamicalsystems that these classes of systems have “good” statistical properties, andour results may be viewed as a precise conﬁrmation of this view.1.1. Previous work.

There has been a substantial amount of work on sta-tistical inference for HMMs, and a complete survey of that area is beyondthe scope of this work. The asymptotic consistency of maximum likelihoodestimation for HMMs has been studied at least since the work of Baum andPetrie [5, 37] under the assumption that both the hidden state space X andthe observation space Y are ﬁnite sets. Leroux extended this result to thesetting where Y is a general space and X is a ﬁnite set [29]. Several otherauthors have shown that maximum likelihood estimation is consistent forHMMs under increasingly general conditions [13, 16, 18, 30, 31], culminat-ing with the work of Douc et al. [15], which currently provides the mostgeneral conditions for HMMs under which maximum likelihood estimationhas been shown to be consistent.Let us now discuss the results of Douc et al. [15] in greater detail. Con-sider parametrized families of HMMs in which both the hidden state space X and the observation space Y are complete, separable metric spaces. Themain result of [15] shows that under several conditions, maximum likelihoodestimation is a consistent method of estimation of the unknown parame-ter. These conditions involve some requirements on the transition kernel ofthe hidden Markov chain, as well as basic integrability conditions on theobservations. The proof of that result relies on information-theoretic argu-ments, in combination with the application of some mixing conditions thatfollow from the assumptions on the transition kernel. To prove our consis-tency result, we take a similar information-theoretic approach, but insteadof placing explicit restrictions on the transition kernel, we identify and studymixing conditions suitable for dynamical systems. See Remarks 2.4 and 3.3for further discussion of our results in the context of HMMs.Other directions of study regarding inference for HMMs include the be-havior of MLE for misspeciﬁed HMMs [14], asymptotic normality for param-eter estimates [8, 23], the dynamics of Bayesian updating [44] and startingthe hidden process away from equilibrium [15]. Extending these results todynamical systems is of potential interest.The topic of statistical inference for dynamical systems has been widelystudied in a variety of ﬁelds. Early interest from the statistical point of viewis reﬂected in the following surveys [6, 12, 21, 22]. For a recent review ofthis area with many references, see [33]. There has been signiﬁcant method-ological work in the area of statistical inference for dynamical systems (for a MCGOFF, MUKHERJEE, NOBEL AND PILLAI few recent examples, see [19, 20, 38, 46, 49]), but in this section we attemptto describe some of the more theoretical work in this area. The relevanttheoretical work to date falls (very) roughly into three classes: • state estimation (also known as denoising or ﬁltering) for dynamical sys-tems with observational noise; • prediction for dynamical systems with observational noise; • system reconstruction from dynamical systems without noise.Let us now mention some representative works from these lines of research.In the setting of dynamical systems with observational noise, Lalley in-troduced several ideas regarding state estimation in [25]. These ideas weresubsequently generalized and developed in [26, 27]. Key results from thisline of study include both positive and negative results on the consistency ofdenoising a dynamical system under additive observational noise. In short,the magnitude of the support of the noise seems to determine whether con-sistent denoising is possible. In related work, Judd [24] demonstrated thatMLE can fail (in a particular sense) in state estimation when noise is large.It is perhaps interesting to note that there are examples of Axiom A systemswith Gaussian observational noise for which state estimation cannot be con-sistent (by results of [26, 27]) and yet MLE provides consistent parameterestimation (by Theorem 5.7).Steinwart and Anghel considered the problem of consistency in predictionaccuracy for dynamical systems with observational noise [45]. They were ableto show that support vector machines are consistent in terms of predictionaccuracy under some conditions on the decay of correlations of the dynamicalsystem.The work of Adams and Nobel uses ideas from regression to study re-construction of measure-preserving dynamical systems [1, 34, 35] withoutnoise. These results show that certain types of inference are possible un-der fairly mild ergodicity assumptions. A sample result from this line ofwork is that a measure-preserving transformation may be consistently re-constructed from a typical trajectory observed without noise, assuming thatthe transformation preserves a measure that is absolutely continuous (withRadon–Nikodym derivative bounded away from 0 and inﬁnity) with respectto a known reference measure.1.2. Organization.

In Section 2, we give some necessary background ondynamical systems observed with noise. Section 3 contains a statement anddiscussion of our main result (Theorem 3.1), which asserts that under somegeneral statistical conditions, maximum likelihood parameter estimation isconsistent for families of dynamical systems observed with noise. The pur-pose of Section 4 is to establish connections between well-studied properties

ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS of dynamical systems and the (statistical) conditions appearing in Theo-rem 3.1. Section 5 gives several examples of widely studied families of dy-namical systems to which we apply Theorem 3.1 and therefore establish con-sistency of maximum likelihood estimation. The proofs of our main resultsappear in Section 6, and we conclude with some ﬁnal remarks in Section 7.

2. Setting and notation.

Recall that our primary objects of study areparametrized families of dynamical systems. In this section we introducethese objects in some detail. First let us recall some terminology regard-ing dynamical systems and ergodic theory. We use X to denote a statespace, which we assume to be a complete separable metric space endowedwith its Borel σ -algebra X . Then a measurable dynamical system on X is deﬁned by a measurable map T : X → X , which governs the evolutionof states over one (discrete) time increment. For a probability measure µ on the measurable space ( X , X ), we say that T preserves µ (or µ is T -invariant) if µ ( T − E ) = µ ( E ) for each set E in X . We refer to the quadruple( X , X , T, µ ) as a measure-preserving system. To generate a trajectory ( X k )from such a measure-preserving system, one chooses X according to µ andsets X k = T k ( X ) for k ≥

0. Note that ( X k ) is then a stationary X -valuedstochastic process. Finally, the measure-preserving system ( X , X , T, µ ) is saidto be ergodic if T − E = E implies µ ( E ) ∈ { , } . See the books [36, 48] foran introduction to measure-preserving systems and ergodic theory.Let us now introduce the setting of parametrized families of dynamicalsystems. We denote the parameter space by Θ , which is assumed to be acompact metric space endowed with its Borel σ -algebra. Fix a state space X and its Borel σ -algebra X as above. To each parameter θ in Θ , we asso-ciate a measurable transformation T θ : X → X , which prescribes the dynamicscorresponding to the parameter θ . Finally, we need to specify some initialconditions. In this article, we consider the case that the system is startedfrom equilibrium. More precisely, we associate to each θ in Θ a T θ -invariantBorel probability measure µ θ on ( X , X ). Thus, to each θ in Θ , we associatea measure-preserving system ( X , X , T θ , µ θ ), and we refer to the collection( X , X , T θ , µ θ ) θ ∈ Θ as a parametrized family of dynamical systems. For ease ofnotation, we will refer to ( T θ , µ θ ) θ ∈ Θ as a family of dynamical systems on( X , X ), instead of referring to the family of quadruples ( X , X , T θ , µ θ ) θ ∈ Θ .We would like to study the situation that such a family of dynamical sys-tems is observed via noisy measurements. Here we describe the speciﬁcs ofour observation model. We suppose that we have a complete, separable met-ric space Y , endowed with its σ -algebra Y , which serves as our observationspace. We also assume that we have a family of Borel probability densities { g θ ( ·| x ) : θ ∈ Θ , x ∈ X } with respect to a ﬁxed reference measure ν on Y . Thedensity g θ ( ·| x ) prescribes the distribution of our observation given that thestate of the dynamical system is x and the state of nature is θ . Finally, we MCGOFF, MUKHERJEE, NOBEL AND PILLAI assume that the noise involved in successive observations is conditionallyindependent given θ and the underlying trajectory of the dynamical system.Thus our full model consists of a parametrized family of dynamical systems( T θ , µ θ ) θ ∈ Θ on a measurable space ( X , X ) with corresponding observationdensities { g θ ( ·| x ) : θ ∈ Θ , x ∈ X } .In general, we would like to estimate the parameter θ from our observa-tions. Maximum likelihood estimation provides a basic method for perform-ing such estimation. Our ﬁrst main result states that maximum likelihoodestimation is a consistent estimator of θ under some general conditions onthe family of systems and the noise. In order to state these results precisely,we now introduce the likelihood for our model. For the sake of notation, itwill be convenient to denote ﬁnite sequences ( x i , . . . , x j ) with the notation x ji .As we have assumed that our observations are conditionally independentgiven θ and a trajectory ( X k ), we have that for θ ∈ Θ and y n ∈ Y n +1 , thelikelihood of observing y n given θ and ( X k ) is p θ ( y n | X n ) = n Y j =0 g θ ( y j | X j ) . Since X k = T kθ ( X ) given θ and X , the conditional likelihood of y n given θ and X = x is p θ ( y n | x ) = n Y j =0 g θ ( y j | T jθ ( x )) . Since our model also assumes that X is distributed according to µ θ , wehave that for θ ∈ Θ and y n ∈ Y n +1 , the marginal likelihood of observing y n given θ is p θ ( y n ) = Z p θ ( y n | x ) dµ θ ( x ) . (2.1)We denote by ν n the product measure on Y n +1 with marginals equal to ν .Let P θ be the probability measure on X × Y N such that for Borel sets A ⊂ X and B ⊂ Y n +1 , it holds that P θ ( A × B ) = Z Z A ( x ) B ( y n ) p θ ( y n | x ) dν n ( y n ) dµ θ ( x ) , which is well deﬁned by Kolmogorov’s consistency theorem. Let E θ denoteexpectation with respect to P θ , and let P Yθ be the marginal of P θ on Y N .Before we deﬁne consistency, let us ﬁrst consider the issue of identiﬁability.Our notion of identiﬁability is captured by the following equivalence relation. ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS Definition 2.1.

Deﬁne an equivalence relation on Θ as follows: let θ ∼ θ ′ if P Yθ = P Yθ ′ . Denote by [ θ ] the equivalence class of θ with respect to thisequivalence relation.In a strong theoretical sense, if θ ′ is in [ θ ], then the systems correspondingto the parameter values θ ′ and θ cannot be distinguished from each otherbased on observations of the system.Now we ﬁx a distinguished element θ in Θ . Here and in the rest of thepaper, we assume that θ is the “true” parameter; that is, the data aregenerated from the measure P Yθ . Hence, one may think of [ θ ] as the set ofparameters that cannot be distinguished from the true parameter. Definition 2.2.

An approximate maximum likelihood estimator (MLE)is a sequence of measurable functions ˆ θ n : ( Y ) n +1 → Θ such that1 n log p ˆ θ n ( Y n ) ( Y n ) ≥ sup θ n log p θ ( Y n ) − o a . s . (1) , (2.2)where o a . s . (1) denotes a process that tends to zero P θ -a.s. as n tends toinﬁnity. Remark 2.1.

Several notions in this article, including the deﬁnition ofapproximate MLE above, involve taking suprema over θ in Θ . In many sit-uations of interest to us, X and Θ are compact, and all relevant functionsare continuous in these arguments. In such cases, we have suﬃcient regu-larity to guarantee that suprema over θ in Θ are measurable. However, inthe general situation, such suprema are not guaranteed to be measurable,and one must take some care. As all our measurable spaces are Polish (com-plete, separable metric spaces); such functions are always universally mea-surable [7], Proposition 7.47. Similarly, a Borel-measurable (approximate)maximum likelihood estimator need not exist, but the Polish assumptionensures the existence of universally measurable maximum likelihood estima-tors [7], Proposition 7.50. Thus all probabilities and expectations may beunambiguously extended to such quantities. Remark 2.2.

In this work, we do not consider speciﬁc schemes for con-structing an approximate MLE. Based on the existing results regarding de-noising and system reconstruction (e.g., [1, 25–27, 34, 35], which are brieﬂydiscussed in Section 1.1), explicit construction of an approximate MLE maybe possible under suitable conditions. Although the description and studyof such constructive methods could be interesting, it is outside of the scopeof this work.

MCGOFF, MUKHERJEE, NOBEL AND PILLAI

Remark 2.3.

In principle, one could consider inference based on theconditional likelihood p θ ( ·| x ) in place of the marginal likelihood p θ ( · ). How-ever, we do not pursue this direction in this work. For nonlinear dynamicalsystems, even the conditional likelihood p θ ( ·| x ) may depend very sensitivelyon x ; see [6], for example. Thus optimizing over x is essentially no more“tractable” than marginalizing the likelihood via an invariant measure. Remark 2.4.

The framework of this paper may be translated into thelanguage of Markov chains as follows. For each θ ∈ Θ , we deﬁne a (degener-ate) Markov transition kernel Q θ as follows: Q θ ( x, y ) = δ T θ ( x ) ( y ) . In other words, for each θ ∈ Θ , x ∈ X , and Borel set A ⊂ X , the probabilitythat X ∈ A conditioned on X = x is Q θ ( x, A ) = δ T θ ( x ) ( A ) , where δ x is deﬁned to be a point mass at x .In all previous work on consistency of maximum likelihood estimationfor HMMs (including [13, 15, 16, 18, 30, 31]), there have been signiﬁcantassumptions placed on the Markovian structure of the hidden chain. Forexample, the central hypothesis appearing in [15] requires that there is a σ -ﬁnite measure λ on X such that for some L ≥

0, the L -step transition kernel Q Lθ ( x, · ) is absolutely continuous with respect to λ with bounded Radon–Nikodym derivative. If X is uncountable, then the degeneracy of Q θ , whicharises directly from the fact that we are considering deterministic systems,makes the existence of such a dominating measure impossible. In short, itis precisely the determinism in our hidden processes that prevents previoustheorems for HMMs from applying to dynamical systems.Nonetheless, there is a special case of systems that we consider in Sec-tion 5.1 that overlaps with the systems considered in the HMM literature.If X is a shift of ﬁnite type, T θ is the shift map σ : X → X for all θ , µ θ isa (1-step) Markov measure for all θ , and g θ ( ·| x ) depends only θ and thezero coordinate x , then both the present work and the results in [15] applyto this setting and guarantee consistency of any approximate MLE underadditional assumptions on the noise.

3. Consistency of MLE.

In this section, we show that under suitableconditions, any approximate MLE is consistent for families of dynamicalsystems observed with noise. To make this statement precise, we make thefollowing deﬁnition of consistency.

Definition 3.1.

An approximate MLE (ˆ θ n ) n is consistent at θ if ˆ θ n ( Y n )converges to [ θ ], P θ -a.s. as n tends to inﬁnity. ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS For the sake of notation, deﬁne the function γ : Θ × Y → R + , where γ θ ( y ) = sup x ∈ X g θ ( y | x ) . Also, for x >

0, let log + x = max(0 , log( x )).Consider the following conditions on a family of dynamical systems ob-served with noise:(S1) Ergodicity .The system ( T θ , µ θ ) on ( X , X ) is ergodic.(S2) Logarithmic integrability at θ .It holds that E θ [log + γ θ ( Y )] < ∞ and E θ (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) log Z g θ ( Y | x ) dµ θ ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) < ∞ . (S3) Logarithmic integrability away from θ . For each θ ′ / ∈ [ θ ], there exists a neighborhood U of θ ′ such that E θ h sup θ ∈ U log + γ θ ( Y ) i < ∞ . (S4) Upper semi-continuity of the likelihood .For each θ ′ / ∈ [ θ ] and n ≥

0, the function θ p θ ( Y n ) is upper semi-continuous at θ ′ , P θ -a.s.(S5) Mixing condition .There exists ℓ ≥ m ≥

0, there exists a measurablefunction C m : Θ × Y m +1 → R + such that if t ≥ w , . . . , w t ∈ Y m +1 ,then Z t Y j =0 p θ ( w j | T j ( m + ℓ ) θ x ) dµ θ ( x ) ≤ t Y j =0 C m ( θ, w j ) t Y j =0 p θ ( w j ) . Furthermore, for each θ ′ / ∈ [ θ ], there exists a neighborhood U of θ ′ such thatsup m E θ h sup θ ∈ U log C m ( θ, Y m ) i < ∞ . (S6) Exponential identiﬁability .For each θ / ∈ [ θ ], there exists a sequence of measurable sets A n ⊂ Y n +1 such that lim inf n P Yθ ( A n ) > n n log P Yθ ( A n ) < . MCGOFF, MUKHERJEE, NOBEL AND PILLAI

The following theorem is our main general result.

Theorem 3.1.

Suppose that ( T θ , µ θ ) θ ∈ Θ is a parametrized family of dy-namical systems on ( X , X ) with corresponding observation densities ( g θ ) θ ∈ Θ .If conditions (S1)–(S6) hold, then any approximate MLE is consistent at θ . The proof of Theorem 3.1 is given in Section 6. In the following remark,we discuss conditions (S1)–(S6).

Remark 3.2.

Conditions (S1)–(S3) involve basic irreducibility and inte-grability conditions, and similar conditions have appeared in previous workon consistency of maximum likelihood estimation for HMMs; see, for ex-ample, [15, 29]. Taken together, conditions (S1) and (S2) ensure the almostsure existence and ﬁniteness of the entropy rate for the process ( Y n ), h ( θ ) = lim n n log p θ ( Y n ) . Condition (S3) serves as a basic integrability condition in the proof of The-orem 3.1, in which one must essentially show that for θ / ∈ [ θ ],lim sup n n log p θ ( Y n ) < h ( θ ) . Conditions (S4)–(S6) are more interesting from the point of view of dynam-ical systems, and we discuss them in greater detail below.The upper semi-continuity of the likelihood (S4) is closely related to thecontinuity of the map θ µ θ . In general, the continuous dependence of µ θ on θ places nontrivial restrictions on a family of dynamical systems. Thisproperty (continuity of θ µ θ ) is often called “statistical stability” in thedynamical systems and ergodic theory literature, and it has been studiedfor some families of systems; for example, see [2, 17, 42, 47] and referencestherein. In Section 4.1, we show how statistical stability of the family ofdynamical systems may be used to establish the upper semi-continuity ofthe likelihood (S4).The mixing condition (S5) involves control of the correlations of the ob-servation densities along trajectories of the underlying dynamical system.Although the general topic of decay of correlations has been widely studiedin dynamical systems (see [3] for an overview), condition (S5) is not impliedby the particular decay of correlations properties that are typically studiedfor dynamical systems. Nonetheless, we show in Section 4.2 how some well-studied mixing properties of dynamical systems imply the mixing condition(S5).Finally, condition (S6) involves the exponential identiﬁability of the trueparameter θ . We show in Section 4.3 how large deviations for a family ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS of dynamical systems may be used to establish exponential identiﬁability(S6). Large deviations estimates for dynamical systems have been studiedin [41, 50], and our main goal in Section 4.3 is to connect such results toexponential identiﬁability (S6). Remark 3.3.

Suppose one has a family of bi-variate stochastic processes { ( X θk , Y θk ) : θ ∈ Θ } , where ( X θk ) is interpreted as a hidden process and ( Y θk )as an observation process. If the observations have conditional densities withrespect to a common measure given ( X θk ) and θ , then it makes sense to askwhether maximum likelihood estimation is a consistent method of inferencefor the parameter θ .It is well known that the setting of stationary stochastic processes may betranslated into the deterministic setting of dynamical systems, which may becarried out as follows. Let { ( X θk ) : θ ∈ Θ } be a family of stationary stochasticprocesses on a measurable space ( X , X ). Consider the product space ˆ X = X ⊗ Z with corresponding σ -algebra ˆ X . Each process ( X θk ) corresponds toa probability measure µ θ on (ˆ X , ˆ X ) with the property that µ θ is invariantunder the left-shift map T : ˆ X → ˆ X given by x = ( x i ) i T ( x ) = ( x i +1 ) i . Withthis translation, Theorem 3.1 shows that maximum likelihood estimation isconsistent for families of hidden stochastic processes ( X θk ) observed withnoise, whenever the corresponding family of dynamical systems ( T, µ θ ) on(ˆ X , ˆ X ) with observation densities satisfy conditions (S1)–(S6).With the above translation, Theorem 3.1 applies to some families of pro-cesses allowing inﬁnite-range dependence in both the hidden process ( X θk )and the observation process ( Y θk ). From this point of view, Theorem 3.1highlights the fact that maximum likelihood estimation is consistent for de-pendent processes observed with noise as long as they satisfy some generalconditions: ergodicity, logarithmic integrability of observations, continuousdependence on the parameters and some mixing of the observation process.It is interesting to note that the existing work on consistency of maximumlikelihood estimation for HMMs [11, 13, 15, 16, 18, 29–31] makes assump-tions of precisely this sort in the speciﬁc context of Markov chains.

4. Statistical properties of dynamical systems.

In our main consistencyresult (Theorem 3.1), we establish the consistency of any approximate MLEunder conditions (S1)–(S6). We have chosen to formulate our result in theseterms because they reﬂect general statistical properties of dynamical systemsobserved with noise that are relevant to parameter inference. However, theseconditions have not been explicitly studied in the dynamical systems liter-ature, despite the fact that much eﬀort has been devoted to understandingcertain statistical aspects of dynamical systems. In this section, we makeconnections between the general statistical conditions appearing in Theo-rem 3.1 and some well-studied properties of dynamical systems. Section 4.1 MCGOFF, MUKHERJEE, NOBEL AND PILLAI shows how the notion of statistical stability may be used to verify the uppersemi-continuity of the likelihood (S4). Section 4.2 connects well-known mix-ing properties of some measure-preserving dynamical systems to the mixingproperty (S5). In Section 4.3, we show how large deviations for dynamicalsystems may be used to deduce the exponential identiﬁability condition (S6).Proofs of statements in this section, as well as additional discussion, appearin Supplementary Appendix A [32].4.1.

Statistical stability and continuity of p θ . As discussed in Remark 3.2,the upper semi-continuity condition (S4) places nontrivial restrictions on thefamily of dynamical systems under consideration. In this section, we estab-lish suﬃcient conditions for (S4) to hold. The continuous dependence of µ θ on θ is a property called statistical stability in the dynamical systems litera-ture [2, 17, 42, 47]. Let us state this property precisely. Let M ( X ) denote thespace of Borel probability measures on X . Endow M ( X ) with the topologyof weak convergence: µ n converges to µ if R f dµ n converges to R f dµ as n tends to inﬁnity, for each continuous, bounded function f : X → R . Thefamily of dynamical systems ( T θ , µ θ ) θ ∈ X on ( X , X ) is said to have statisticalstability if the map θ µ θ is continuous with respect to the weak topologyon M ( X ).The following proposition shows that under some continuity and com-pactness assumptions, statistical stability of the family of dynamical systemsimplies upper semi-continuity of the likelihood (S4). Proposition 4.1.

Suppose that X and Θ are compact, and the maps T : Θ × X → X and g : Θ × X × Y → R + are continuous. If the family ( T θ , µ θ ) θ ∈ Θ has statistical stability, then upper semi-continuity of the likelihood (S4) holds. The proof of Proposition 4.1 appears in Supplementary Appendix A.1[32].4.2.

Mixing.

In this section, we focus on mixing condition (S5). Recallthat (S5) involves a nontrivial restriction on the correlations of the obser-vation densities g θ along trajectories of the underlying dynamical system.Although mixing conditions have been widely studied in the dynamics liter-ature, the particular type of condition appearing in (S5) appears not to havebeen investigated. Nonetheless, we show that a well-studied mixing propertyfor dynamical systems implies the statistical mixing property (S5).In order to study mixing for dynamical systems, one typically places re-strictions on the type of events or observations that one considers (by consid-ering certain functionals of the process). For example, in some situations asubstantial amount work has been devoted to ﬁnding particular partitions of ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS state space with respect to which the system possess good mixing properties;an example of such partitions are the well-known Markov partitions [9]. If asystem has good mixing properties with respect to a particular partition, andif that partition possesses certain (topological) regularity properties, then itis often possible to show that the system also has good mixing properties forrelated function classes, such as Lipschitz or H¨older continuous observables.For variations of this approach to mixing in dynamical systems, see the vastliterature on decay of correlations; for an introduction, see the survey [3].In this section, we follow the above approach to study mixing condition(S5) for dynamical systems observed with noise. First, we deﬁne a mixingproperty for families of dynamical systems with respect to a partition (M1).Second, we deﬁne a regularity property for partitions (M2). Third, we deﬁnea topological regularity property for a family of observation densities (M3).Finally, in the main result of this section (Proposition 4.2), we show howthese three properties together imply the mixing condition (S5).Here and in the rest of this section, we consider only invertible transfor-mations. It is certainly possible to modify the deﬁnitions slightly to handlethe noninvertible case, but we omit such modiﬁcations.We will have need to consider ﬁnite partitions of X . The join of twopartitions C and C is deﬁned to be the common reﬁnement of C and C , and it is denoted C ∨ C . Note that for any measurable transformation T : X → X , if C is a partition, then so is T − C = { T − A : A ∈ C} . For a ﬁxedpartition C and i ≤ j , let C ji = W jk = i T − kθ C . Notice that C ji depends on θ through T θ , although we suppress this dependence in our notation. Nowconsider the following alternative conditions, which may be used in place ofcondition (S5):(M1) Mixing condition with respect to the partition C .There exists L : Θ → R + and ℓ ≥ θ ∈ Θ , m, n ≥ A ∈ C m and B ∈ C n , it holds that µ θ ( A ∩ T − ( m + ℓ ) θ B ) ≤ L θ µ θ ( A ) µ θ ( B ) . Furthermore, for each θ ′ / ∈ [ θ ] there exists a neighborhood U of θ ′ such thatsup θ ∈ U L θ < ∞ . (M2) Regularity of the partition C . There exists β ∈ (0 ,

1) such that forall θ ∈ Θ and m, n ≥

0, if A ∈ C n − m and x, z ∈ A , then d ( x, z ) ≤ β min( m,n ) . (M3) Regularity of observations . There exists a function K : Θ × Y → R + such that for y ∈ Y and x, z ∈ X , g θ ( y | x ) ≤ g θ ( y | z ) exp( K ( θ, y ) d ( x, z )) . MCGOFF, MUKHERJEE, NOBEL AND PILLAI

Furthermore, for each θ ′ / ∈ [ θ ], there exists a neighborhood U of θ ′ such that E θ h sup θ ∈ U K ( θ, Y ) i < ∞ . Let us now state the main proposition of this section, whose proof isdeferred to Supplementary Appendix A.2 [32].

Proposition 4.2.

Suppose ( T θ , µ θ ) θ ∈ Θ is a family of dynamical systemson ( X , X ) with corresponding observation densities ( g θ ) θ ∈ Θ . If there existsa partition C of X such that conditions (M1) and (M2) are satisﬁed, and ifthe observation regularity condition (M3) is satisﬁed, then mixing property (S5) holds. Exponential identiﬁability.

In this section, we study exponential iden-tiﬁability condition (S6). We show how large deviations for dynamical sys-tems may be used in combination with some regularity of the observationdensities to establish exponential identiﬁability (S6).Let X and X be metric spaces with metrics d and d , respectively.Recall that a function f : X → X is said to be H¨older continuous if thereexist α > C > x, z in X , it holds that d ( f ( x ) , f ( z )) ≤ Cd ( x, z ) α . If (

T, µ ) is a dynamical system on ( X , X ) such that T : X → X is H¨older con-tinuous, then we refer to ( T, µ ) as a H¨older continuous dynamical system. Formany dynamical systems, the class of H¨older continuous functions f : X → R provides a natural class of observables whose statistical properties are fairlywell understood and satisfy some large deviations estimates [41, 50].Consider the following conditions, which we later show are suﬃcient toguarantee exponential identiﬁability (S6):(L1) Large deviations.

For each θ / ∈ [ θ ], for each H¨older continuous func-tion f : X → R , and for each δ >

0, it holds thatlim sup n n log µ θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n n − X k =0 f ( T kθ ( x )) − Z f dµ θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) > δ ! < . (L2) Regularity of observations.

There exists α > K : Θ × Y → R + such that for each x and z in X , it holds that g θ ( y | x ) ≤ g θ ( y | z ) exp( K ( θ, y ) d ( x, z ) α ) . Furthermore, for θ ∈ Θ and C >

0, it holds thatsup x Z exp( CK ( θ, y )) g θ ( y | x ) dν ( y ) < ∞ . ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS The following proposition relates large deviations for dynamical systemsto the exponential identiﬁability condition (S6).

Proposition 4.3.

Suppose that ( T θ , µ θ ) θ ∈ Θ is a family of H¨older contin-uous dynamical systems on the ( X , X ) with corresponding observation den-sities ( g θ ) θ ∈ Θ . Further suppose that the large deviations property (L1) andthe observation regularity property (L2) are satisﬁed. Then the exponentialidentiﬁability condition (S6) holds. The proof of Proposition 4.3 appears in Supplementary Appendix A.3 [32].

5. Examples.

In this section we present some classical families of dy-namical systems for which maximum likelihood estimation is consistent. Webegin in Section 5.1 by considering symbolic dynamical systems called shiftsof ﬁnite type. The state space for such systems consists of (bi-)inﬁnite se-quences of symbols from a ﬁnite set, and the transformation on the statespace is always given by the “left-shift” map, which just shifts each pointone coordinate to the left. Such systems are considered models of “chaotic”dynamical systems that may be deﬁned by a ﬁnite amount of combinatorialinformation. In this setting Gibbs measures form a natural class of invariantmeasures, which have been studied due to their connections to statisticalphysics. These measures play a central role in a topic called the thermody-namic formalism , which is well described in the books [10, 43]. Note that k th order ﬁnite state Markov chains form a special case of Gibbs measures.The main result of this section is Theorem 5.1, which states that under suﬃ-cient regularity conditions, any approximate maximum likelihood estimatoris consistent for families of Gibbs measures on a shift of ﬁnite type. Thecrucial assumptions for this theorem involve continuous dependence of theGibbs measures on θ and suﬃciently regular dependence of g θ ( y | x ) on x . Ad-ditional proofs and discussion for this section appear in the SupplementaryAppendix B [32].Having established consistency of maximum likelihood estimation for fam-ilies of Gibbs measures on a shift of ﬁnite type, we deduce in Section 5.2that maximum likelihood estimation is consistent for families of Axiom Aattractors observed with noise. Axiom A systems are well studied diﬀeren-tiable dynamical systems on manifolds that, like shifts of ﬁnite type, exhibit“chaotic” behavior; for a thorough treatment of Axiom A systems, see thebook [10]. In related statistical work, Lalley [25] considered the problem ofdenoising the trajectories of Axiom A systems. For these systems, there is anatural class of measures, known as SRB (Sinai–Ruelle–Bowen) measures.See the article [52] for an introduction to these measures with discussion oftheir interpretation and importance. With the construction of Markov par-titions [9, 10], one may view an Axiom A attractor with its SRB measure MCGOFF, MUKHERJEE, NOBEL AND PILLAI as a factor of a shift of ﬁnite type with a Gibbs measure. Using this naturalfactor structure, we establish the consistency of any approximate maximumlikelihood estimator for Axiom A systems. Proofs and discussion of thesetopics appear in the Supplementary Appendix C [32].5.1.

Gibbs measures.

In this section, we consider the setting of sym-bolic dynamics, shifts of ﬁnite type and Gibbs measures. We prove thatany approximate maximum likelihood estimator is consistent for these sys-tems (Theorem 5.1) under some general assumptions on the observations.Finally, we consider two examples of observations in greater detail. In theﬁrst example, we consider “discrete” observations, corresponding to a “noisychannel.” In the second example, we consider making real-valued observa-tions with Gaussian observational noise. For a brief introduction to shiftsof ﬁnite type and Gibbs measures that contains everything needed in thiswork, see the Supplementary Appendix B [32]. For a complete introductionto shifts of ﬁnite type and Gibbs measures, see [10].Let us now consider some families of measure-preserving systems on SFTs.Let A be an alphabet, and let M be a binary matrix with dimensions | A | ×| A | . Let X = X M be the associated SFT, and let X be the Borel σ -algebraon X . For α >

0, let f : Θ → C α ( X ) be a continuous map, and let µ θ be theGibbs measure associated to the potential function f θ . In this setting, werefer to ( µ θ ) θ ∈ Θ as a continuously parametrized family of Gibbs measureson ( X , X ). Theorem 5.1.

Suppose X = X M is a mixing shift of ﬁnite type and ( µ θ ) θ ∈ Θ is a continuously parametrized family of Gibbs measures on ( X , X ) .If the family of observation densities ( g θ ) θ ∈ Θ satisﬁes the integrability condi-tions (S2) and (S3) and the regularity conditions (M3) and (L2) , then anyapproximate maximum likelihood estimator is consistent. The proof of Theorem 5.1 is based on an appeal to Theorem 3.1. However,in order to verify the hypotheses of Theorem 3.1, we combine the results ofSection 4 with some well-known properties of Gibbs measures. This proofappears in the Supplementary Appendix B [32].

Remark 5.2.

There is an analogous theory of “one-sided” symbolic dy-namics and Gibbs measures, in which A Z is replaced by A N and appropriatemodiﬁcations are made in the deﬁnitions. The two-sided case deals with in-vertible dynamical systems, whereas the one-sided case handles noninvertiblesystems. We have stated Theorem 5.1 in the invertible setting, although itapplies as well in the noninvertible setting, with the obvious modiﬁcations. ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS Example 5.3.

In this example, we consider families of dynamical sys-tems ( T θ , µ θ ) on ( X , X ), where X is a mixing shift of ﬁnite type, T θ = σ | X ,and µ θ is a continuous family of Gibbs measures on X (as in Theorem 5.1).Here we consider the particular observation model in which our observationsof X are passed through a discrete, memoryless, noisy channel. Suppose that Y is a ﬁnite set, ν is counting measure on Y and for each symbol a in A and parameter θ in Θ , we have a probability distribution π θ ( ·| a ) on Y . Weconsider the case that our observation densities g θ satisfy g θ ( ·| x ) = π θ ( ·| x ).This situation is covered by Theorem 5.1, since the following conditions maybe easily veriﬁed: observation integrability (S2) and (S3) and observationregularity (M3) and (L2). Example 5.4.

In this example, we once again consider families of dy-namical systems ( T θ , µ θ ) on ( X , X ), such that X is a mixing shift of ﬁnitetype, T θ = σ | X and µ θ is a continuous family of Gibbs measures on X (as inTheorem 5.1). Here we consider the particular observation model in which wemake real-valued, parameter-dependent measurements of the system, whichare corrupted by Gaussian noise with parameter-dependent variance. Moreprecisely, let us assume that Y = R , and there exists a Lipschitz continuous ϕ : Θ × X → R and continuous s : Θ → (0 , ∞ ) such that g θ ( y | x ) = 1 s ( θ ) √ π exp (cid:18) − s ( θ ) ( ϕ θ ( x ) − y ) (cid:19) . We now proceed to verify conditions (S2), (S3), (M3) and (L2). First, bycompactness and continuity, there exist C , C , C > θ in Θ , y in Y and x in X , it holds that C − exp( − C y ) ≤ g θ ( y | x ) ≤ C exp( − C y ) . (5.1)From (5.1), one easily obtains the observation integrability conditions (S2)and (S3). Furthermore, there exists C , C > x, z ∈ X , it holdsthat g θ ( y | x ) g θ ( y | z )= exp (cid:18) − s ( θ ) [( ϕ θ ( x ) − y ) − ( ϕ θ ( z ) − y ) ] (cid:19) (5.2) = exp (cid:18) − s ( θ ) [( ϕ θ ( x ) − ϕ θ ( z ))( ϕ θ ( x ) + ϕ θ ( z )) + 2 y ( ϕ θ ( z ) − ϕ θ ( x ))] (cid:19) ≤ exp(( C + C | y | ) | ϕ θ ( x ) − ϕ θ ( z ) | ) . Let ϕ be Lipschitz continuous with constant C , and let K ( θ, y ) = C ( C + C | y | ). With this choice of K and (5.2), one may easily verify the observationregularity conditions (M3) and (L2). MCGOFF, MUKHERJEE, NOBEL AND PILLAI

Remark 5.5.

Similar calculations to those in Example 5.4 imply thatany approximate maximum likelihood estimator is also consistent if theobservational noise is “double-exponential” [i.e., g θ ( y | x ) ∝ e −| y − x | ]. Indeed,these calculations should hold for most members of the exponential family,although we do not pursue them here.5.2. Axiom A systems.

In this section, we show how the previous resultsmay be applied to some smooth (diﬀerentiable) families of dynamical sys-tems. These results follow easily from the results in Section 5.1, using thework of Bowen and others (see [9, 10] and references therein) in construct-ing Markov partitions for these systems. With Markov partitions, Axiom Asystems may be viewed as factors of the shifts of ﬁnite type with Gibbsmeasures. For a brief introduction of Axiom A systems that contains thedetails necessary for this work, see the Supplementary Appendix C [32].The basic fact that allows us to transfer our results from shifts of ﬁnitetype to Axiom A systems is that consistency of maximum likelihood esti-mation is preserved under taking appropriate factors. Let us now make thisstatement precisely. Suppose that ( T θ , µ θ ) θ ∈ Θ is a family of dynamical sys-tems on ( X , X ) with observation densities ( g θ ) θ ∈ Θ . Further, suppose thatthere are continuous maps π : Θ × ˜ X → X and ˜ T : Θ × ˜ X → ˜ X such that:(i) for each θ , we have that π θ ◦ ˜ T θ = T θ ◦ π θ ;(ii) for each θ , there is a unique probability measure ˜ µ θ on ˜ X such that˜ µ θ ◦ π − θ = µ θ ;(iii) for each θ , the map π θ is injective ˜ µ θ -a.s.For x in ˜ X and θ in Θ , deﬁne ˜ g θ ( ·| x ) = g θ ( ·| π θ ( x )). Then ( ˜ T θ , ˜ µ θ ) θ ∈ Θ is afamily of dynamical systems on (˜ X , ˜ X ) with observation densities (˜ g θ ) θ ∈ Θ .In this situation, we say that ( T θ , µ θ , g θ ) θ ∈ Θ is an isomorphic factor of( ˜ T θ , ˜ µ θ , ˜ g θ ) θ ∈ Θ , and π is the factor map. The following proposition addressesthe consistency of maximum likelihood estimation for isomorphic factors.Its proof is straightforward and omitted. Proposition 5.6.

Suppose that ( T θ , µ θ , g θ ) θ ∈ Θ is an isomorphic fac-tor of ( ˜ T θ , ˜ µ θ , ˜ g θ ) θ ∈ Θ . Then maximum likelihood estimation is consistent for ( T θ , µ θ , g θ ) θ ∈ Θ if and only if maximum likelihood estimation is consistent for ( ˜ T θ , ˜ µ θ , ˜ g θ ) θ ∈ Θ . For the sake of brevity, we defer precise deﬁnitions for Axiom A systemsto Supplementary Appendix C [32].We consider families of Axiom A systems as follows. Suppose that f : Θ × X → X is a parametrized family of diﬀeomorphisms such that:(i) θ f θ is H¨older continuous; ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS (ii) there exists α > θ , the map f θ is C α ;(iii) for each θ , Ω( f θ ) is an Axiom A attractor and the restriction f θ | Ω( f θ ) is topologically mixing;(iv) for each θ , the measure µ θ is the unique SRB measure correspondingto f θ [10], Theorem 4.1.If these conditions are satisﬁed, then we say that ( f θ , µ θ ) θ ∈ Θ is a parametrizedfamily of Axiom A systems on ( X , X ). Theorem 5.7.

Suppose that ( f θ , µ θ ) θ ∈ Θ is a parametrized family of Ax-iom A systems on ( X , X ) . Further, suppose that ( g θ ) θ ∈ Θ is a family of obser-vations densities satisfying the following conditions: observation integrability (S2) and (S3) and observation regularity (M3) and (L2) . Then maximumlikelihood estimation is consistent. The proof of Theorem 5.7 appears in the Supplementary Appendix C [32].

6. Proof of the main result.

Propositions 6.1–6.5 are used in the proofof Theorem 3.1, which is given at the end of the present section.

Proposition 6.1.

Suppose that condition (S1) (ergodicity) holds. Thenthe process ( Y k ) is ergodic under P Yθ . Proof.

Let m > A and B be Borel subsets of Y m +1 . To obtain the ergodicity of { Y k } k , it suﬃces to show that (see [36])lim n n n X k =0 P Yθ ( Y m ∈ A, Y k + mk ∈ B ) = P Yθ ( Y m ∈ A ) P Yθ ( Y m ∈ B ) . (6.1)For x ∈ X , deﬁne η A ( x ) = Z A ( y m ) p θ ( y m | x ) dν m ( y m ) , and deﬁne η B ( x ) similarly. For k > m , by the conditional independence of Y m and Y k + mk given θ and X = x , we have that P Yθ ( Y m ∈ A, Y k + mk ∈ B )= Z Z A ( y m ) B ( y k + mk ) p θ ( y n + m | x ) dν n + m ( y n + m ) dµ θ ( x )= Z (cid:18)Z A ( y m ) p θ ( y m | x ) dν m ( y m ) × Z B ( y k + mk ) p θ ( y k + mk | T kθ ( x )) dν m ( y k + mk ) (cid:19) dµ θ ( x )= Z η A ( x ) η B ( T kθ ( x )) dµ θ ( x ) , MCGOFF, MUKHERJEE, NOBEL AND PILLAI where we have used Fubini’s theorem. Since m is ﬁxed, we have thatlim n n n X k =0 P Yθ ( Y m ∈ A, Y k + mk ∈ B )= lim n n m X k =0 P Yθ ( Y m ∈ A, Y k + mk ∈ B )+ 1 n n X k = m +1 Z η A ( x ) η B ( T kθ ( x )) dµ θ ( x ) ! = lim n n n X k = m +1 Z η A ( x ) η B ( T kθ ( x )) dµ θ ( x ) . Since ( T θ , µ θ ) is ergodic, an alternative characterization of ergodicity (see[36]) gives thatlim n n n X k =0 P Yθ ( Y m ∈ A, Y k + mk ∈ B ) = lim n n n X k = m +1 Z η A ( x ) η B ( T kθ ( x )) dµ θ ( x )= Z η A ( x ) dµ θ ( x ) Z η B ( x ) dµ θ ( x )= P Yθ ( Y m ∈ A ) P Yθ ( Y m ∈ B ) . Thus we have veriﬁed equation (6.1), and the proof is complete. (cid:3)

For the following propositions, recall our notation that γ θ ( y ) = sup x g θ ( y | x ) . Proposition 6.2.

Suppose that conditions (S1) and (S2) hold. Thenthere exists h ( θ ) ∈ ( −∞ , ∞ ) such that h ( θ ) = lim n E θ (cid:18) n log p θ ( Y n ) (cid:19) . Moreover, the following equality holds P θ -a.s.: h ( θ ) = lim n n log p θ ( Y n ) . Proof.

The proposition is a direct application of Barron’s generalizedShannon–McMillan–Breiman theorem [4]. Here we simply check that the hy-potheses of that theorem hold in our setting. Since condition (S1) (ergodic-ity) holds, Proposition 6.1 gives ( Y k ) is stationary and ergodic under P θ . By ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS deﬁnition, Y n has density p θ ( Y n ) with respect to the σ -ﬁnite measure ν n .The measure ν n is a product of the measure ν taken n + 1 times. As such,the sequence { ν n } clearly satisﬁes Barron’s condition that this sequenceis “Markov with stationary transitions.” Deﬁne D n = E θ (log p θ ( Y n +10 )) − E θ (log p θ ( Y n )). Let us show that for n >

0, we have that E θ ( | log p θ ( Y n ) | ) < ∞ , (6.2)which clearly implies that −∞ < D n < ∞ . Once (6.2) is established, wewill have veriﬁed all of the hypotheses of Barron’s generalized Shannon–McMillan–Breiman theorem, and the proof of the proposition will be com-plete.Observe that the ﬁrst part of the integrability condition (S2) gives that E θ [log + p θ ( Y n )] ≤ ( n + 1) E θ [log + γ θ ( Y )] < ∞ . (6.3)Then the second part of the integrability condition (S2) implies that E θ [log p θ ( Y n )] = E θ (cid:20) log p θ ( Y n ) Q nk =0 R g θ ( Y k | x ) dµ θ ( x ) (cid:21) + E θ " n X k =0 log Z g θ ( Y k | x ) dµ θ ( x ) (6.4) ≥ − ( n + 1) E θ (cid:20)(cid:12)(cid:12)(cid:12)(cid:12) log Z g θ ( Y | x ) dµ θ ( x ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:21) > −∞ , where we have used that relative entropy is nonnegative. By (6.3) and (6.4),we conclude that (6.2) holds, which completes the proof. (cid:3) The following proposition is used in the proof of Theorem 3.1 to given analmost sure bound for the normalized log-likelihoods in terms of quantitiesinvolving only expectations.

Proposition 6.3.

Suppose that conditions (S1) , (S3) and (S5) hold.Let ℓ be as in condition (S5) . Then for θ ′ / ∈ [ θ ] , there exists a neighborhood U of θ ′ such that for each m > , the following inequality holds P θ -a.s.: lim sup n →∞ sup θ ∈ U n log p θ ( Y n ) ≤ m + ℓ E θ (cid:16) sup θ ∈ U log p θ ( Y m ) (cid:17) + ℓm + ℓ E θ (cid:16) sup θ ∈ U log + γ θ ( Y ) (cid:17) + 1 m + ℓ E θ (cid:16) sup θ ∈ U log C m ( θ, Y m ) (cid:17) . MCGOFF, MUKHERJEE, NOBEL AND PILLAI

Informally, in the proof of Proposition 6.3, we use the mixing propertyfrom condition (S5) to parse a sequence of observations into alternatingsequences of “large blocks” and “small blocks,” and then the ergodicity andintegrability conditions ﬁnish the proof. More speciﬁcally, we break up thesequence of observations Y n into alternating blocks of length m and ℓ , where ℓ is given by condition (S5). Proof.

Let θ ′ / ∈ [ θ ]. Fix a neighborhood U of θ ′ so that the conclusionsof both condition (S3) and condition (S5) hold. Let m > ℓ be as in condition (S5). We consider sequences of observations of length n , where n is a large integer. These sequences of observations will be parsedinto alternating blocks of lengths m and ℓ , respectively, starting from anoﬀset of size s and possibly ending with a remainder sequence. For thesake of notation, we use interval notation to denote intervals of integers.For n > m + ℓ ) and s in [0 , m + ℓ ), let R = R ( s, m, ℓ, n ) ∈ [0 , m + ℓ ) and k = k ( s, m, ℓ, n ) ≥ n = s + k ( m + ℓ ) + R . Thenwe partition [0 , n ] as follows: B s = [0 , s ) ,I s ( j ) = [ s + ( m + ℓ )( j − , s + ( m + ℓ )( j −

1) + m ) for 1 ≤ j ≤ k,J s ( j ) = [ s + ( m + ℓ )( j −

1) + m, s + ( m + ℓ ) j ) for 1 ≤ j ≤ k,E s = [ s + t ( m + ℓ ) , n ] . Given a sequence Y n of observations, we deﬁne the following subsequencesof Y n according to the above partitions of [0 , n ]: b s = Y | B s ,w s ( j ) = Y | I s ( j ) for 1 ≤ j ≤ k,v s ( j ) = Y | J s ( j ) for 1 ≤ j ≤ k,e s = Y | E s . For a sequence y t in Y t +1 , deﬁne γ θ ( y t ) = t Y j =0 γ θ ( y j ) = t Y j =0 sup x g θ ( y j | x ) . Then for θ in U , it follows from condition (S5) that p θ ( Y n ) ≤ γ θ ( b s ) γ θ ( e s ) · k Y j =1 γ θ ( v s ( j )) · k Y j =1 C m ( θ, w s ( j )) · k Y j =1 p θ ( w s ( j )) . ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS Taking the logarithm of both sides and averaging over s in [0 , m + ℓ ), weobtain log p θ ( Y n ) ≤ m + ℓ m + ℓ − X s =0 k X j =1 [log p θ ( w s ( j )) + log C m ( θ, w s ( j ))]+ 1 m + ℓ m + ℓ − X s =0 k X j =1 log γ θ ( v s ( j ))(6.5) + 1 m + ℓ m + ℓ − X s =0 [log γ θ ( b s ) + log γ θ ( e s )] . Let us now take the supremum over θ in U in (6.5) and evaluate the limitsof the three terms on the right-hand side as n tends to inﬁnity.Let ξ : Y m +1 → R and ξ : Y m +1 → R be deﬁned by ξ ( y m ) = sup θ ∈ U log p θ ( y m ) ,ξ ( y m ) = sup θ ∈ U log C m ( θ, y m ) . With this notation, we have that1 n m + ℓ − X s =0 k X j =1 h sup θ ∈ U log p θ ( w s ( j )) + sup θ ∈ U log C m ( θ, w s ( j )) i = 1 n n X i =0 [ ξ ( Y i + mi ) + ξ ( Y i + mi )] . Since ( Y k ) is ergodic (by Proposition 6.1), it follows from Birkhoﬀ’s ergodictheorem and conditions (S3) and (S5) that the following limit exists P θ -a.s.:lim n n m + ℓ − X s =0 k X j =1 h sup θ ∈ U log p θ ( w s ( j )) + sup θ ∈ U log C m ( θ, w s ( j )) i = lim n n n X i =0 [ ξ ( Y i + mi ) + ξ ( Y i + mi )]= E θ [ ξ ( Y m )] + E θ [ ξ ( Y m )](6.6) = E θ h sup θ ∈ U log p θ ( Y m ) i + E θ h sup θ ∈ U log C m ( θ, Y m ) i . MCGOFF, MUKHERJEE, NOBEL AND PILLAI

Similarly, using Birkhoﬀ’s ergodic theorem and condition (S3), we havethat the following holds P θ -a.s.:lim sup n n m + ℓ − X s =0 k X j =1 sup θ ∈ U log γ θ ( v s ( j )) ≤ lim sup n n n X i =0 sup θ ∈ U log + γ θ ( Y i + ℓ − i ) ≤ ℓ lim sup n n n X i =0 sup θ ∈ U log + γ θ ( Y i )(6.7) = ℓ E θ h sup θ ∈ U log + γ θ ( Y ) i . Finally, Birkhoﬀ’s ergodic theorem and condition (S3) again imply that thefollowing limit holds P θ -a.s.:lim n n m + ℓ − X s =0 h sup θ ∈ U log + γ θ ( b s ) + sup θ ∈ U log + γ θ ( e s ) i = 0 , (6.8)where we have used that max( | B s | , | E s | ) ≤ m + ℓ .Combining the inequalities in (6.5)–(6.8), we obtain thatlim sup n →∞ sup θ ∈ U n log p θ ( Y n ) ≤ m + ℓ E θ h sup θ ∈ U log p θ ( Y m ) i + 1 m + ℓ E θ h sup θ ∈ U log C m ( θ, Y m ) i + ℓm + ℓ E θ h sup θ ∈ U log + γ θ ( Y ) i , as desired. (cid:3) The following proposition is a direct application of Lemma 10 in [15] tothe present setting, and we omit the proof.

Proposition 6.4.

Suppose that the following conditions hold: ergodic-ity (S1) , logarithmic integrability at θ (S2) and exponential identiﬁability (S6) . Then for θ / ∈ [ θ ] , it holds that lim sup n n E θ [log p θ ( Y n )] < h ( θ ) . The following proposition provides an essential estimate in the proof ofTheorem 3.1.

ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS Proposition 6.5.

Suppose that conditions (S1)–(S6) hold, and let ℓ beas in (S5) . Then for θ ′ / ∈ [ θ ] , there exists m > and a neighborhood U of θ ′ such that h ( θ ) > m + ℓ E θ h sup θ ∈ U log p θ ( Y m ) i + ℓm + ℓ E θ h sup θ ∈ U log + γ θ ( Y ) i + 1 m + ℓ E θ h sup θ ∈ U log C m ( θ ′ , Y m ) i . Proof.

Suppose θ ′ / ∈ [ θ ]. By Proposition 6.4, there exists ε > n n E θ [log p θ ′ ( Y n )] < h ( θ ) − ε. (6.9)By conditions (S3) (logarithmic integrability away from θ ) and (S5) (mix-ing), there exists a neighborhood U ′ of θ ′ and m > m ≥ m ,we have that ε/ > ℓm + ℓ E θ h sup θ ∈ U ′ log + γ θ ( Y ) i (6.10) + 1 m + ℓ E θ h sup θ ∈ U ′ log C m ( θ, Y m ) i . Fix m ≥ m such that1 m + ℓ E θ [log p θ ′ ( Y m )] < lim sup n n E θ [log p θ ′ ( Y n )] + ε/ . (6.11)For η >

0, let B ( θ ′ , η ) denote the ball of radius η about θ ′ in Θ . For η suchthat B ( θ ′ , η ) ⊂ U ′ , we have thatsup θ ∈ B ( θ ′ ,η ) log p θ ( Y m ) ≤ m X k =0 sup θ ∈ U ′ log + γ θ ( Y k ) . The sum above is integrable with respect to P θ and does not depend on η .Then (the reverse) Fatou’s Lemma implies thatlim sup η → E θ h sup θ ∈ B ( θ ′ ,η ) log p θ ( Y m ) i ≤ E θ h lim sup η → sup θ ∈ B ( θ ′ ,η ) log p θ ( Y m ) i . By condition (S4) [upper semi-continuity of θ p θ ( Y m )], we see that E θ h lim sup η → sup θ ∈ B ( θ ′ ,η ) log p θ ( Y m ) i ≤ E θ [log p θ ′ ( Y m )] . MCGOFF, MUKHERJEE, NOBEL AND PILLAI

Now by an appropriate choice of η >

0, we have shown that there exists aneighborhood U ⊂ U ′ of θ ′ such that1 m + ℓ E θ h sup θ ∈ U log p θ ( Y m ) i < m + ℓ E θ [log p θ ′ ( Y m )] + ε/ . (6.12)Combining estimates (6.9)–(6.12), we obtain the desired inequality. (cid:3) Proof of Theorem 3.1.

Let h ( θ ) be deﬁned as in Proposition 6.2.We prove the theorem by showing the following statement: for each closedset C in Θ such that C ∩ [ θ ] = ∅ , it holds thatlim sup n sup θ ∈ C n log p θ ( Y n ) < h ( θ ) . (6.13)Let C be a closed subset of Θ such that C ∩ [ θ ] = ∅ . Since Θ is compact, C is compact. Suppose that for each θ ′ ∈ C , there exists a neighborhood U of θ ′ such that lim sup n sup θ ∈ U ∩ C n log p θ ( Y n ) < h ( θ ) . (6.14)Then by compactness, we would conclude that (6.13) holds and thus com-plete the proof of the theorem.Let θ ′ be in C . Let us now show that there exists a neighborhood U of θ ′ such that (6.14) holds. Since θ ′ is in C , we have that θ ′ / ∈ [ θ ]. Let ℓ be asin (S5). By Proposition 6.5, there exists m > U ′ of θ ′ such that h ( θ ) > m + ℓ E θ h sup θ ∈ U ′ log p θ ( Y m ) i + ℓm + ℓ E θ h sup θ ∈ U ′ log + γ θ ( Y ) i (6.15) + 1 m + ℓ E θ h sup θ ∈ U ′ log C m ( θ, Y m ) i . By Proposition 6.3, there exists a neighborhood U ⊂ U ′ of θ ′ such thatlim sup n →∞ sup θ ∈ U n log p θ ( Y n ) ≤ m + ℓ E θ h sup θ ∈ U log p θ ( Y m ) i + ℓm + ℓ E θ h sup θ ∈ U log + γ θ ( Y ) i (6.16) + 1 m + ℓ E θ h sup θ ∈ U log C m ( θ ′ , Y m ) i . Combining (6.15) and (6.16), we obtain (6.14), which completes the proofof the theorem. (cid:3)

ONSISTENCY OF MLE FOR SOME DYNAMICAL SYSTEMS

7. Concluding remarks.

In this paper, we demonstrate how the proper-ties of a family of dynamical systems aﬀect the asymptotic consistency ofmaximum likelihood parameter estimation. We have exhibited a collection ofgeneral statistical conditions on families of dynamical systems observed withnoise, and we have shown that under these general conditions, maximumlikelihood estimation is a consistent method of parameter estimation. Fur-thermore, we have shown that these general conditions are indeed satisﬁedby some classes of well-studied families of dynamical systems. As mentionedin the Introduction, our results can be considered as a theoretical validationof the notion from dynamical systems that these classes of systems have“good” statistical properties.However, there remain interesting families of systems to which our resultsdo not apply, including some classes of systems that are also believed to have“good” statistical properties. In particular, the class of systems modeled byYoung towers with exponential tail [51] has exponential decay of correlationsand certain large deviations estimates [41]. These families include a positivemeasure set of maps from the quadratic family [ x ax (1 − x )] and theH´enon family, as well as certain billiards and many other systems of physi-cal and mathematical interest [51]. In short, the setting of systems modeledby Young towers with exponential tail provides a very attractive setting inwhich to consider consistency of maximum likelihood estimation. Unfortu-nately, our proof does not apply to systems in this setting in general, mainlydue to the presence of the mixing condition (S5), which is not satisﬁed bythese systems in general.A natural next step might be to obtain rates of convergence and derivecentral limit theorems for maximum likelihood estimation. To this end, itmight be possible to build oﬀ of analogous results for HMMs [8, 23]. Weleave these questions for future work.SUPPLEMENTARY MATERIAL Supplement to “Consistency of maximum likelihood estimation for somedynamical systems” (DOI: 10.1214/14-AOS1259SUPP; .pdf). We providethree technical appendices. In Appendix A, we present proofs of Propositions4.1, 4.2 and 4.3. In Appendix B, we discuss shifts of ﬁnite type and Gibbsmeasures and prove Theorem 5.1. Finally, Appendix C contains deﬁnitionsfor Axiom A systems, as well as a proof of Theorem 5.7.REFERENCES [1]

Adams, T. M. and

Nobel, A. B. (2001). Finitary reconstruction of a measurepreserving transformation.

Israel J. Math.

Alves, J. F. , Carvalho, M. and

Freitas, J. M. (2010). Statistical stability andcontinuity of SRB entropy for systems with Gibbs–Markov structures.

Comm.Math. Phys. MCGOFF, MUKHERJEE, NOBEL AND PILLAI[3]

Baladi, V. (2001). Decay of correlations. In

Smooth Ergodic Theory and Its Applica-tions (Seattle, WA, 1999) . Proc. Sympos. Pure Math. Barron, A. R. (1985). The strong ergodic theorem for densities: GeneralizedShannon–McMillan–Breiman theorem.

Ann. Probab. Baum, L. E. and

Petrie, T. (1966). Statistical inference for probabilistic functionsof ﬁnite state Markov chains.

Ann. Math. Statist. Berliner, L. M. (1992). Statistics, probability and chaos.

Statist. Sci. Bertsekas, D. P. and

Shreve, S. E. (1978).

Stochastic Optimal Control: The Dis-crete Time Case . Mathematics in Science and Engineering . Academic Press,New York. MR0511544[8]

Bickel, P. J. , Ritov, Y. and

Ryd´en, T. (1998). Asymptotic normality of themaximum-likelihood estimator for general hidden Markov models.

Ann. Statist. Bowen, R. (1970). Markov partitions for Axiom A diﬀeomorphisms.

Amer. J. Math. Bowen, R. (2008).

Equilibrium States and the Ergodic Theory of Anosov Diﬀeomor-phisms , revised ed.

Lecture Notes in Math. . Springer, Berlin. MR2423393[11]

Capp´e, O. , Moulines, E. and

Ryd´en, T. (2005).

Inference in Hidden Markov Mod-els . Springer, New York. MR2159833[12]

Chatterjee, S. and

Yilmaz, M. R. (1992). Chaos, fractals and statistics.

Statist.Sci. Douc, R. and

Matias, C. (2001). Asymptotics of the maximum likelihood estimatorfor general hidden Markov models.

Bernoulli Douc, R. and

Moulines, E. (2012). Asymptotic properties of the maximum likeli-hood estimation in misspeciﬁed hidden Markov models.

Ann. Statist. Douc, R. , Moulines, E. , Olsson, J. and van Handel, R. (2011). Consistencyof the maximum likelihood estimator for general hidden Markov models.

Ann.Statist. Douc, R. , Moulines, ´E. and

Ryd´en, T. (2004). Asymptotic properties of the max-imum likelihood estimator in autoregressive models with Markov regime.

Ann.Statist. Freitas, J. M. and

Todd, M. (2009). The statistical stability of equilibrium statesfor interval maps.

Nonlinearity Genon-Catalot, V. and

Laredo, C. (2006). Leroux’s method for general hiddenMarkov models.

Stochastic Process. Appl.

Ionides, E. , Bret´o, C. and

King, A. (2006). Inference for nonlinear dynamicalsystems.

Proc. Natl. Acad. Sci. USA

Ionides, E. L. , Bhadra, A. , Atchad´e, Y. and

King, A. (2011). Iterated ﬁltering.

Ann. Statist. Isham, V. (1993). Statistical aspects of chaos: A review. In

Networks and Chaos—Statistical and Probabilistic Aspects . Monogr. Statist. Appl. Probab. Jensen, J. L. (1993). Chaotic dynamical systems with a view towards statistics: Areview. In

Networks and Chaos—Statistical and Probabilistic Aspects . Monogr.Statist. Appl. Probab. Jensen, J. L. and

Petersen, N. V. (1999). Asymptotic normality of the maximumlikelihood estimator in state space models.

Ann. Statist. [24] Judd, K. (2007). Failure of maximum likelihood methods for chaotic dynamical sys-tems.

Phys. Rev. E (3) Lalley, S. P. (1999). Beneath the noise, chaos.

Ann. Statist. Lalley, S. P. (2001). Removing the noise from chaos plus noise. In

Nonlinear Dy-namics and Statistics (Cambridge, 1998)

Lalley, S. P. and

Nobel, A. B. (2006). Denoising deterministic time series.

Dyn.Partial Diﬀer. Equ. Law, K. J. H. and

Stuart, A. M. (2012). Evaluating data assimilation algorithms.

Monthly Weather Review

Leroux, B. G. (1992). Maximum-likelihood estimation for hidden Markov models.

Stochastic Process. Appl. Le Gland, F. and

Mevel, L. (2000). Basic properties of the projective productwith application to products of column-allowable nonnegative matrices.

Math.Control Signals Systems Le Gland, F. and

Mevel, L. (2000). Exponential forgetting and geometric er-godicity in hidden Markov models.

Math. Control Signals Systems McGoff, K. , Mukherjee, S. , Nobel, A. and

Pillai, N. (2014). Supplement to“Consistency of maximum likelihood estimation for some dynamical systems.”DOI:10.1214/14-AOS1259SUPP.[33]

McGoff, K. , Mukherjee, S. and

Pillai, N. (2013). Statistical inference for dy-namical systems: A review. Available at arXiv:1204.6265.[34]

Nobel, A. (2001). Consistent estimation of a dynamical map. In

Nonlinear Dynamicsand Statistics (Cambridge, 1998)

Nobel, A. B. and

Adams, T. M. (2001). Estimating a function from ergodic sampleswith additive noise.

IEEE Trans. Inform. Theory Petersen, K. (1989).

Ergodic Theory . Cambridge Studies in Advanced Mathematics . Cambridge Univ. Press, Cambridge. MR1073173[37] Petrie, T. (1969). Probabilistic functions of ﬁnite state Markov chains.

Ann. Math.Statist. Pisarenko, V. F. and

Sornette, D. (2004). Statistical methods of parameter es-timation for deterministically chaotic time series.

Phys. Rev. E (3) Poole, D. and

Raftery, A. E. (2000). Inference for deterministic simulation mod-els: The Bayesian melding approach.

J. Amer. Statist. Assoc. Ramsay, J. O. , Hooker, G. , Campbell, D. and

Cao, J. (2007). Parameter esti-mation for diﬀerential equations: A generalized smoothing approach.

J. R. Stat.Soc. Ser. B Stat. Methodol. Rey-Bellet, L. and

Young, L.-S. (2008). Large deviations in non-uniformlyhyperbolic dynamical systems.

Ergodic Theory Dynam. Systems Ruelle, D. (1997). Diﬀerentiation of SRB states.

Comm. Math. Phys.

Ruelle, D. (2004).

Thermodynamic Formalism: The Mathematical Structures ofEquilibrium Statistical Mechanics , 2nd ed. Cambridge Univ. Press, Cambridge.MR2129258[44]

Shalizi, C. R. (2009). Dynamics of Bayesian updating with dependent data andmisspeciﬁed models.

Electron. J. Stat. MCGOFF, MUKHERJEE, NOBEL AND PILLAI[45]

Steinwart, I. and

Anghel, M. (2009). Consistency of support vector machinesfor forecasting the evolution of an unknown ergodic dynamical system fromobservations with unknown noise.

Ann. Statist. Toni, T. , Welch, D. , Strelkowa, N. , Ipsen, A. and

Stumpf, M. (2009). Approxi-mate Bayesian computation scheme for parameter inference and model selectionin dynamical systems.

J. R. Soc. Interface V´asquez, C. H. (2007). Statistical stability for diﬀeomorphisms with dominatedsplitting.

Ergodic Theory Dynam. Systems Walters, P. (1982).

An Introduction to Ergodic Theory . Graduate Texts in Mathe-matics . Springer, New York. MR0648108[49] Wood, S. N. (2010). Statistical inference for noisy nonlinear ecological dynamicsystems.

Nature

Young, L.-S. (1990). Large deviations in dynamical systems.

Trans. Amer. Math.Soc.

Young, L.-S. (1998). Statistical properties of dynamical systems with some hyper-bolicity.

Ann. of Math. (2)

Young, L.-S. (2002). What are SRB measures, and which dynamical systems havethem?

J. Stat. Phys.

K. McGoffDepartment of MathematicsDuke UniversityDurham, North Carolina 27708USAE-mail: mcgoﬀ@math.duke.edu

S. MukherjeeDepartments of Statistical Science,Computer Science, and MathematicsInstitute for Genome Sciences & PolicyDuke UniversityDurham, North Carolina 27708USAE-mail: [email protected]

A. NobelDepartment of Statisticsand Operations ResearchUniversity of North CarolinaChapel Hill, North Carolina 27599-3260USAE-mail: [email protected]