Asymptotic Properties of the Maximum Likelihood Estimator in Endogenous Regime-Switching Models
aa r X i v : . [ ec on . E M ] N ov Asymptotic Properties of the Maximum LikelihoodEstimator in Endogenous Regime-Switching Models ∗ Chaojun Li † East China Normal University Yan Liu ‡ Boston UniversityNovember 11, 2020
Abstract
This study proves the asymptotic properties of the maximum likelihood esti-mator (MLE) in a wide range of endogenous regime-switching models. This classof models extends the constant state transition probability in Markov-switchingmodels to a time-varying probability that includes information from observations.A feature of importance in this proof is the mixing rate of the state process condi-tional on the observations, which is time varying owing to the time-varying tran-sition probabilities. Consistency and asymptotic normality follow from the almostdeterministic geometric decaying bound of the mixing rate. Relying on low-level as-sumptions that have been shown to hold in general, this study provides theoreticalfoundations for statistical inference in most endogenous regime-switching modelsin the literature. As an empirical application, an endogenous regime-switchingautoregressive conditional heteroscedasticity (ARCH) model is estimated and an-alyzed with the obtained inferential results.
Keywords:
Regime-switching model, Endogenous regime-switching model, Asymptoticproperty, Maximum likelihood estimator.
JEL codes:
C13, C22 ∗ We are grateful to Yoosoon Chang, Yoshihiko Nishiyama, and Joon Y. Park for their guidance inwriting the paper. All errors are our own. This work was partially supported by the National NaturalScience Foundation of China (Grant No. 91546202). † Academy of Statistics and Interdisciplinary Sciences, Faculty of Economics and Management, EastChina Normal University, Shanghai 200062, China. Email: [email protected] ‡ Department of Economics, Boston University, 270 Bay State Road, Boston, MA 02215, USA. Email:[email protected] Introduction
Regime-switching models have been applied extensively since Hamilton (1989) to studyhow time-series patterns change across different underlying economic states, such asboom and recession, high-volatility and low-volatility financial market environments,and active and passive monetary and fiscal policies. This class of models features abivariate process ( S t , Y t ), where ( S t ) is an unobservable Markov chain determining theregime in each period, and ( Y t ) is an observable process whose conditional distributionis governed by the underlying state ( S t ). The concept of Markov switching has beenintroduced in a broad class of time-series econometric models, including but not limitedto Markov-switching vector autoregressions (Krolzig, 1997), the switching autoregres-sive conditional heteroscedasticity (ARCH) model (Cai, 1994; Hamilton and Susmel,1994), and the Markov-switching generalized autoregressive conditional heteroscedas-ticity model (Gray, 1996; Klaassen, 2002).In basic Markov-switching models, like Hamilton (1989), the unobservable stateprocess is assumed to follow a homogeneous Markov chain. The implication is that thetransition probability and expected duration of each regime are constant through all pe-riods, regardless of the level of the observable and how long the regime has lasted. Thesetup is restrictive in application and conflicts with empirical findings, such as thoseof Watson (1992) and Filardo and Gordon (1998). Diebold et al. (1994) extended thestate transition probability to be time varying by allowing it to depend on the prede-termined variables ( X t ), which are often chosen as economic covariates for predictingthe regime change. This class of models, usually referred to as time-varying transitionprobability regime-switching models, is widely used in many empirical studies; see, e.g.,Filardo (1994), Bekaert and Harvey (1995), Gray (1996), Filardo and Gordon (1998),Ang et al. (2008). Such models, however, still assume that the evolution of the under-lying state is independent of the other parts of the model. Chib and Dueker (2004),Kim et al. (2008), and Chang et al. (2017) proposed endogenous regime-switching mod-els in which the determination of states depends on the realizations of the observableprocess. The model of particular interest to this study is Chang et al. (2017), whichallows the transition of states to depend on past realizations of the observable ( Y t ). Intheir model, accumulation of sustained positive (or negative) shocks to the observable Chib and Dueker (2004) used a Bayesian estimation method and thus is beyond the scope of thisstudy. We do not consider Kim et al. (2008), since its determination of the current state depends onthe current shock to the observable process Y t . When the regime is determined, Y t is a determinedone-point realization. We cannot interpret the model in the conventional way in which the observableprocess follows a particular pattern in some regime, as we can in basic regime-switching models. Ailliot and Pene(2015) established consistency of the MLE in models with time-inhomogeneous Markovregimes. Pouzo et al. (2018) showed the asymptotic properties of the MLE in autore-gressive models with time-inhomogeneous Markov regime switching and possible modelmisspecifications. Their model set-ups encompass endogenous regime switching in thesense that they allow the transition kernel of the state to depend on past realizationsof the observable. However, the authors restricted the observable time series to dependonly on the current state, and the state process to be only a first-order inhomogeneousMarkov chain. Such restrictions make their theory not applicable to some models usedin the empirical literature, such as Filardo (1994), Filardo and Gordon (1998), andChang et al. (2017).The aim of this study is to show the asymptotic properties of the MLE in a widerange of endogenous regime-switching models, including Chang et al. (2017) as a specialcase. The model we consider is general enough to incorporate the time-varying tran-sition probability regime-switching models represented by Diebold et al. (1994). Ourmodel is more general than that of Douc et al. (2004) and Kasahara and Shimotsu(2019) in that it allows dependence of state transition probability on past realiza-tions and other economic fundamentals. Our model is more general than that ofAilliot and Pene (2015) and Pouzo et al. (2018) in that it allows the transition of theobservable and the state process to depend on more than one lag of the processes.Moreover, our assumptions on the state transition probabilities key to the establish-ment of asymptotic theories are less restrictive than that of Ailliot and Pene (2015)and Pouzo et al. (2018). We show that our assumptions hold in some widely used en-dogenous regime-switching models. Thus, this study provides theoretical foundationsfor statistical inference of endogenous regime-switching models in most empirical re-search. Some interesting and important statistical tests can now be conducted, such aswhether the observables affect transition probability and whether the effect is positive Baum and Petrie (1966), Leroux (1992), Bickel and Ritov (1996), Bickel et al. (1998),Jensen and Petersen (1999), Le Gland and Mevel (2000) and Douc and Matias (2001) contribute tothe asymptotic theories with the basic Markov models less general than Douc et al. (2004) andKasahara and Shimotsu (2019). Their models, usually referred to as hidden Markov models, do notallow autoregression and ( Y t ) are conditionally independent given the current state.
3r negative.The general difficulty in the proof of asymptotic theories with regime-switchingmodels is that the predictive densities of the observable given past realizations do notform a stationary sequence, and thus, the ergodic theorem does not directly apply. Thestrategy, originated from Baum and Petrie (1966), is to approximate the log-likelihoodfunction by the partial sum of a stationary ergodic sequence. The cornerstone of theapproximation is the almost surely geometrically decaying bound of the mixing rateof the conditional chain S | Y, X . The time-varying state transition probability makesit more complex to show the bound, because it enters the mixing rate and causes therate to approach unity as the transition probability approaches zero. By contrast, theconstant state transition probability is bounded away from zero in Douc et al. (2004)and Kasahara and Shimotsu (2019). The main theoretical contribution of this study isthat we show the mixing rate is bounded away from zero eventually by assuming thatthere is a small probability that the observable takes extreme values, which is vital forour findings of consistency and asymptotic normality of the MLE.The rest of this paper is organized as follows. Section 2 lists the main assumptionsand examples of endogenous regime-switching models. Section 3 shows the mixingrate of the conditional chain and the approximation of the log-likelihood function withan ergodic stationary process. Sections 4 and 5 show the consistency and asymptoticnormality of the MLE, respectively. Section 6 reports the simulation results. Section7 estimates an endogenous regime-switching ARCH model using the weekly S&P500data after World War II (WWII). Using the inference result established in Sections 2–5,we find strong evidence of endogeneity in regime switching: today’s bad news to stockreturns makes it more likely to change into the high-volatility regime tomorrow. Ourempirical result provides new perspective on how stock returns affect volatility. Section8 concludes.
Endogenous regime-switching models can generally be defined by a transition equationof the observed process and a state transition probability allowed to depend on pastobservable Y t − , . . . , Y t − r :transition equation of the observed process:4 t = f θ ( S t , . . . , S t − r +1 , Y t − , . . . , Y t − r , X t ; U t ) (1)state transition probability: q θ ( S t | S t − , . . . , S t − r , Y t − , . . . , Y t − r , X t ) (2)where X t is a predetermined variable (vector), ( U t ) is an independent and identicallydistributed (i.i.d.) sequence of random variables, f θ is a family of functions indexedby θ , and q θ is a family of probabilities indexed by θ . Pouzo et al. (2018) dealtwith the model with r = 1. Douc et al. (2004) and Kasahara and Shimotsu (2019)dealt with the special case in which the transition probability in (2) depends onlyon ( S t − , . . . , S t − r ) so that the model reduces to the basic Markov-switching model.The case in which (2) depends on ( S t − , . . . , S t − r , X t ) is the time-varying transitionprobability regime-switching model represented by Diebold et al. (1994).Some widely applied transition equations of the observed process are summarizedas Y t = m ( Y t − , . . . , Y t − k , S t , . . . , S t − k , X t ) + σ ( Y t − , . . . , Y t − k , S t , . . . , S t − k ) U t = m t + σ t U t . (3)An example of (3) is the autoregressive model with switching in mean and variance: γ ( L )( Y t − µ t ) = γ ′ X X t + σ t U t (4)where γ ( z ) = 1 − γ z − · · · − γ k z k , µ t = µ ( S t ), σ t = σ ( S t ), S t = 1 , , , . . . , or J .Another example is the autoregressive model with state-dependent autoregressioncoefficients: Y t = µ ( S t ) + γ ( S t ) Y t − + · · · + γ k ( S t ) Y t − k + γ X ( S t ) ′ X t + σ ( S t ) U t . (5)Cai (1994) and Hamilton and Susmel (1994) proposed regime-switching ARCH mod- The transition equation of the observed process here is more general than it may appear by al-lowing for different numbers of lags of ( S t ) and ( Y t ) as f θ ( S t , . . . , S t − p +1 , Y t − , . . . , Y t − q , X t ; U t ). If,without loss of generality, p ≤ q , then we can make an innocuous change by including more lags as f θ ( S t , . . . , S t − p +1 , S t − p , . . . , S t − q +1 , Y t − , . . . , Y t − q , X t ; U t ). Then, this is the model in (1). Moreover,the transition probability here can accommodate different numbers of lags in ( S t ) and ( Y t ) as well asdifferent numbers of lags from those in (1) by making similar changes. The state transition probability in Pouzo et al. (2018) cannot be generalized easily to accommodatemore lags by defining the state variable S t as a vector ( ˜ S t , ˜ S t − , . . . , ˜ S t − r ) because their Assumption1 requires q θ ( S t | S t − , Y t − ) > S t and S t − . This is violated when, for r = 2, q θ ( S t =( s , s ) | S t − = ( s , s ) , Y t − ) = 0. Similarly, the transition equation of the observed process cannot begeneralized to accommodate more lags owing to their Assumption 5. Y t = µ + φ Y t − + . . . + φ k Y t − k + ξ t ,ξ t = p h t U t ,h t = C ( S t ) + γ ( S t ) ξ t − + . . . + γ k ( S t ) ξ t − k . (6)For the state transition probability, we mainly consider the specifications proposedby Diebold et al. (1994) and Chang et al. (2017) as examples. Example 1. (Diebold et al., 1994) Transition probabilities are functions mapping X t to [0 , S t = 0 or 1, q θ ( S t = s t | S t − = s t − , X t ) = p ( X t ) 1 − p ( X t )1 − p ( X t ) p ( X t ) ! . (7)The functions used most often are logistic functions and probit functions. The case inwhich p ( X t ) and p ( X t ) are constants reduces to the basic Markov-switching model.Although the transition probability in Diebold et al. (1994) does not depend on pastobservations of Y t and thus, is strictly not an endogenous regime-switching model, thereason we include it here is twofold. First, the model is widely used but its asymptoticproperties have not been fully discussed. Second, the model can be easily extended toendogenous regime-switching models by including past realizations of the observableas the predetermined variables. Example 2. (Chang et al., 2017) Chang et al. (2017) proposed a new approach tomodelling switching when the number of regime is two. The regime is decided usingan autoregressive latent factor, depending on whether the factor takes a value aboveor below a threshold τ . The latent factor follows an AR(1) process W t = αW t − + V t (8)for t = 1 , , . . . with parameter α ∈ ( − ,
1) and i.i.d. standard normal innovations ( V t ).( U t ) and ( V t ) are jointly i.i.d. and distributed as U t V t +1 ! = d N ! , ρρ !! . (9)6he regime is decided by S t = if W t ≥ τ if W t < τ . (10)The correlation ρ between innovations to the latent factor and lagged innovations to theobserved time series connects the state transition probability to the observed time series.A positive correlation means that the information raising the level of the observed timeseries makes the economy more likely to be in the high regime ( S t = 1) in the future. Anegative correlation works in the opposite direction. A zero-correlation model reducesto the basic Markov-switching model.Theorem 3.1 in Chang et al. (2017) clarifies the state transition probability. Thetheorem states that with the transition equation of the observed process (3), when | α | < | ρ | <
1, ( S t , Y t ) together follow a ( k + 1)th-order Markov process, and thestate transition probability is q θ ( S t | S t − , . . . , S t − k − , Y t − , . . . , Y t − k − ) = (1 − S t ) ω ρ + S t (1 − ω ρ ) (11)with ω ρ = ω ρ ( S t − , . . . , S t − k − , Y t − , . . . , Y t − k − ) defined as ω ρ = (cid:2) (1 − S t − ) R τ √ − α −∞ + S t − R ∞ τ √ − α (cid:3) Φ (cid:16) τ − ρU t − √ − ρ − αx √ − α √ − ρ (cid:17) ϕ ( x ) dx (1 − S t − )Φ( τ √ − α ) + S t − (cid:2) − Φ( τ √ − α ) (cid:3) (12)where U t = Y t − m t σ t , Φ( · ) and ϕ ( · ) are the distribution function and the density functionof the standard normal distribution,respectively. The state transition probability fits(2) with r = k + 1. Let , denote “equals by definition.” For any probability measures µ and µ , we definethe total variation distance k µ − µ k T V = sup A | µ ( A ) − µ ( A ) | . For any x ∈ R + , let ⌊ x ⌋ denote the largest integer that is smaller than x . Let ∇ θ be the gradient and ∇ θ the Hessian operator with respect to the parameter θ . For any matrix or vector A , k A k = P | A ij | . “i.o.” stands for “infinitely often.” Let a.s. → , p → , and ⇒ denote almostsure convergence, converge in probability, and weak convergence respectively.7or short notation, we define Y nm , ( Y n , Y n − , . . . , Y m ) ′ for n ≥ m, Y t , Y tt − r +1 = ( Y t , Y t − , . . . , Y t − r +1 ) ′ , Y nm , ( Y n , . . . , Y m ) ′ for n ≥ m, and similarly for S t , X t and realizations s t , y t , and x t .We assume that { S t } ∞ t = − r +1 takes a value in a discrete set S with J elements.Let S , S r , and use P ( S ) to denote the power set of S . For each t ≥ Y t − t − r , S t − t − r , X t ), S t is conditionally independent of ( Y t − r − − r +1 , S t − r − − r +1 , X t − , X ∞ t +1 ). Thetransition probability is q θ ( s | S t − , Y t − , X t ). { Y t } ∞ t = − r +1 takes a value in a set Y , whichis separable and metrizable by a complete metric. Let Y , Y r . For each t ≥ Y t − t − r , S tt − r +1 , X t ), Y t is independent of ( Y t − r − − r +1 , S t − r − r +1 , X t − , X ∞ t +1 ). The condi-tional law has a density g θ ( y | Y t − , S t , X t ) with respect to some fixed σ − finite measure ν on the Borel σ − field B ( Y ). { X t } ∞ t =1 takes a value in a set X . Conditionally on X t , { X k } k ≥ t +1 is independent of { Y k } k ≤ t and { S k } k ≤ t . Conditionally on X t , { X k } k ≤ t − isindependent of { Y k } k ≥ t and { S k } k ≥ t .Under the setup, conditional on X ∞ , ( S t , Y t ) follows a Markov chain of order r withtransition density p θ ( S t , Y t | S t − , . . . , S t − r , Y t − , . . . , Y t − r , X t )= g θ ( Y t | Y t − , S t , X t ) q θ ( S t | S t − , Y t − , X t ) = p θ ( S t , Y t | S t − , Y t − , X t ) . It also follows that for 1 ≤ t ≤ n , p θ ( Y t | Y , S = s , X n ) = p θ ( Y t | Y , S = s , X t ) ,p θ ( Y t | Y , X n ) = p θ ( Y t | Y , X t ) . This study works with the conditional likelihood function given initial observations Y = ( Y , . . . , Y − r +1 ), (unobservable) initial state S = ( S , . . . , S − r +1 ), and prede-termined variables X t , owing to the difficulties obtaining the closed-form expressionof the unconditional stationary likelihood function. We can write the conditional log-likelihood function as ℓ n ( θ, s ) = log p θ ( Y , . . . , Y n | Y , S = s , X n ) = n X t =1 log p θ ( Y t | Y t − , S = s , X t ) (13)8ith predictive density p θ ( Y t | Y t − , S , X t )= X s t , s t − g θ ( Y t | Y t − , s t , X t ) q θ ( s t | s t − , Y t − , X t ) P θ ( S t − = s t − | Y t − , S = s , X t − ) . (14)When the number of observations is n + r , we condition on the first r observations, andarbitrarily choose the initial state s . The aim of this study is to show consistency andasymptotic normality of the MLE ˆ θ n, s = arg max θ ℓ n ( θ, s ) with any choice of s , evenwhen it is not the true underlying initial state.The following are the basic assumptions. Assumption 1.
The parameter θ belongs to Θ . Θ is compact. Let θ ∗ denote the trueparameter. θ ∗ lies in the interior of Θ . Assumption 2. ( S t , Y t , X t ) is a strictly stationary ergodic process. Assumption 3. (a) σ − ( y , x ) := inf θ min s ∈ S , s ∈ S q θ ( s | s , y , x ) > , for all y ∈ Y and x ∈ X ; (b) b − ( y − r +1 , x ) := inf θ min s ∈ S g θ ( y | y , s , x ) > , for all y − r +1 ∈ Y r +1 and x ∈ X , and E θ ∗ | log b − ( y − r +1 , x ) | < ∞ ; (c) b + := sup θ sup y , y ,x max s g θ ( y | y , s , x ) < ∞ . Assumption 4. (a) Constants α > , C , C ∈ (0 , + ∞ ) , and β > exist such that,for any ξ > , P θ ∗ ( σ − ( Y , X ) ≤ C e − α ξ ) ≤ C ξ − β . (15) (b) Constants α > , C , C ∈ (0 , + ∞ ) , and β > exist such that, for any ξ > , P θ ∗ ( b − ( Y − r +1 , X ) ≤ C e − α ξ ) ≤ C ξ − β . (16) Remark 1.
Assumptions 3(a) and 4(a) impose restrictions on state transition prob-abilities. In basic Markov-switching models, σ − ( · ) reduces to a constant. In this spe-cial case, Assumption 3(a) is the same as Assumption (1d) in Kasahara and Shimotsu(2019). It states that the state transition probability is bounded away from zero. Theassumption is key to their proof of the geometrically decaying bound of the mixing rateof the conditional chain ( S | Y, X ), which is the cornerstone of showing the asymptoticproperties. 9n endogenous regime-switching models, σ − ( · ) is a function of the observations.Assumption 3(a) states that the transition probability is positive. We do not strengthenthe assumption to restrict σ − ( y , x ) to be bounded away from zero as Ailliot and Pene(2015), because it can be too restrictive to exclude examples in use. For instance, inChang et al. (2017), when ρ > S t = 0, the transition probability approacheszero as U t − = Y t − − m ( Y t − ,S t − , S t − ,X t − ) σ ( S tt − k ) approaches infinity. The possibility that σ − ( · ) might approach zero makes it more complex to prove the asymptotic properties,because the bound of the mixing rate of ( S | Y, X ) does not decay.We note, however, that the case where σ − ( y , x ) takes extremely small valuesdoes not happen very often in the example of Chang et al. (2017). This motivates usto impose Assumption 4(a), which implies a low probability of σ − ( y , x ) taking anextremely small value. This is the key assumption we use in Lemma 2 to show thatthe bound of the mixing rate decays geometrically with a probability close to one.Assumption 4(a) is not very restrictive. We show in Appendix C.1 that the statetransition probabilities in Chang et al. (2017) and Diebold et al. (1994) satisfy the as-sumption. A sufficient condition that is easier to verity in practice is E θ ∗ [ | log σ − ( y , x ) | δ ] < ∞ for some δ > q θ and g θ and the identifiability of θ ∗ : Assumption 5.
For all ( s ′ , s ) ∈ S × S , ( y , y ′ ) ∈ Y × Y , and x ∈ X , θ q θ ( s ′ | s , y , x ) and θ g θ ( y ′ | y , s , x ) are continuous. Assumption 6. θ and θ ∗ are identical (up to a permutation of state indexes) if andonly if P θ ( Y n ∈ ·| Y , X n − r +1 ) = P θ ∗ ( Y n ∈ ·| Y , X n − r +1 ) for all n ≥ . Remark 2.
Assumption 6 involves distributions conditional on the initial observation.By contrast, the identifiability assumption in Kasahara and Shimotsu (2019) involves p θ ( Y ∈ ·| Y − n , X − n ) for all n ≥
0, which depend on all past observations and maketheir assumption difficult to verify in practice. We show in the appendix how to verifyour assumption using the identifiability of the mixture normal distribution.10
Approximation with stationary ergodic sequence
Consistency and asymptotic normality follows if we can show the following two results:(a) the normalized log-likelihood n − ℓ n ( θ, s ) converges to a deterministic function ℓ ( θ )uniformly with respect to θ , and θ ∗ is a well-separated point of maximum of ℓ ( θ ); and(b) a central limit theorem for the Fisher score function and a locally uniform law oflarge numbers for the observed Fisher information.The updating distribution P θ ( S t − = s t − | Y t − , S , X t − ) in (14) is not stationaryergodic and thus, the predictive density is not stationary ergodic. Therefore, the ergodictheorem cannot be applied directly to conclude a law of large numbers or central limittheorems. The strategy, originated from Baum and Petrie (1966), is to approximate thepredictive density with a stationary ergodic sequence. This section gives the theoreticalfoundations of the approximation.We choose the stationary ergodic sequence to be p θ ( Y t | Y t − −∞ , X t − −∞ )= X s t , s t − g θ ( Y t | Y t − , s t , X t ) q θ ( s t | s t − , Y t − , X t ) P θ ( S t − = s t − | Y t − −∞ , X t − −∞ ) . (17)(17) differs from (14) in that the updating distribution part now depends on the wholehistory of ( Y t , X t ) from past infinity and does not depend on the initial state. Wecan extend ( Y t , X t ) with infinite time { Y t , X t } + ∞−∞ because of Assumption 2. Stationaryergodicity of (17) follows from Theorem 7.1.3 in Durrett (2010). The approximationbuilds on the almost surely geometric decaying bound of the mixing rate of the con-ditional chain ( S | Y, X ). The bound guarantees that the influence of observation andinitial state far in the past quickly vanishes and thus, the difference between the exactpredictive and approximated predictive log densities becomes asymptotically negligible.Lemma 1 establishes an upper bound for the mixing rate.
Lemma 1 (Uniform ergodicity) . Assume 3. Let m, n ∈ Z , − m ≤ n and θ ∈ Θ . Then,for − m ≤ k ≤ n , for all probability measures µ and µ defined on P ( S ) and for all Y n − m and X n − m , (cid:13)(cid:13)(cid:13) X s ∈ S P θ ( S k ∈ ·| S − m = s , Y n − m , X n − m ) µ ( s ) − X s ∈ S P θ ( S k ∈ ·| S − m = s , Y n − m , X n − m ) µ ( s ) (cid:13)(cid:13)(cid:13) T V ≤ ⌊ ( k + m ) /r ⌋ Y i =1 (cid:0) − ω ( Y − m + ri − − m + ri − r , X − m + ri − m + ri − r +1 ) (cid:1) (18)11 ith ω ( Y k − k − r , X kk − r +1 ) = Q kℓ = k − r +1 σ − ( Y ℓ − , X ℓ ) Q k − ℓ = k − r +1 b − ( Y ℓℓ − r ,X ℓ ) b r − . If ω ( Y k − k − r , X kk − r +1 ) is bounded away from zero, then the bound is decaying geo-metrically. In basic Markov-switching models, the state transition probability doesnot include information from observations, and thus σ − ( Y k − , X k ) in the upper boundreduces to a constant strictly larger than zero. Kasahara and Shimotsu (2019) showedthat in this case, ω ( Y k − k − r , X kk − r +1 ) is bounded away from zero with a probability closeto one, and the conditional chain forgets its past at an exponential rate. In endogenousregime-switching models, σ − ( Y k − , X k ) and also ω ( Y k − k − r , X kk − r +1 ) can approach zerowhen the observations take extreme values. However, we show in the following lemmathat as long as the extreme cases do not occur very often, ω ( Y k − k − r , X kk − r +1 ) is boundedaway from zero with a probability close to one. More specifically, under Assumption4(a), which restricts the probability of σ − ( Y k − , X k ) being close to zero, the boundin Lemma 1 decays geometrically. In the following proof, for short notation, define V t , ( Y t − t − r , X tt − r +1 ). Lemma 2 (Bound for mixing rate) . Assume 2–4. Then, ε ∈ (0 , r ) and ρ ∈ (0 , exist such that for all m, n ∈ Z + , P θ ∗ (cid:16) n Y k =1 (1 − ω ( V t k )) ≥ ρ (1 − ε ) n i . o . (cid:17) = 0 , (19) E θ ∗ h n Y k =1 (1 − ω ( V t k )) m i ≤ ρ n , (20) where { t k } ≤ k ≤ n is a sequence of integers such that t k = t k ′ for ≤ k, k ′ ≤ n and k = k ′ . It also holds that for t ≥ r , P θ ∗ ( ω ( V t ) ≤ C ρ ε ⌊ ( t − r ) /r ⌋ i . o . ) = 0 , (21) where C = C r C r − b r − . In the special case with r = 1 and no exogenous variables, Lemma 1 parallelsTheorem 2 in Pouzo et al. (2018). In this case, b − ( · ) reduces to a constant smallerthan b + , and the bound in (18) reduces to Q k + mi =1 (1 − σ − ( Y − m + i − )). Pouzo et al.(2018) imposed Assumption 4 on the bound to show approximation. Compared to ourAssumption 4(a), which serves the same function, their assumption is more restrictive.More specifically, they assumed P ∞ k =0 E θ ∗ (cid:2) Q ki =0 (1 − σ − ( Y i )) (cid:3) < ∞ with the notationin our paper. From (20), E θ ∗ (cid:2) Q ki =0 (1 − σ − ( Y i )) (cid:3) ≤ ρ k and P ∞ k =0 E θ ∗ (cid:2) Q ki =0 (1 − − ( Y i )) (cid:3) ≤ P ∞ k =0 ρ k < ∞ . Thus, Assumption 4 in Pouzo et al. (2018) follows as aresult of our Lemma 2 and is more restrictive than our assumption.(20) establishes geometric decaying rates of the moments of the upper bound. Therates play a role when we establish the approximation in L ( P θ ∗ ). Kasahara and Shimotsu(2019) did not derive such rates. Instead, they made additional high-level assumptionson the moment of sup m ≥ sup θ ∈ G sup s ∇ iθ log p θ ( Y | Y − m , X − m ) with i = 1 , L ( P θ ∗ ). Bycontrast, this study does not need such additional assumptions. (20) is derived fromthe same assumptions as (19) and (21). This sections proves the consistency of the MLE ˆ θ n, s = arg max θ ℓ n ( θ, s ). We firstshow that the normalized log-likelihood n − ℓ n ( θ, s ) converges to a deterministic func-tion ℓ ( θ ) uniformly with respect to θ . The normalized log-likelihood function is n − ℓ n ( θ, s ) = 1 n n X t =1 log p θ ( Y t | Y t − , S , X t ) , with p θ ( Y t | Y t − , S , X t ) expressed in (14). The first step is to approximate n − ℓ n ( θ, s )with the sample mean of a P θ ∗ -stationary ergodic sequence of random variables. Theconstruction of the stationary ergodic sequence is as follows. Define∆ t,m, s ( θ ) , log p θ ( Y t | Y t − − m , S − m = s , X t − m +1 ) , ∆ t,m ( θ ) , log p θ ( Y t | Y t − − m , X t − m +1 ) . We show in (22) of Lemma 3 that { ∆ t,m, s ( θ ) } m ≥ is a uniformly Cauchy sequence withrespect to θ ∈ Θ P θ ∗ − almost surely (a.s.). Lemma 3.
Assume 2–4. Then, there exist constants ρ ∈ (0 , and M < ∞ andrandom sequences { A t,m } t ≥ ,m ≥ such that, for all t ≥ and m ′ ≥ m ≥ , sup θ ∈ Θ max s i , s j ∈ S | ∆ t,m, s i ( θ ) − ∆ t,m ′ , s j ( θ ) | ≤ A t,m ρ ⌊ ( t + m ) / r ⌋ , (22)sup θ ∈ Θ max s ∈ S | ∆ t,m, s i ( θ ) − ∆ t,m ( θ ) | ≤ A t,m ρ ⌊ ( t + m ) / r ⌋ , (23) where P θ ∗ ( A t,m ≥ M i . o . ) = 0 . µ ( s ) = δ s i ( s )and µ ( s ) = P ( S − m = s | S − m ′ = s j , Y n − m ′ , X n − m ′ ) in Lemma 1. Then (18) gives anupper bound for the distance between P θ ( S k ∈ ·| S − m = s i , Y n − m , X n − m ) and P θ ( S k ∈·| S − m ′ = s j , Y n − m ′ , X n − m ′ ). In other words, Lemma 1 gives an upper bound for the stateprocesses that start from different time points − m and − m ′ . (22) follows by combiningLemmas 1 and 2. We give the details in the appendix.(22) indicates { ∆ t,m, s ( θ ) } m ≥ converges uniformly P θ ∗ − a.s. and the limit doesnot depend on s i . Define ∆ t, ∞ ( θ ) , lim m →∞ ∆ t,m, s ( θ ). We show in Lemma 6 that { ∆ t,m, s ( θ ) } is uniformly bounded in L ( P θ ∗ ). Thus ∆ t, ∞ ( θ ) is also well-defined in L ( P θ ∗ ). From (23), ∆ t, ∞ ( θ ) = lim m →∞ ∆ t,m ( θ ). From Assumption 2, { ∆ t, ∞ } t ≥ isstationary ergodic, and the ergodic theorem applies n − n X t =1 ∆ t, ∞ ( θ ) a.s. → ℓ ( θ ) , E θ ∗ [∆ , ∞ ( θ )] . (24)Let m = 0 and m ′ → ∞ in (22), which yields1 n n X t =1 sup θ ∈ Θ max s ∈ S | ∆ t, , s ( θ ) − ∆ t, ∞ ( θ ) | a.s. → . (25)(25) shows the difference between n ℓ n ( θ, s ) and n P nt =1 ∆ t, ∞ ( θ ) is asymptotically neg-ligible. Then n ℓ n ( θ, s ) also converges to ℓ ( θ ) almost surely. From Assumption 5, wecan show the continuity of ℓ ( θ ). Then n ℓ n ( θ, s ) converges to ℓ ( θ ) uniformly withrespect to θ , which is shown in Proposition 1. Proposition 1 (Uniform convergence) . Assume 1–5. Then, sup θ ∈ Θ | n − ℓ n ( θ, s ) − ℓ ( θ ) | a.s. → , as n → ∞ . The consistency follows once we show that θ ∗ is a well-separated point of maximumof ℓ ( θ ). Theorem 1 summarizes the finding and the details of proof is in the appendix. Theorem 1.
Assume 1–6. Then, for any s ∈ S , ˆ θ n, s a.s. → θ ∗ , as n → ∞ . This section establishes the asymptotic normality of the MLE. We need additionalregularity assumptions. Assume a positive real δ exists such that on G , { θ ∈ Θ : | θ − θ ∗ | < δ } , the following conditions hold:14 ssumption 7. For all s ′ ∈ S , s ∈ S and y ′ ∈ Y , y ∈ Y , the functions θ q θ ( s ′ | s , y , x ) and θ g θ ( y ′ | y , s , x ) are twice continuously differentiable. Assumption 8. E θ ∗ [sup θ ∈ G max s ′ , s k∇ θ log q θ ( s ′ | s , Y , X ) k ] < ∞ , E θ ∗ [sup θ ∈ G max s ′ , s k∇ θ log q θ ( s ′ | s , Y , X ) k ] < ∞ ; E θ ∗ [sup θ ∈ G max s k∇ θ log g θ ( Y | Y − , s , X ) k ] < ∞ , E θ ∗ [sup θ ∈ G max s k∇ θ log g θ ( Y | Y − , s , X ) k ] < ∞ . Assumption 9. (a) For almost all ( y , y ′ , x ) , a finite function f y ,y ′ ,x : S → R + existssuch that sup θ ∈ G g θ ( y ′ | y , s , x ) ≤ f y ,y ′ ,x ( s ) . (b) For almost all ( y , s , x ) , functions f y , s ,x : Y → R + and f y , s ,x : Y → R + exist in L ( ν ) such that k∇ θ g θ ( y ′ | y , s , x ) k ≤ f y , s ,x ( y ′ ) and k∇ θ g θ ( y ′ | y , s , x ) k ≤ f y , s ,x ( y ′ ) for all θ ∈ G . This subsection shows the asymptotic normality of the score function. First, we showthat the score function can be approximated by a sequence of integrable martingaleincrements. We use the Fisher identity (Capp´e et al., 2009, p.353) to write the scorefunction in a model with missing data as the expectation of the complete score condi-tional on the observed data: ∇ θ ℓ n ( θ, s ) = E θ h n X t =1 φ θ,t (cid:12)(cid:12)(cid:12) Y n , S = s , X n i , where φ θ,k = ∇ θ log p θ ( Y k , S k | Y k − , S k − , X k )= ∇ θ log (cid:0) g θ ( Y k | Y k − , S k , X k ) q θ ( S k | S k − , Y k − , X k ) (cid:1) . We write the period score function as˙∆ t,m, s ( θ ) , ∇ θ ∆ t,m, s ( θ ) = ∇ θ log p θ ( Y t | Y t − − m , S − m = s , X t − m +1 )= E θ h t X k = − m +1 φ θ,k | Y t − m , S − m = s , X t − m i − E θ h t − X k = − m +1 φ θ,k | Y t − − m , S − m = s , X t − − m i , (26)so that ∇ θ ℓ n ( θ, s ) = P nt =1 ˙∆ t, , s ( θ ). 15e can similarly define˙∆ t,m ( θ ) , E θ h t X k = − m +1 φ θ,k | Y t − m , X t − m i − E θ h t − X k = − m +1 φ θ,k | Y t − − m , X t − − m i . (27)The stationary conditional score is constructed by conditioning on the whole history of( Y t , X t ) starting from past infinity: ˙∆ t, ∞ ( θ ∗ ) , lim m →∞ ˙∆ t,m ( θ ). Define the filtration F t = σ (cid:0) ( Y k , X k +1 ) , −∞ < k ≤ t (cid:1) for t ∈ Z . (A.16)-(A.19) show that { ˙∆ t, ∞ ( θ ∗ ) } ∞ k = −∞ is an ( F , P θ ∗ )-adapted stationary ergodic and square integrable martingale incrementsequence. The central limit theorem for the sums of such a sequence shows that n − n X t =1 ˙∆ t, ∞ ( θ ∗ ) ⇒ N (0 , I ( θ ∗ )) , where I ( θ ∗ ) , E θ ∗ [ ˙∆ , ∞ ( θ ∗ ) ˙∆ , ∞ ( θ ∗ ) T ] is the asymptotic Fisher information matrix.The next lemma shows that ˙∆ t, , s ( θ ∗ ) can be approximated by ˙∆ t, ∞ ( θ ∗ ) in L ( P θ ∗ ).The approximation, like Lemma 3, is derived from the results in Section 3. Lemma 4.
Assume 2–4 and 7-8. Then, for all s ∈ S , E θ ∗ (cid:13)(cid:13)(cid:13) √ n n X t =1 (cid:0) ˙∆ t, , s ( θ ∗ ) − ˙∆ t, ∞ ( θ ∗ ) (cid:1)(cid:13)(cid:13)(cid:13) → . By Lemma 4, n − P nt =1 ˙∆ t, , s ( θ ∗ ) has the same limiting distribution as n − P nt =1 ˙∆ t, ∞ ( θ ∗ ). Theorem 2.
Assume 2–4 and 7–9. Then, for any s ∈ S , n − ∇ θ ℓ n ( θ ∗ , s ) ⇒ N (0 , I ( θ ∗ )) . This subsection presents the law of large numbers for the observed Fisher information.We use the Louis missing information principle (Louis, 1982) to express the observedFisher information in terms of the Hessian of the complete log-likelihood function: ∇ θ ℓ n ( θ, s ) = E θ h n X t =1 ˙ φ θ,t (cid:12)(cid:12)(cid:12) Y n , S = s , X n i + Var θ h n X t =1 φ θ,t (cid:12)(cid:12)(cid:12) Y n , S = s , X n i , where ˙ φ θ,t , ∇ θ φ θ,t = ∇ θ log p θ ( Y t , S t | Y t − , S t − , X t ) .
16e construct the stationary ergodic sequence by conditioning the observed Hessian onthe entire history of ( Y t , X t ) from past infinity. The construction is similar to theone in Subsection 5.1, and we leave the details in the appendix and directly give thetheoretical finding. Theorem 3.
Assume 2–5 and 7–9. Let { θ ∗ n } be any, possibly stochastic, sequence in Θ such that θ ∗ n a.s. → θ ∗ . Then, for all s ∈ S , − n − ∇ θ ℓ n ( θ ∗ n , s ) a.s. → I ( θ ∗ ) . Theorems 2 and 3 together yield the following theorem of asymptotic normality.
Theorem 4.
Assume 1–6 and 7–9. Then, for any s ∈ S , n / (ˆ θ n, s − θ ∗ ) ⇒ N (0 , I ( θ ∗ ) − ) . From Theorems 1, 3, and 4, the negative of the inverse of the Hessian matrix at theMLE value is a consistent estimate of the asymptotic variance.
We use simulation to examine the finite sample distributions of the MLE in endogenous-switching models. We consider three experiments. The first one is variance-switchingmodel with Chang et al. (2017)-type transition probabilities: Y t = σ S t U t , (28)where the process of S t is decided by (8) and (10), and the innovations U t and V t +1 arejointly i.i.d. and distributed as (9). We take ( σ (0) ∗ , σ (1) ∗ ) = (0 . , . α ∗ , τ ∗ ) =(0 . , . ρ ∗ = − . , , .
4. The second experiment is variance-switching model in (28) with Diebold et al. (1994)-type transition matrix0 101 exp( β q + β q X t )1+exp( β q + β q X t ) − exp( β q + β q X t )1+exp( β q + β q X t ) − exp( β p + β p X t )1+exp( β p + β p X t ) exp( β p + β p X t )1+exp( β p + β p X t ) ( σ (0) ∗ , σ (1) ∗ ) = (0 . , . β ∗ q , β ∗ q , β ∗ p , β ∗ p ) = (0 . , − , , Y t = ξ t , ξ t = p h t U t h t = C ( S t ) + γ { ξ t − ≤ } ξ t − , γ > , where the process of S t is decided by (8) and (10), and the innovations U t and V t +1 arejointly i.i.d. and distributed as (9). We take ( C (0) ∗ , C (1) ∗ , γ ∗ , α ∗ , α ∗ , τ ∗ ) ′ = (2 , , . , . , . ′ and three levels of ρ ∗ = − . , , . (a) Experiment 1 with ρ ∗ = − . ρ ∗ = − . Figure 1: Histogram of standardized MLEFor each experiment, we generate 1000 independent samples of size n = 800. Figure1 gives the histogram of standardized ML estimates in the second experiment and thefirst and the third experiments with ρ ∗ = − .
4. The figure of the complete results isin the appendix. The histogram is scaled so that the sum of the bar heights times thebar lengths is equal to 1. We also plot the density of the standard normal distribution.It appears that asymptotic theory provides a good approximation to the finite sampledistributions.
In this section, we propose an endogenous regime-switching ARCH model to describestock volatility. Cai (1994) and Hamilton and Susmel (1994) first considered regime-18witching ARCH models to study the different volatility dynamics during crisis andnormal periods and concluded that the high persistence of volatility can be attributedto the persistence of each regime. Their models assumed a basic Markov-switchingregime process. In other words, the determination of volatility dynamic patterns ispurely driven by random draws and is independent from other parts of the model, es-pecially the stock returns. The assumption can be restrictive and conflict with theobservation that stock volatility and returns are closely related. For instance, a well-documented empirical finding is the leverage effect: stock volatility and returns arenegatively correlated. Hamilton and Susmel (1994) found supportive evidence of theleverage effect in their regime-switching ARCH model. To allow the transition prob-ability to include the information from stock returns, we adopt Chang et al. (2017)approach to model regime switching. More specifically, we allow the latent factor de-termining regimes to be correlated with innovations to stock returns. If ρ < ρ > Y t = µ + ϕY t − + ξ t ,ξ t = p h t U t ,h t = C ( S t ) + γ ξ t − + γ { ξ t − ≤ } ξ t − , γ , γ ≥ S t follows Chang et al. (2017)-type transition probability andis decided by (8) and (10), and the innovations U t and V t +1 are jointly i.i.d. anddistributed as (9). We restrict C (1) > C (0) >
0. Then the events { S t = 0 } and { S t = 1 } respectively represent the low-volatility and hight-volatility regimes. Weinclude γ { ξ t − ≤ } ξ t − to account for the leverage effect. To study how the effectvaries across different regimes, we also allow γ to switch. That is, h t = C ( S t ) + γ ξ t − + γ ( S t ) { ξ t − ≤ } ξ t − , γ , γ (0) , γ (1) ≥ . The model is estimated using the weekly percentage log return of Standard and19able 1: Maximum Likelihood Estimates for S&P500 Weekly DataEstimatesParameter No switchARCH Exogenous Switchin c & γ ( ρ = 0) Endogenous Switch in c & γ ( − < ρ < µ ϕ -0.1093(0.0157) -0.0389(0.0189) -0.0384(0.0189) c (0) 7.2573(0.2106) 2.1660(0.0962) 2.1843(0.1036) c (1) 10.5479(0.8946) 10.4745(0.8434) γ γ (0) 0.8772(0.1052) 0.2031(0.0455) 0.1766(0.0497) γ (1) 0.4098(0.0822) 0.3722(0.0757) ρ -0.6961(0.2238) α τ The data set is available fromYahoo Finance. The estimation result of the case where both C and γ are allowed toswitch is reported in Table 1. The result of the model with switch only in C is listedin Table E.1. In both specifications, ρ is significant on the basis of Wald tests andlikelihood ratio tests. The estimate of ρ is negative, implying that today’s bad newsof stock returns make it more likely to change into the high-volatility regime tomorrow.Moreover, both tests reject the hypothesis γ (0) = γ (1) at 5% level of significance whenwe allow for endogeneity. The estimates for γ (0) and γ (1) are 0.1766 (0.0497) and0.3722 (0.0757). The conditional variance reacts more strongly to bad news of stockreturns in the high-volatility regime than in the low-volatility regime. Further, the The index is collected at every Tuesday and is adjusted for both dividends and splits. Given theindex p t , the log return is calculated as 100 × log( p t p t − ). In the specification with switch only in C , the p-values of Wald test and likelihood ratio test arerespectively 0.10% and 0.05%. In the specification with switch in C and γ , the p-values of Wald testand likelihood ratio test are respectively 0.19% and 0.07%. The p-values of Wald test and likelihood ratio test are respectively 3% and 2.71%. γ (0) and γ (1) become smaller after we account for endogeneity. It impliesthe indirect effect of stock returns on conditional variance will be mistakenly attributedto the direct leverage effect if we ignore endogeneity. This study proves consistency and asymptotic normality of the MLE in endogenousregime-switching models. The key feature in the proof is to approximate the sequenceof non-ergodic period predictive densities with a stationary ergodic process using themixing rate of the unobservable state process conditional on the observations. Weprovide almost deterministic geometric decaying bounds for the time-varying mixingrate based on the assumption that there is a small probability that the observable takeextreme values. The assumptions in the proof are low-level ones and have been shownto hold in general. As an empirical application, we estimate the endogenous regime-switching ARCH model using the weekly S&P500 data. Using the inference establishedin the study, we find strong evidence of endogeneity in regime switching: today’s badnews to stock returns makes it more likely to change into the high-volatility regimetomorrow.Throughout the paper, we assume that the number of regimes is precisely known.One fails to identify all of the parameters if the number of regimes is set greater thanthe true value. There are few studies about criteria for selecting the proper number ofregimes in endogenous regime-switching models. We leave this topic for future research.
References
Ailliot, P. and Pene, F. (2015). Consistency of the maximum likelihood estimate for non-homogeneous markov–switching models.
ESAIM: Probability and Statistics , 19:268–292.Ang, A., Bekaert, G., and Wei, M. (2008). The term structure of real rates and expectedinflation.
Journal of Finance , 63(2):797–849.Baum, L. E. and Petrie, T. (1966). Statistical inference for probabilistic functions offinite state markov chains.
Annals of Mathematical Statistics , 37(6):1554–1563.Bekaert, G. and Harvey, C. R. (1995). Time-varying world market integration.
Journalof Finance , 50(2):403–444. 21ickel, P. J. and Ritov, Y. (1996). Inference in hidden markov models i: Local asymp-totic normality in the stationary case.
Bernoulli , 2(3):199–228.Bickel, P. J., Ritov, Y., and Ryden, T. (1998). Asymptotic normality of the maximum-likelihood estimator for general hidden markov models.
Annals of Statistics , pages1614–1635.Brandt, A. (1986). The stochastic equation Y n +1 = A n Y n + B n with stationary coeffi-cients. Advances in Applied Probability , 18(1):211–220.Cai, J. (1994). A Markov model of switching-regime ARCH.
Journal of Business andEconomic Statistics , 12(3):309–316.Capp´e, O., Moulines, E., and Ryd´en, T. (2009). Inference in hidden markov models.In
Proceedings of EUSFLAT Conference , pages 14–16.Chang, Y., Choi, Y., and Park, J. Y. (2017). A new approach to model regime switching.
Journal of Econometrics , 196(1):127–143.Chib, S. and Dueker, M. (2004). Non-markovian regime switching with endogenousstates and time-varying state strengths.
FRB of St. Louis Working Paper No. 2004-030A .Diebold, F. X., Lee, J.-H., and Weinbach, G. C. (1994). Regime switching with time-varying transition probabilities. In C. Hargreaves (ed.),
Nonstationary Time SeriesAnalysis and Cointegration (Advanced Texts in Econometrics, C.W.J. Granger and G.Mizon, eds.), pages 283–302. Oxford: Oxford University Press.Douc, R. and Matias, C. (2001). Asymptotics of the maximum likelihood estimator forgeneral hidden markov models.
Bernoulli , 7(3):381–420.Douc, R., Moulines, E., and Ryden, T. (2004). Asymptotic properties of the maximumlikelihood estimator in autoregressive models with markov regime.
Annals of statistics ,32(5):2254–2304.Durrett, R. (2010).
Probability: Theory and Examples . Cambridge University Press,2nd edition.Filardo, A. J. (1994). Business-cycle phases and their transitional dynamics.
Journalof Business and Economic Statistics , 12(3):299–308.22ilardo, A. J. and Gordon, S. F. (1998). Business cycle durations.
Journal of Econo-metrics , 85(1):99–123.Francq, C., Roussignol, M., and Zakoian, J.-M. (2001). Conditional heteroskedasticitydriven by hidden markov chains.
Journal of Time Series Analysis , 22(2):197–220.Francq, C. and Zakoıan, J.-M. (2001). Stationarity of multivariate markov–switchingarma models.
Journal of Econometrics , 102(2):339–364.Gray, S. F. (1996). Modeling the conditional distribution of interest rates as a regime-switching process.
Journal of Financial Economics , 42(1):27–62.Hamilton, J. D. (1989). A new approach to the economic analysis of nonstationarytime series and the business cycle.
Econometrica: Journal of the Econometric Society ,pages 357–384.Hamilton, J. D. and Susmel, R. (1994). Autoregressive conditional heteroskedasticityand changes in regime.
Journal of Econometrics , 64(1):307–333.Jensen, J. L. and Petersen, N. V. (1999). Asymptotic normality of the maximumlikelihood estimator in state space models.
Annals of Statistics , pages 514–535.Kasahara, H. and Shimotsu, K. (2019). Asymptotic properties of the maximum like-lihood estimator in regime switching econometric models.
Journal of econometrics ,208(2):442–467.Kim, C.-J., Piger, J., and Startz, R. (2008). Estimation of markov regime-switchingregression models with endogenous switching.
Journal of Econometrics , 143(2):263–273.Klaassen, F. (2002). Improving garch volatility forecasts with regime-switching garch.
Empirical Economics , 27(2):363–394.Krolzig, H.-M. (1997).
Markov-switching vector autoregressions: Modelling, StatisticalInference, and Application to Business Cycle Analysis , volume 454. Springer-VerlagBerlin Heidelberg.Le Gland, F. and Mevel, L. (2000). Exponential forgetting and geometric ergodicity inhidden markov models.
Mathematics of Control, Signals and Systems , 13(1):63–93.Leroux, B. G. (1992). Maximum-likelihood estimation for hidden markov models.
Stochastic processes and their applications , 40(1):127–143.23ing, S. and Li, W. K. (1997). On fractionally integrated autoregressive moving-averagetime series models with conditional heteroscedasticity.
Journal of the American Statis-tical Association , 92(439):1184–1194.Louis, T. A. (1982). Finding the observed information matrix when using the emalgorithm.
Journal of the Royal Statistical Society. Series B (Methodological) , pages226–233.Pouzo, D., Psaradakis, Z., and Sola, M. (2018). Maximum likelihood estimation in pos-sibly misspecified dynamic models with time inhomogeneous markov regimes.
WorkingPapers 1612.04932, arXiv.org .Shiryaev, A. N. (1996).
Probability . New York, NY: Springer, 2nd edition.Teicher, H. (1967). Identifiability of mixtures of product measures.
Annals of Mathe-matical Statistics , 38(4):1300–1302.Watson, M. W. (1992). Business cycle durations and postwar stabilization of the useconomy. Technical report, National Bureau of Economic Research.Yao, J. (2001). On square-integrability of an AR process with Markov switching.
Statis-tics and Probability Letters , 52(3):265–270.
APPENDIX A: PROOFS
For short notation, define k φ θ ∗ ,t k ∞ = max s t , s t − k φ θ ∗ (( s t , Y t ) , ( s t − , Y t − ) , X t ) k . A.1 Proof of lemmas and corollaries in Section 3
Proof of Lemma 1:
The lemma is a consequence of Lemma 5 and Kasahara and Shimotsu(2019, Lemma 7). (cid:3)
Proof of Lemma 2:
First, we show (20). By Assumption 4, C < ξ > P θ ∗ (cid:16) − ω ( V t ) ≥ − C e − [ α r + α ( r − ξ (cid:17) ≤ rC ξ − β + ( r − C ξ − β . (A.1)We choose ξ such that it satisfies116 r (1 − C e − [ α r + α ( r − ξ ) mn = rC ξ − β + ( r − C ξ − β . (A.2)24he existence of such ξ is guaranteed by monotone increasing of r (1 − C e − [ α r + α ( r − ξ ) mn in ξ from r (1 − C ) mn to r and monotone decreasing of rC ξ − β + ( r − C ξ − β in ξ from + ∞ to 0. We define ρ = 1 − C e − ( α r + α ( r − ξ (A.3)and I t k , { − ω ( V t k ) ≥ ρ } (A.4)Notice that ρ ∈ (0 , − ω ( V t k )) m ≤ ρ (1 − I tk ) m and mn ≥ n , E θ ∗ h n Y k =1 (1 − ω ( V t k )) m i ≤ E θ ∗ h n Y k =1 ρ (1 − I tk ) m i = ρ mn E θ ∗ h n Y k =1 ρ − mI tk i ≤ ρ n E θ ∗ h n Y k =1 ρ − mI tk i . Next, we find an upper bound for E θ ∗ [ Q nk =1 ρ − mI tk ]. Using the generalized H¨older’sinequality, E θ ∗ h n Y k =1 ρ − mI tk i ≤ n Y k =1 (cid:0) E θ ∗ [ ρ − mnI tk ] (cid:1) n . (A.5)Given (A.1) and (A.2), E θ ∗ [ ρ − mnI tk ] = ρ − mn P θ ∗ ( I t k = 1) + P θ ∗ ( I t k = 0) ≤ ρ − mn P θ ∗ (1 − ω ( V t k ) ≥ ρ ) + 1 ≤ ρ − mn ( rC ξ − β + ( r − C ξ − β ) + 1 = ρ − mn · r ρ mn + 1 ≤ . (A.6)The proof is completed by plugging (A.6) back into (A.5).Next, we show (19). Let ρ and I t k be defined as in (A.3) and (A.4), respectively.Define ε = rC ξ − β + ( r − C ξ − β . (A.7)Then, ε ∈ (0 , r ), and P θ ∗ (1 − ω ( V t ) ≥ ρ ) ≤ ε . − ω ( V t k ) ≤ ρ − I tk , n Y k =1 (cid:0) − ω ( V t k ) (cid:1) ≤ ρ n − P nk =1 I tk = ρ n a n (A.8)with a n := ρ − P nk =1 I tk . Since V t k is stationary and ergodic, it follows from the stronglaw of large numbers that n − P nk =1 I t k a.s. → E θ ∗ [ I t k ] < ε . Hence, P θ ∗ lim inf n →∞ n n Y k =1 (cid:0) − ω ( V t k ) (cid:1) ≤ ρ (1 − ε ) n o ≥ P θ ∗ lim inf n →∞ { a n ≤ ρ − ε n } = 1 , (A.9)and (19) follows.Next, we show (21). Let ρ and ε be defined as in (A.3) and (A.4) respectively. Foreach t ≥ r , set ξ t such that it satisfies ρ ε ⌊ ( t − r ) /r ⌋ = e − [ α r + α ( r − ξ t . The existence of ξ t is guaranteed, since ρ ε ⌊ ( t − r ) /r ⌋ ∈ (0 , e − [ α r + α ( r − ξ t is monotone decreasingin ξ t from 1 to 0. Then, ∞ X t =1 P θ ∗ (cid:0) ω ( V t ) ≤ C ρ ε ⌊ ( t − r ) /r ⌋ (cid:1) ≤ (2 r −
1) + ∞ X t =2 r P θ ∗ (cid:0) ω ( V t ) ≤ C e − [ α r + α ( r − ξ t (cid:1) ≤ (2 r −
1) + r ∞ X t =2 r P θ ∗ (cid:0) σ − ( Y t − , X t ) ≤ C e − α ξ t (cid:1) + ( r − ∞ X t =2 r P θ ∗ (cid:0) b − ( Y tt − r , X t ) ≤ C e − α ξ t (cid:1) < ∞ . By the Borel–Cantelli lemma, P θ ∗ (cid:0) ω ( V t ) ≤ C ρ ε ⌊ ( t − r ) /r ⌋ i.o. (cid:1) = 0. (cid:3) A.2 Proof of propositions and theorems in Section 4
Proof of Lemma 3:
Write p θ ( Y t | Y t − − m , S − m = s i , X t − m +1 )= X s t , s t − , s − m g θ ( Y t | Y t − , s t , X t ) q θ ( s t | s t − , Y t − , X t ) × P θ ( S t − = s t − | S − m = s − m , Y t − − m , X t − − m +1 ) δ s i ( s − m )and similarly for p θ ( Y t | Y t − − m ′ , S − m ′ = s j , X t − m ′ +1 ). It follows that (cid:12)(cid:12)(cid:12) p θ ( Y t | Y t − − m , S − m = s i , X t − m +1 ) − p θ ( Y t | Y t − − m ′ , S − m ′ = s j , X t − m ′ +1 ) (cid:12)(cid:12)(cid:12) (cid:12)(cid:12)(cid:12)(cid:12) X s t , s t − , s − m h g θ ( Y t | Y t − , s t , X t ) q θ ( s t | s t − , Y t − , X t ) × P θ ( S t − = s t − | S − m = s − m , Y t − − m , X t − − m +1 ) × (cid:0) δ s i ( s − m ) − P θ ( S − m = s − m | S − m ′ = s j , Y t − − m ′ , X t − − m ′ +1 ) (cid:1)i(cid:12)(cid:12)(cid:12)(cid:12) ≤ ⌊ ( t − m ) /r ⌋ Y i =1 (cid:0) − ω ( V − m + ri ) (cid:1) X s t , s t − g θ ( Y t | Y t − , s t , X t ) q θ ( s t | s t − , Y t − , X t ) . Moreover, we rewrite the period predictive density as p θ ( Y t | Y t − − m , S − m = s i , X t − m +1 )= X s t , s t − , s t − r − , s − m (cid:0) g θ ( Y t | Y t − , s t , X t ) q θ ( s t | s t − , Y t − , X t ) × P θ ( S t − = s t − | S t − r − = s t − r − , Y t − t − r − , X t − t − r ) × P θ ( S t − r − = s t − r − | S − m = s − m , Y t − − m , X t − − m +1 ) δ s i ( s − m ) (cid:1) and similarly for p θ ( Y t | Y t − − m ′ , S − m ′ = s j , X t − m ′ +1 ). It follows that p θ ( Y t | Y t − − m , S − m = s i , X t − m +1 ) ∧ p θ ( Y t | Y t − − m , S − m ′ = s j , X t − m ′ +1 ) ≥ min s ′ t − , s t − r − P θ ( S t − = s ′ t − | S t − r − = s t − r − , Y t − t − r − , X t − t − r ) × X s t , s t − g θ ( Y t | Y t − , s t , X t ) q θ ( s t | s t − , Y t − , X t )= ω ( V t − ) X s t , s t − g θ ( Y t | Y t − , s t , X t ) q θ ( s t | s t − , Y t − , X t ) . By | log x − log y | ≤ | x − y | / ( x ∧ y ), | log p θ ( Y t | Y t − − m , S − m = s i , X t − m +1 ) − log p θ ( Y t | Y t − − m ′ , S − m ′ = s j , X t − m ′ +1 ) |≤ Q ⌊ ( t − m ) /r ⌋ i =1 (cid:0) − ω ( V − m + ri ) (cid:1) ω ( V t − ) . (A.10)By (19) and (21) in Lemma 2, P θ ∗ (cid:0) | ∆ t,m, s i ( θ ) − ∆ t,m, s j ( θ ) | ≥ C ρ (1 − ε ) ⌊ ( t − m ) /r ⌋ i . o . (cid:1) = 0 . Then (22) follows because (1 − ε ) ⌊ ( t − m ) /r ⌋ ≥ ⌊ ( t − m ) /r ⌋ / ≥ ⌊ ( t − m ) / r ⌋ ≥ ⌊ ( t + m ) / r ⌋ . (cid:3) roof of Proposition 1: By the compactness of Θ, it suffices to show ∀ θ ∈ Θ,lim sup δ → lim sup n →∞ sup θ ′ : | θ ′ − θ |≤ δ | n − ℓ n ( θ ′ , s ) − ℓ ( θ ) | = 0 , P θ ∗ − a.s. We decompose the difference aslim sup δ → lim sup n →∞ sup θ ′ : | θ ′ − θ |≤ δ | n − ℓ n ( θ ′ , s ) − ℓ ( θ ) | = lim sup δ → lim sup n →∞ sup θ ′ : | θ ′ − θ |≤ δ | n − ℓ n ( θ ′ , s ) − ℓ ( θ ′ ) | + lim sup δ → sup θ ′ : | θ ′ − θ |≤ δ | ℓ ( θ ′ ) − ℓ ( θ ) | . On the one hand, the first termlim sup δ → lim sup n →∞ sup θ ′ : | θ ′ − θ |≤ δ | n − ℓ n ( θ ′ , s ) − ℓ ( θ ′ ) |≤ lim sup n →∞ n n X t =1 sup θ ′ ∈ Θ | ∆ t, , s ( θ ′ ) − ∆ t, ∞ ( θ ′ ) | + lim sup n →∞ sup θ ′ ∈ Θ | n n X t =1 ∆ t, ∞ ( θ ′ ) − ℓ ( θ ′ ) | = 0 , P θ ∗ − a.s. The last step follows from (25) and (24). On the other hand, note that∆ ,m, s ( θ ) = p θ ( Y | Y − − m , S − m = s − m , X − m +1 ) = p θ ( Y − m +1 | Y − m , S − m = s − m , X − m +1 ) p θ ( Y − − m +1 | Y − m , S − m = s − m , X − − m +1 ) , where p θ ( Y j − m +1 | Y − m , S − m = s − m , X j − m +1 )= X s j − m +1 j Y ℓ = − m +1 q θ ( s ℓ | s ℓ − , Y ℓ − , X ℓ ) j Y ℓ = − m +1 g θ ( Y ℓ | Y ℓ − , s ℓ , X ℓ ) for j = 0 , − . Therefore, ∆ ,m, s ( θ ) is continuous with respect to θ from Assumption 5. It follows that∆ , ∞ ( θ ) is continuous by uniform convergence. Then, the second term is bounded bylim sup δ → sup θ ′ : | θ ′ − θ |≤ δ | ℓ ( θ ′ ) − ℓ ( θ ) | ≤ lim sup δ → sup θ ′ : | θ ′ − θ |≤ δ E θ ∗ | ∆ , ∞ ( θ ′ ) − ∆ , ∞ ( θ ) |≤ lim sup δ → E θ ∗ h sup θ ′ : | θ ′ − θ |≤ δ | ∆ , ∞ ( θ ′ ) − ∆ , ∞ ( θ ) | i = E θ ∗ h lim sup δ → sup θ ′ : | θ ′ − θ |≤ δ | ∆ , ∞ ( θ ′ ) − ∆ , ∞ ( θ ) | i = 0where the first equality follows by the dominated convergence theorem. (cid:3) Proposition 2 (Identification) . Under 2 and 6, ℓ ( θ ) ≤ ℓ ( θ ∗ ) and ℓ ( θ ) = ℓ ( θ ∗ ) if and nly if θ = θ ∗ . Proof: ℓ ( θ ) = E θ ∗ h lim m →∞ log p θ ( Y | Y − m , S − m , X − m +1 ) i = lim m →∞ E θ ∗ h log p θ ( Y | Y − m , S − m , X − m +1 ) i (A.11)= lim m →∞ E θ ∗ (cid:2) E θ ∗ [log p θ ( Y | Y − m , S − m , X − m +1 ) | Y − m , S − m , X − m +1 ] (cid:3) . It follows that ℓ ( θ ∗ ) − ℓ ( θ ) = lim m →∞ E θ ∗ (cid:20) E θ ∗ h log p θ ∗ ( Y | Y − m , S − m , X − m +1 ) p θ ( Y | Y − m , S − m , X − m +1 ) (cid:12)(cid:12)(cid:12) Y − m , S − m , X − m +1 i(cid:21) ≥ . The nonnegativity follows because Kullback–Leibler divergence is nonnegative, andthus, the limit of its expectation is nonnegative. Next, we show that θ ∗ is the uniquemaximizer. The proof closely follows Douc et al. (2004, pp. 2269–2270). Assume ℓ ( θ ) = ℓ ( θ ∗ ). For any t ≥ m ≥ E θ ∗ [log p θ ( Y t | Y − m , S − m , X t − m +1 )] = t X k =1 E θ ∗ [log p θ ( Y k | Y k − − m , S − m , X k − m +1 )] . By (A.11) and stationarity, lim m →∞ E θ ∗ [log p θ ( Y t | Y − m , S − m , X t − m +1 )] = tℓ ( θ ). For1 ≤ k ≤ t − r + 1,0 = t ( ℓ ( θ ∗ ) − ℓ ( θ )) = lim m →∞ E θ ∗ (cid:20) log p θ ∗ ( Y t | Y − m , S − m , X t − m +1 ) p θ ( Y t | Y − m , S − m , X t − m +1 ) (cid:21) = lim sup m →∞ E θ ∗ (cid:20) log p θ ∗ ( Y tt − k +1 | Y t − k , Y − m , S − m , X t − m +1 ) p θ ( Y tt − k +1 | Y t − k , Y − m , S − m , X t − m +1 )+ log p θ ∗ ( Y t − k | Y − m , S − m , X t − m +1 ) p θ ( Y t − k | Y − m , S − m , X t − m +1 ) + log p θ ∗ ( Y t − k − r | Y tt − k − r +1 , Y − m , S − m , X t − m +1 ) p θ ( Y t − k − r | Y tt − k − r +1 , Y − m , S − m , X t − m +1 ) (cid:21) ≥ lim sup m →∞ E θ ∗ (cid:20) log p θ ∗ ( Y tt − k +1 | Y t − k , Y − m , S − m , X t − m +1 ) p θ ( Y tt − k +1 | Y t − k , Y − m , S − m , X t − m +1 ) (cid:21) = lim sup m →∞ E θ ∗ (cid:20) log p θ ∗ ( Y k | Y , Y − t + k − m − t + k , S − m − t + k , X k − m +1 − t + k ) p θ ∗ ( Y k | Y , Y − t + k − m − t + k , S − m − t + k , X k − m +1 − t + k ) (cid:21) . t → ∞ . It suffices to show that for all t ≥ θ ∈ Θ,sup m ≥ k (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E θ ∗ (cid:20) log p θ ∗ ( Y t | Y , Y − k − m , S − m , X t − m ) p θ ( Y t | Y , Y − k − m , S − m , X t − m ) (cid:21) − E θ ∗ (cid:20) log p θ ∗ ( Y t | Y , X t − r +1 ) p θ ( Y t | Y , X t − r +1 ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) → , (A.12)as k → ∞ . From (A.12) and the previous inequality, if ℓ ( θ ∗ ) = ℓ ( θ ), then E θ ∗ (cid:2) log (cid:0) p θ ∗ ( Y t | Y , X t − r +1 ) /p θ ( Y t | Y , X t − r +1 ) (cid:1)(cid:3) = 0. The laws P θ ∗ ( Y t ∈ ·| Y , X t − r +1 ) and P θ ( Y t ∈ ·| Y , X t − r +1 )agree. From Assumption 6, θ = θ ∗ .Next we show (A.12). Define U k,m ( θ ) = log p θ ( Y t | Y , Y − k − m , S − m , X t − m ), and U ( θ ) = log p θ ( Y t | Y , X t − r +1 ). It is enough to show that for all θ ∈ Θ, E θ ∗ h sup m ≥ k | U k,m ( θ ) − U ( θ ) | i → , as k → ∞ . Put A k,m = p θ ( Y t | Y − k − m , S − m , X t − m ) , A = p θ ( Y t | X t − r +1 ) ,B k,m = p θ ( Y | Y − k − m , S − m , X − m ) , B = p θ ( Y | X − r +1 ) . Then | p θ ( Y t | Y , Y − k − m , S − m , X t − m ) − p θ ( Y t | Y , X t − r +1 ) | = (cid:12)(cid:12)(cid:12) A k,m B k,m − AB (cid:12)(cid:12)(cid:12) ≤ B | A k,m − A | + A | B k,m − B | BB k,m . (A.13)For all t ≥ k ≥ r , write p θ ( Y t | Y − k − m , S − m , X t − m )= Z Z p θ ( Y t | Z − r = z − r , X t − r +1 ) P θ ( d z − r | Z − k +1 = z − k +1 , X − r − k +2 ) × P θ ( d z − k +1 | Y − k − m , S − m , X − k +1 − m ) p θ ( Y t | X t − r +1 )= Z Z p θ ( Y t | Z − r = z − r , X t − r +1 ) P θ ( d z − r | X − r − r +1 ) × P θ ( d z − k +1 | Y − k − m , S − m , X − k +1 − m ) , where Z t , ( Y t , S t ) ′ and Z t , ( Y k , S k ) ′ . The second expression holds because condition-30lly on X t , { X k } k ≥ t +1 is independent of { Z k } k ≤ t , and conditionally on X t , { X k } k ≤ t − is independent of { Z k } k ≥ t . An upper bound of their difference is | p θ ( Y t | Y − k − m , S − m , X t − m ) − p θ ( Y t | X t − r +1 ) |≤ Z Z p θ ( Y t | Z − r = z − r , X t − r +1 ) × (cid:12)(cid:12) P θ ( d z − r | Z − k +1 = z − k +1 , X − r − k +2 ) − P θ ( d z − r | X − r − r +1 ) (cid:12)(cid:12) P θ ( d z − k +1 | Y − k − m , S − m , X t − m ) ≤ b t + r + Z k P θ ( z − r ∈ ·| Z − k +1 = z − k +1 , X − r − k +2 ) − P θ ( z − r ∈ ·| X − r − r +1 ) k T V × P θ ( d z − k +1 | Y − k − m , S − m , X t − m ) k P θ ( z − r ∈ ·| Z − k +1 = z − k +1 , X − r − k +2 ) − P θ ( z − r ∈ ·| X − r − r +1 ) k T V goes to zero as k goes toinfinity, owing to the Markovian property of Z t conditional on X + ∞−∞ and Assumption2. Thussup m ≥ k | p θ ( Y t | Y − k − m , S − m , X t − m ) − p θ ( Y t | X t − r +1 ) | a.s. → , as k → ∞ . (A.14)Using Assumption 3, B = p θ ( Y | X − r +1 )= Z X s (cid:16) Y s = − r +1 p θ ( Y s | Y s − − r , S − r = s , X s − r +1 ) (cid:17) × P θ ( S − r = s | Y − r , X − r +1 ) P θ ( Y − r ∈ d y − r | X − r +1 ) ≥ Z Y s = − r +1 b − ( Y ss − r , X s ) P θ ( Y − r ∈ d y − r | X − r +1 ) > P θ ∗ -probability arbitrarily close to 1, B k,m is uniformly bounded awayfrom zero for m ≥ k and k sufficiently large. By (A.13) and (A.14),sup m ≥ k | p θ ( Y t | Y , Y − k − m , S − m , X t − m ) − p θ ( Y t | Y , X t − r +1 ) | p → , as k → ∞ . Since p θ ( Y t | Y , Y − k − m , S − m , X t − m ) = P s p θ ( Y t | Y , S = s , X t ) P θ ( S = s | Y , S − m , X t − m ),it is bounded below by Q ts =1 b − ( Y ss − r , X s ) and bounded above by b t + . The same lowerbound can be attained for p θ ( Y t | Y , X t − r +1 ). Use the inequality | log x − log y | ≤ x − y | / ( x ∧ y ), sup m ≥ k | U k,m ( θ ) − U ( θ ) | p → , as k → ∞ . Using the bounds of p θ ( Y t | Y , Y − k − m , S − m , X t − m ), E θ ∗ h sup k sup m ≥ k | U k,m ( θ ) | i < ∞ . The proof is completed by applying the bounded convergence theorem. (cid:3)
Proof of Theorem 1: As θ ∗ is a well-separated maximum of ℓ ( θ ), we have ∀ ε > θ : | θ − θ ∗ |≥ ε ℓ ( θ ) < ℓ ( θ ∗ ). For ε , δ > | θ − θ ∗ | > ε implies ℓ ( θ ) <ℓ ( θ ∗ ) − δ , and thus, P θ ∗ ( | ˆ θ n, s − θ ∗ | > ε ) ≤ P θ ∗ ( ℓ (ˆ θ n, s ) < ℓ ( θ ∗ ) − δ ) = P θ ∗ ( ℓ ( θ ∗ ) − ℓ (ˆ θ n, s ) > δ ) . The proof is completed by Proposition 1 and ℓ ( θ ∗ ) − ℓ (ˆ θ n, s ) = n − ℓ n ( θ ∗ , s ) − ℓ (ˆ θ n, s ) + ℓ ( θ ∗ ) − n − ℓ n ( θ ∗ , s ) ≤ n − ℓ n (ˆ θ n, s , s ) − ℓ (ˆ θ n, s ) + ℓ ( θ ∗ ) − n − ℓ n ( θ ∗ , s ) ≤ θ | n − ℓ n ( θ, s ) − ℓ ( θ ) | . (cid:3) A.3 Proof of lemmas and theorems in Subsection 5.1
We first show that ˙∆ t, ∞ ( θ ∗ ) is well-defined in L ( P θ ∗ ). Since { E θ ∗ [ φ θ ∗ ,k | Y t − m , X t − m ] } m ≥ is a martingale, by Jensen’s inequality, {k E θ ∗ [ φ θ ∗ ,k | Y t − m , X t − m ] k } m ≥ is a submartin-gale. Moreover, for any m , E θ ∗ (cid:2) k E θ ∗ [ φ θ ∗ ,k | Y t − m , X t − m ] k (cid:3) ≤ E θ ∗ (cid:2) E θ ∗ [ k φ θ ∗ ,k k | Y t − m , X t − m ] (cid:3) = E θ ∗ (cid:2) k φ θ ∗ ,k k (cid:3) < ∞ under Assumption 8. Then by the martingale convergence theorem (see, e.g., Shiryaev(1996, p.508)), k E θ ∗ [ φ θ ∗ ,k | Y t − m , X t − m ] k a.s. → k E θ ∗ [ φ θ ∗ ,k | Y t −∞ , X t −∞ ] k , (A.15)32s m → ∞ and E θ ∗ (cid:2) k E θ ∗ [ φ θ ∗ ,k | Y t −∞ , X t −∞ ] k (cid:3) < ∞ . (A.16)On the other hand, by setting m = ∞ in (B.13), E θ ∗ h t − X k = −∞ k E θ ∗ [ φ θ ∗ ,k | Y t −∞ , X t −∞ ] − E θ ∗ [ φ θ ∗ ,k | Y t − −∞ , X t − −∞ ] k i < ∞ . (A.17)Combining (A.16) and (A.17), ˙∆ t, ∞ ( θ ∗ ) is well defined in L ( P θ ∗ ).To see that { ˙∆ t, ∞ ( θ ∗ ) } ∞ k = −∞ is an ( F , P θ ∗ )-adapted martingale increment sequence,note that E θ ∗ [ ˙∆ t, ∞ |F t − ] = 0 follows from E θ ∗ " t − X k = −∞ (cid:0) E θ ∗ [ φ θ ∗ ,k | Y t −∞ , X t −∞ ] − E θ ∗ [ φ θ ∗ ,k | Y t − −∞ , X t − −∞ ] (cid:1)(cid:12)(cid:12)(cid:12) Y t − −∞ , X t − −∞ = 0 , (A.18) E θ ∗ [ φ θ ∗ ,t | Y t − −∞ , X t − −∞ ] = E θ ∗ (cid:2) E θ ∗ [ φ θ ∗ ,t | Y t − −∞ , S t − , X t − −∞ ] (cid:12)(cid:12) Y t − −∞ , X t − −∞ ] (cid:3) = 0 . (A.19) Proof of Lemma 4:
It suffices to show lim n →∞ E θ ∗ (cid:2) k √ n P nt =1 ( ˙∆ t, , s ( θ ∗ ) − ˙∆ t, ( θ ∗ )) k (cid:3) =0 and lim n →∞ E θ ∗ (cid:2) k √ n P nt =1 ( ˙∆ t, ( θ ∗ ) − ˙∆ t, ∞ ( θ ∗ )) k (cid:3) = 0.First, we show the first term. Using the Minkowski inequality and (B.11), (cid:0) E θ ∗ (cid:2) k √ n n X t =1 ( ˙∆ t, , s ( θ ∗ ) − ˙∆ t, ( θ ∗ )) k (cid:3)(cid:1) = (cid:0) E θ ∗ (cid:2) k √ n n X t =1 ( E θ ∗ [ φ θ ∗ ,t | Y t , S = s , X t ] − E θ ∗ [ φ θ ∗ ,t | Y t , X t ]) k (cid:3)(cid:1) ≤ √ n n X t =1 (cid:0) E θ ∗ (cid:2) k E θ ∗ [ φ θ ∗ ,t | Y t , S = s , X t ] − E θ ∗ [ φ θ ∗ ,t | Y t , X t ] k (cid:3)(cid:1) ≤ √ n n X t =1 (cid:0) E θ ∗ [ k φ θ ∗ , k ∞ ]) ρ ⌊ ( t − / r ⌋ (cid:1) ≤ √ n E θ ∗ [ k φ θ ∗ , k ∞ ]) ρ (1 − ρ r )which converges to zero as n → ∞ .Next, consider the second term. (cid:0) E θ ∗ [ k ˙∆ t, ( θ ∗ ) − ˙∆ t, ∞ ( θ ∗ ) k ] (cid:1) ≤ (cid:0) E θ ∗ [ k E θ ∗ [ φ θ ∗ ,t | Y t −∞ , X t −∞ ] − E θ ∗ [ φ θ ∗ ,t | Y t , X t ] k ] (cid:1) t − X k =1 (cid:0) E θ ∗ [ k E θ ∗ [ φ θ ∗ ,k | Y t −∞ , X t −∞ ] − E θ ∗ [ φ θ ∗ ,k | Y t − −∞ , X t − −∞ ] − E θ ∗ [ φ θ ∗ ,k | Y t , X t ] − E θ ∗ [ φ θ ∗ ,k | Y t − , X t − ] k ] (cid:1) + X k = −∞ (cid:0) E θ ∗ [ k E θ ∗ [ φ θ ∗ ,k | Y t −∞ , X t −∞ ] − E θ ∗ [ φ θ ∗ ,k | Y t − −∞ , X t − −∞ ] k ] (cid:1) ≤ E θ ∗ [ k φ θ ∗ , k ∞ ]) (cid:0) ρ ⌊ ( t − / r ⌋ + 2 t − X k =1 ρ ⌊ ( k − / r ⌋ ∧ ρ ⌊ ( t − − k ) / r ⌋ + X k = −∞ ρ ⌊ ( t − − k ) / r ⌋ (cid:1) ≤ E θ ∗ [ k φ θ ∗ , k ∞ ]) − ρ r − ρ r ρ t − r − r . Using Minkowski inequality, (cid:0) E θ ∗ [ k √ n n X t =1 ( ˙∆ t, ( θ ∗ ) − ˙∆ t, ∞ ( θ ∗ )) k ] (cid:1) ≤ √ n n X t =1 (cid:0) E θ ∗ [ k ˙∆ t, ( θ ∗ ) − ˙∆ t, ∞ ( θ ∗ ) k ] (cid:1) ≤ √ n n X t =1 E θ ∗ [ k φ θ ∗ , k ∞ ]) − ρ r − ρ r ρ t − r − r which converges to 0 as n → ∞ . (cid:3) A.4 Proof of theorems in Subsection 5.2
We define Γ t,m, s ( θ ) , E θ h t X k = − m +1 ˙ φ θ,k | Y t − m , S − m = s , X t − m i − E θ h t − X k = − m +1 ˙ φ θ,k | Y t − − m , S − m = s , X t − − m i Φ t,m, s ( θ ) , Var θ h t X k = − m +1 φ θ,k | Y t − m , S − m = s , X t − m i − Var θ h t − X k = − m +1 φ θ,k | Y t − − m , S − m = s , X t − − m i so that ∇ θ ℓ n ( θ, s ) = P nt =1 (Γ t, , s ( θ ) + Φ t, , s ( θ )). Similarly, we defineΓ t,m ( θ ) , E θ h t X k = − m +1 ˙ φ θ,k | Y t − m , X t − m i − E θ h t − X k = − m +1 ˙ φ θ,k | Y t − − m , X t − − m i , k,m ( θ ) , Var θ h t X k = − m +1 φ θ,k | Y t − m , X t − m i − Var θ h t − X k = − m +1 φ θ,k | Y t − − m , X t − − m i . Lemmas 12 and 13 show that { Γ t,m, s ( θ ) } m ≥ converges uniformly with respect to θ ∈ G P θ ∗ − a.s. and in L ( P θ ∗ ) to a random variable that we denote by Γ t, ∞ ( θ ), and the limitdoes not depend on s . Lemmas 14 and 15 show similar results for { Φ t,m, s ( θ ) } m ≥ . Weconstruct the stationary ergodic sequence by conditioning the observed Hessian on theentire history of ( Y t , X t ) from past infinity: Γ t, ∞ ( θ ) = lim m →∞ Γ t,m, s ( θ ) and Φ t, ∞ ( θ ) =lim m →∞ Φ t,m, s ( θ ). Using the Louis missing information principle and Assumption 9, E θ ∗ [Γ , ∞ ( θ ∗ ) + Φ , ∞ ( θ ∗ )] = − E θ ∗ [ ˙∆ , ∞ ( θ ∗ ) ˙∆ , ∞ ( θ ∗ ) T ] = − I ( θ ∗ ) . (A.20)Propositions 3 and 4 and (A.20) together yield Theorem 3. Proposition 3.
Assume 2–5 and 7–8. Then, for all s ∈ S and θ ∈ G , lim δ → lim n →∞ sup θ ′ : | θ ′ − θ | <δ (cid:12)(cid:12)(cid:12) n n X t =1 Γ t, , s ( θ ′ ) − E θ ∗ [Γ , ∞ ( θ )] (cid:12)(cid:12)(cid:12) = 0 , P θ ∗ − a.s. Proof:
Write lim δ → lim n →∞ sup θ ′ : | θ ′ − θ | <δ (cid:12)(cid:12)(cid:12) n n X t =1 Γ t, , s ( θ ′ ) − E θ ∗ [Γ , ∞ ( θ )] (cid:12)(cid:12)(cid:12) ≤ lim δ → lim n →∞ sup θ ′ : | θ ′ − θ | <δ n n X t =1 (cid:12)(cid:12)(cid:12) Γ t, , s ( θ ′ ) − Γ t, ∞ ( θ ′ ) (cid:12)(cid:12)(cid:12) + lim δ → lim n →∞ sup θ ′ : | θ ′ − θ | <δ (cid:12)(cid:12)(cid:12) n n X t =1 Γ t, ∞ ( θ ′ ) − E θ ∗ [Γ , ∞ ( θ ′ )] (cid:12)(cid:12)(cid:12) + lim δ → lim n →∞ sup θ ′ : | θ ′ − θ | <δ E θ ∗ (cid:12)(cid:12) Γ , ∞ ( θ ′ ) − Γ , ∞ ( θ ) (cid:12)(cid:12) . The first term on the right-hand side is zero P θ ∗ − a.s. by Lemma 12. The secondterm is zero P θ ∗ − a.s. , owing to the ergodic theorem. The third term is zero from thecontinuity of Γ , ∞ ( θ ) in L ( P θ ∗ ), the proof of which follows a similar argument to theproof of Douc et al. (2004, Lemma 14) and is omitted here. (cid:3) Proposition 4.
Assume 2–5 and 7–8. Then, for all s ∈ S and θ ∈ G , lim δ → lim n →∞ sup θ ′ : | θ ′ − θ | <δ (cid:12)(cid:12)(cid:12) n n X t =1 Φ t, , s ( θ ′ ) − E θ ∗ [Φ , ∞ ( θ )] (cid:12)(cid:12)(cid:12) = 0 , P θ ∗ − a.s. roof: By Lemmas 14 and 15, { Φ t,m, s ( θ ) } m ≥ is a uniform Cauchy sequence w.r.t. θ P θ ∗ − a.s. and in L ( P θ ∗ ). The rest of the proof of Proposition 4 follows along the sameline as the proof of Proposition 3. Appendix B: Auxiliary results
Lemma 5 (Minorization condition) . Let m, n ∈ Z with − m ≤ n and θ ∈ Θ . Con-ditionally on Y n − m and X n − m , { S k } − m ≤ k ≤ n satisfies the Markov property. Assume 3.Then, for all − m + r ≤ k ≤ n , a function µ k ( Y nk − r , X nk , A ) exists, such that: (a) forany A ∈ P ( S ) , ( y nk − r , x nk ) → µ k ( y nk − r , x nk , A ) is a Borel function; and (b) for any y nk − r and x nk , µ k ( y nk − r , x nk , · ) is a probability measure on P ( S ) . Moreover, for A ∈ P ( S ) , thefollowing holds: min s k − r ∈ S P θ ( S k ∈ A | S k − r = s k − r , Y n − m , X n − m ) ≥ ω ( Y k − k − r , X kk − r +1 ) · µ k ( Y nk − r , X nk , A ) where ω ( Y k − k − r , X kk − r +1 ) := Q kℓ = k − r +1 σ − ( Y ℓ − , X ℓ ) Q k − ℓ = k − r +1 b − ( Y ℓℓ − r ,X ℓ ) b r − . Proof:
For − m ≤ k ≤ n , { S k } is a Markov chain conditionally, since p θ ( S k | S k − − m , Y n − m , X n − m ) = p θ ( S k , Y nk | S k − − m , Y k − − m , X n − m ) p θ ( Y nk | S k − − m , Y k − − m , X n − m )= p θ ( S k , Y nk | S k − , Y k − , X nk ) p θ ( Y nk | S k − , Y k − , X nk ) = p θ ( S k | S k − , Y nk − , X nk )To see the minorization condition, observe that P θ ( S k ∈ A | S k − r , Y n − m , X n − m ) = P θ ( S k ∈ A | S k − r , Y nk − r , X nk − r +1 )= P s k ∈ A p θ ( Y nk | S k = s k , Y k − , X nk ) P θ ( S k = s k | S k − r , Y k − k − r , X kk − r +1 ) P s k ∈ S p θ ( Y nk | S k = s k , Y k − , X nk ) P θ ( S k = s k | S k − r , Y k − k − r , , X kk − r +1 ) . Since P θ ( S k = s k | s k − r , Y k − k − r , X kk − r +1 ) ≤ P θ ( S k = s k | s k − r , Y k − k − r , X kk − r +1 )= Q k − ℓ = k − r +1 g θ ( Y ℓ | Y ℓ − , s ℓ , X ℓ ) Q kℓ = k − r +1 q θ ( s ℓ | s ℓ − , Y ℓ − , X ℓ ) P s k ∈ S Q k − ℓ = k − r +1 g θ ( Y ℓ | Y ℓ − , s ℓ , X ℓ ) Q kℓ = k − r +1 q θ ( s ℓ | s ℓ − , Y ℓ − , X ℓ ) ≥ Q k − ℓ = k − r +1 b − ( Y ℓℓ − r , X ℓ ) Q kℓ = k − r +1 σ − ( Y ℓ − , X ℓ ) b r − = ω ( Y k − k − r , X kk − r +1 ) ,
36t readily follows that P θ ( S k ∈ A | S k − r , Y n − m , X n − m ) ≥ ω ( Y k − k − r , X kk − r +1 ) µ k ( Y nk − r , X nk , A )with µ k ( Y nk − r , X nk , A ) , P s k ∈ A p θ ( Y nk | S k = s k , Y k − , X nk ) P s k ∈ S p θ ( Y nk | S k = s k , Y k − , X nk ) . (B.1)In order for µ k ( Y nk − r , X nk , A ) to be a well-defined probability measure, we need showonly that the denominator is strictly positive. The summand term in the denominatorof (B.1) is p θ ( Y nk | S k = s k , Y k − , X nk ) = X s nk +1 n Y ℓ = k g θ ( Y ℓ | Y ℓ − , s ℓ , X ℓ ) n Y ℓ = k +1 q θ ( s ℓ | s ℓ − , Y ℓ − , X ℓ ) ≥ n Y ℓ = k b − ( Y ℓℓ − r , X ℓ ) n Y ℓ = k +1 σ − ( Y ℓ − , X ℓ ) > . (cid:3) Lemma 6.
Assume 2–4. For m ′ ≥ m ≥ , sup θ ∈ Θ sup m ≥ max s ∈ S | ∆ t,m, s ( θ ) | ≤ max (cid:8)(cid:12)(cid:12) log b + (cid:12)(cid:12) , (cid:12)(cid:12) log( b − ( Y tt − r , X t )) (cid:12)(cid:12)(cid:9) , (B.2) Proof: (23) follows by replacing P θ ( S − m = s − m | S − m ′ = s j , Y t − − m ′ , X t − m ′ +1 ) in theproof of Lemma 3 with P θ ( S − m = s − m | Y t − − m , X t − m +1 ). (B.2) follows from b − ( Y tt − r , X t ) ≤ p θ ( Y t | Y t − − m , S − m = s i , X t − m +1 ) ≤ b + . (cid:3) Lemma 7 (Minorization condition of the time-reversed Markov process) . Let m, n ∈ Z with − m ≤ n and θ ∈ Θ . Conditionally on Y n − m and X n − m , { S n − k } ≤ k ≤ n + m sat-isfies the Markov property. Assume 3. Then, for all r ≤ k ≤ n + m , a function ˜ µ k ( Y n − k + r − m , X n − k + r − m , A ) exists such that: (a) for any A ∈ P ( S ) , ( y n − k + r − m , x n − k + r − m ) → ˜ µ k ( y n − k + r − m , x n − k + r − m , A ) is a Borel function; and (b) for any y n − k + r − m and x n − k + r − m , ˜ µ k ( y n − k + r − m , x n − k + r − m , · ) is a probability measure on P ( S ) . Moreover, for A ∈ P ( S ) , the following olds: min s n − k + r ∈ S P θ ( S n − k ∈ A | S n − k + r = s n − k + r , Y n − m , X n − m ) ≥ ω ( Y n − k + r − n − k − r +1 , X n − k + rn − k +1 ) · ˜ µ k ( Y n − k + r − − m , X n − k + r − m , A ) where ω ( Y n − k + r − n − k − r +1 , X n − k + rn − k +1 ) , Q n − k + r − ℓ = n − k +1 b − ( Y ℓℓ − r ,X ℓ ) Q n − k + rℓ = n − k +1 σ − ( Y ℓ − ,X ℓ ) b + r − . Proof:
To observe the Markovian property, for 2 ≤ k ≤ m + n , p θ ( S n − k | S nn − k +1 , Y n − m , X n − m ) (B.3)= p θ ( S nn − k +2 , Y nn − k +1 | S n − k +1 , Y n − k , X n − m ) p θ ( S n − k +1 n − k , Y n − k − m | X n − m ) p θ ( S nn − k +2 , Y nn − k +1 | S n − k +1 , Y n − k , X n − m ) p θ ( S n − k +1 , Y n − k − m | X n − m ) (B.4)= p θ ( S n − k | S n − k +1 , Y n − k − m , X n − k +1 − m ) . (B.5)To observe the minorization condition, note that P θ ( S n − k ∈ A | S n − k + r , Y n − m , X n − m ) = P θ ( S n − k ∈ A | S n − k + r , Y n − k + r − − m , X n − k + r − m )= P s n − k ∈ A p θ ( S n − k = s n − k , Y n − k + r − − m , X n − k + r − m ) p θ ( S n − k + r | S n − k = s n − k , Y n − k + r − n − k , X n − k + rn − k +1 ) P s n − k ∈ S p θ ( S n − k = s n − k , Y n − k + r − − m , X n − k + r − m ) p θ ( S n − k + r | S n − k = s n − k , Y n − k + r − n − k , X n − k + rn − k +1 ) . Since p θ ( S n − k + r | S n − k = s n − k , Y n − k + r − n − k , X n − k + rn − k +1 ) ≤
1, and for all s n − k + r , s n − k ∈ S , P θ ( S n − k + r = s n − k + r | S n − k = s n − k , Y n − k + r − n − k , X n − k + rn − k +1 ) ≥ ω ( Y n − k + r − n − k − r +1 , X n − k + rn − k +1 ) , it readily follows that P θ ( S n − k ∈ A | S n − k + r , Y n − m , X n − m ) ≥ ω ( Y n − k + r − n − k − r +1 , X n − k + rn − k +1 )˜ µ k ( Y n − k + r − − m − r +1 , X n − k + r − m , A )with ˜ µ k ( Y n − k + r − − m − r +1 , X n − k + r − m , A ) , p θ ( S n − k ∈ A | Y n − k + r − − m , X n − k + r − m ) . (cid:3) Lemma 8 (Uniform ergodicity of the time-reversed Markov process) . Assume 3. Let m, n ∈ Z , − m ≤ n , and θ ∈ Θ . Then, for − m ≤ k ≤ n , for all probability measures µ nd µ defined on P ( S ) and all Y n − m , (cid:13)(cid:13)(cid:13) X s ∈ S P θ ( S k ∈ ·| S n = s , Y n − m , X n − m ) µ ( s ) − X s ∈ S P θ ( S k ∈ ·| S n = s , Y n − m , X n − m ) µ ( s ) (cid:13)(cid:13)(cid:13) T V ≤ ⌊ ( n − k ) /r ⌋ Y i =1 (cid:0) − ω ( Y n − ri + r − n − ri − r +1 , X n − ri + rn − ri +1 ) (cid:1) = ⌊ ( n − k ) /r ⌋ Y i =1 (cid:0) − ω ( V n + r − ri ) (cid:1) . Moreover, (cid:13)(cid:13)(cid:13) X s ∈ S P θ ( S k ∈ ·| S n = s , Y n − m , X n − m , S − m ) µ ( s ) − X s ∈ S P θ ( S k ∈ ·| S n = s , Y n − m , X n − m , S − m ) µ ( s ) (cid:13)(cid:13)(cid:13) T V ≤ ⌊ ( n − k ) /r ⌋ Y i =1 (cid:0) − ω ( Y n − ri + r − n − ri − r +1 , X n − ri + rn − ri +1 ) (cid:1) = ⌊ ( n − k ) /r ⌋ Y i =1 (cid:0) − ω ( V n + r − ri ) (cid:1) . Lemma 9.
Assume 2–4. Then, ε ∈ (0 , r ) and ρ ∈ (0 , exist such that for all m, n ∈ Z + , E θ ∗ h n Y k =1 (1 − ω ( V t k )) m ∧ n Y k =1 (1 − ω ( V t k )) m i ≤ ρ n ∧ ρ n ) , (B.6) E θ ∗ h n Y k =1 (1 − ω ( V t k )) m ∧ n Y k =1 (1 − ω ( V t k )) m ∧ n Y k =1 (1 − ω ( V t k )) m i ≤ ρ n ∧ ρ n ∧ ρ n ) . (B.7) Proof: (B.6) follows from (20) given E θ ∗ (cid:2) Q n k =1 (1 − ω ( V t k )) m ∧ Q n k =1 (1 − ω ( V t k )) m ] ≤ min { E θ ∗ [ Q n k =1 (1 − ω ( V t k )) m ] , E θ ∗ [ Q n k =1 (1 − ω ( V t k )) m ] } . (B.7) similarly follows. (cid:3) Lemma 10.
Assume 2–4. For ≤ m ≤ n , { t k i } ≤ i ≤ a m and { t h i } ≤ i ≤ b m are twosequences of integers satisfying (i) t k i < t k i ′ for ≤ i < i ′ ≤ a m ; (ii) t h i < t h i ′ for ≤ i < i ′ ≤ b m ; and (iii) t k i = t h i for all ≤ i ≤ a m and ≤ i ≤ b m , then itholds that for the same ε and ρ defined as in Lemma 2, there exists a random sequence { b t,t } such that n X m =1 h a m Y i =1 (1 − ω ( V t ki )) ∧ b m Y i =1 (1 − ω ( V t hi )) i ≤ ρ − ε ( t − t +1) A t,t n X m =1 ρ a m ∨ b m , (B.8) where t = min { t k , t h } , t = max { t k am , t h bm } , and P θ ∗ ( A t,t ≥ M i . o . ) = 0 for a constant < ∞ .For ≤ m ≤ n , { t k i } ≤ i ≤ a m , { t h i } ≤ i ≤ b m , and { t ℓ i } ≤ i ≤ c m are three sequencesof integers satisfying (i) t k i < t k i ′ for ≤ i < i ′ ≤ a m ; (ii) t h i < t h i ′ for ≤ i
First, we show (B.8). We define ρ , I t and ε as in (A.3), (A.4) and (A.7),respectvely. Notice that ε ∈ (0 , r ). Using 1 − ω ( V t ) ≤ ρ − I t , it follows that a m Y i =1 (1 − ω ( V t ki )) ≤ ρ a m − P ami I tki ≤ ρ a m − P tt = t I t , b m Y i =1 (1 − ω ( V t ℓi )) ≤ ρ b m − P bmi I tℓi ≤ ρ b m − P tt = t I t , and n X m =1 (cid:16) a m Y i =1 (1 − ω ( V t ki )) ∧ b m Y i =1 (1 − ω ( V t ℓi )) (cid:17) ≤ n X m =1 (cid:0) ρ a m − P tt = t I t ∧ ρ b m − P tt = t I t (cid:1) = ρ − P tt = t I t n X m =1 ρ a m ∨ b m . (B.10)Since V t forms a stationary and ergodic sequence for t ≤ t ≤ t , the strong law oflarge numbers yields ( t − t + 1) − ( P tt = t I t ) a.s. → E θ ∗ [ I ] < ε , as t − t → ∞ Hence, P θ ∗ (cid:0) ρ − P tt = t I t ≥ ρ − ε ( t − t +1) i . o . (cid:1) = 0 . Let { b } t,t denote a random sequence suchthat P θ ∗ ( b t,t ≥ M i . o . ) = 0 for a constant M < ∞ , then (B.10) is bounded by ρ − ε ( t − t +1) A t,t P nm =1 ρ a m ∨ b m . (B.9) similarly follows. (cid:3) emma 11. Assume 2–4 and 7–8. Then, for − m ′ < − m < k ≤ t , E θ ∗ h(cid:13)(cid:13) E θ ∗ [ φ θ ∗ ,k | Y t − m , S − m = s , X t − m ] − E θ ∗ [ φ θ ∗ ,k | Y t − m , X t − m ] (cid:13)(cid:13) i ≤ (cid:0) E θ ∗ (cid:2) k φ θ ∗ , k ∞ (cid:3)(cid:1) ρ ⌊ ( k + m − / r ⌋ , (B.11) E θ ∗ h k E θ ∗ [ φ θ ∗ ,k | Y t − m , X t − m ] − E θ ∗ [ φ θ ∗ ,k | Y t − m ′ , X t − m ′ ] k i ≤ (cid:0) E θ ∗ (cid:2) k φ θ ∗ , k ∞ (cid:3)(cid:1) ρ ⌊ ( k + m − / r ⌋ , (B.12) E θ ∗ h k E θ ∗ [ φ θ ∗ ,k | Y t − m , X t − m ] − E θ ∗ [ φ θ ∗ ,k | Y t − − m , X t − − m ] k i ≤ (cid:0) E θ ∗ (cid:2) k φ θ ∗ , k ∞ (cid:3)(cid:1) ρ ⌊ ( t − − k ) / r ⌋ . (B.13) Proof:
First, show (B.11). k E θ ∗ [ φ θ ∗ ,k | Y t − m , S − m = s , X t − m ] − E θ ∗ [ φ θ ∗ ,k | Y t − m , X t − m ] k≤ k φ θ ∗ ,k k ∞ (cid:13)(cid:13)(cid:13) X s − m P θ ∗ ( S k − ∈ ·| Y t − m , S − m = s − m , X t − m ) δ s ( s − m ) − X s − m P θ ∗ ( S k − ∈ ·| Y t − m , S − m = s − m , X t − m ) P θ ∗ ( S − m = s − m | Y t − m , X t − m ) (cid:13)(cid:13)(cid:13) ≤ k φ θ ∗ ,k k ∞ ⌊ ( k + m − /r ⌋ Y i =1 (1 − ω ( V − m + ri ))Using Lemma 2 and H¨older’s inequality, the second moment is bounded by4 E θ ∗ h k φ θ ∗ ,k k ∞ ⌊ ( k + m − /r ⌋ Y i =1 (1 − ω ( V − m + ri )) i ≤ (cid:0) E θ ∗ (cid:2) k φ θ ∗ , k ∞ (cid:3)(cid:1) (cid:16) E θ ∗ h ⌊ ( k + m − /r ⌋ Y i =1 (1 − ω ( V − m + ri )) i(cid:17) ≤ (cid:0) E θ ∗ (cid:2) k φ θ ∗ , k ∞ (cid:3)(cid:1) ρ ⌊ ( k + m − / r ⌋ . We can show (B.12) by replacing P ( S k − ∈ ·| Y t − m , S − m = s , X t − m ) with P ( S k − ∈·| Y t − m ′ , X t − m ′ ). For (B.13), k E θ ∗ [ φ θ ∗ ,k | Y t − m , X t − m ] − E θ ∗ [ φ θ ∗ ,k | Y t − − m , X t − − m ] k≤ k φ θ ∗ ,k k ∞ (cid:13)(cid:13)(cid:13) X s t − P θ ∗ ( S k ∈ ·| S t − = s t − , Y t − − m , X t − − m ) P θ ∗ ( S t − | Y t − m , X t − m ) − X s t − P θ ∗ ( S k ∈ ·| S t − = s t − , Y t − − m , X t − − m ) P θ ∗ ( S t − | Y t − − m , X t − − m ) (cid:13)(cid:13)(cid:13) k φ θ ∗ ,k k ∞ ⌊ ( t − − k ) /r ⌋ Y i =1 (1 − ω ( V t − r − ri )) . The bound for its second moment follows similarly to the proof for (B.11). (cid:3)
Lemma 12.
Assume 2–4 and 7–8. There exist a random sequence { A t,m } and arandom variable K ∈ L ( P θ ∗ ) exists such that, for all t ≥ and ≤ m ≤ m ′ , max s sup θ ∈ G k Γ t,m, s ( θ ) − Γ t,m ′ , s ( θ ) k ≤ K ( t ∨ m ) ρ ( t + m ) / r A t,m , (B.14)max s sup θ ∈ G k Γ t,m, s ( θ ) − Γ t,m ( θ ) k ≤ K ( t ∨ m ) ρ ( t + m ) / r A t,m , (B.15) where P θ ∗ ( A t,m ≥ M i . o . ) = 0 for M < ∞ . Proof:
Put k ˙ φ k k ∞ = max s k , s k − sup θ ∈ G k ˙ φ θ (( s k , Y k ) , ( s k − , Y k − ) , X k ) k . For − m ′ < − m < k ≤ t , k E θ [ ˙ φ θ,k | Y t − m , S − m = s , X t − m ] − E θ [ ˙ φ θ,k | Y t − m , X t − m ] k≤ k ˙ φ k k ∞ ⌊ ( k + m − /r ⌋ Y i =1 (1 − ω ( V − m + ri )) , k E θ [ ˙ φ θ,k | Y t − m , S − m = s , X t − m ] − E θ [ ˙ φ θ,k | Y t − m ′ , S − m ′ = s , X t − m ′ ] k≤ k ˙ φ k k ∞ ⌊ ( k + m − /r ⌋ Y i =1 (1 − ω ( V − m + ri )) , k E θ [ ˙ φ θ,k | Y t − m , S − m = s , X t − m ] − E θ [ ˙ φ θ,k | Y t − − m , S − m = s , X t − − m ] k≤ k ˙ φ k k ∞ ⌊ ( t − − k ) /r ⌋ Y i =1 (1 − ω ( V t + r − ri )) , k E θ [ ˙ φ θ,k | Y t − m , X t − m ] − E θ [ ˙ φ θ,k | Y t − − m , X t − − m ] k ≤ k ˙ φ k k ∞ ⌊ ( t − − k ) /r ⌋ Y i =1 (1 − ω ( V t + r − ri )) . First, we show (B.15). k Γ t,m, s − Γ t,m k ≤ k ˙ φ t k ∞ ⌊ ( t + m − /r ⌋ Y i =1 (cid:0) − ω ( V − m + ri ) (cid:1) + 4 t − X k = − m +1 k ˙ φ k k ∞ (cid:16) ⌊ ( k + m − /r ⌋ Y i =1 (cid:0) − ω ( V − m + ri ) (cid:1) ∧ ⌊ ( t − k − /r ⌋ Y i =1 (cid:0) − ω ( V t + r − ri ) (cid:1)(cid:17) − m +1 ≤ k ≤ t k ˙ φ k k ∞ t X k = − m +1 (cid:16) ⌊ ( k + m − /r ⌋ Y i =1 (cid:0) − ω ( V − m + ri ) (cid:1) ∧ ⌊ ( t − k − /r ⌋ Y i =1 (cid:0) − ω ( V t + r − ri ) (cid:1)(cid:17) . (B.16)The first part of (B.16) is bounded by4 max − m +1 ≤ k ≤ t k ˙ φ k k ∞ ≤ t X k = − m +1 ( | k | ∨ | k | ∨ k ˙ φ k k ∞ ≤ t ∨ m ) ∞ X k = −∞ | k | ∨ k ˙ φ k k ∞ (B.17)We proceed to bound the second part of (B.16). Since − m + r ⌊ ( k + m − /r ⌋ < t + r − r ⌊ ( t − k − /r ⌋ , we can apply Lemma 10. Using ρ ⌊ ( k + m − /r ⌋ ∧ ρ ⌊ ( t − k − /r ⌋ = ρ ⌊ ( t − k − /r ⌋ for k ≤ t − m and ρ ⌊ ( k + m − /r ⌋ for k ≥ t − m , ρ − ε ( t + m + r − t X k = − m +1 (cid:0) ρ ⌊ k + m − r ⌋ ∧ ρ ⌊ t − k − r ⌋ (cid:1) ≤ ρ − ε ( t + m + r − (cid:16) X k ≤ ( t − m ) / ρ ⌊ t − k − r ⌋ + X k ≥ ( t − m ) / ρ ⌊ k + m − r ⌋ (cid:17) ≤ ρ t + m r ρ r − (1 − ρ r ) . (B.18)Because of (B.17), (B.18), and Lemma 10, (B.15) follows.Next, we follow a similar procedure to show (B.14). k Γ t,m, s ( θ ) − Γ t,m ′ , s ( θ ) k ≤ k ˙ φ t k ∞ ⌊ ( t + m − /r ⌋ Y i =1 (cid:0) − ω ( V − m + ri ) (cid:1) + 4 t − X k = − m +1 k ˙ φ k k ∞ (cid:16) ⌊ ( k + m − /r ⌋ Y i =1 (cid:0) − ω ( V − m + ri ) (cid:1) ∧ ⌊ ( t − k − /r ⌋ Y i =1 (cid:0) − ω ( V t + r − ri ) (cid:1)(cid:17) + 2 − m X k = − m ′ +1 k ˙ φ k k ∞ ⌊ ( t − k − /r ⌋ Y i =1 (cid:0) − ω ( V t + r − ri ) (cid:1) (B.19)The first two terms on the right-hand side can be bounded as above. We proceed toshow the bound for the third term. Define ρ , I t and ε as in (A.3), (A.4) and (A.7),respectvely. Notice that the ρ and ε are the same as the ones in Lemma 2. Using43 − V t ≤ ρ − I t , − m X k = − m ′ +1 k ˙ φ k k ∞ ⌊ ( t − k − /r ⌋ Y i =1 (cid:0) − ω ( V t + r − ri ) (cid:1) ≤ − m X k = − m ′ +1 k ˙ φ k k ∞ ρ ⌊ ( t − k − /r ⌋− P ⌊ ( t − k − /r ⌋ i =1 I t + r − ri ≤ − m X k = − m ′ +1 k ˙ φ k k ∞ ρ ⌊ ( t − k − /r ⌋− P ti = − m ′ + r +1 I i ≤ ρ − P ti = − m ′ + r +1 I i − m X k = − m ′ +1 k ˙ φ k k ∞ ρ ⌊ ( t − k − /r ⌋ . Using (A.5) and (A.6), E θ ∗ [ ρ − P ti = − m ′ + r +1 I i ] ≤ m ′ and t . Then ρ − P ti = − m ′ + r +1 I i ≤ ρ − P + ∞ i = −∞ I i , which is in L ( P θ ∗ ).Since − k r ≤ t − m − k r for k ≤ m , we can follow a similar procedure to Douc et al.(2004, p. 2295) to obtain2 − m X k = − m ′ +1 k ˙ φ k k ∞ ρ ⌊ ( t − k − /r ⌋ ≤ ρ t + m r − m X k = − m ′ +1 k ˙ φ k k ∞ ρ t − m − k r − ≤ ρ t + m r ∞ X k = −∞ k ˙ φ k k ∞ ρ | k | r − ≤ ρ t + m r ∞ X k = −∞ k ˙ φ k k ∞ ρ | k | r − . Then (B.14) follows. (cid:3)
Lemma 13.
Assume 2–4 and 7–8. Then, for all t ≥ and ≤ m ≤ m ′ , lim m →∞ E θ ∗ [max s sup θ ∈ G k Γ t,m, s ( θ ) − Γ t,m ′ , s ( θ ) k ] = 0 , (B.20)lim m →∞ E θ ∗ [max s sup θ ∈ G k Γ t,m, s ( θ ) − Γ t,m ( θ ) k ] = 0 . (B.21) Proof: (B.20) follows from (B.19) and E θ ∗ (cid:2) sup θ ∈ G max s ∈ S k Γ t,m, s ( θ ) − Γ t,m ′ , s ( θ ) k (cid:3) ≤ E θ ∗ k ˙ φ k ∞ ) (cid:0)(cid:0) E θ (cid:2) ⌊ ( t + m − /r ⌋ Y i =1 (cid:0) − ω ( V − m + ri ) (cid:1) (cid:3)(cid:1) + 2 t − X k = − m +1 (cid:0) E θ (cid:2) ⌊ ( k + m − /r ⌋ Y i =1 (cid:0) − ω ( V − m + ri ) (cid:1) ∧ ⌊ ( t − k − /r ⌋ Y i =1 (cid:0) − ω ( V t + r − ri ) (cid:1) (cid:3)(cid:1) + − m X k = − m ′ +1 (cid:0) E θ (cid:2) ⌊ ( t − k − /r ⌋ Y i =1 (cid:0) − ω ( V t + r − ri ) (cid:1) (cid:3)(cid:1) (cid:1) ≤ (cid:0) E θ k ˙ φ k ∞ (cid:1) ρ t + m − r − − ρ r , which converges to 0. Similarly, we can obtain (B.21) by using (B.16). (cid:3) emma 14. Assume 2–4 and 7–8. Then, there exists a random sequence { A t,m } anda random variable K ∈ L ( P θ ∗ ) such that, for all t ≥ and ≤ m ≤ m ′ , max s sup θ ∈ G k Φ t,m, s ( θ ) − Φ t,m ( θ ) k ≤ K ( t ∨ m ) ρ ( t + m ) / r A t,m , max s sup θ ∈ G k Φ t,m, s ( θ ) − Φ t,m ′ , s ( θ ) k ≤ K ( t ∨ m ) ρ ( t + m ) / r A t,m , where P θ ∗ ( A t,m ≥ M i . o . ) = 0 for M < ∞ . Proof:
Put k φ k k ∞ = max s k , s k − sup θ ∈ G k φ θ (( s k , Y k ) , ( s k − , Y k − ) , X k ) k . For m ′ ≥ m ≥
0, all − m < ℓ ≤ k ≤ n , all θ ∈ G , and all s − m ∈ S , k Cov θ [ φ θ,k , φ θ,ℓ | Y n − m , X n − m ] k ≤ k φ k k ∞ k φ ℓ k ∞ ⌊ ( k − ℓ − /r ⌋ Y i =1 (1 − ω ( V ℓ + ri )) , k Cov θ [ φ θ,k , φ θ,ℓ | Y n − m , S − m = s − m , X n − m ] k ≤ k φ k k ∞ k φ ℓ k ∞ ⌊ ( k − ℓ − /r ⌋ Y i =1 (1 − ω ( V ℓ + ri )) , k Cov θ [ φ θ,k , φ θ,ℓ | Y n − m , S − m = s − m , X n − m ] − Cov θ [ φ θ,k , φ θ,ℓ | Y n − m , X n − m ] k≤ k φ k k ∞ k φ ℓ k ∞ ⌊ ( ℓ + m − /r ⌋ Y i =1 (1 − ω ( V − m + ri )) , k Cov θ [ φ θ,k , φ θ,ℓ | Y n − m , X n − m ] − Cov θ [ φ θ,k , φ θ,ℓ | Y n − − m , X n − − m ] k≤ k φ k k ∞ k φ ℓ k ∞ ⌊ ( n − k − /r ⌋ Y i =1 (1 − ω ( V n + r − ri )) , k Cov θ [ φ θ,k , φ θ,ℓ | Y n − m , S − m = s − m , X n − m ] − Cov θ [ φ θ,k , φ θ,ℓ | Y n − − m , S − m = s − m , X n − − m ] k≤ k φ k k ∞ k φ ℓ k ∞ ⌊ ( n − k − /r ⌋ Y i =1 (1 − ω ( V n + r − ri )) . We define Λ ba = P bi = a φ θ,i . Then, Φ t,m, s ( θ ) − Φ t,m ′ , s ( θ ) may be decomposed as A + 2 B + C , where A =Var θ [Λ t − − m +1 | Y t − m , S − m = s , X t − m ] − Var θ [Λ t − − m +1 | Y t − − m , S − m = s , X t − − m ] − Var θ [Λ t − − m +1 | Y t − m , X t − m ] + Var θ [Λ t − − m +1 | Y t − − m , X t − − m ] ,B =Cov θ [Λ t − − m +1 , φ θ,t | Y t − m , S − m = s , X t − m ] − Cov θ [Λ t − − m +1 , φ θ,t | Y t − m , X t − m ] ,C =Var θ [ φ θ,t | Y t − m , S − m = s , X t − m ] − Var θ [ φ θ,t | Y t − m , X t − m ] .
45e have k A k ≤ max − m +1 ≤ ℓ ≤ k ≤ t − k φ k k ∞ k φ ℓ k ∞ × X − m +1 ≤ ℓ ≤ k ≤ t − (cid:16)(cid:0) × ⌊ ( ℓ + m − /r ⌋ Y i =1 (1 − ω ( V − m + ri )) (cid:1) ∧ (cid:0) × ⌊ ( k − ℓ − /r ⌋ Y i =1 (1 − ω ( V ℓ + ri )) (cid:1) ∧ (cid:0) × ⌊ ( t − k − /r ⌋ Y i =1 (1 − ω ( V t + r − ri )) (cid:1)(cid:17) , k B k ≤ max − m +1 ≤ k ≤ t − k φ k k ∞ k φ t k ∞ × X − m +1 ≤ k ≤ t − (cid:16)(cid:0) ⌊ ( k + m − /r ⌋ Y i =1 (1 − ω ( V − m + ri )) (cid:1) ∧ (cid:0) × ⌊ ( t − k − /r ⌋ Y i =1 (1 − ω ( V k + ri )) (cid:1)(cid:17) , k C k ≤ k φ t k ∞ × ⌊ ( t + m − /r ⌋ Y i =1 (1 − ω ( V − m + ri )) . Similar to the calculation on p. 2299 of Douc et al. (2004), we derivemax − m +1 ≤ ℓ ≤ k ≤ t − k φ k k ∞ k φ ℓ k ∞ ≤ ( m + t ) ∞ X k = −∞ | k | ∨ k φ k k ∞ . In view of Lemma 10 and ρ − ε ( t + m − r +1) X − m +1 ≤ ℓ ≤ k ≤ t − (cid:16) ρ ⌊ ( ℓ + m − /r ⌋ ∧ ρ ⌊ ( k − ℓ − /r ⌋ ∧ ρ ⌊ ( t − k − /r ⌋ (cid:17) ≤ ρ (1 − ρ )(1 − ρ ) ρ t + m − r , there exists a random sequence { A t,m } such that k A k ≤ ( m + t ) ∞ X k = −∞ | k | ∨ k φ k k ∞ ρ t + m r − ρ r (1 − ρ r )(1 − ρ r ) , k B k ≤ ( m + t ) ∞ X k = −∞ | k | ∨ k φ k k ∞ ρ t + m r − ρ r (1 − ρ r ) , k C k ≤ ( m + t ) ∞ X k = −∞ | k | ∨ k φ k k ∞ ρ ⌊ ( t + m − / r ⌋ , where P θ ∗ ( A t,m ≥ M i . o . ) = 0 for M < ∞ . The difference Φ t,m, s ( θ ) − Φ t,m ′ , s ( θ ) can be46ecomposed as A + 2 B + C − D − E − F , where A =Var θ [Λ t − − m +1 | Y t − m , S − m = s , X t − m ] − Var θ [Λ t − − m +1 | Y t − − m , S − m = s , X t − − m ] − Var θ [Λ t − − m +1 | Y t − m ′ , S − m ′ = s , X t − m ′ ] + Var θ [Λ t − − m +1 | Y t − − m ′ , S − m ′ = s , X t − − m ′ ] ,B =Cov θ [Λ t − − m +1 , φ θ,t | Y t − m , S − m = s , X t − m ] − Cov θ [Λ t − − m +1 , φ θ,t | Y t − m ′ , S − m ′ = s , X t − m ′ ] ,C =Var θ [ φ θ,t | Y t − m , S − m = s , X t − m ] − Var θ [ φ θ,t | Y t − m ′ , S − m ′ = s , X t − m ′ ] ,D =Var θ [Λ − m − m ′ +1 | Y t − m ′ , S − m ′ = s , X t − m ′ ] − Var θ [Λ − m − m ′ +1 | Y t − − m ′ , S − m ′ = s , X t − − m ′ ] ,E =Cov θ [Λ t − − m +1 , Λ − m − m ′ +1 | Y t − m ′ , S − m ′ = s , X t − m ′ ] − Cov θ [Λ t − − m +1 , Λ − m − m ′ +1 | Y t − − m ′ , S − m ′ = s , X t − − m ′ ] ,F =Cov θ [Λ − m − m ′ +1 , φ θ,t | Y t − m ′ , S − m ′ = s , X t − m ′ ] . k A k , k B k , and k C k are bounded as above. For the other terms, we have k D k ≤ X − m ′ +1 ≤ ℓ ≤ k ≤− m (cid:0) ⌊ ( t − k − /r ⌋ Y i =1 (1 − ω ( V t + r − ri )) ∧ × ⌊ ( k − ℓ − /r ⌋ Y i =1 (1 − ω ( V ℓ + ri )) (cid:1) k φ k k ∞ k φ ℓ k ∞ , k E k ≤ t − X k = − m +1 − m X ℓ = − m ′ +1 (cid:0) ⌊ ( t − k − /r ⌋ Y i =1 (1 − ω ( V t + r − ri )) ∧ × ⌊ ( k − ℓ − /r ⌋ Y i =1 (1 − ω ( V ℓ + ri )) (cid:1) k φ k k ∞ k φ ℓ k ∞ , k F k ≤ X − m ′ +1 ≤ ℓ ≤− m ⌊ ( t − ℓ − /r ⌋ Y i =1 (1 − ω ( V ℓ + ri )) k φ ℓ k ∞ k φ t k ∞ . Define ρ , I t and ε as in (A.3), (A.4) and (A.7), respectvely. Notice that the ρ and ε are the same as the ones in Lemma 2. Using 1 − V t ≤ ρ − I t , X − m ′ +1 ≤ ℓ ≤ k ≤− m (cid:16) ⌊ ( t − k − /r ⌋ Y i =1 (1 − ω ( V − m + ri )) ∧ × ⌊ ( k − ℓ − /r ⌋ Y i =1 (1 − ω ( V ℓ + ri )) (cid:17) k φ k k ∞ k φ ℓ k ∞ ≤ X − m ′ +1 ≤ ℓ ≤ k ≤− m (cid:0) ρ ⌊ ( t − k − /r ⌋− P ⌊ ( t − k − /r ⌋ i =1 I − m + ri ∧ ρ ⌊ ( k − ℓ − /r ⌋− P ⌊ ( k − ℓ − /r ⌋ i =1 I ℓ + ri (cid:1) k φ k k ∞ k φ ℓ k ∞ ≤ ρ − P ti = − m ′ +1 I i X − m ′ +1 ≤ ℓ ≤ k ≤− m (cid:0) ρ ⌊ ( t − k − /r ⌋ ∧ ρ ⌊ ( k − ℓ − /r ⌋ (cid:1) k φ k k ∞ k φ ℓ k ∞ E θ ∗ [ ρ − P ti = − m ′ I i ] ≤ m ′ and t . Then ρ − P ti = − m ′ I i ≤ ρ − P + ∞ i = −∞ I i , which is in L ( P θ ∗ ). Following a similar procedure to Douc et al. (2004,pp. 2301), we obtain X − m ′ +1 ≤ ℓ ≤ k ≤− m (( ρ ⌊ ( t − k − /r ⌋ ∧ ρ ⌊ ( k − ℓ − /r ⌋ ) k φ k k ∞ k φ ℓ k ∞ ) ≤ ρ t + m − r − ∞ X ℓ = −∞ ρ | ℓ | r k φ ℓ k ∞ ∞ X k = −∞ ρ | k | r k φ k k ∞ Thus, k D k ≤ ρ t + m − r − ρ − P + ∞ i = −∞ I i ∞ X ℓ = −∞ ρ | ℓ | r k φ ℓ k ∞ ∞ X k = −∞ ρ | k | r k φ k k ∞ . Similarly, we can derive k E k ≤ ρ t + m − r − ρ − P + ∞ i = −∞ I i ∞ X ℓ = −∞ ρ | ℓ | r k φ ℓ k ∞ ∞ X k = −∞ ρ | k | r k φ k k ∞ , k F k ≤ ρ t + m r − ρ − P + ∞ i = −∞ I i ∞ X ℓ = −∞ ρ | ℓ | r k φ ℓ k ∞ ∞ X k = −∞ ρ | k | r k φ k k ∞ . The proof is complete. (cid:3)
Lemma 15.
Assume 2–4 and 7–8. Then, for all t ≥ and ≤ m ≤ m ′ , E θ ∗ [max s sup θ ∈ G k Φ t,m, s ( θ ) − Φ t,m ′ , s ( θ ) k ] → , as m → ∞ , E θ ∗ [max s sup θ ∈ G k Φ t,m, s ( θ ) − Φ t,m ( θ ) k ] → , as m → ∞ . The proof of Lemma 15 follows from the inequalities in Lemma 14 using a similarargument to that of Lemma 13.
Appendix C: Discussion of assumptions
This section shows that Assumption 4(a), which is the key assumption to deliver themain result of geometric decay rate of the mixing coefficient, holds in the state transitionprobability of Diebold et al. (1994) logistic form (7) or of the Chang et al. (2017) type(11). In addition, we derive explicit conditions for Assumptions 2 and 6 to hold. Theother assumptions are either standard in the literature or easy to verify, and are not48iscussed here.
C.1 Discussion of Assumption 4(a)
We first give a sufficient condition of Assumption 4(a) that is easier to verify in practice.
Lemma 16.
A sufficient condition for Assumption 4(a) is that for some δ > , E θ ∗ [ | log σ − ( y , x ) | δ ] < ∞ . Proof:
Set C = 1. P θ ∗ ( σ − ( Y , X ) ≤ e − α ξ ) = P θ ∗ ( | log σ − ( Y , X ) | δ ≥ ( α ξ ) δ ) ≤ E θ ∗ (cid:2) | log σ − ( Y , X ) | δ (cid:3) / ( α ξ ) δ = C ξ − (1+ δ ) where C = E θ ∗ [ | log σ − ( Y , X )) | δ ] /α δ , and the second inequality follows fromMarkov’s inequality. (cid:3) Proposition 5.
Assumption 4(a) holds for the state transition probability of a logisticform: q θ ( s t = s | s t − = s, s t − , . . . , s t − r , X t ) = exp( β ′ s X t )1+exp( β ′ s X t ) , if for all s ∈ S , β s takesa value on a compact set, and exp( − β ′ s X t ) and exp(2 β ′ s X t ) have finite first momentsfor all β s . Proof:
We show that the condition in Lemma 16 is satisfied with δ = 1. For s ∈ S ,we use | log x − log y | ≤ | x − y | x ∧ y to have E θ ∗ (cid:2) | log q θ ( s = s | s = s, s − , . . . , s − r +1 , X ) | (cid:3) = E θ ∗ h(cid:12)(cid:12)(cid:12) log exp( β ′ s X )1 + exp( β ′ s X ) (cid:12)(cid:12)(cid:12) i ≤ E θ ∗ (cid:2) exp( − β ′ s X ) (cid:3) < ∞ . For s, s ′ ∈ S and s ′ = s , E θ ∗ (cid:2) | log q θ ( s = s ′ | s = s, s − , . . . , s − r +1 , X ) | (cid:3) = E θ ∗ h(cid:12)(cid:12)(cid:12) log 11 + exp( β ′ s X ) (cid:12)(cid:12)(cid:12) i ≤ E θ ∗ (cid:2) exp(2 β ′ s X ) (cid:3) < ∞ . (cid:3) From Proposition 5, the transition probability of a logistic form satisfies Assumption4(a) when X t is normally distributed. Proposition 6.
Assumption 4(a) holds in the Chang et al. (2017)-type transition prob-ability (11). roof: It suffices to show that E θ ∗ | log q θ ( s ′ | s , Y , X ) | < ∞ for all s ′ ∈ S , s ∈ S , and θ ∈ Θ. First, consider the case of s = 0 and s = 0, q θ ( s = 0 | s = 0 , s − , . . . , s − r +1 , Y , X ) = R τ √ − α −∞ Φ (cid:16) τ − ρU t − √ − ρ − αx √ − α √ − ρ (cid:17) ϕ ( x ) dx Φ( τ √ − α ) . The following result holds for the numerator: Z τ √ − α −∞ Φ (cid:16) τ − ρU p − ρ − αx √ − α p − ρ (cid:17) ϕ ( x ) dx ≥ Z −| τ √ − α |−| τ √ − α |− Φ (cid:16) τ − ρU p − ρ + | α | x √ − α p − ρ (cid:17) ϕ ( x ) dx ≥ Φ (cid:16) τ − ρU p − ρ − | ατ √ − α | + | α |√ − α p − ρ (cid:17)(cid:16) Φ( −| τ p − α | ) − Φ( −| τ p − α | − | α | ) (cid:17) ≥ Φ( − D ) (cid:16) Φ( −| τ p − α | ) − Φ( −| τ p − α | − (cid:17) =Φ c ( D ) (cid:16) Φ( −| τ p − α | ) − Φ( −| τ p − α | − (cid:17) ≥ r π e − D D + √ D + 1 (cid:16) Φ( −| τ p − α | ) − Φ( −| τ p − α | − (cid:17) where D , (cid:12)(cid:12)(cid:12) τ − ρU t − √ − ρ − | ατ √ − α | + | α |√ − α √ − ρ (cid:12)(cid:12)(cid:12) and Φ c ( x ) , − Φ( x ). The last inequality followsfrom Φ c ( x ) ≥ q π e − x / x + √ x +4 . E θ ∗ (cid:2) | log q θ ( s = 0 | s = 0 , s − , . . . , s − r +1 , Y , X ) | (cid:3) ≤ E θ ∗ h(cid:12)(cid:12)(cid:12) log (cid:16) e − D D + √ D + 1 (cid:17) + log r π + log (cid:16) Φ( −| τ p − α | ) − Φ( −| τ p − α | − (cid:17) − log (cid:0) Φ( τ p − α ) (cid:1)(cid:12)(cid:12)(cid:12) i . We need show only that E θ ∗ h(cid:0) log (cid:0) e − D D + √ D +1 (cid:1)(cid:1) i < ∞ . Note that D is folded50ormal distributed and has finite moments. We use log( x ) ≤ x − x > E θ ∗ h(cid:0) log (cid:0) e − D D + √ D + 1 (cid:1)(cid:1) i = E θ ∗ h(cid:0) − D − log( D + p D + 4) (cid:1) i = E θ ∗ h D (cid:0) log( D + p D + 4) (cid:1) + D log( D + p D + 4) i ≤ E θ ∗ h D D + p D + 4 − + D ( D + p D + 4 − i ≤ E θ ∗ h D D + 1) + D (2 D + 1) i < ∞ . (C.1)Next, consider the case of s = 1 and s = 0, q θ ( s = 1 | s = 0 , s − , . . . , s − r +1 , Y , X ) = R τ √ − α −∞ Φ c (cid:16) τ − ρU t − √ − ρ − αx √ − α √ − ρ (cid:17) ϕ ( x ) dx Φ( τ √ − α ) . The following result holds for the numerator Z τ √ − α −∞ Φ c (cid:16) τ − ρU p − ρ − αx √ − α p − ρ (cid:17) ϕ ( x ) dx ≥ Z −| τ √ − α |−| τ √ − α |− Φ c (cid:16) τ − ρU p − ρ − | α | x √ − α p − ρ (cid:17) ϕ ( x ) dx ≥ Φ c (cid:16) τ − ρU p − ρ + | ατ √ − α | + | α |√ − α p − ρ (cid:17)(cid:16) Φ (cid:0) − | τ p − α | (cid:1) − Φ (cid:0) − | τ p − α | − (cid:1)(cid:17) ≥ Φ c ( D )(Φ( −| τ p − α | ) − Φ( −| τ p − α | − ≥ r π e − D D + p D + 1 (cid:16) Φ (cid:0) − | τ p − α | (cid:1) − Φ (cid:0) − | τ p − α | − (cid:1)(cid:17) where D , (cid:12)(cid:12)(cid:12) τ − ρU t − √ − ρ + | ατ √ − α | + | α |√ − α √ − ρ (cid:12)(cid:12)(cid:12) . Then, we can show E θ ∗ | log q θ ( s = 1 | s =0 , s − , . . . , s − r +1 , Y , X ) | < ∞ by showing that E θ ∗ h(cid:16) log (cid:0) e − D D + √ D +1 (cid:1)(cid:17) i < ∞ .The proof follows a similar procedure to that in (C.1).For the case of s = 0 , s = 1, Z −∞ τ √ − α Φ (cid:16) τ − ρU p − ρ − αx √ − α p − ρ (cid:17) ϕ ( x ) dx ≥ r π e − D D + √ D + 1 (cid:16) Φ (cid:0) | τ p − α + 1 | (cid:1) − Φ (cid:0) | τ p − α | (cid:1)(cid:17) s = 1, s = 1, Z −∞ τ √ − α Φ c (cid:16) τ − ρU p − ρ − αx √ − α p − ρ (cid:17) ϕ ( x ) dx ≥ r π e − D D + p D + 1 (cid:16) Φ (cid:0) | τ p − α + 1 | (cid:1) − Φ (cid:0) | τ p − α | (cid:1)(cid:17) and the result follows. (cid:3) C.2 Discussion of Assumption 2
In this subsection, we derive explicit conditions for Assumption 2 to hold in modelswith a transition equation of the observed process (5) or (6) and transition probability(11).Note that the Chang et al. (2017)-type transition probability (11) is essentially afunction of ( S t , S t − , U t − ), which allows us to use another notation q θ ( S t | S t − , U t − ).Since W t − is independent of U t − , and so is S t − , integration with respect to U t − yields the unconditional transition probabilities p ij , E [ q θ ( S t = j | S t − = i, U t − ) | S t − = i ]= (1 − j ) ω + j (1 − ω ) (C.2)where ω = [(1 − i ) R τ √ − α −∞ + i R ∞ τ √ − α ]Φ (cid:0) τ − αx √ − α (cid:1) ϕ ( x ) dx (1 − i )Φ( τ √ − α ) + i [1 − Φ( τ √ − α )] . We denote by ⊗ the Kronecker tensor product. Define the 2 k × k matrix M = ( A (0) ⊗ A (0)) p ( A (0) ⊗ A (0)) p ( A (1) ⊗ A (1)) p ( A (1) ⊗ A (1)) p ! . We denote the spectral radius of a real matrix A = ( a ij ) by ρ ( A ).We first investigate Assumption 2 for the observed process (5) and transition prob-ability (11). We follow Francq and Zakoıan (2001) to write the transition equation of(5) in the vectorial form as Y tt − k +1 = A ( S t ) Y t − t − k + B ( S t , X t ) (C.3)52here B ( S t , X t ) = ( µ ( S t ) + γ X ( S t ) ′ X t + σ ( S t ) U t , , . . . , ′ and A ( S t ) = γ ( S t ) γ ( S t ) . . . γ k ( S t )1 0 0 00 . . . 0 00 0 1 0 . Results on stationarity for autoregressive processes with basic Markov switchinghave been shown in Yao (2001) and Francq and Zakoıan (2001). We extend theirproofs to Chang et al. (2017)-type transition probability, and summarize our findingsin Proposition 7.
Proposition 7.
Assumption 2 holds in a two-regime endogenous regime-switchingmodel with a transition equation of the observed process (5) satisfying that X t is strictlystationary, and Chang et al. (2017)-type transition probability (11) satisfying | α | < , W ∼ N (0 , − α ) and ρ ( M ) < . Proof:
In the proof, we omit subscript θ and write B ( S t ) instead of B ( S t , X t ) forsimplicity. Note that the stationarity and ergodicity of { S t } are implied by those of { W t } . Under the assumptions that | α | < W ∼ N (0 , − α ), { W t } becomesa strictly stationary process. In addition, the dynamics of { W t } can be equivalentlydefined as W t = αW t − + ρU t − + p − ρ V t = ∞ X l =0 ρα l U t − − l + ∞ X l =0 p − ρ α l V t − l . According to Theorem 7.1.3 in Durrett (2010), the ergodicity of { W t } follows from theergodicity of { U t } and { V t } . Consequently, { ( A ( S t ) , B ( S t )) } is a strictly stationary andergodic sequence.We show that H t,p , A ( S t ) A ( S t − ) . . . A ( S t − p +1 )converges to zero in L at an exponential rate as p goes to infinity. We obtain E [vec( H t,p H ′ t,p )]= E [( A ( S t ) ⊗ A ( S t )) . . . ( A ( S t − p +1 ) ⊗ A ( S t − p +1 ))]53 Z · · · Z X s t ,...,s t − p +1 ( A ( s t ) ⊗ A ( s t )) . . . ( A ( s t − p +1 ) ⊗ A ( s t − p +1 )) × p ( s t , s t − , u t − , s t − , u t − , . . . , s t − p +1 , u t − p +1 ) du t − . . . du t − p +1 (C.4)with p ( s t , s t − , u t − , s t − , u t − , . . . , s t − p +1 , u t − p +1 )= p ( s t | s t − , u t − ) p ( u t − ) p ( s t − | s t − , u t − ) p ( u t − ) . . . p ( s t − p +2 | s t − p +1 , u t − p +1 ) p ( u t − p +1 ) p ( s t − p +1 )= p ( s t , u t − | s t − ) p ( s t − , u t − | s t − ) . . . p ( s t − p +2 , u t − p +1 | s t − p +1 ) p ( s t − p +1 ) (C.5)where the first equality follows because U t is independent of W t , W t − , and U t − .Plugging (C.5) back into (C.4) yields E [vec( H t,p H ′ t,p )] = E [( A ( S t ) ⊗ A ( S t )) . . . ( A ( S t − p +1 ) ⊗ A ( S t − p +1 ))]= Z · · · Z X s t ,...,s t − p +1 ( A ( s t ) ⊗ A ( s t )) . . . ( A ( s t − p +1 ) ⊗ A ( s t − p +1 )) × p ( s t , u t − | s t − ) p ( s t − , u t − | s t − ) . . . p ( s t − p +2 , u t − p +1 | s t − p +1 ) p ( s t − p +1 ) du t − . . . du t − p +1 = X s t ,...,s t − p +1 ( A ( s t ) ⊗ A ( s t )) . . . ( A ( s t − p +1 ) ⊗ A ( s t − p +1 )) × p ( s t | s t − ) p ( s t − | s t − ) . . . p ( s t − p +2 | s t − p +1 ) p ( s t − p +1 ) . We define the 2 k × N = (cid:0) A (0) ⊗ A (0) (cid:1) p ( s t = 0 (cid:0) A (1) ⊗ A (1) (cid:1) p ( s t = 1) ! . Then, we obtain E [ vec ( H t,p H ′ t,p )] = I ′ M p − N where I = ( I k , I k ) ′ and I k are a 2 k × k matrix and a k × k identity matrix, respectively. It follows that k H t,p k L ≤ k I ′ M p − N k / ≤ k I ′ k / k M p − k / k N k / . (C.6)Now we can verify that the top Lyapounov exponent defined by γ = inf t ≥ n E h t log k A ( S t ) A ( S t − ) . . . A ( S ) k io is strictly negative. Since for fixed p , the stationarity of { H t,p } t is implied by the54tationarity of { A ( S t ) } , we have E (cid:2) log k A ( S t ) A ( S t − ) . . . A ( S ) k (cid:3) = E (cid:2) log k H t,t k (cid:3) = E (cid:2) log k H ,t k (cid:3) = 12 E (cid:2) log k H ,t k (cid:3) ≤
12 log E (cid:2) k H ,t k (cid:3) ≤ C + 12 t log k M k , where the last inequality follows from (C.6). Hence, if ρ ( M ) <
1, then γ = inf t ≥ n E h t log k A ( S t ) A ( S t − ) . . . A ( S ) k io ≤
12 log k M k < Y tt − k +1 = B ( S t ) + A ( S t ) B ( S t − ) + A ( S t ) A ( S t − ) B ( S t − ) + · · · = ∞ X p =0 H t,p B ( S t − p )converges almost surely for each t and is the unique strictly stationary solution of (C.3).Furthermore, since { ( A ( S t ) , B ( S t )) } is strictly stationary ergodic, by Theorem 7.1.3 inDurrett (2010), { Y tt − k +1 } t is ergodic. (cid:3) We proceed to investigate Assumption 2 for the observed process (6) and transitionprobability (11). Following Francq et al. (2001), we write the third equation in (6) invectorial form as h t = Γ t ( S t ) h t − + C ( S t ) (C.7)where C ( S t ) = C ( S t )0...0 , h t = h t h t − ... h t − k +1 , Γ t ( S t ) = γ ( S t ) U t − γ ( S t ) U t − . . . γ k ( S t ) U t − k . Results on stationarity for ARCH processes with basic Markov switching have beenshown in Francq et al. (2001). We extend their proofs to Chang et al. (2017)-typetransition probability, and summarize our findings in Proposition 8.
Proposition 8.
Let ˜Γ( S t ) be the k × k matrix obtained by replacing U t − l by /σ − in Γ t ( S t ) , for l = 1 , . . . , k , where σ − = min i,j p ij with p ij defined in (C.2). Define the k × k matrix M Γ = ˜Γ(0) p ˜Γ(0) p ˜Γ(1) p ˜Γ(1) p ! . Assumption 2 holds in a two-regime endogenous regime-switching model with a transi-tion equation of the observed process (6) satisfying that Chang et al. (2017)-type tran-sition probability (11) satisfying | α | < , W ∼ N (0 , − α ) and ρ ( M Γ ) < . Proof:
The strict stationarity and ergodicity of { S t } follows the same argumentas in the proof of Proposition 7. According to Theorem 7.1.3 in Durrett (2010), { Γ t ( S t ) , C ( S t ) } is a strictly stationary and ergodic sequence.Then we look at { h t } . Since Γ( S t )Γ t − ( S t − ) . . . Γ t − p +1 ( S t − p +1 ) C ( S t − p ) has positiveelements, for p > E [Γ( S t )Γ t − ( S t − ) . . . Γ t − p +1 ( S t − p +1 ) C ( S t − p )]= Z · · · Z X s t ,...,s t − p +1 − k Γ t ( s t )Γ t − ( s t − ) . . . Γ t − p +1 ( s t − p +1 ) C ( s t − p ) × p ( s t , s t − , u t − , s t − , u t − , . . . , s t − p +1 − k , u t − p +1 − k ) du t − . . . du t − p +1 − k = Z · · · Z X s t ,...,s t − p +1 − k Γ t ( s t )Γ t − ( s t − ) . . . Γ t − p +1 ( s t − p +1 ) C ( s t − p ) × p ( s t , u t − | s t − ) . . . p ( s t − p +2 − k , u t − p +1 − k | s t − p +1 − k ) p ( s t − p +1 − k ) du t − . . . du t − p +1 − k ≤ X s t ,...,s t − p +1 − k ˜Γ( s t ) . . . ˜Γ( s t − p +1 ) C ( s t − p ) × p ( s t | s t − ) . . . p ( s t − p +1 | s t − p ) p ( s t − p , . . . , s t − p +1 − k )where the second equality is derived using the same technique as (C.5), and the lastinequality follows from the fact that for each l = 1 , . . . , p − k , Z u t − l p ( s t − l +1 , u t − l | s t − l ) ≤ E [ u t − l ] p ( s t − l +1 | s t − l ) ≤ /σ − . We define the 2 k × C = C (0) p ( s t = 0) C (1) p ( s t = 1) ! . Then, we obtain E [Γ( S t )Γ t − ( S t − ) . . . Γ t − p +1 ( S t − p +1 ) C ( S t − p )] ≤ I ′ M p Γ C I = ( I k , I k ) ′ and I k are a 2 k × k matrix and a k × k identity matrix, respectively.Now we follow Francq et al. (2001) to define, for t, p ∈ Z , the following R k -valuedrandom vectors: H p ( t ) = p < C ( S t ) + Γ t ( S t ) H p − ( t −
1) if p ≥ . We have E k H p ( t ) − H p − ( t ) k = E k Γ( S t )Γ t − ( S t − ) . . . Γ t − p +1 ( S t − p +1 ) C ( S t − p ) k = k E [Γ( S t )Γ t − ( S t − ) . . . Γ t − p +1 ( S t − p +1 ) C ( S t − p )] k≤k I ′ M p Γ C k = ′ M p Γ C . When ρ ( M Γ ) < ′ M p Γ C tends to zero at an exponential rate as p → ∞ . Hence, for t fixed, H p ( t ) converges both in L and almost surely as p → ∞ . Write h t as the limitin L and almost sure of { H p ( t ) } p . Since, for all p , { H p ( t ) } t ∈ Z is strictly stationary,the limit { h t } t ∈ Z is also strictly stationary. It is obvious that { h t } t ∈ Z satisfies (C.7).Therefore, for each t , h t is the unique strictly stationary solution of (C.7). Furthermore,since h t satisfies h t = C ( S t ) + Γ t ( S t ) C ( S t − ) + Γ t ( S t )Γ t − ( S t − ) C ( S t − ) + . . . , and { (Γ t ( S t ) , C ( S t )) } is strictly stationary ergodic, by Theorem 7.1.3 in Durrett (2010), { h t } is ergodic.Finally, by Theorem 2.3 of Ling and Li (1997), if all roots of φ ( L ) = 1 − φ L − · · · − φ k L k lie outside the unit circle, then { Y t } is strictly stationary and ergodic. (cid:3) C.3 Discussion of Assumption 6
Write the conditional density as p θ ( Y , . . . , Y n | Y − r +1 , X n − r +1 )= X s n − r +2 (cid:16) n Y t =1 g θ ( Y t | Y t − , s t , X t ) n Y t =2 q θ ( s t | s t − , Y t − , X t ) P θ ( S = s | Y − r +1 , X n − r +1 ) (cid:17) . It is a finite mixture of the n − fold product densities Q nt =1 g θ ( Y t | Y t − , s t , X t ) withmixing distributions Q nt =2 q θ ( s t | s t − , Y t − , X t ) P θ ( S = s | Y − r +1 , X n − r +1 ). According57o Teicher (1967), when the transition equation of the observed process is (3) and U t ∼ N (0 , m ( Y tt − k , S tt − k , X t ) = m ( Y tt − k , ˜ S tt − k , X t ) or σ ( Y tt − k , S tt − k , X t ) = σ ( Y tt − k , ˜ S tt − k , X t ), for S tt − k = ˜ S tt − k . Second, q θ ( · ) = q θ ∗ ( · ) if and only if θ = θ ∗ . APPENDIX D: SIMULATION RESULTS (a) Experiment 1 with ρ ∗ = − . ρ ∗ = 0(c) Experiment 1 with ρ ∗ = 0 . D. 1: Histogram of standardized MLE in Experiment 1In this section, we first explain how to compute standardized MLE. Notice that someof the parameters are subject to constraints. To facilitate optimization, we performtransformations on the parameters. That is, θ = f ( ζ ), where θ is the original parameterin the simulation experiments and ζ is the parameter about which we solve optimizationproblems. We list the functions f below. Instead of plotting the histograms of theoriginal parameters, we plot standardized transformed ML estimates ˆ ζ n . From theasymptotic result, √ n (ˆ ζ n − ζ ∗ ) ⇒ N (0 , I ( ζ ∗ ) − ) and an estimator of the asymptoticvariance ˆ V n is the negative of the inverse of the Hessian matrix at the MLE value. ζ ∗ can be obtained from ζ ∗ = f − ( θ ∗ ), where f − is the inverse function of f . Thestandardized MLE is calculated as ˆ ζ std = ˆ V − / √ n (ˆ ζ n − ζ ∗ ).The transformation functions f are as follows. For σ (0) and σ (1) in Experiments 1and 2 and C (0) and C (1) in Experiment 3, f ( x ) = exp( x ) and f − ( x ) = log( x ). For ρ and α in Experiments 1 and 2, f ( x ) = 2 arctan( x ) /π and f − ( x ) = tan(0 . πx ). For γ a) Experiment 3 with ρ ∗ = − . ρ ∗ = 0(c) Experiment 3 with ρ ∗ = 0 . D. 2: Histogram of standardized MLE in Experiment 3in Experiment 3, f ( x ) = arctan( x ) /π + 0 . f − ( x ) = tan( π ( x − . PPENDIX E: TABLES OF EMPIRICAL APPLICATION T a b l e E . : M a x i m u m L i k e li h oo d E s t i m a t e s f o r S & P W ee k l y D a t a E s t i m a t e s P a r a m e t e r N o s w i t c h A R C H E x o S w i t c h i n c ( ρ = ) E nd o S w i t c h i n c ( − < ρ < ) E x o S w i t c h i n c & γ ( ρ = ) E nd o S w i t c h i n c & γ ( − < ρ < ) µ . . ( . ) . ( . ) . ( . ) . ( . ) φ - . - . ( . ) - . ( . ) - . ( . ) - . ( . ) c ( ) . . ( . ) . ( . ) . ( . ) . ( . ) c ( ) . ( . ) . ( . ) . ( . ) . ( . ) γ . . ( . ) . ( . ) . ( . ) . ( . ) γ ( ) . . ( . ) . ( . ) . ( . ) . ( . ) γ ( ) . ( . ) . ( . ) ρ - . ( . ) - . ( . ) α . ( . ) . ( . ) . ( . ) . ( . ) τ . ( . ) . ( . ) . ( . ) . ( . ) l og - li k e li h oo d - . - . - . - . - .8