[PDF] AdaVol: An Adaptive Recursive Volatility Prediction Method

Abstract

Quasi-Maximum Likelihood (QML) procedures are theoretically appealing and widely used for statistical inference. While there are extensive references on QML estimation in batch settings, it has attracted little attention in streaming settings until recently. An investigation of the convergence properties of the QML procedure in a general conditionally heteroscedastic time series model is conducted, and the classical batch optimization routines extended to the framework of streaming and large-scale problems. An adaptive recursive estimation routine for GARCH models named AdaVol is presented. The AdaVol procedure relies on stochastic approximations combined with the technique of Variance Targeting Estimation (VTE). This recursive method has computationally efficient properties, while VTE alleviates some convergence difficulties encountered by the usual QML estimation due to a lack of convexity. Empirical results demonstrate a favorable trade-off between AdaVol's stability and the ability to adapt to time-varying estimates for real-life data.

Full PDF

AAn Adaptive Recursive Volatility Prediction Method

Nicklas Werge a , Olivier Wintenberger a a LPSM, Sorbonne Université, 4 place Jussieu, 75005 Paris, France

Abstract

The Quasi-Maximum Likelihood (QML) procedure is widely used for statistical inference due to its robustness against overdisper-sion. However, while there are extensive references on non-recursive QML estimation, recursive QML estimation has attracted littleattention until recently. In this paper, we investigate the convergence properties of the QML procedure in a general conditionallyheteroscedastic time series model, extending the classical o ﬄ ine optimization routines to recursive approximation. We propose anadaptive recursive estimation routine for GARCH models using the technique of Variance Targeting Estimation (VTE) to alleviatethe convergence di ﬃ culties encountered in the usual QML estimation. Finally, empirical results demonstrate a favorable trade-o ﬀ between the ability to adapt to time-varying estimates and stability of the estimation routine. Keywords: volatility models, quasi-likelihood, recursive algorithm, GARCH, prediction method, stock index

1. Introduction

Time series analysis has attracted considerable attention inthe last three decades. A crucial issue for time series anal-ysis is modeling heteroscedasticity of the conditional vari-ance e.g. volatility clustering in ﬁnancial time series. Themost known models capturing this feature are the autoregres-sive conditional heteroscedasticity (ARCH) model and general-ized ARCH (GARCH) model introduced by Engle (1982) andBollerslev (1986), respectively. The success of these modelscan be explained by many reasons; one of them is that theyconstitute a stationary time series model with a time-varyingconditional variance; another one is that they model time serieswith heavier tails than the Gaussian one which often occurs inﬁnancial time series.Quasi-Maximum Likelihood (QML) estimation is widelyused for statistical inference in GARCH models since it tol-erates this overdispersion. In this paper, we study the Quasi-Maximum Likelihood Estimator (QMLE) for the broader classof conditionally heteroscedastic time series models of multi-plicative form given by X t = h t ( θ ) Z t , t ∈ Z , (1.1)where θ is the true underlying parameter vector and the (non-negative) volatility process ( h t ) t ∈ Z is deﬁned as h t ( θ ) = g θ (cid:0) X t − , . . . , X t − p , h t − ( θ ) , . . . , h t − q ( θ ) (cid:1) , p , q ≥ , (1.2)where ( Z t ) is a sequence of i.i.d. random variables with E [ Z ] = E [ Z ] =

1. Suppose that the parameter set Θ ⊂ R d and { g θ | θ ∈ Θ } denotes the (ﬁnite) parametric family of non-negative Email addresses: [email protected] (Nicklas Werge), [email protected] (Olivier Wintenberger) functions on R p × [0 , ∞ ) q , fulﬁlling certain regularity condi-tions. We also require that h t is F t − -measurable where for all t ∈ Z , F t = σ ( Z k : k ≤ t ) denotes the σ -ﬁeld generated by therandom variables { Z k : k ≤ t } .The stability of model (1.1)-(1.2) is accomplished underthe assumption that " g θ is a contraction". This condition isa random Lipschitz coe ﬃ cient condition where the Lipschitzcoe ﬃ cient has a negative logarithmic moment. The notionof contractivity is clariﬁed in Straumann and Mikosch (2006)where they study QML inference of general conditionally het-eroscedastic models with emphasis on the approximation (ˆ h t )of the stochastic volatility ( h t ).QML estimation of the parameters in the class of condition-ally heteroscedastic time series models has been studied fre-quently in recent years, see e.g. Berkes et al. (2003), Francq andZakoÃ´ran (2004), Straumann and Mikosch (2006) and Win-tenberger (2013). However, all these references consider non-recursive (o ﬄ ine) estimation where one assemble a batch ofdata and afterward perform the statistical inference. Neverthe-less, recursive estimates made using recursive algorithms arearguably advantageous as one treats only the observations once.Indeed in recursive QML estimation, we update the QMLE attime t − t to yield the QMLEof the parameters at time t .In modern statistics analysis, it is becoming increasinglycommon to work with streaming data where one observes onlya group of observations at a time. Naturally, this has led toan expanded interest in scalable (in time) recursive estimationprocedures e.g. see Bottou and Bousquet (2007). However,only a little amount of attention has been given to recursive(online) estimation in conditionally heteroscedastic time seriesmodels. Dahlhaus and Subba Rao (2007) presented a recursivemethod to estimate the parameters in an ARCH process. Un-der su ﬃ cient conditions on the underlying process, Aknoucheand Guerbyenne (2006) showed consistency of their recur- Preprint submitted to Econometrics and Statistics June 4, 2020 a r X i v : . [ q -f i n . S T ] J un ive least squares method for GARCH processes. Kierkegaardet al. (2000) also developed a recursive estimation method forGARCH processes supported by empirical evidence. The au-thors of Gerencsér et al. (2010) show convergence analysisof recursive QML estimation for GARCH processes based onBMP-theory with the use of a resetting mechanism. A self-weighted recursive estimation algorithm for GARCH modelswas proposed by Cipra and Hendrych (2018) with a robustiﬁca-tion in Hendrych and Cipra (2018). However, none of the abovereferences mention convexity nor estimation issues of small ω parameter values for GARCH models.In settings with streaming data sets, the di ﬃ culty of estimat-ing time-varying parameters of statistical models increases. Tosustain computational e ﬃ ciency and being adaptive one maydecrease the number of observations in each iteration in theoptimization procedure which may increase the instability ofthe statistical inference. We propose a natural adaptation ofthe QML method relying on stochastic approximations withthe use of Variance Targeting Estimation (VTE) technique. Wegive empirical evidence that the procedure achieves a favorabletrade-o ﬀ between adaptation ability and stability.

2. QML Estimation in Conditionally Heteroscedastic TimeSeries Models

The approximate QMLE ˆ θ ∗ n is deﬁned asˆ θ ∗ n ∈ arg min θ ∈K ˆ L n ( θ ) , (2.1)where the parameter set K is a suitable compact subset of theparameter space Θ . The QL function L n ( θ ) and approximateQL function ˆ L n ( θ ) are, respectively, given by L n ( θ ) = n (cid:88) t = l t ( θ ) and ˆ L n ( θ ) = n (cid:88) t = ˆ l t ( θ ) , (2.2)where the QL losses, l t ( θ ) and ˆ l t ( θ ), are given by l t ( θ ) = (cid:32) X t h t ( θ ) + log h t ( θ ) (cid:33) and ˆ l t ( θ ) = (cid:32) X t ˆ h t ( θ ) + log ˆ h t ( θ ) (cid:33) , (2.3)where (ˆ h t ) is an approximation of ( h t ) deﬁned recursively for t ≥ h − q + = · · · = ˆ h = h t ) and the true ( h t ) will vanish exponentiallyfast almost surely from (Straumann, 2005, Proposition 5.2.12).Assuming that Z is standard normal distributed, note that X t isalso Gaussian with variance h t conditionally on F t − . The QLfunction L n ( · ) in (2.2) is derived under this Gaussian assump-tion.The consistency and asymptotic properties of the QMLEˆ θ ∗ n , combined with the robustness of the QL function with re-spect to overdispersion (See Patton (2006)) makes the methodhighly used in practice. Under the conditions in (Straumann and Mikosch, 2006, N.1, N.2, N.3 and N.4) then the QMLE ˆ θ ∗ n is strongly consistent and asymptotically normal, i.e.ˆ θ ∗ n a.s. → θ and √ n (cid:16) ˆ θ ∗ n − θ (cid:17) → N (0 , V ) as n → ∞ , (2.4)with θ as the true parameter vector and V the asymptotic co-variance matrix.Unfortunately, these asymptotic properties in (2.4) comewith a drawback on the QL loss; the robustness is achievedthanks to careful domination with logarithms. The concavityof those logarithms makes the criterion insensitive to extremevalues but it also implies that the criterion behaves itself as aconcave function. As most optimization algorithms are basedon convex assumptions then this is striking.In the next section, we show that the approximate Hessianˆ H n ( θ ) = n − ∇ ˆ L n ( θ ) admits strictly positive eigenvalues for n large enough depending on the model speciﬁcations and the un-derlying data process. Meaning, for su ﬃ ciently large batchsizes of observations then the QMLE ˆ θ ∗ n can be seen as theunique solution of a locally strongly convex optimization prob-lem; The existence and uniqueness of ˆ θ ∗ n ensure that it can be ef-ﬁciently approximated by usual (o ﬄ ine) optimization routinesfor n large enough. In order to establish the asymptotic local convexity of the QLfunction of model (1.1)-(1.2) we need the following assump-tions; Assumption W1, W2 and W3, which naturally emergesby the arguments and properties (Straumann and Mikosch,2006, N.1, N.2, N.3 and N.4) made in order to have stabil-ity of the QL function and QMLE procedure. We will in thispaper use two di ﬀ erent matrix norms: Let (cid:107) A (cid:107) op denote thematrix operator norm of matrix A ∈ R d × d with respect to theEuclidean norm i.e. (cid:107) A (cid:107) op = sup v (cid:44) | Av | / | v | . Denote (cid:107) A (cid:107) K the norm of the continuous matrix-valued function A on K i.e. (cid:107) A (cid:107) K = sup x ∈K (cid:107) A ( x ) (cid:107) op , where K is a compact set of R d . Assumption W1.

Assume model (1.1)-(1.2) with θ = θ ad-mits a unique stationary ergodic solution. Assumption W2.

Assume

K ⊂ Θ is a compact set withtrue parameter vector θ ∈ K in the interior. The randomfunctions fulﬁll certain conditions, such that E [ (cid:107) l (cid:107) K ] < ∞ , E [ (cid:107)∇ l (cid:107) K ] < ∞ and further have the following uniform con-vergences: (cid:107) n − ˆ L n − L n (cid:107) K a.s. −→ n − (cid:107)∇ ˆ L n − ∇ L n (cid:107) K a.s. −→ n → ∞ . Assumption W3.

Assume the components of the vector ∇ θ g θ ( X , h ) from (1.2) with θ = θ are linearly independentrandom variables.The following Theorem 2.1 is an extension of Ip et al. (2006)which established similar results for the likelihood function ofGARCH models under the assumption that ( X t ) is strictly sta-tionary and strongly mixing with geometric rate and ( Z t ) isGaussian. Solving the QML estimation problem (2.1) for ˆ θ ∗ n is known to be computationally heavy since one has to ﬁnd thesolution of the non-linear equation (2.2). Nonetheless, Theo-rem 2.1 ensures the existence of a N such that we have a uniqueglobal QMLE ˆ θ ∗ n .2 heorem 2.1. Under Assumption W1, W2 and W3 there existpositive constants C , δ > and a random positive integer N ∈ N + such that we haveg T ˆ H n ( θ ) g > Cg T g , ∀ n ≥ N , g ∈ R d \ { } , a.s., (2.5) for all θ ∈ B ( θ , δ ) . The above results shows local strongly convexity of the QLfunction ˆ L n . The following corollary arises from the proof ofTheorem 2.1: Corollary 2.1.

Under Assumption W1, W2 and W3, the QMLE ˆ θ ∗ n exists and is unique, namely ˆ θ ∗ n = arg min θ ∈K ˆ L n ( θ ) . Local strong convexity is crucial in order to have conver-gence of an optimization algorithm. Thus, Theorem 2.1 is anessential result to compute the QMLE ˆ θ ∗ n for the parametersin model (1.1)-(1.2). But to guarantee the property in (2.5),we need a su ﬃ ciently large (and maybe unbounded) random N which depends on the true parameter vector, the parameter es-timates and the observations. In practice, one often has a ﬁxedsize of observations, which is why the iterative algorithm maynot converge. To our experience, this phenomena will occurwhen the true parameter vector is near the boundary of K or ifthe initial values are far away from the true parameters. ( p , q ) Parameters

The general class of conditionally heteroscedastic time seriesmodels includes the very popular ARCH and GARCH models.These models have for more than three decades, since their in-troduction, attracted considerable amounts of attention in theliterature. A process ( X t ) is called a GARCH( p , q ) process withparameter vector θ = ( ω, α , . . . , α p , β , . . . , β q ) T if it satisﬁes  X t = σ t Z t ,σ t = ω + (cid:80) pi = α i X t − i + (cid:80) qj = β j σ t − j , (2.6)where ω , α i and β j for 1 ≤ i ≤ p and 1 ≤ j ≤ q are non-negative parameters ensuring the non-negativity of the condi-tional variance process ( σ t ). The innovations ( Z t ) is a sequenceof i.i.d. random variables with E [ Z ] = E [ Z ] =

1. Simi-larly, one can deﬁne an ARCH( p ) process by setting β j = ≤ j ≤ q in (2.6). A GARCH( p , q ) process ( X t ) given in (2.6)has QL losses given by ˆ l t ( θ ) = − ( X t / ˆ σ t ( θ ) + log ˆ σ t ( θ )) withﬁrst derivative ∇ ˆ l t ( θ ) = ∇ ˆ σ t ( θ ) (cid:32) ˆ σ t ( θ ) − X t σ t ( θ ) (cid:33) , (2.7)and second derivate ∇ ˆ l t ( θ ) = ∇ ˆ σ t ( θ ) T ∇ ˆ σ t ( θ ) (cid:32) X t − ˆ σ t ( θ )2 ˆ σ t ( θ ) (cid:33) + ∇ ˆ σ t ( θ ) (cid:32) ˆ σ t ( θ ) − X t σ t ( θ ) (cid:33) , (2.8) where ∇ ˆ σ t ( θ ) = ϑ t ( θ ) + (cid:80) qj = β j ∇ ˆ σ t − j ( θ ) with ϑ t ( θ ) = (1 , X t − , . . . , X t − p , ˆ σ t − ( θ ) , . . . , ˆ σ t − q ( θ )) T ∈ R p + q + and Hessianˆ H n ( θ ) = n − (cid:80) nt = ∇ ˆ l t ( θ ).The equation (2.6) creates a complicated probabilistic struc-ture that is not easily understood although it looks rather simple.A solution for the problem of ﬁnding conditions for the exis-tence and uniqueness of a stationary solution to the equations(2.6) for GARCH(1 ,

1) was provided by Nelson (1990) whileBougerol and Picard (1992) showed it for the GARCH( p , q )model. Bougerol and Picard (1992) make use of the fact thatGARCH( p , q ) can be embedded in a Random Iterated LipschitzMap (RILM). See Bougerol (1993) for a formal deﬁnition ofRILMs.We can illustrate the RILM method on the GARCH(1 , θ = ( ω, α , β ) T . The RILM for σ t is then given by σ t = A t σ t − + B t with t ∈ Z where A t = α Z t − + β and B t = ω . Remark that (( A t , B t )) constitutes ani.i.d. sequence. From the literature on RILMs, it is well knownthat the conditions E [log | A | ] < E [log + | B | ] < ∞ guaran-tee the existence and uniqueness of a strictly stationary solutionof the RILM Y t = A t Y t − + B t for t ∈ Z , provided (( A t , B t )) is astationary ergodic sequence. Applying this to the GARCH(1 , E [log( α Z + β )] < ﬃ cient condition for the existence of a stationary solution.This also implies β < β ) ≤ E [log( α Z + β )] < β =

0) then require E [log( α Z )] < α < e (cid:15) ≈ .

56 with Z Gaussian. Thus, the condition for stationary is much weakerthan the second order stationary condition where α < Z t ) remains unspeciﬁed and thusone usually determines the likelihoods under the hypothesis ofstandard Gaussian innovations. Moreover, the volatility ( σ t ) isan unobserved quantity that is approximated by mimicking therecursion (2.6) with an initialization X − p + = · · · = X = σ − q + = · · · = σ = p , q )models by Theorem 2.1. However, the number of observationsneeded to guarantee local strong convexity vary. This can eas-ily be seen by looking at the simplest case, namely, where ( X t )is an ARCH(1) process with parameter vector θ = ( ω, α ) T .The volatility process σ t ( θ ) is given as ω + α X t − . The non-negativity of ∇ l t ( θ ) given by (1 + X t − )(2 X t − σ t ( θ )) σ − t ( θ ),would ensure convexity at iteration t in our QML procedure.However, the probability of having convexity at each iterationis unlikely as P ( ∩ nt = ∇ l t ( θ ) ≥ = P ( ∩ nt = Z t ≥ / = P ( Z ≥ / n is approximately 0 . n with i.i.d. Gaussian innovations( Z t ) i.e. ( Z t ) is chi-squared distributed with 1 degree of free-dom, Z ∼ χ . On the opposite side, increasing the number ofobservations used at each iteration would increase the probabil-ity of having local strong convexity.3 . Adaptive Recursive QML Estimation Our recursive QML method relies on stochastic approxima-tions introduced by Robbins and Monro (1951) which only re-quires the previous parameter estimate at each iterate to updatethe parameter estimate using the new observation. We performthe ﬁrst-order stochastic gradient method deﬁned asˆ θ t = ˆ θ t − − η t − ∇ ˆ l t (ˆ θ t − ) , (3.1)where η t − > t − ∇ ˆ l t (ˆ θ t − )is the gradient using the X t observation and the QMLE esti-mate ˆ θ t − . Depending on the amount of observations, we havea trade-o ﬀ between the accuracy of the QML estimates and thetime it takes to perform a parameter update (See Bottou andBousquet (2007)).According to Robbins and Monro (1951) we must schedulethe step-size such that (cid:80) ∞ t = η t = ∞ and (cid:80) ∞ t = η t < ∞ . But thesebounds does not make the choice of an appropriate step-size η t easier in practice. A more suitable approach is an adaptivelearning rate which update the step-size in (3.1) on the ﬂy pur-suant to the gradient ∇ ˆ l t ( · ). Thus, our choice of step-size η t haveless impact on performance, making convergence more robustand lower the demand for manual ﬁne-tuning. This is oftenused in streaming settings where generic methods are preferred.Adaptive learning rates and a separate learning rate for each pa-rameter was proposed by Duchi et al. (2011) in their AdaGradprocedure. This speeds up convergence in situations where theappropriate learning rates vary across parameters. Other well-known examples of adaptive learning rates could be AdaDeltaby Zeiler (2012), RMSProp by Tieleman and Hinton (2012) andADAM by Kingma and Ba (2015). As we may expect a lackof convexity then we select the AdaGrad algorithm since it hasshown promising result in non-convex optimization (See Wardet al. (2018)). The AdaGrad procedure is given by the updatesˆ θ t = ˆ θ t − − η (cid:113)(cid:80) ti = ∇ ˆ l i (ˆ θ i − ) + (cid:15) ∇ ˆ l t (ˆ θ t − ) , (3.2)(thought element-wise) where η > (cid:15) > ∇ ˆ l i (ˆ θ i − ) indicates the element-wise square ∇ ˆ l i (ˆ θ i − ) (cid:12) ∇ ˆ l i (ˆ θ i − ).As the QL loss is only deﬁned for ˆ θ n ∈ K , we will requirethat the recursive algorithm always lies in K . As suggestedby Zinkevich (2003) we project our approximation ˆ θ n onto K preventing large jumps and enforcing our stochastic gradientmethod to converge. This is implemented on (3.2) and ourmethod is given byˆ θ t = Projection K  ˆ θ t − − η (cid:113)(cid:80) ti = ∇ ˆ l i (ˆ θ i − ) + (cid:15) ∇ ˆ l t (ˆ θ t − )  . (3.3) The parameters in a GARCH process ( X t ) are numericallydi ﬃ cult to estimate in empirical applications. The numerical optimization algorithms can easily fail or converge to irregularsolutions (See Zumbach (2000)). Therefore, the approximationof the QMLE ˆ θ ∗ n must be examined with a healthy dose of skep-ticism. A well-discussed problem for the GARCH( p , q ) modelsis that the QMLE performs badly for numerically small (but stillpositive) ω values. The parameter ω is a vital and often trickyparameter to estimate. Stabilizing the estimation of ω wouldnot only improve the ω estimate but also have a positive impacton the other model parameters.On way to overcome small ω values for the GARCH( p , q )model is by scaling ( X t ) with some factor λ > . However, we wish to avoid this form of infer-ence in our recursive algorithm as one then need to come upwith a scaling parameter which have to be estimated before-hand. Instead, we comprehend this issue by introducing a con-cept called Variance Targeting Estimation (VTE) (see Francqet al. (2011)). We apply VTE for estimating ω by use of γ which is the unconditional variance estimated by the samplevariance as seen in (3.4). Thus we have a two-step estimatorwhere we estimate γ recursively as the sample variance andthe remaining parameters are estimated by the QML method.Pseudo-code of our proposed adaptive recursive algorithm ispresented in Algorithm 1. The reparametrization is obtained bydeﬁning ω = γ  − p (cid:88) i = α i − q (cid:88) j = β j  . (3.4)The volatility process in the GARCH( p , q ) process can then berewritten as( σ t − γ ) = p (cid:88) i = α i ( X t − i − γ ) + q (cid:88) j = β j ( σ t − j − γ ) . (3.5)Similarly, one can deﬁne an ARCH( p ) process by setting β j = ≤ j ≤ q . The GARCH( p , q ) process ( X t ) in (3.5) hassimilar QL losses as before except from ∇ ˆ σ t ( θ ) in (2.7) and(2.8) where ϑ t ( θ ) is given as ( X t − − γ , . . . , X t − p − γ , ˆ σ t − ( θ ) − γ , . . . , ˆ σ t − q ( θ ) − γ ) T ∈ R p + q and the corresponding K = { ( α , · · · , α p , β , . . . , β q ) ∈ R p + q + | (cid:80) pi = α i + (cid:80) qj = β j ≤ } .An advantage with VTE is that we ensure a consistent esti-mate of the long-run variance even if the model is misspeciﬁed.Also, given γ is well estimated, we reduce the dimension ofthe parameter space and increase the speed of convergence ofthe optimization routines. Moreover, the nice geometry of thenew set of optimization K lets the projection step in (3.3) beinge ﬃ ciently implemented following Duchi et al. (2008). How-ever, VTE requires stronger assumptions for the existence ofthe variance and is likely to su ﬀ er from e ﬃ ciency loss. Francqet al. (2011) also showed that VTE will never be asymptoti-cally more accurate than the QMLE. Another drawback of us-ing VTE is that one needs a ﬁnite fourth moment of the process Let ( X t ) follow a GARCH( p , q ) process with parameter vector θ = ( ω, α , . . . , α p , β , . . . , β q ) T and innovations ( Z t ). Then for any λ > √ λ X t ) is a GARCH( p , q ) process with parameter vector θ = ( λω, α , . . . , α p , β , . . . , β q ) T and identical innovations ( Z t ). lgorithm 1: Adaptive recursive QML estimation forGARCH( p , q ) models using the technique of VTE appliedon ( X t ) t ≥ with an initialization X − p + = · · · = X = σ − q + = · · · = ˆ σ =

0. The projection onto K is indicatedby P K [ · ]. All vector operations are element-wise e.g. ˆ g t de-notes the element-wise square ˆ g t (cid:12) ˆ g t . Good default settingsare η = . (cid:15) = − . input : ˆ θ (initial parameter vector) begin initialize: ˆ σ = X , ˆ µ =

0, ˆ γ =

0, ˆ G = t = while ˆ θ t not converged do t = t + µ t = t ( t + − ˆ µ t − + ( t + − X t ˆ γ t = ( t − t − ˆ γ t − + t − ( X t − ˆ µ t ) ˆ g t = ∇ ˆ l t (ˆ θ t − )ˆ G t = ˆ G t − + ˆ g t ˆ θ t = P K (cid:104) ˆ θ t − − η ( ˆ G t + (cid:15) ) − / ˆ g t (cid:105) ˆ σ t + = ˆ γ t + (cid:80) pi = ˆ α ( t ) i ( X t − i − ˆ γ t ) + (cid:80) qj = ˆ β ( t ) j ( ˆ σ t − j − ˆ γ t ) endendResult: ˆ θ t (resulting estimates)( X t ). Meaning, one would need α < .

57 for an ARCH(1)model using standard Gaussian noise as EX t < ∞ if and onlyif α + ( EZ − α <

1. For a GARCH(1 ,

1) model we shouldhave ( α + β ) + ( EZ − α <

1. These parameter boundsrestrict the usefulness and range of applications for our method.

4. Applications

In this section, we apply our recursive method on simulatedand real-life data. Our implementation of Algorithm 1 is pro-vided in a repository at Werge (2019). We compare our ap-proach to the non-recursive QMLE approximation ˜ θ n which isestimated at every two thousand incremental using all obser-vations up to this point i.e. (˜ θ t ) ( k − + ≤ t ≤ k is estimated us-ing ( X t ) ≤ t ≤ k for k = , , . . . . As suggested by Ip et al.(2006), we use the

L-BFGS-B algorithm to solve the nonlinearoptimization problem in (2.1) for ˜ θ n with initial guess ˜ θ ∈ K .Our two-step recursive QMLE approximation ˆ θ n is describedin Section 3.1 for the GARCH( p , q ) model. It takes our initialvalue ˆ θ ∈ K , learning rate η = . (cid:15) = − as input . Atlast, for a fair comparison, we always use the same initial guessfor both methods, namely ˆ θ = ˜ θ ∈ K . Algorithm 1 can be tuned by changing the learning parameter η e.g. bychoosing the best performing learning rate η using the ﬁrst part of the observa-tions. The choice of learning rates is arduous as a too large learning rate maycause the algorithm to diverge away from the parameter estimate. Contrarily, atoo small learning rate may result in slow convergence. However, a small learn-ing rate can be preferred if one only wishes to keep track of minor changes inthe parameter estimates. All simulations are performed by the use of twenty thousandobservations ( n = X t ) is alwaysgenerated using Gaussian innovations with zero mean and unitvariance. As discussed before, the non-recursive QMLE approxima-tion ˜ θ n performs badly for numerically small ω > ω parameter val-ues then in Figure 1 we have the trajectories of both QMLEapproximations using an ARCH(1) process with true parametervector and initial values given by θ = (cid:32) ωα (cid:33) = (cid:32) . . (cid:33) and ˆ θ = ˜ θ = (cid:32) . . (cid:33) . (4.1)Figure 1 shows a very reasonable convergence of both estima-tors, ˆ θ n = ( ˆ ω ( n ) , ˆ α ( n )1 ) T and ˜ θ n = ( ˜ ω ( n ) , ˜ α ( n )1 ) T , when the trueparameter ω = .

0. Not surprisingly, our method experiencesome ﬂuctuations in the beginning but as the learning rate de-crease the ﬂuctuation evaporates.

Figure 1: Trajectory of the recursive ˆ θ n (solid line) and non-recursive ˜ θ n (semi-dotted line) for an ARCH(1) process with true parameter vector (dotted line)and initial guess given in (4.1). Likewise, in Figure 2 we have the trajectories of the QMLEapproximations for an ARCH(1) process but now with true pa-rameter vector and initial guess given as θ = (cid:32) · − . (cid:33) and ˆ θ = ˜ θ = (cid:32) · − . (cid:33) . (4.2)Figure 2 indicates a modest convergence of ˆ θ n but shows slowconvergence of ˜ α n towards the true α parameter, in addition,5 α n seems bias with respect to the initial value ˜ α = . α = . Figure 2: Trajectory of the recursive ˆ θ n (solid line) and non-recursive ˜ θ n (semi-dotted line) for an ARCH(1) process with true parameter vector (dotted line)and initial guess given in (4.2). A way of demonstrating the variation in performance of ˆ θ n and ˜ θ n for small ω values is presented in Figure 3 and Figure4, respectively, where we have the average trajectory of onehundred trajectories with their corresponding boxplots showingthe distribution of these one hundred trajectories. Here, in Fig-ure 3, we can see that our recursive algorithm converges to thetrue parameter values with low sensitivity respect to the initialvalues. But one should still expect some likelihood to end upwith an irregular solution where the algorithm fails to converge.However, in Figure 4 we see somehow the opposite in which ˜ θ n have convergence issues; it is consistently underestimating the ω parameter and for some iterations also the α parameter.As we observe the true volatility process ( σ t ) in this sectionthen we can evaluate the accuracy of our recursive ( ˆ σ t ) and thenon-recursive ( ˜ σ t ) volatility process. We do this by use of theMean Percentage Errors (MPE), given asˆ σ MPE = n n (cid:88) t = σ t − ˆ σ t σ t and ˜ σ MPE = n n (cid:88) t = σ t − ˜ σ t σ t , (4.3)and the Mean Absolute Percentage Errors (MAPE), given byˆ σ MAPE = n n (cid:88) t = | σ t − ˆ σ t | σ t and ˜ σ MAPE = n n (cid:88) t = | σ t − ˜ σ t | σ t . (4.4)Boxplots of one hundred accuracy scores, MPE in (4.3) andMAPE in (4.4), can be found in Figure 5. To avoid bias dueto the choice of the true parameter vector θ and initial values Figure 3: Average trajectory (solid line) of one hundred recursive ˆ θ n ’s for anARCH(1) process with true parameter vector (dotted line) and initial guess from(4.2). The boxplots shows the distribution of the one hundred trajectories.Figure 4: Average trajectory (solid line) of one hundred non-recursive ˜ θ n ’s foran ARCH(1) process with true parameter vector (dotted line) and initial guessfrom (4.2). The boxplots shows the distribution of the one hundred trajectories. ˆ θ , ˜ θ , we then have the accuracy scores with a random param-eter vector θ ∈ K and random initial guesses ˆ θ , ˜ θ ∈ K inFigure 5. In the top graph of Figure 5, one can observe that theMPE for both methods is symmetric around zero but ˜ σ MPE hasa negative tail (meaning the non-recursive method may overes-6imate the volatility in some cases). Also, the spread of ˜ σ MPE ishigher than the ˆ σ MPE , which is clearly seen by looking at ˜ σ MAPE in the bottom graph of Figure 5.

Figure 5: Boxplots of accuracy scores, MPE from (4.3) and MAPE from (4.4),for one hundred trajectories of ˆ θ n and ˜ θ n using an ARCH(1) process with ran-dom true parameter vector and initial guess in K . Another way of measuring the accuracy can be made bystudying the conditional quantiles using the recursive ( ˆ σ t ) andnon-recursive ( ˜ σ t ) predicted volatility processes (See Biau andPatra (2011)). Under the assumption of standard Gaussian in-novations then X t is Gaussian with zero mean and variance σ t .Thus, for any α ∈ (0 , α -quantile of a Gaussian distri-bution N (0 , σ t ) is σ t Φ − ( α ), where Φ − ( α ) is the α -quantileof the standard Gaussian one. We use the so-called α -quantileloss function proposed by Koenker and Bassett (1978): The α -quantile loss function ρ α using the volatility process σ t is de-ﬁned as ρ α ( X t , σ t ) =  α (cid:16) X t − Φ − ( α ) σ t (cid:17) , for X t > Φ − ( α ) σ t (1 − α ) (cid:16) Φ − ( α ) σ t − X t (cid:17) , for X t ≤ Φ − ( α ) σ t (4.5)with tilting parameter α ∈ (0 , α -quantileloss function is to penalize quantiles of low probability morefor overestimation than for underestimation (and contrariwisein the case of high probability quantiles). We evaluate acrossthe α -quantile scores ρ α of ( σ t ) by the (normalized) cumulative α -quantile scoring function QS α : QS α ( X n , σ n ) = n n (cid:88) t = M (cid:88) m = ρ α m ( X t , σ t ) , (4.6)with M as the number of quantiles α = { α , . . . , α M } . Thebest ability of volatility forecast is indicated by the lowest QS α score. The ﬁndings of one hundred QS α ( X n , ˆ σ n ) and QS α ( X n , ˜ σ n ) scores, with α = { . , . , . . . , . } and ran-dom true parameter vector and random initialization in K , ispresented in Figure 6. The QS α scores in Figure 6 are indistin-guishable. Figure 6: Boxplots of one hundred QS α scores using the recursive ˆ σ t and non-recursive ˜ σ t volatility process, respectively, for α = { . , . , . . . , . } , foran ARCH(1) model with random true parameter vector and initial value in K . We have similar ﬁndings for the GARCH(1 ,

1) model pre-sented in Figure 7 for the recursive ˆ θ n and Figure 8 for the non-recursive ˜ θ n with true parameter vector and initial guess givenby θ =  ωα β  =  · − . .  and ˆ θ = ˜ θ =  · − . .  . (4.7)That said, the non-recursive ˜ θ n is consistently overestimatingthe β parameter. It is worth mentioning that even if all initialvalues are chosen in the stationary region i.e. ˆ θ = ˜ θ = θ ,we still have a proper amount of ﬂuctuation in our estimatestrajectories. As discussed before, this may partially be due tothe volatility the stochastic gradient descent introduce and theﬂatness of the QL loss (See Zumbach (2000)).The accuracy scores, namely MPE in (4.3) and MAPE in(4.4), can be found in Figure 9 for the GARCH(1 ,

1) model us-ing both random true parameter vector and random initial val-ues in K . As in the ARCH(1) case, we obtain lower spread forˆ σ MPE than ˜ σ MPE .Figure 10 present the results of one hundred QS α scores withrandom true parameter vector and initial value in K . Again, the QS α scores are indistinguishable (even when the non-recursivemethod is forward looking). We demonstrate our method on real-life observations, show-ing how our technique works in practice. Table 1 shows anoverview of the used stock market indices. All empirical stud-ies are made using the GARCH(1 ,

1) model but parameters ofhigher-order may yield a better ﬁt. As the observation periodspans over a long time then it is unlikely that the log-return se-ries is stationary. To exhibit our method ability to adapt to time-varying estimates then we begin by considering the S&P500 In-7 igure 7: Average trajectory (solid line) of one hundred ˆ θ n ’s for a GARCH(1 , dex in Section 4.2.1. Thereafter, in Section 4.2.2, we investigatethe remaining six stock market indices presented in Table 1.Stock Market Index Period (years)CAC 40 (CAC) 1991-2020DAX 30 (DAX) 1988-2020Dow Jones Industrial Average (DJIA) 1986-2020NASDAQ Composite (NDAQ) 1971-2020Nikkei 225 (NKY) 1965-2020Russell 2000 (RUT) 1988-2020Standard & Poor’s 500 (S&P500) 1950-2020 Table 1: Overview of considered stock market indices including their observa-tion periods. The observations consist of daily log-returns which are deﬁned aslog di ﬀ erences of the closing prices of the index between two consecutive days. & P500 Index

We apply our method on the S&P500 Index for the years1950 to 2020 (consisting of n = ,

1) model

Figure 8: Average trajectory (solid line) of one hundred ˜ θ n ’s for a GARCH(1 , with initial values: ˆ θ = ˜ θ =  · − . .  . (4.8)The QML trajectories can been seen in Figure 11: Our recursiveapproximation ˆ θ n = ( ˆ ω ( n ) , ˆ α ( n )1 , ˆ β ( n )1 ) T ﬂuctuates more than theQMLE approximation ˜ θ n = ( ˜ ω ( n ) , ˜ α ( n )1 , ˜ β ( n )1 ) T which is estimatedincremental for every two thousand observation. Remember,the ﬂuctuations we experience in the recursive method can bereduced by lowering the learning rate η . It is remarkable that theQMLE approximation ˜ θ n estimate ˜ β ( n )1 is so low in some yearsbetween 1990 and 2000 even when it is estimated using overhalf of the observations.In Figure 12, we have the log-returns r t of the S&P500 Indexand the conﬁdence intervals ¯ r ± .

96 ˆ σ t and ¯ r ± .

96 ˜ σ t using therecursive ˆ σ t and non-recursive ˜ σ t predicted volatilities, respec-tively, where ¯ r is the mean of the log-returns r t . It seems that therecursive method ˆ σ t adapts more rapidly than the non-recursiveone ˜ σ t to changes in the S&P500 Index observations r t .The e ﬃ ciency of our recursive ( ˆ σ t ) and the non-recursive( ˜ σ t ) volatility can be appraised with use of the squared log-8 igure 9: Boxplots of accuracy scores, MPE from (4.3) and MAPE from (4.4),for one hundred trajectories of ˆ θ n and ˜ θ n using a GARCH(1 ,

1) process withtrue parameter vector and random initial guess in K .Figure 10: Boxplots of one hundred QS α scores using the recursive ˆ σ t andnon-recursive ˜ σ t volatility process, respectively, for α = { . , . , . . . , . } ,using the GARCH(1 ,

1) model with random true parameter vector and initialvalue in K . returns ( r t ) in absence of the true (unobserved) variance process( σ t ). In Table 2, we have the Mean Absolute Errors (MAE) de-ﬁned byˆ σ = n n (cid:88) t = | r t − ˆ σ t | and ˜ σ = n n (cid:88) t = | r t − ˜ σ t | , (4.9)for the same periods used in Figure 12, including for the fulldataset. The results in Table 2 conﬁrm our conclusions aboutFigure 12; the recursive method tracks the volatility better thanthe non-recursive method.Figure 13 contain the results of one hundred QS α scoresusing the recursive ( ˆ σ t ) and non-recursive ( ˜ σ t ) volatility pro-cess, respectively, with random initial values in K . It isremarkable that the recursive method outperforms the non-recursive method although the latter uses future informationi.e. (˜ θ t ) ( k − + ≤ t ≤ k is estimated using ( r t ) ≤ t ≤ k for k = Figure 11: Trajectory of recursive QML estimates ˆ θ n (solid line) and non-recursive ˜ θ n (semi-dotted line) using a GARCH(1 ,

1) model on S&P500 Indexlog-returns from year 1950 to 2020. Both methods use initial value from (4.8).

Periods (years) Recursive ˆ σ Non-recursive ˜ σ Table 2: MAEs (4.9) using log-returns r t of S&P500 Index with the recursiveˆ σ t and non-recursive ˜ σ t predicted volatilities, respectively. Both methods hasinitial value given in (4.8). The ˆ σ and ˜ σ numbers are scaled by 10 − . , , . . . , , The results of one hundred QS α scores using the recursive( ˆ σ t ) and non-recursive ( ˜ σ t ) volatility process, respectively, withrandom initial values in K is presented in Figure 14 for theremaining six stock market indices in Table 1 (i.e. the CAC,DAX, DJIA, NDAQ, NKY and RUT index). As for the S&P500Index (See Figure 13), these ﬁndings indicate that the recursiveapproach estimate the quantiles better than the non-recursivemethod, both on average but also with a lower spread.9 igure 12: Log-returns r t of S&P500 Index (solid lines) and conﬁdence inter-vals ¯ r ± .

96 ˆ σ t and ¯ r ± .

96 ˜ σ t (dotted lines) using the recursive ˆ σ t (blue) andnon-recursive ˜ σ t (red) predicted volatilities, respectively, where ¯ r is the meanof the log-returns r t . From top to bottom we have the periods: year 1950 to1952, 1985 to 1987 and 2018 to 2020.Figure 13: Boxplots of one hundred QS α scores with use of the recursive ˆ σ t andnon-recursive ˜ σ t volatility process, respectively, for α = { . , . , . . . , . } ,using the GARCH(1 ,

1) model on the log-returns r t of S&P500 Index with ran-dom initial value in K . The assumption of having an underlying data generation pro-cess with constant "true" parameters may not hold in real-lifeexamples. Thus, our recursive method seems to have an advan-tage compared to the non-recursive method, as it estimates theparameters step-by-step whereas the non-recursive method al-ways has to estimate the parameters using all observations overa large period of time.

5. Discussion

We proved asymptotic local convexity of the QL function in ageneral conditionally heteroscedastic time series model of mul-tiplicative form. An interesting question arises: can one proveTheorem 2.1 for a bounded set of N observations? Expresseddi ﬀ erently, can one ﬁnd a N bounded, such that we have conver-gence / convexity of recursive algorithms e.g. for the GARCH,EGARCH and AGARCH models. To our knowledge, this is notbeen proved yet.We proposed an adaptive approach to recursively estimatethe parameters of GARCH models in an online setting with useof the VTE technique (See Algorithm 1). We obtain a morestable, reliable and adaptive method. We know that the stabil-ity of using our recursive approach to solve the QML problemcould be improved by using a mini-batch approach. This wouldbe lowering the volatility in each incremental as one use moreobservations per iteration to update the QML estimate.Furthermore, applying a mini-batch method does not requiremuch more computational power compared to the stochasticgradient descent, and by using more observations we could getmore consistency and smoothness in the convergence of the es-timation procedure. The size of the mini-batch to use is left tofuture research work. Appendix A.

Proof of Theorem 2.1.

To prove local strong convexity for theapproximate QL function ˆ L n using the approximate QMLE ˆ θ ∗ n we ﬁrst list some bounds for the Hessians: Under the regularityconditions on the derivatives of h t , then by use of (2.3) we canwrite ∇ l t ( θ ) = ∇ h t ( θ ) h t ( θ ) (cid:32) − X t h t ( θ ) (cid:33) , ∇ l t ( θ ) = h t ( θ ) (cid:32) ∇ h t ( θ ) T ∇ h t ( θ ) (cid:32) X t h t ( θ ) − (cid:33) + ∇ h t ( θ ) (cid:16) h t ( θ ) − X t (cid:17) (cid:33) , where the Hessian H n ( θ ) is deﬁned as n − ∇ L n ( θ ) = n − (cid:80) nt = ∇ l t ( θ ). Similarly, for ∇ ˆ l t ( θ ), ∇ ˆ l t ( θ ) and ˆ H n ( θ ) wereplace h t ( θ ) , ∇ h t ( θ ) and ∇ h t ( θ ) by ˆ h t ( θ ) , ∇ ˆ h t ( θ ) and ∇ ˆ h t ( θ ),respectively. From Assumption W2, we know n − (cid:107)∇ ˆ L n −∇ L n (cid:107) K a.s. −→ n → ∞ . Hence, for some random N largeenough there exists (cid:15) > n − (cid:107)∇ ˆ L n − ∇ L n (cid:107) K < (cid:15) forall n ≥ N a.s. As a consequence, we get (cid:107) ˆ H n − H n (cid:107) K < (cid:15), a.s. , (A.1)10 igure 14: Boxplots of one hundred QS α scores with use of the recursive ˆ σ t and non-recursive ˜ σ t volatility process, respectively, for α = { . , . , . . . , . } ,using the GARCH(1 ,

1) model on the log-returns r t of the CAC (top left), DAX (top right), DJIA (mid left), NDAQ (mid right), NKY (bottom left) and RUT (bottomright) index with random initial value in K . for all n ≥ N . Similarly, applying the ergodic theorem on theintegrable sequence (uniformly over K ) ( ∇ l t ) of continuousfunctions over the compact set K , we obtain (cid:107) n − (cid:80) nt = ∇ l t − E [ ∇ l ] (cid:107) K a.s. −→ n → ∞ . Then there exists N such that (cid:107) H n − H (cid:107) K < (cid:15), a.s. , (A.2)for all n ≥ N . Thus, by equation (A.1) and (A.2) we knowthere exists N = max( N , N ) such that for all n ≥ N we have (cid:107) ˆ H n − H (cid:107) K ≤ (cid:107) ˆ H n − H n (cid:107) K + (cid:107) H n − H (cid:107) K < (cid:15), a.s.Especially, as (cid:107) ˆ H n − H (cid:107) K is deﬁned as sup θ ∈K (cid:107) ˆ H n ( θ ) − H ( θ ) (cid:107) op then (cid:107) ˆ H n ( θ ) − H ( θ ) (cid:107) op < (cid:15), (A.3)for all θ ∈ K . From (Straumann and Mikosch, 2006, Lemma 7.2), theasymptotic Hessian H ( θ ) = E [ ∇ l ( θ )] is a symmetric posi-tive deﬁnite matrix a.s. under Assumption W3. As H ( θ ) is thelimit of the continuous matrix-valued function H n ( θ ) then it isitself a continuous matrix-valued function. Thus, the eigenvaluefunction λ i ( θ ) for 1 ≤ i ≤ d of H ( θ ) is also continuous. Theeigenvalues λ i ( θ ) are positive real numbers with the smallestone λ min0 ( θ ) denoted by λ min0 ( θ ) = min ≤ i ≤ d λ i ( θ ) > , satisfying g T H ( θ ) g ≥ λ min0 ( θ ) g T g for all g ∈ R d \ { } .To shorten the notation we write with no ambiguity H ( θ ) (cid:23) λ min0 ( θ ) I d where I d denote the d -dimensional identity matrix.Note that by continuity λ min0 ( θ ) is also positive on a neighbor-hood B ( θ , δ ) so that ∃ (cid:15) > λ min0 ( θ ) − (cid:15) >

0, meaning H ( θ ) (cid:23) ( λ min0 ( θ ) − (cid:15) ) I d , θ ∈ B ( θ , δ ). Hence, for θ ∈ B ( θ , δ ) and g ∈ R d \ { } , wehave g T ˆ H n ( θ ) g T g = g T H ( θ ) gg T g + g T (cid:16) ˆ H n ( θ ) − H ( θ ) (cid:17) gg T g ≥ λ min − (cid:15) − g T (cid:107) ˆ H n ( θ ) − H ( θ ) (cid:107) op gg T g > λ min − (cid:15)> C , a.s.,with use of (A.3) for all n ≥ N by taking 0 < (cid:15) < − λ min and letting C = − λ min . Then we have the desired inequality(2.5). Proof of Corollary 2.1.

The uniqueness of the QMLE ˆ θ ∗ n fol-lows from a Pfanzagl argument (See Pfanzagl (1969)). By The-orem 2.1, we know there exists N such thatinf θ ∈ B ( θ ,δ ) g T ˆ H n ( θ ) g > Cg T g , a.s.,for all n ≥ N where B ( θ , δ ) denotes the open ball around θ with radius δ >

0. For each element θ i ∈ K we make anopen ball B ( θ i , δ i ) for δ i > B ( θ i , δ i ) forall i only contains θ once, i.e. θ (cid:60) B ( θ i , δ i ) for i (cid:44)

0. As K is compact and contained in the union of all B ( θ i , δ i ) thenthere is a ﬁnite covering of K , i.e. K ⊆ (cid:83) ki = B ( θ i , δ i ). Let K (cid:48) = K \ B ( θ , δ ). As K (cid:48) is compact then the minimum ofthe continuous QL function E [ l ] exists. Moreover, as E [ l ] isa unique minimum at θ under Assumption W1, we getinf θ ∈K (cid:48) E [ l ( θ )] > E [ l ( θ )] a.s.From Assumption W2, we know that (cid:107) n − ˆ L n − L (cid:107) K (cid:48) a.s. −→ n → ∞ . Hence, we haveinf θ ∈K (cid:48) n − ˆ L n ( θ ) a.s. −→ inf θ ∈K (cid:48) L ( θ ) , where inf θ ∈K (cid:48) L ( θ ) > E [ l ( θ )]. Thus, the B ( θ , δ ) gives us aunique global minimum of the QL function ˆ L n , i.e.inf θ ∈K n − ˆ L n ( θ ) ≥ E [ l ( θ )] , a.s.,where equality is only attained when θ = θ . References