[PDF] The Hyvärinen scoring rule in Gaussian linear time series models

Abstract

Likelihood-based estimation methods involve the normalising constant of the model distributions, expressed as a function of the parameter. However in many problems this function is not easily available, and then less efficient but more easily computed estimators may be attractive. In this work we study stationary time-series models, and construct and analyse "score-matching'' estimators, that do not involve the normalising constant. We consider two scenarios: a single series of increasing length, and an increasing number of independent series of fixed length. In the latter case there are two variants, one based on the full data, and another based on a sufficient statistic. We study the empirical performance of these estimators in three special cases, autoregressive (\AR), moving average (MA) and fractionally differenced white noise (\ARFIMA) models, and make comparisons with full and pairwise likelihood estimators. The results are somewhat model-dependent, with the new estimators doing well for $\MA$ and \ARFIMA\ models, but less so for $\AR$ models.

Full PDF

aa r X i v : . [ s t a t . M E ] A p r The Hyv¨arinen scoring rule in Gaussian lineartime series models

Silvia Columbu a , Valentina Mameli b , Monica Musio a and A.Philip Dawid c a Dep. of Mathematics and Computer Science, Univ. of Cagliari, IT b Dep. of Environmental Sciences, Informatics and Statistics Ca’ Foscari Univ. of Venice, Italy c Dep. of Pure Mathematics and Mathematical Statistics, Univ. of Cambridge, UK

Abstract

Likelihood-based estimation methods involve the normalising con-stant of the model distributions, expressed as a function of the param-eter. However in many problems this function is not easily available,and then less eﬃcient but more easily computed estimators may beattractive. In this work we study stationary time-series models, andconstruct and analyse “score-matching” estimators, that do not in-volve the normalising constant. We consider two scenarios: a singleseries of increasing length, and an increasing number of independentseries of ﬁxed length. In the latter case there are two variants, onebased on the full data, and another based on a suﬃcient statistic.We study the empirical performance of these estimators in threespecial cases, autoregressive (AR), moving average (MA) and frac-tionally diﬀerenced white noise (ARFIMA) models, and make com-parisons with full and pairwise likelihood estimators. The results aresomewhat model-dependent, with the new estimators doing well forMA and ARFIMA models, but less so for AR models.

Keywords : Scoring rule estimators, Hyv¨arinen scoring rule, Gaussian Lineartime series.

The evaluation of the exact full likelihood may be diﬃcult or even impossiblein situations where complex problems deal with large correlated datasets.These problems are likely to occur for instance in spatial statistics and timeseries frameworks in which direct computation of the normalising constant1an be a very challenging task, entailing multidimensional integration of thefull joint density for each value of the parameter. Many approaches havebeen proposed to tackle this issue. in this paper, we investigate an appealingapproach based on the score matching estimator proposed by Hyv¨arinen in2005 [23].The score matching estimator can be regarded as a speciﬁc case of esti-mators derived from proper scoring rules (see Dawid and Musio (2014) [13]),which are loss functions for measuring the quality of probabilistic forecasts.In particular, this estimator derives from the Hyv¨arinen scoring rule, whichis a homogeneous proper scoring rule (see Ehm and Gneiting (2012) [17] andParry et al. (2012) [33]), namely a proper scoring rule which does not requireknowledge of the normalising constant. Homogeneous scoring rules have beencharacterised for continuous real variables (Parry et al. (2012) [33]) and fordiscrete variables (Dawid et al. (2012) [12]). In a Bayesian framework, Dawidand Musio (2015) [15] have shown, for the case of continuous variables, howto handle Bayesian model selection with improper within-model prior dis-tributions, by exploiting the use of homogeneous proper scoring rules. Thediscrete counterpart has been empirically studied by Dawid et al. (2017)[16]. In a recent contribution, Shao et al. (2018) [35] consider the use ofthe Hyv¨arinen score for model comparison. Although the majority of con-tributions involving the use of Hyv¨arinen scoring rules focus on Euclideanspaces, scholars have also investigated extensions to non-Euclidean spaces:for an early study see Dawid (2007) [11]. Recently, Mardia et al. (2016) [29]proposed an extension of the Hyv¨arinen scoring rule to compact orientedRiemannian manifolds, and Takasu et al. (2018) [36] constructed a novelclass of homogeneous strictly proper scoring rules for statistical models onspheres.Given the growing interest in the use of this scoring rule for very com-plex statistical models, in this paper we aim to derive an estimation methodbased on the Hyv¨arinen scoring rule for estimating linear Gaussian time seriesmodels.We distinguish two separate cases: a ﬁrst in which the length of a singletime series increases to inﬁnity, and a second in which the length of the timeseries is ﬁxed and the number of series increases to inﬁnity.The consistency and asymptotic distribution of the Hyv¨arinen estimatorare derived for the case of a single time series of increasing length. In par-ticular, under some mild regularity conditions we derive consistency of theproposed estimator for linear Gaussian time series models, while the asymp-totic distribution is proved in the speciﬁc case of autoregressive moving av-erage (ARMA) causal invertible models. For time series with ﬁxed lengthand the number of time series increasing to inﬁnity the performances of two2stimators based on the Hyv¨arinen scoring rule, namely the total Hyv¨arinenestimator and the matrix Hyv¨arinen estimator are compared through simu-lation studies with the full maximum likelihood and the pairwise maximumlikelihood estimators. Three simple time series models have been consideredin the design of simulations: autoregressive (AR), moving average (MA) andfractionally diﬀerenced white noise (ARFIMA).Diﬀerent behaviours can be detected for the total Hyv¨arinen estimatoramong the settings examined. In particular, it outperforms the pairwiselikelihood estimator in terms of eﬃciency with respect to the full maximumlikelihood estimator for the MA and ARFIMA processes.The paper unfolds as follows. Section 2 introduces basic notions on scor-ing rules. In Section 3 we introduce the Hyv¨arinen scoring rule for Gaus-sian linear time series. Some asymptotic results for the Hyv¨arinen estimatorare given. In the speciﬁc case of n independent series we describe the to-tal Hyv¨arinen estimator and the matrix Hyv¨arinen estimator. Section 4summarises the results of the simulation studies. Section 5 presents someconcluding remarks. Technical details are postponed to the Appendices. A scoring rule is a loss function designed to measure the quality of a proposedprobability distribution Q , for a random variable X , in light of the outcome x of X . Speciﬁcally, if a forecaster quotes a predictive distribution Q for X and the event X = x realises, then the forecaster’s loss will be S ( x, Q ). Theexpected value of S ( X, Q ) when X has distribution P is denoted by S ( P, Q ).The scoring rule S is proper (relative to the class of distributions P ) if S ( P, Q ) ≥ S ( P, P ) , for all P, Q ∈ P . (1)It is strictly proper if equality obtains only when Q = P .Any proper scoring rule gives rise to a general method for parameterestimation, based on an unbiased estimating equation: see § Some important proper scoring rules are the log-score, S ( x, Q ) = − log q ( x )(Good (1952) [20]), where q ( · ) is the density function of Q , which recoversthe full (negative log) likelihood; and the Brier score (Brier (1950) [3]). Aparticularly interesting case, which avoids the need to compute the normal-ising constant, produces the score matching estimation method of Hyv¨arinen32005) [23], based on the following proper scoring rule: S ( x , Q ) = ∆ x ln q ( x ) + 12 k∇ x ln q ( x ) k , (2)where X ranges over the whole of IR p supplied with the Euclidean norm k · k , q ( · ) is assumed twice continuously diﬀerentiable, and x is the realised value of X . In (2), ∇ x denotes the gradient operator, and ∆ x the Laplacian operator,with respect to x . For p = 1 we can express S ( x, Q ) = q ′′ ( x ) q ( x ) − (cid:18) q ′ ( x ) q ( x ) (cid:19) . (3)The scoring rule (2) is a (seeParry et al. (2012) [33]). It is homogeneous in the density function q ( · ), i.e. its value is unaﬀected by applying a positive scale factor to the density q ,and so can be computed even if we only know the density function up to ascale factor. Inference performed using any homogeneous scoring rule doesnot require knowledge of the normalising constant of the distribution. Let ( x , . . . , x n ) be independent realisations of a random variable X , havingdistribution P θ depending on an unknown parameter θ ∈ Θ, where Θ is anopen subset of IR m . Given a proper scoring rule S , let S ( x, θ ) denote S ( x, P θ ).Inference for the parameter θ may be performed by minimising the totalempirical score , S ( θ ) = n X p =1 S ( x p , θ ) , (4)resulting in the minimum score estimator , b θ S = arg min θ S ( θ ).Under broad regularity conditions on the model (see e.g. Barndorﬀ-Nielsen & Cox (1994) [2]), b θ S satistﬁes the score equation : s ( θ ) := n X p =1 s ( x p , θ ) = 0 , (5)where s ( x, θ ) := ∇ θ S ( x, θ ), the gradient vector of S ( x, θ ) with respect to θ .The score equation is an unbiased estimating equation (Dawid & Lauritzen(2005) [10]). When S is the log-score, the minimum score estimator becomesthe maximum likelihood estimator. 4rom the general theory of unbiased estimating functions, under broadregularity conditions on the model the minimum score estimate b θ S is asymp-totically consistent and normally distributed: b θ S ∼ N ( θ, { nG ( θ ) } − ) , where G ( θ ) denotes the Godambe information matrix G ( θ ) := K ( θ ) J ( θ ) − K ( θ ) , where J ( θ ) = E (cid:8) s ( X, θ ) s ( X, θ ) T (cid:9) is the variability matrix , and K ( θ ) =E (cid:8) ∇ s ( X, θ ) T (cid:9) is the sensitivity matrix . In contrast to the case for the fulllikelihood, J and K are diﬀerent in general: see Dawid & Musio (2014) [13],Dawid et al (2016)[14]. We point out that estimation of the matrix J ( θ ), and(to a somewhat lesser extent) of the matrix K ( θ ), is not an easy task: seeVarin (2008) [37], Varin et al. (2011) [38] and Cattelan and Sartori (2016)[5]. In this section we introduce some results based on the use of the Hyv¨arinenscoring rule in the setting of Gaussian linear time series models.Consider the Gaussian linear time series model y t = µ + ∞ X j =0 ψ j z t − j , t = 1 , , . . . , (6)parametrised by µ , σ and λ ∈ IR m − , where for j ≥ ψ j = ψ j ( λ ) satisﬁes ψ = 1 and P ∞ t =0 ψ t < ∞ . The ( z t ) are iid Gaussian variables with mean 0and variance σ . Let θ = ( µ, σ , λ ) be the vector of model parameters. Theautocovariance function is E { ( y t + j − µ )( y t − µ ) } = σ P ∞ t =0 ψ t ψ t + j = σ γ λ ( j ),say. Using basic diﬀerentiation rules, it is easy to ﬁnd the Hyv¨arinen scorebased on the single time series Y T = ( y , y , . . . , y T ): H ( θ, Y T ) = − σ T X i =1 Γ ii + 12 T X i =1 ( T X t =1 σ Γ it ( y t − µ ) ) , (7)where the matrix Γ has ( i, j ) entry Γ ij = γ λ ( | i − j | ) and Γ ij is the ( i, j ) entryof Γ − . We will denote denote the Hyv¨arinen estimator based on a singleseries by b θ H . In this section we analyse the asymptotic statistical properties of theHyv¨arinen scoring rule estimator, based on (7), for a single time series.The following theorem shows the consistency of the estimator b θ H in theGaussian linear time series setting. The proof of the Theorem is deferred5o Appendix and follows arguments similar to those used by Davis and Yau(2011) [8] to prove the consistency of the pairwise likelihood estimator. Theorem 1.

Suppose ( y t ) is the linear process in (6) with µ = 0 and param-eter θ = ( σ , λ ) . Let b θ H = argmin θ H ( θ, Y T ) be the minimum score estimator, where θ = ( σ , λ ) and λ ∈ Λ , where Λ is acompact set. If the identiﬁability condition σ γ λ ( j ) = σ γ λ ( j ) for j = 0 , , . . . , k iﬀ ( σ , λ ) = ( σ , λ ) (8) is satisﬁed, then b θ H a.s. −−→ θ as T → ∞ . Once consistency has been proved, we focus on the asymptotic distribu-tion of b θ H . Its analytic form involves the elements Γ ij of the inverse of theautocovariance matrix. In order to guarantee its absolute summability, werestrict our attention to the case of ARMA causal invertible processes.Deﬁning b ij = Γ ij /σ , the gradient and the Hessian with respect to b θ H are given, respectively, by the following two expressions: J ( θ ) = ∂∂θ H ( θ, Y T ) = − T X i =1 ∂b ii ∂θ + 22 T X i,j,t =1 ∂b ij ∂θ b it y j y t (9) K ( θ ) = ∂ ∂θ H ( θ, Y T ) = − T X i =1 ∂ b ii ∂θ + T X i,j,t =1 ∂b ij ∂θ ∂b it ∂θ y j y t + T X i,j,t =1 ∂ b ij ∂θ b it y j y t (10)where ∂/∂θ denotes diﬀerentiation with respect to the components of thevector θ . Theorem 2.

Suppose that ( y t ) is an ARMA( p, q ) causal and invertible pro-cess. Furthermore, assume that the Hessian matrix K ( θ ) is invertible in aneighbourhood of θ . If the identiﬁability condition (8) holds, then √ T ( b θ T − θ ) D −→ N m − (cid:0) , K ( θ ) − V K T ( θ ) − (cid:1) , where V = P ∞ r = −∞ P ∞ k = −∞ V ( r, k, θ ) with V ( r, k, θ ) = (cid:18) ∂∂θ γ − (0) σ (cid:19) + ∂∂θ γ − ( k ) σ γ − (0) ∂γ − ( k + r ) ∂θ γ ( r ) . b θ H , in the casethat ( y t ) is an ARMA causal invertible process, is asymptotically normallydistributed with rate of decay √ T . As is well known, the autocovariancefunction of an ARMA process decays exponentially, which means that anARMA process is a short memory process, and its autocovariance function isabsolutely summable Brockwell & Davis (1991) [4]. This property, togetherwith the duality of ARMA models under causality and invertibility, allows usto prove asymptotic normality. For the complete proof refer to the appendix. n independent time se-ries In the remainder of this section we discuss the case of n independent seriesof length T . We assume that T is ﬁxed while n increases to inﬁnity. Wealso specialise to the case that the common mean µ and innovation variance σ = σ are known; without loss of generality we take µ = 0.Consider now n independent and identically distributed processes Y , . . . , Y n , where Y p = ( y p , . . . , y pT ), each having the T -variate normal dis-tribution with mean-vector 0 and variance covariance matrix σ Γ, with un-known parameter λ . Let the ( n × T ) random matrix Y have the vector Y p asits p th row. We deﬁne the total Hyv¨arinen score (HT) as the sum of n singleHyv¨arinen scores in (7): HT( λ ) = n X p =1 H p ( λ, Y p ) , (11)where H p ( λ, Y p ) = − σ T X i =1 Γ ii + 12 T X i =1 ( T X t =1 σ Γ it y pt ) . (12)The estimate of λ minimising the total Hyv¨arinen score will be denoted by b λ HT .An alternative approach is to consider as basic observable the sum-of-squares-and-products matrix SSP = Y T Y , which is a suﬃcient statistic forthe multivariate normal model, having the Wishart distribution with n de-grees of freedom and scale matrix σ Γ. Then inference for the parameter λ can be performed by resorting to the Hyv¨arinen score based directly on theWishart model. We will call this scoring rule the matrix Hyv¨arinen score .Assuming n ≥ T , so that the joint distribution of the upper triangle( s ij : 1 ≤ i ≤ j ≤ T ) of the sum-of-squares-and-products random matrix SSP7which has a Wishart distribution with parameters n and σ Γ) has a density,and taking into consideration all of the properties of the derivatives of tracesand determinants, it can be shown that the Hyv¨arinen score based on thisjoint density isHW(SSP , Γ) = − ( n − T − T X i =1 ( s ii ) + 12 T X i,j =1 (cid:26) ( n − T − s ij − σ Γ ij (cid:27) , (13)where s ij and Γ ij are the elements of the inverse matrices SSP − and of Γ − ,respectively. The matrix Hyv¨arinen estimator for λ , minimising HW(SSP , Γ)with respect to λ , will be denoted by b λ HW .The derivative of HW(SSP , Γ) with respect to λ isHW λ (SSP , Γ) = − σ T X i,j =1 (cid:26) ( n − T − s ij − σ Γ ij (cid:27) ∂ Γ ij ∂λ , (14)and E { HW λ (SSP , Γ) } = 0 since E ( s ij ) = Γ ij / ( σ ( n − T − K ( λ ) = E { HW λλ (SSP , Γ) } = P Ti,j =1 ( ∂ Γ ij /∂λ ) / σ . The derivation of the function J ( λ ), which aftertaking account of the square of (14) reduces to J ( λ ) = ( n − T − σ T X i,j,k,l =1 (cid:18) ∂ Γ ij ∂λ (cid:19) cov (cid:0) s ij , s kl (cid:1) , (15)involves calculations requiring the covariance matrix of the random matrixSSP − , which has an Inverse Wishart distribution with scale matrix σ Γ − :see Von Rosen (1998) [39] for details on the components of the covariancematrix.In general, the Godambe information needed to estimate the standarderror of b λ HW is not easy to derive analytically due to the form of the matrixΓ. It should be pointed out that this approach can not be used if we have asingle time series of length T with T increasing to ∞ , since for non-singularityof the Wishart distribution we need to assume n ≥ T . In this section we report simulation studies designed to assess and comparethe behaviours of the estimators obtained by using the total and the matrixHyv¨arinen estimators. We refer to the case described in paragraph 3.2 in8hich T is ﬁxed and n increases to ∞ . For comparison, we will consider alsothe full and pairwise maximum likelihood estimators (Davis & Yau (2011)[8] ). We discuss three examples: the ﬁrst order autoregressive AR(1), theﬁrst order moving average MA(1) and the fractionally diﬀerenced white noiseARFIMA(0 , d, R ([34]). The summary statistics shown are: average estimatesof the parameters, asymptotic standard deviations ( sd ) and the asymptoticrelative eﬃciency (ARE) with respect to the maximum likelihood estimator. The stationary univariate autoregressive process of order 1, denoted byAR(1), is deﬁned by y = µ + 1 p − φ z y t = µ + φ ( y t − − µ ) + z t , ( t = 2 , . . . , T ) , where ( z t ) is a Gaussian white noise process with mean 0 and variance σ .Here φ , with | φ | <

1, is the autoregressive parameter . Then y , . . . , y T are jointly normal with mean vector µ T (where 1 T is the T -dimensionalunit vector), and covariance matrix σ Γ, with Γ having components Γ lm = φ | l − m | / (1 − φ ) ( l, m = 1 , . . . , T ).For comparison purposes we consider also the numerical performance ofa class of pairwise likelihood estimators. Since, in the time series considered,dependence decreases in time, as in Davis & Yau (2011) [8] we shall restrictto the ﬁrst order consecutive pairwise likelihood , rather than the completepairwise likelihood, so that adjacent observations are more closely relatedthan the others. This choice is motivated also by the loss in eﬃciency incurredin using the k -th order consecutive pairwise likelihood as k increases (seeDavis and Yau (2011) [8]; Joe and Lee (2009) [27]). Note that, when it isknown that µ = 0 but σ is unknown, the pairwise likelihood estimator of φ is b φ PL = 2 P Tt =2 y t y t − / P Tt =2 ( y t + y t − ), which is also the Yule-Walkerestimator (Davis & Yau (2011) [8]). Simulation 1

The values of the model parameters are µ = 0 and σ = 1,with the autoregressive parameter φ ∈ {− . , − . , . . . , . , . } . In thesimulation study, 1000 replicates are generated of n = 200 processes of length T = 50. Results are summarised in Table 1. The numerical results in Table 1and in the left-hand panel of Figure 1 suggest that b φ HT and b φ HW do not have9igh eﬃciency as φ approaches 1: in particular, the asymptotic eﬃciency of b φ HW tends to 0 for large values of | φ | . In contrast, under the same modelsetting, there is only a modest loss of eﬃciency for the pairwise likelihood-based estimator b φ PL . The univariate moving average process of order 1, denoted by MA(1), isdeﬁned by y t = µ + αz t − + z t , ( t = 1 , . . . , T ) , (16)where | α | < z , . . . , z T are independent Gaussian random variableswith 0 mean and variance σ . Then the random variables y , . . . , y T arejointly normal, each having mean µ and variance σ (1 + α ). The variables y t and y t + k are independent for | k | >

1, while y t and y t +1 have covariance σ α ( t = 1 , . . . , T − σ Γ of Y = ( y , y , . . . , y T )has components σ Γ ss = σ (1 + α ), σ Γ st = σ α if | s − t | = 1, σ Γ st = 0otherwise.As before we consider the ﬁrst order consecutive pairwise likelihood sincethe use of a higher order consecutive pairwise likelihood is unrealistic dueto the independence of y t and y t + k for k ≥

2. For t = 1 , . . . , T −

1, the pair( y t , y t +1 ) has a bivariate Gaussian distribution, in which the two componentsboth have mean µ and variance σ (1 + α ), and have covariance σ α . Simulation 2

The values of the model parameters are µ = 0 and σ = 1,with the moving average parameter α ∈ {− . , − . , . . . , . , . } . In thesimulation study, 1000 replicates are generated of n = 200 processes of length T = 50. Results are summarised in Table 2. The simulation shows that thetotal Hyv¨arinen estimator b α HT achieves the same eﬃciency as the MLE in theMA(1) model for values of the moving average parameter near 0; see Table 2and the right-hand panel of Figure 1. However, the loss in eﬃciency of thetotal Hyv¨arinen estimator b α HT is modest even when the absolute value ofthe moving average parameter reaches 1. In contrast, the pairwise likelihoodestimator b α PL shows very poor performances in terms of asymptotic relativeeﬃciency: the ARE ranges from 1 to 0 . | α | increases. The fractionally diﬀerenced white noise, ARFIMA(0 , d, − Π) d y t = z t , with t = 1 , . . . , T, d ∈ (0 , . z , . . . , z T are independentGaussian random variables with 0 mean and variance σ . Then the randomvariables y , . . . , y T are jointly normal, with covariance matrix σ Γ whosecomponents (see Hosking (1981) [22]) are σ Γ ij = ( − | k | σ Γ(1 − d )Γ( | k | − d + 1)Γ( −| k | − d + 1) ( k = i − j ) (17)(where in the right-hand side of (17), Γ denotes the gamma function.)As before we consider the ﬁrst order consecutive pairwise likelihoodsince no great improvement can be detected by using a higher order con-secutive pairwise likelihood: see the results of Davis and Yau (2011) [8].For t = 1 , . . . , T −

1, the pair ( y t , y t +1 ) has a bivariate Gaussian distri-bution, in which the two components both have mean µ and variance σ Γ(1 − d ) / Γ(1 − d ) , and have covariance − σ Γ(1 − d ) / Γ(2 − d )Γ( − d ). Simulation 3

The values of the model parameters are µ = 0 and σ = 1,with the fractional parameter d ∈ { . , . , . , . , . , . , . } . In thesimulation study, 1000 replicates are generated of n = 100 processes of length T = 50. Results are sumarised in Table 3. Simulation 3 shows that the totalHyv¨arinen estimator b d HT achieves the same eﬃciency as the MLE in theARFIMA(0 , d,

0) model near 0 and near 0 .

3; see Table 3 and the right-handpanel of Figure 1. The loss in eﬃciency of the total Hyv¨arinen estimator b d HT is very slight when d ∈ (0 , . b d HW is poor with AREvalues ranging from 0 to 0 .

45. For all the estimators considered the ARE is0 when d ∈ (0 . , . b d PL performs better than b d HW ,however the values of ARE range from 0.6 to 0.96, reaching a maximum when d = 0 .

1, with a major loss of eﬃciency with respect to the total Hyv¨arinenestimator.

It should be noted that for the MA(1) and the ARFIMA(0 , d,

0) models noanalytic expressions for the derivatives of (7) are available. The standarddeviations of b φ HT , b α HT and b d HT are empirical estimates of the square root ofthe Godambe information function, which is obtained by compounding theempirical estimates of J and K . The standard deviations of the pairwisemaximum likelihood estimator and the maximum likelihood estimator areobtained using the analytic expressions (see Pace et al. (2011) [31]) for theAR(1) model and the empirical counterparts for the MA(1) model. Numer-ical evaluation of scoring rule derivatives has been carried out using the R package numDeriv ; see Gilbert & Varadhan (2012) [19].11esults from simulations reveal that the estimators considered produceestimates very close to the true values of the parameters. However, resultsnot shown here suggest that when the length T of the series is small thepairwise likelihood estimator performs worse in terms of bias than the otherestimators in all the models considered.The left, the middle and right-hand panels of Figure 1 depict the asymp-totic relative eﬃciency as a function of φ for the AR(1) model, as a functionof α for the MA(1) model, and as a function of d for the ARFIMA(0 , d, In this paper we have considered the use of Hyv¨arinen scoring rules in lineartime series estimation under diﬀerent conditions. We have established theconsistency of the Hyv¨arinen scoring rule estimator for a single times seriesunder some general conditions and its asymptotic normality in an ARMAtime series context.We have investigated, for n independent time series, the performances oftwo estimators based on the Hyv¨arinen scoring rule, which can be regardedas a surrogate for a complex full likelihood. The properties of the estimatorsfound using this scoring rule are compared with the full and pairwise max-imum likelihood estimators. Three simple models are discussed: the ﬁrst astationary ﬁrst order autoregressive model, the second a ﬁrst order movingaverage model and the third a fractionally diﬀerenced white noise. In theﬁrst case the total Hyv¨arinen method leads to poor estimators; in contrast,in the second and third this method produces good estimators. The oppositebehaviour is observed for the pairwise estimators. For the moving averageprocess and the fractionally diﬀerenced white noise, there can be a large gainin eﬃciency, as compared to the pairwise likelihood method, by using thetotal or the matrix Hyv¨arinen scoring rule estimators. For the autoregres-sive model, in contrast, the total Hyv¨arinen score methods suﬀer a loss ofeﬃciency as | φ | approaches 1.In all examples, results not reported here show that there is a great im-provement in the performances of the matrix Hyv¨arinen estimator based onthe Wishart model as the ratio T /n becomes negligible. It is clear that theloss of eﬃciency incurred in using the Hyv¨arinen scoring rules or pairwiselikelihood can be substantial, but this depends on the underlying model, and12o overall general principle has emerged that might oﬀer guidance for casesnot yet considered. The matrix Hyv¨arinen estimator has the apparent ad-vantage over the other estimators (apart from full maximum likelihood) ofbeing based on the suﬃcient statistic of the model; nevertheless the totalHyv¨arinen estimator shows good performance in terms of eﬃciency.We conclude that minimising the total Hyv¨arinen score may oﬀer a vi-able and useful approach to estimation in models where computation of thenormalising constant in the likelihood function is not feasible, and pairwiselikelihood methods lead to poor estimators.

Acknowledgements

Monica Musio was supported by the project GESTA of the Fondazione diSardegna and Regione Autonoma di Sardegna.

Appendix: Proof of Theorem 1

Let θ = ( σ , λ ) and let E θ denote the expectation with respect to the proba-bility distribution for ( y t ) deﬁned in Equation (6). Let θ = ( σ , λ ) denotethe true parameter value. From the ergodicity of ( y t ), it follows that H ( θ, Y T )is ergodic and stationary and therefore1 T H ( θ, Y T ) a.s. −−→ H ( θ , θ ) := E θ H ( θ, y ) . (18)Since the Hyv¨arinen score is strictly proper we have H ( θ , θ ) ≥ H ( θ , θ ) (19)with equality if and only if θ = θ , by the identiﬁability condition (8). Theapproach used to derive the consistency of the total Hyv¨arinen estimatornow follows the same general argument used to derive the consistency of thepairwise likelihood estimator in Davis & Yau (2011) [8].In particular, the compactness of Λ and the inequality (19) are used asdevices for proving the claim. Appendix: Proof of Theorem 2

Deﬁne the sample gradient and Hessian as J T ( θ ) := − T T X i =1 ∂b ii ∂θ + 22 T T X i,j,t =1 ∂b ij ∂θ b it y j y t K T ( θ ) := − T T X i =1 ∂ b ii ∂θ + 1 T T X i,j,t =1 ∂b ij ∂θ ∂b it ∂θ y j y t + 1 T T X i,j,t =1 ∂ b ij ∂θ b it y j y t . Using a Taylor expansion of J T ( θ ) around θ and the consistency ofHyv¨arinen scoring rule estimator, it can be proved that, for some θ + T be-tween θ and b θ T , J T ( θ ) = K T ( θ + T ) √ T ( θ − b θ T ) . (20)The asymptotic distribution of b θ T can be derived by exploiting the asymp-totic properties of K T ( θ + T ) and J T ( θ ), together with the ergodic theorem andthe fact that θ + T a.s. −−→ θ .Writing ∂∂θ = ∂∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ Γ = Γ( λ )it can be shown that K T ( θ + T ) a.s. −−→ T X i,j,t =1 ∂b ij ∂θ ∂b it ∂θ σ Γ tj = K ( θ ) . In order to calculate the asymptotic distribution of b θ T we need to calculatethe expectation and the variance of J T ( θ ). The calculation of the expecta-tion of J T ( θ ) follows easily from the unbiasdness of the scoring rule estimat-ing equation (Dawid & Lauritzen (2005) [10]). However, calculation of thevariance of J T ( θ ) is challenging due to the presence of the non deterministicterm B i = T X j,t =1 ∂b ij ∂θ b it y j y t . It relies on the following calculation:var( J T ( θ )) = 1 T T X i =1 var( B i ) == 1 T T X i,j,t,ℓ,h =1 ∂b ij ∂θ b it ∂b iℓ ∂θ b ih cov( y j y t , y ℓ y h )14 T X i,j,t,ℓ,h =1 ∂b ij ∂θ Γ it σ ∂b iℓ ∂θ Γ ih σ { cov( y j , y ℓ )cov( y t , y h )+ cov( y j , y h )cov( y t , y ℓ ) + cum ( y j , y t , y ℓ , y h ) } = T X i,j,t,ℓ,h =1 A jℓth + C jhtℓ + D itℓh , (21)where A jℓth = ∂b ij ∂θ Γ it σ ∂b iℓ ∂θ Γ ih σ cov( y j , y ℓ )cov( y t , y h ) C jhtℓ = ∂b ij ∂θ Γ it σ ∂b iℓ ∂θ Γ ih σ cov( y j , y h )cov( y t , y ℓ ) D itℓh = ∂b ij ∂θ Γ it σ ∂b iℓ ∂θ Γ ih σ cum ( y j , y t , y ℓ , y h ) . The ﬁrst term in (21) simpliﬁes as T X i,j,t,ℓ,h =1 A jℓth = T X i,j,t,ℓ,h =1 ∂b ij ∂θ Γ ii ∂b iℓ ∂θ Γ jℓ . (22)The second term simpliﬁes as T X i,j,t,ℓ,h =1 C jhtℓ = T (cid:18) ∂∂θ γ − (0) σ (cid:19) . (23)The third term in (21), which involves the fourth cumulant, vanishes as forGaussian linear processes all the cumulant functions cum k for k > J T ( θ ))is evaluated by considering only the ﬁrst non-constant term (22).Equation (22) can be rewritten as T X i,j,t,ℓ,h =1 A jℓth = T X i,j,ℓ =1 ∂∂θ γ − ( i − j ) σ γ − (0) ∂∂θ γ − ( i − ℓ ) σ γ ( j − ℓ ) , where γ ( j − ℓ ) = Γ jℓ and γ − ( i − j ) = Γ ij . Let k = i − j and r = j − ℓ .Without lose of generality, we assume that γ ( h ) = 0 if | h | > T . Then theprevious expression and consequently the ﬁrst term in (21) simpliﬁes to T X k,r = − T ( T − max {| k | , | k + r | , | r |} ) ∂∂θ γ − ( k ) σ γ − (0) ∂∂θ γ − ( k + r ) σ γ ( r ) . T →∞ T X r,k = − T ( T − max {| k | , | k + r | , | r |} ) T × ∂∂θ γ − ( k ) σ γ − (0) ∂∂θ γ − ( k + r ) σ γ ( r )= ∞ X r,k = −∞ ∂∂θ γ − ( k ) σ γ − (0) ∂∂θ γ − ( k + r ) σ γ ( r )= ∞ X r,k = −∞ V ( r, k, θ ) , (24)where V ( r, k, θ ) = (cid:18) ∂∂θ γ − (0) σ (cid:19) + ∂∂θ γ − ( k ) σ γ − (0) ∂γ − ( k + r ) ∂θ γ ( r ) . Combining equations (24) and (23) we obtainvar( J T ( θ )) −→ V, (25)where V = P ∞ r,k = −∞ V ( r, k, θ ).Since J ( θ ) depends on the B i ’s, which involve the sample autocovariance,it follows from the asymptotic normality of the sample autocovariance ofARMA processes that J T ( θ ) is also asymptotically normal with zero meanand variance V . From (20) and (25) we obtain the asymptotic normality of b θ T : √ T ( b θ T − θ ) D −→ N m − (cid:0) , K ( θ ) − V K T ( θ ) − (cid:1) . References [1] Almeida, M. P. and B. Gidas (1993). A variational method for estimatingthe parameters of MRF from complete or incomplete data.

Annals ofApplied Probability , , 103–136.[2] Barndorﬀ-Nielsen, O. E. & Cox, D. R. (1994). Inference and Asymp-totics . Chapman & Hall, London.163] Brier, G. W. (1950). Veriﬁcation of forecasts expressed in terms of prob-ability,

Monthly Weather Review , 78, 1–3.[4] Brockwell, P. J. and Davis, R. A. (1991).

Time Series: Theory andMethods . Springer, New York.[5] Cattelan, M. & Sartori, N. (2016). Empirical and simulated adjustmentsof composite likelihood ratio statistics,

Journal of Statistical Computa-tion and Simulation , 86, 1056-1067.[6] Chatﬁeld, C. Inverse Autocorrelations.

Journal of the Royal StatisticalSociety. Series A , 142(3), 363-377.[7] Cleveland, W. S. (1972) The Inverse Autocorrelations of a Time Seriesand Their Applications.

Technometrics , 14(2), 277-293.[8] Davis, R. A. & Yau, C. Y. (2011). Comments on pairwise likelihood intime series models.

Statistica Sinica , 21, 255–277.[9] Davison, A. C. (2003).

Statistical Models . Cambridge University Press,Cambridge.[10] Dawid, A. P. & Lauritzen, S. L. (2005). The geometry of decision theory.In

Proceedings of the Second International Symposium on InformationGeometry and its Applications , 22–28. University of Tokyo.[11] Dawid, A.P. (2007). The geometry of proper scoring rules.

Ann. Inst.Statist. Math , 59, 77–93.[12] Dawid, A. P., Lauritzen, S. L. and Parry, M. (2012). Proper local scoringrules on discrete sample spaces.

Ann. Statist.— , 40, 593–608.[13] Dawid, A. P. & Musio, M. (2014). Theory and applications of properscoring rules.

Metron , 72, 169–183.[14] Dawid, A. P., Musio, M. , & Ventura, L. (2016). Minimum scoring ruleinference.

Scandinavian Journal of Statistics , 43, 1, 123–134.[15] Dawid, A. P. & Musio, M. (2015). Bayesian model selection based onproper scoring rules (with Discussion).

Bayesian Analysis , , 479–521.[16] Dawid, A. P. & Musio, M., Columbu, S. (2017). A note on Bayesianmodel selection for discrete data using proper scoring rules, Statistics &Probability Letters , 129, 101–106.1717] Ehm, W., Gneiting, T. Local proper scoring rules of order two

Ann.Statist. , 40, 609–637.[18] Forbes, P. G. M. & Lauritzen, S. (2015) Linear estimating equations forexponential families with application to Gaussian linear concentrationmodels,

Linear Algebra and its Applications , 473, 261–283.[19] Gilbert, P. & Varadhan, R. (2012). numDeriv: Accurate NumericalDerivatives. http://CRAN.R-project.org/package=numDeriv .[20] Good, I. J. (1952). Rational decisions.

Journal of the Royal StatisticalSociety, Series B , 14, 107–114.[21] Hosking, J. R. M. (1980). The Asymptotic Distribution of the SampleInverse Autocorrelations of an Autoregressive-Moving Average Process.

Biometrika , 67(1), 223–226.[22] Hosking, J. R. M. (1981). Fractional diﬀerencing.

Biometrika , 68(1),165–176.[23] Hyv¨arinen, A. (2005). Estimation of non-normalized statistical modelsby score matching.

Journal of Machine Learning Research , 6, 695–709.[24] Hyv¨arinen, A. (2007). Some extensions of score matching.

Computa-tional Statistics and Data Analysis , 51, 2499–2512.[25] Kollo, T. & von Rosen, D. (2005).

Advanced Multivariate Statistics withMatrices . Dordrecht: Springer.[26] Kroese, D. P., Taimre, T. & Botev, Z. I. (2011).

Handbook of MonteCarlo Methods . John Wiley & Sons, Inc., Hoboken, New Jersey.[27] Joe, H. & Lee, Y. (2009). On weighting of bivariate margins in pairwiselikelihood.

Journal of Multivariate Analysis , 100, 670–685.[28] Le Cessie, S. & Van Houwelingen, J. C. (1994). Logistic regression forcorrelated binary data.

Journal of the Royal Statistical Society, SeriesC , 43, 95–108.[29] Mardia, K. Kent, J., Laha, A. (2016). Score matching estimators fordirectional distributions. arXiv:1604.08470 .[30] Musio, M. & Dawid, A. P. (2013). Local scoring rules: A versatile toolfor inference. In Proceedings of the 59th ISI World Statistics Congress,Hong Kong.http://2013.isiproceedings.org/Files/STS019-P3-S.pdf.1831] Pace, L., Salvan, A., & Sartori, N. (2011). Adjusting composite likeli-hood ratio statistics.

Statistica Sinica , 21, 129–148.[32] Park, J. and Haran, M. (2017). Bayesian Inference in the Presence ofIntractable Normalizing Functions.

Journal of the American StatisticalAssociation ,

113 (523) , 1372–1390.[33] Parry, M. F., Dawid, A. P., & Lauritzen, S. L. (2012). Proper localscoring rules.

The Annals of Statistics , 40, 561–592.[34] R Core Team (2013). R: A language and environment for statisticalcomputing. R Foundation for Statistical Computing, Vienna, Austria.ISBN 3-900051-07-0.[35] Shao, S., Jacob, P. E., Ding, J. & Tarokh, V. (2018). Bayesian modelcomparison with the Hyv¨arinen score: computation and consistency.

Journal of the American Statistical Association , online ﬁrst, DOI:10.1080/01621459.2018.1518237[36] Takasu, Y., Yano, K., Komaki, F. (2018). Scoring rules for statisticalmodels on spheres,

Statistics & Probability Letters , , 111–115.[37] Varin, C. (2008). On composite marginal likelihoods. AStA Advances inStatistical Analysis , 92, 1–28.[38] Varin, C., Reid, N., & Firth, D. (2011). An overview of composite like-lihood methods.

Statistica Sinica , 21, 5–42.[39] von Rosen, D. (1988). Moments for the inverted Wishart distribution.

Scandinavian Journal of Statistics , 15, 97–109.19 . . . . . . f A R E f ^ p f ^ H f ^ HW −0.5 0.0 0.5 . . . . . . a A R E a ^ p a ^ H a ^ HW . . . . . . d A R E d^ p d^ H d^ HW Figure 1: Asymptotic relative eﬃciency of estimators for the AR(1) model(left top panel), for the MA(1) model (right top panel) and for theARFIMA(0 , d,

0) model (left bottom panel).20able 1: Simulation 1. Estimated mean (

Est. ), asymptotic standard de-viation ( sd ), and asymptotic relative eﬃciency (ARE) of estimators of theparameter φ in the AR(1) model, for n = 200, T = 50, and varying valuesof φ . We denote by b φ the maximum likelihood estimate, by b φ PL the pairwiselikelihood estimate, and by b φ HT and b φ HW the total and the matrix Hyv¨arinenestimates, respectively. b φ b φ PL b φ HT b φ HW φ Est. sd Est. sd ARE

Est. sd

ARE

Est. sd

ARE − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Est. ), asymptotic standard de-viation ( sd ), and asymptotic relative eﬃciency (ARE) of estimators of theparameter α in the MA(1) model, for n = 200, T = 50, and varying valuesof α . We denote by b α the maximum likelihood estimate, by b α PL the pairwiselikelihood estimate, and by b α HT and b α HW the total and the matrix Hyv¨arinenestimates, respectively. b α b α PL b α HT b α HW α Est. sd Est. sd ARE

Est. sd

ARE

Est. sd

ARE − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Est. ), asymptotic standard de-viation ( sd ), and asymptotic relative eﬃciency (ARE) of estimators of theparameter d in the ARFIMA model, for n = 200, T = 50, and varying valuesof d . We denote by b d the maximum likelihood estimate, by b d PL the pairwiselikelihood estimate, and by b d HT and b d HW the total and the matrix Hyv¨arinenestimates, respectively. b d b d PL b d HT b d HW d Est. sd Est. sd ARE

Est. sd

ARE

Est. sd

ARE0 .

01 0 . . . .

007 0 . . . . . . . .

05 0 . . . . . . . . . . . . . .

006 0 . . . . . . . . . .

15 0 . . .

15 0 . . . . . . . . .

20 0 . . . . . . . . . . . .

25 0 . . . . . . . . . . . . . . . . . . .0043 0