The Hyvärinen scoring rule in Gaussian linear time series models
Silvia Columbu, Valentina Mameli, Monica Musio, A.Philip Dawid
aa r X i v : . [ s t a t . M E ] A p r The Hyv¨arinen scoring rule in Gaussian lineartime series models
Silvia Columbu a , Valentina Mameli b , Monica Musio a and A.Philip Dawid c a Dep. of Mathematics and Computer Science, Univ. of Cagliari, IT b Dep. of Environmental Sciences, Informatics and Statistics Ca’ Foscari Univ. of Venice, Italy c Dep. of Pure Mathematics and Mathematical Statistics, Univ. of Cambridge, UK
Abstract
Likelihood-based estimation methods involve the normalising con-stant of the model distributions, expressed as a function of the param-eter. However in many problems this function is not easily available,and then less efficient but more easily computed estimators may beattractive. In this work we study stationary time-series models, andconstruct and analyse “score-matching” estimators, that do not in-volve the normalising constant. We consider two scenarios: a singleseries of increasing length, and an increasing number of independentseries of fixed length. In the latter case there are two variants, onebased on the full data, and another based on a sufficient statistic.We study the empirical performance of these estimators in threespecial cases, autoregressive (AR), moving average (MA) and frac-tionally differenced white noise (ARFIMA) models, and make com-parisons with full and pairwise likelihood estimators. The results aresomewhat model-dependent, with the new estimators doing well forMA and ARFIMA models, but less so for AR models.
Keywords : Scoring rule estimators, Hyv¨arinen scoring rule, Gaussian Lineartime series.
The evaluation of the exact full likelihood may be difficult or even impossiblein situations where complex problems deal with large correlated datasets.These problems are likely to occur for instance in spatial statistics and timeseries frameworks in which direct computation of the normalising constant1an be a very challenging task, entailing multidimensional integration of thefull joint density for each value of the parameter. Many approaches havebeen proposed to tackle this issue. in this paper, we investigate an appealingapproach based on the score matching estimator proposed by Hyv¨arinen in2005 [23].The score matching estimator can be regarded as a specific case of esti-mators derived from proper scoring rules (see Dawid and Musio (2014) [13]),which are loss functions for measuring the quality of probabilistic forecasts.In particular, this estimator derives from the Hyv¨arinen scoring rule, whichis a homogeneous proper scoring rule (see Ehm and Gneiting (2012) [17] andParry et al. (2012) [33]), namely a proper scoring rule which does not requireknowledge of the normalising constant. Homogeneous scoring rules have beencharacterised for continuous real variables (Parry et al. (2012) [33]) and fordiscrete variables (Dawid et al. (2012) [12]). In a Bayesian framework, Dawidand Musio (2015) [15] have shown, for the case of continuous variables, howto handle Bayesian model selection with improper within-model prior dis-tributions, by exploiting the use of homogeneous proper scoring rules. Thediscrete counterpart has been empirically studied by Dawid et al. (2017)[16]. In a recent contribution, Shao et al. (2018) [35] consider the use ofthe Hyv¨arinen score for model comparison. Although the majority of con-tributions involving the use of Hyv¨arinen scoring rules focus on Euclideanspaces, scholars have also investigated extensions to non-Euclidean spaces:for an early study see Dawid (2007) [11]. Recently, Mardia et al. (2016) [29]proposed an extension of the Hyv¨arinen scoring rule to compact orientedRiemannian manifolds, and Takasu et al. (2018) [36] constructed a novelclass of homogeneous strictly proper scoring rules for statistical models onspheres.Given the growing interest in the use of this scoring rule for very com-plex statistical models, in this paper we aim to derive an estimation methodbased on the Hyv¨arinen scoring rule for estimating linear Gaussian time seriesmodels.We distinguish two separate cases: a first in which the length of a singletime series increases to infinity, and a second in which the length of the timeseries is fixed and the number of series increases to infinity.The consistency and asymptotic distribution of the Hyv¨arinen estimatorare derived for the case of a single time series of increasing length. In par-ticular, under some mild regularity conditions we derive consistency of theproposed estimator for linear Gaussian time series models, while the asymp-totic distribution is proved in the specific case of autoregressive moving av-erage (ARMA) causal invertible models. For time series with fixed lengthand the number of time series increasing to infinity the performances of two2stimators based on the Hyv¨arinen scoring rule, namely the total Hyv¨arinenestimator and the matrix Hyv¨arinen estimator are compared through simu-lation studies with the full maximum likelihood and the pairwise maximumlikelihood estimators. Three simple time series models have been consideredin the design of simulations: autoregressive (AR), moving average (MA) andfractionally differenced white noise (ARFIMA).Different behaviours can be detected for the total Hyv¨arinen estimatoramong the settings examined. In particular, it outperforms the pairwiselikelihood estimator in terms of efficiency with respect to the full maximumlikelihood estimator for the MA and ARFIMA processes.The paper unfolds as follows. Section 2 introduces basic notions on scor-ing rules. In Section 3 we introduce the Hyv¨arinen scoring rule for Gaus-sian linear time series. Some asymptotic results for the Hyv¨arinen estimatorare given. In the specific case of n independent series we describe the to-tal Hyv¨arinen estimator and the matrix Hyv¨arinen estimator. Section 4summarises the results of the simulation studies. Section 5 presents someconcluding remarks. Technical details are postponed to the Appendices. A scoring rule is a loss function designed to measure the quality of a proposedprobability distribution Q , for a random variable X , in light of the outcome x of X . Specifically, if a forecaster quotes a predictive distribution Q for X and the event X = x realises, then the forecaster’s loss will be S ( x, Q ). Theexpected value of S ( X, Q ) when X has distribution P is denoted by S ( P, Q ).The scoring rule S is proper (relative to the class of distributions P ) if S ( P, Q ) ≥ S ( P, P ) , for all P, Q ∈ P . (1)It is strictly proper if equality obtains only when Q = P .Any proper scoring rule gives rise to a general method for parameterestimation, based on an unbiased estimating equation: see § Some important proper scoring rules are the log-score, S ( x, Q ) = − log q ( x )(Good (1952) [20]), where q ( · ) is the density function of Q , which recoversthe full (negative log) likelihood; and the Brier score (Brier (1950) [3]). Aparticularly interesting case, which avoids the need to compute the normal-ising constant, produces the score matching estimation method of Hyv¨arinen32005) [23], based on the following proper scoring rule: S ( x , Q ) = ∆ x ln q ( x ) + 12 k∇ x ln q ( x ) k , (2)where X ranges over the whole of IR p supplied with the Euclidean norm k · k , q ( · ) is assumed twice continuously differentiable, and x is the realised value of X . In (2), ∇ x denotes the gradient operator, and ∆ x the Laplacian operator,with respect to x . For p = 1 we can express S ( x, Q ) = q ′′ ( x ) q ( x ) − (cid:18) q ′ ( x ) q ( x ) (cid:19) . (3)The scoring rule (2) is a (seeParry et al. (2012) [33]). It is homogeneous in the density function q ( · ), i.e. its value is unaffected by applying a positive scale factor to the density q ,and so can be computed even if we only know the density function up to ascale factor. Inference performed using any homogeneous scoring rule doesnot require knowledge of the normalising constant of the distribution. Let ( x , . . . , x n ) be independent realisations of a random variable X , havingdistribution P θ depending on an unknown parameter θ ∈ Θ, where Θ is anopen subset of IR m . Given a proper scoring rule S , let S ( x, θ ) denote S ( x, P θ ).Inference for the parameter θ may be performed by minimising the totalempirical score , S ( θ ) = n X p =1 S ( x p , θ ) , (4)resulting in the minimum score estimator , b θ S = arg min θ S ( θ ).Under broad regularity conditions on the model (see e.g. Barndorff-Nielsen & Cox (1994) [2]), b θ S satistfies the score equation : s ( θ ) := n X p =1 s ( x p , θ ) = 0 , (5)where s ( x, θ ) := ∇ θ S ( x, θ ), the gradient vector of S ( x, θ ) with respect to θ .The score equation is an unbiased estimating equation (Dawid & Lauritzen(2005) [10]). When S is the log-score, the minimum score estimator becomesthe maximum likelihood estimator. 4rom the general theory of unbiased estimating functions, under broadregularity conditions on the model the minimum score estimate b θ S is asymp-totically consistent and normally distributed: b θ S ∼ N ( θ, { nG ( θ ) } − ) , where G ( θ ) denotes the Godambe information matrix G ( θ ) := K ( θ ) J ( θ ) − K ( θ ) , where J ( θ ) = E (cid:8) s ( X, θ ) s ( X, θ ) T (cid:9) is the variability matrix , and K ( θ ) =E (cid:8) ∇ s ( X, θ ) T (cid:9) is the sensitivity matrix . In contrast to the case for the fulllikelihood, J and K are different in general: see Dawid & Musio (2014) [13],Dawid et al (2016)[14]. We point out that estimation of the matrix J ( θ ), and(to a somewhat lesser extent) of the matrix K ( θ ), is not an easy task: seeVarin (2008) [37], Varin et al. (2011) [38] and Cattelan and Sartori (2016)[5]. In this section we introduce some results based on the use of the Hyv¨arinenscoring rule in the setting of Gaussian linear time series models.Consider the Gaussian linear time series model y t = µ + ∞ X j =0 ψ j z t − j , t = 1 , , . . . , (6)parametrised by µ , σ and λ ∈ IR m − , where for j ≥ ψ j = ψ j ( λ ) satisfies ψ = 1 and P ∞ t =0 ψ t < ∞ . The ( z t ) are iid Gaussian variables with mean 0and variance σ . Let θ = ( µ, σ , λ ) be the vector of model parameters. Theautocovariance function is E { ( y t + j − µ )( y t − µ ) } = σ P ∞ t =0 ψ t ψ t + j = σ γ λ ( j ),say. Using basic differentiation rules, it is easy to find the Hyv¨arinen scorebased on the single time series Y T = ( y , y , . . . , y T ): H ( θ, Y T ) = − σ T X i =1 Γ ii + 12 T X i =1 ( T X t =1 σ Γ it ( y t − µ ) ) , (7)where the matrix Γ has ( i, j ) entry Γ ij = γ λ ( | i − j | ) and Γ ij is the ( i, j ) entryof Γ − . We will denote denote the Hyv¨arinen estimator based on a singleseries by b θ H . In this section we analyse the asymptotic statistical properties of theHyv¨arinen scoring rule estimator, based on (7), for a single time series.The following theorem shows the consistency of the estimator b θ H in theGaussian linear time series setting. The proof of the Theorem is deferred5o Appendix and follows arguments similar to those used by Davis and Yau(2011) [8] to prove the consistency of the pairwise likelihood estimator. Theorem 1.
Suppose ( y t ) is the linear process in (6) with µ = 0 and param-eter θ = ( σ , λ ) . Let b θ H = argmin θ H ( θ, Y T ) be the minimum score estimator, where θ = ( σ , λ ) and λ ∈ Λ , where Λ is acompact set. If the identifiability condition σ γ λ ( j ) = σ γ λ ( j ) for j = 0 , , . . . , k iff ( σ , λ ) = ( σ , λ ) (8) is satisfied, then b θ H a.s. −−→ θ as T → ∞ . Once consistency has been proved, we focus on the asymptotic distribu-tion of b θ H . Its analytic form involves the elements Γ ij of the inverse of theautocovariance matrix. In order to guarantee its absolute summability, werestrict our attention to the case of ARMA causal invertible processes.Defining b ij = Γ ij /σ , the gradient and the Hessian with respect to b θ H are given, respectively, by the following two expressions: J ( θ ) = ∂∂θ H ( θ, Y T ) = − T X i =1 ∂b ii ∂θ + 22 T X i,j,t =1 ∂b ij ∂θ b it y j y t (9) K ( θ ) = ∂ ∂θ H ( θ, Y T ) = − T X i =1 ∂ b ii ∂θ + T X i,j,t =1 ∂b ij ∂θ ∂b it ∂θ y j y t + T X i,j,t =1 ∂ b ij ∂θ b it y j y t (10)where ∂/∂θ denotes differentiation with respect to the components of thevector θ . Theorem 2.
Suppose that ( y t ) is an ARMA( p, q ) causal and invertible pro-cess. Furthermore, assume that the Hessian matrix K ( θ ) is invertible in aneighbourhood of θ . If the identifiability condition (8) holds, then √ T ( b θ T − θ ) D −→ N m − (cid:0) , K ( θ ) − V K T ( θ ) − (cid:1) , where V = P ∞ r = −∞ P ∞ k = −∞ V ( r, k, θ ) with V ( r, k, θ ) = (cid:18) ∂∂θ γ − (0) σ (cid:19) + ∂∂θ γ − ( k ) σ γ − (0) ∂γ − ( k + r ) ∂θ γ ( r ) . b θ H , in the casethat ( y t ) is an ARMA causal invertible process, is asymptotically normallydistributed with rate of decay √ T . As is well known, the autocovariancefunction of an ARMA process decays exponentially, which means that anARMA process is a short memory process, and its autocovariance function isabsolutely summable Brockwell & Davis (1991) [4]. This property, togetherwith the duality of ARMA models under causality and invertibility, allows usto prove asymptotic normality. For the complete proof refer to the appendix. n independent time se-ries In the remainder of this section we discuss the case of n independent seriesof length T . We assume that T is fixed while n increases to infinity. Wealso specialise to the case that the common mean µ and innovation variance σ = σ are known; without loss of generality we take µ = 0.Consider now n independent and identically distributed processes Y , . . . , Y n , where Y p = ( y p , . . . , y pT ), each having the T -variate normal dis-tribution with mean-vector 0 and variance covariance matrix σ Γ, with un-known parameter λ . Let the ( n × T ) random matrix Y have the vector Y p asits p th row. We define the total Hyv¨arinen score (HT) as the sum of n singleHyv¨arinen scores in (7): HT( λ ) = n X p =1 H p ( λ, Y p ) , (11)where H p ( λ, Y p ) = − σ T X i =1 Γ ii + 12 T X i =1 ( T X t =1 σ Γ it y pt ) . (12)The estimate of λ minimising the total Hyv¨arinen score will be denoted by b λ HT .An alternative approach is to consider as basic observable the sum-of-squares-and-products matrix SSP = Y T Y , which is a sufficient statistic forthe multivariate normal model, having the Wishart distribution with n de-grees of freedom and scale matrix σ Γ. Then inference for the parameter λ can be performed by resorting to the Hyv¨arinen score based directly on theWishart model. We will call this scoring rule the matrix Hyv¨arinen score .Assuming n ≥ T , so that the joint distribution of the upper triangle( s ij : 1 ≤ i ≤ j ≤ T ) of the sum-of-squares-and-products random matrix SSP7which has a Wishart distribution with parameters n and σ Γ) has a density,and taking into consideration all of the properties of the derivatives of tracesand determinants, it can be shown that the Hyv¨arinen score based on thisjoint density isHW(SSP , Γ) = − ( n − T − T X i =1 ( s ii ) + 12 T X i,j =1 (cid:26) ( n − T − s ij − σ Γ ij (cid:27) , (13)where s ij and Γ ij are the elements of the inverse matrices SSP − and of Γ − ,respectively. The matrix Hyv¨arinen estimator for λ , minimising HW(SSP , Γ)with respect to λ , will be denoted by b λ HW .The derivative of HW(SSP , Γ) with respect to λ isHW λ (SSP , Γ) = − σ T X i,j =1 (cid:26) ( n − T − s ij − σ Γ ij (cid:27) ∂ Γ ij ∂λ , (14)and E { HW λ (SSP , Γ) } = 0 since E ( s ij ) = Γ ij / ( σ ( n − T − K ( λ ) = E { HW λλ (SSP , Γ) } = P Ti,j =1 ( ∂ Γ ij /∂λ ) / σ . The derivation of the function J ( λ ), which aftertaking account of the square of (14) reduces to J ( λ ) = ( n − T − σ T X i,j,k,l =1 (cid:18) ∂ Γ ij ∂λ (cid:19) cov (cid:0) s ij , s kl (cid:1) , (15)involves calculations requiring the covariance matrix of the random matrixSSP − , which has an Inverse Wishart distribution with scale matrix σ Γ − :see Von Rosen (1998) [39] for details on the components of the covariancematrix.In general, the Godambe information needed to estimate the standarderror of b λ HW is not easy to derive analytically due to the form of the matrixΓ. It should be pointed out that this approach can not be used if we have asingle time series of length T with T increasing to ∞ , since for non-singularityof the Wishart distribution we need to assume n ≥ T . In this section we report simulation studies designed to assess and comparethe behaviours of the estimators obtained by using the total and the matrixHyv¨arinen estimators. We refer to the case described in paragraph 3.2 in8hich T is fixed and n increases to ∞ . For comparison, we will consider alsothe full and pairwise maximum likelihood estimators (Davis & Yau (2011)[8] ). We discuss three examples: the first order autoregressive AR(1), thefirst order moving average MA(1) and the fractionally differenced white noiseARFIMA(0 , d, R ([34]). The summary statistics shown are: average estimatesof the parameters, asymptotic standard deviations ( sd ) and the asymptoticrelative efficiency (ARE) with respect to the maximum likelihood estimator. The stationary univariate autoregressive process of order 1, denoted byAR(1), is defined by y = µ + 1 p − φ z y t = µ + φ ( y t − − µ ) + z t , ( t = 2 , . . . , T ) , where ( z t ) is a Gaussian white noise process with mean 0 and variance σ .Here φ , with | φ | <
1, is the autoregressive parameter . Then y , . . . , y T are jointly normal with mean vector µ T (where 1 T is the T -dimensionalunit vector), and covariance matrix σ Γ, with Γ having components Γ lm = φ | l − m | / (1 − φ ) ( l, m = 1 , . . . , T ).For comparison purposes we consider also the numerical performance ofa class of pairwise likelihood estimators. Since, in the time series considered,dependence decreases in time, as in Davis & Yau (2011) [8] we shall restrictto the first order consecutive pairwise likelihood , rather than the completepairwise likelihood, so that adjacent observations are more closely relatedthan the others. This choice is motivated also by the loss in efficiency incurredin using the k -th order consecutive pairwise likelihood as k increases (seeDavis and Yau (2011) [8]; Joe and Lee (2009) [27]). Note that, when it isknown that µ = 0 but σ is unknown, the pairwise likelihood estimator of φ is b φ PL = 2 P Tt =2 y t y t − / P Tt =2 ( y t + y t − ), which is also the Yule-Walkerestimator (Davis & Yau (2011) [8]). Simulation 1
The values of the model parameters are µ = 0 and σ = 1,with the autoregressive parameter φ ∈ {− . , − . , . . . , . , . } . In thesimulation study, 1000 replicates are generated of n = 200 processes of length T = 50. Results are summarised in Table 1. The numerical results in Table 1and in the left-hand panel of Figure 1 suggest that b φ HT and b φ HW do not have9igh efficiency as φ approaches 1: in particular, the asymptotic efficiency of b φ HW tends to 0 for large values of | φ | . In contrast, under the same modelsetting, there is only a modest loss of efficiency for the pairwise likelihood-based estimator b φ PL . The univariate moving average process of order 1, denoted by MA(1), isdefined by y t = µ + αz t − + z t , ( t = 1 , . . . , T ) , (16)where | α | < z , . . . , z T are independent Gaussian random variableswith 0 mean and variance σ . Then the random variables y , . . . , y T arejointly normal, each having mean µ and variance σ (1 + α ). The variables y t and y t + k are independent for | k | >
1, while y t and y t +1 have covariance σ α ( t = 1 , . . . , T − σ Γ of Y = ( y , y , . . . , y T )has components σ Γ ss = σ (1 + α ), σ Γ st = σ α if | s − t | = 1, σ Γ st = 0otherwise.As before we consider the first order consecutive pairwise likelihood sincethe use of a higher order consecutive pairwise likelihood is unrealistic dueto the independence of y t and y t + k for k ≥
2. For t = 1 , . . . , T −
1, the pair( y t , y t +1 ) has a bivariate Gaussian distribution, in which the two componentsboth have mean µ and variance σ (1 + α ), and have covariance σ α . Simulation 2
The values of the model parameters are µ = 0 and σ = 1,with the moving average parameter α ∈ {− . , − . , . . . , . , . } . In thesimulation study, 1000 replicates are generated of n = 200 processes of length T = 50. Results are summarised in Table 2. The simulation shows that thetotal Hyv¨arinen estimator b α HT achieves the same efficiency as the MLE in theMA(1) model for values of the moving average parameter near 0; see Table 2and the right-hand panel of Figure 1. However, the loss in efficiency of thetotal Hyv¨arinen estimator b α HT is modest even when the absolute value ofthe moving average parameter reaches 1. In contrast, the pairwise likelihoodestimator b α PL shows very poor performances in terms of asymptotic relativeefficiency: the ARE ranges from 1 to 0 . | α | increases. The fractionally differenced white noise, ARFIMA(0 , d, − Π) d y t = z t , with t = 1 , . . . , T, d ∈ (0 , . z , . . . , z T are independentGaussian random variables with 0 mean and variance σ . Then the randomvariables y , . . . , y T are jointly normal, with covariance matrix σ Γ whosecomponents (see Hosking (1981) [22]) are σ Γ ij = ( − | k | σ Γ(1 − d )Γ( | k | − d + 1)Γ( −| k | − d + 1) ( k = i − j ) (17)(where in the right-hand side of (17), Γ denotes the gamma function.)As before we consider the first order consecutive pairwise likelihoodsince no great improvement can be detected by using a higher order con-secutive pairwise likelihood: see the results of Davis and Yau (2011) [8].For t = 1 , . . . , T −
1, the pair ( y t , y t +1 ) has a bivariate Gaussian distri-bution, in which the two components both have mean µ and variance σ Γ(1 − d ) / Γ(1 − d ) , and have covariance − σ Γ(1 − d ) / Γ(2 − d )Γ( − d ). Simulation 3
The values of the model parameters are µ = 0 and σ = 1,with the fractional parameter d ∈ { . , . , . , . , . , . , . } . In thesimulation study, 1000 replicates are generated of n = 100 processes of length T = 50. Results are sumarised in Table 3. Simulation 3 shows that the totalHyv¨arinen estimator b d HT achieves the same efficiency as the MLE in theARFIMA(0 , d,
0) model near 0 and near 0 .
3; see Table 3 and the right-handpanel of Figure 1. The loss in efficiency of the total Hyv¨arinen estimator b d HT is very slight when d ∈ (0 , . b d HW is poor with AREvalues ranging from 0 to 0 .
45. For all the estimators considered the ARE is0 when d ∈ (0 . , . b d PL performs better than b d HW ,however the values of ARE range from 0.6 to 0.96, reaching a maximum when d = 0 .
1, with a major loss of efficiency with respect to the total Hyv¨arinenestimator.
It should be noted that for the MA(1) and the ARFIMA(0 , d,
0) models noanalytic expressions for the derivatives of (7) are available. The standarddeviations of b φ HT , b α HT and b d HT are empirical estimates of the square root ofthe Godambe information function, which is obtained by compounding theempirical estimates of J and K . The standard deviations of the pairwisemaximum likelihood estimator and the maximum likelihood estimator areobtained using the analytic expressions (see Pace et al. (2011) [31]) for theAR(1) model and the empirical counterparts for the MA(1) model. Numer-ical evaluation of scoring rule derivatives has been carried out using the R package numDeriv ; see Gilbert & Varadhan (2012) [19].11esults from simulations reveal that the estimators considered produceestimates very close to the true values of the parameters. However, resultsnot shown here suggest that when the length T of the series is small thepairwise likelihood estimator performs worse in terms of bias than the otherestimators in all the models considered.The left, the middle and right-hand panels of Figure 1 depict the asymp-totic relative efficiency as a function of φ for the AR(1) model, as a functionof α for the MA(1) model, and as a function of d for the ARFIMA(0 , d, In this paper we have considered the use of Hyv¨arinen scoring rules in lineartime series estimation under different conditions. We have established theconsistency of the Hyv¨arinen scoring rule estimator for a single times seriesunder some general conditions and its asymptotic normality in an ARMAtime series context.We have investigated, for n independent time series, the performances oftwo estimators based on the Hyv¨arinen scoring rule, which can be regardedas a surrogate for a complex full likelihood. The properties of the estimatorsfound using this scoring rule are compared with the full and pairwise max-imum likelihood estimators. Three simple models are discussed: the first astationary first order autoregressive model, the second a first order movingaverage model and the third a fractionally differenced white noise. In thefirst case the total Hyv¨arinen method leads to poor estimators; in contrast,in the second and third this method produces good estimators. The oppositebehaviour is observed for the pairwise estimators. For the moving averageprocess and the fractionally differenced white noise, there can be a large gainin efficiency, as compared to the pairwise likelihood method, by using thetotal or the matrix Hyv¨arinen scoring rule estimators. For the autoregres-sive model, in contrast, the total Hyv¨arinen score methods suffer a loss ofefficiency as | φ | approaches 1.In all examples, results not reported here show that there is a great im-provement in the performances of the matrix Hyv¨arinen estimator based onthe Wishart model as the ratio T /n becomes negligible. It is clear that theloss of efficiency incurred in using the Hyv¨arinen scoring rules or pairwiselikelihood can be substantial, but this depends on the underlying model, and12o overall general principle has emerged that might offer guidance for casesnot yet considered. The matrix Hyv¨arinen estimator has the apparent ad-vantage over the other estimators (apart from full maximum likelihood) ofbeing based on the sufficient statistic of the model; nevertheless the totalHyv¨arinen estimator shows good performance in terms of efficiency.We conclude that minimising the total Hyv¨arinen score may offer a vi-able and useful approach to estimation in models where computation of thenormalising constant in the likelihood function is not feasible, and pairwiselikelihood methods lead to poor estimators.
Acknowledgements
Monica Musio was supported by the project GESTA of the Fondazione diSardegna and Regione Autonoma di Sardegna.
Appendix: Proof of Theorem 1
Let θ = ( σ , λ ) and let E θ denote the expectation with respect to the proba-bility distribution for ( y t ) defined in Equation (6). Let θ = ( σ , λ ) denotethe true parameter value. From the ergodicity of ( y t ), it follows that H ( θ, Y T )is ergodic and stationary and therefore1 T H ( θ, Y T ) a.s. −−→ H ( θ , θ ) := E θ H ( θ, y ) . (18)Since the Hyv¨arinen score is strictly proper we have H ( θ , θ ) ≥ H ( θ , θ ) (19)with equality if and only if θ = θ , by the identifiability condition (8). Theapproach used to derive the consistency of the total Hyv¨arinen estimatornow follows the same general argument used to derive the consistency of thepairwise likelihood estimator in Davis & Yau (2011) [8].In particular, the compactness of Λ and the inequality (19) are used asdevices for proving the claim. Appendix: Proof of Theorem 2
Define the sample gradient and Hessian as J T ( θ ) := − T T X i =1 ∂b ii ∂θ + 22 T T X i,j,t =1 ∂b ij ∂θ b it y j y t K T ( θ ) := − T T X i =1 ∂ b ii ∂θ + 1 T T X i,j,t =1 ∂b ij ∂θ ∂b it ∂θ y j y t + 1 T T X i,j,t =1 ∂ b ij ∂θ b it y j y t . Using a Taylor expansion of J T ( θ ) around θ and the consistency ofHyv¨arinen scoring rule estimator, it can be proved that, for some θ + T be-tween θ and b θ T , J T ( θ ) = K T ( θ + T ) √ T ( θ − b θ T ) . (20)The asymptotic distribution of b θ T can be derived by exploiting the asymp-totic properties of K T ( θ + T ) and J T ( θ ), together with the ergodic theorem andthe fact that θ + T a.s. −−→ θ .Writing ∂∂θ = ∂∂θ (cid:12)(cid:12)(cid:12)(cid:12) θ = θ Γ = Γ( λ )it can be shown that K T ( θ + T ) a.s. −−→ T X i,j,t =1 ∂b ij ∂θ ∂b it ∂θ σ Γ tj = K ( θ ) . In order to calculate the asymptotic distribution of b θ T we need to calculatethe expectation and the variance of J T ( θ ). The calculation of the expecta-tion of J T ( θ ) follows easily from the unbiasdness of the scoring rule estimat-ing equation (Dawid & Lauritzen (2005) [10]). However, calculation of thevariance of J T ( θ ) is challenging due to the presence of the non deterministicterm B i = T X j,t =1 ∂b ij ∂θ b it y j y t . It relies on the following calculation:var( J T ( θ )) = 1 T T X i =1 var( B i ) == 1 T T X i,j,t,ℓ,h =1 ∂b ij ∂θ b it ∂b iℓ ∂θ b ih cov( y j y t , y ℓ y h )14 T X i,j,t,ℓ,h =1 ∂b ij ∂θ Γ it σ ∂b iℓ ∂θ Γ ih σ { cov( y j , y ℓ )cov( y t , y h )+ cov( y j , y h )cov( y t , y ℓ ) + cum ( y j , y t , y ℓ , y h ) } = T X i,j,t,ℓ,h =1 A jℓth + C jhtℓ + D itℓh , (21)where A jℓth = ∂b ij ∂θ Γ it σ ∂b iℓ ∂θ Γ ih σ cov( y j , y ℓ )cov( y t , y h ) C jhtℓ = ∂b ij ∂θ Γ it σ ∂b iℓ ∂θ Γ ih σ cov( y j , y h )cov( y t , y ℓ ) D itℓh = ∂b ij ∂θ Γ it σ ∂b iℓ ∂θ Γ ih σ cum ( y j , y t , y ℓ , y h ) . The first term in (21) simplifies as T X i,j,t,ℓ,h =1 A jℓth = T X i,j,t,ℓ,h =1 ∂b ij ∂θ Γ ii ∂b iℓ ∂θ Γ jℓ . (22)The second term simplifies as T X i,j,t,ℓ,h =1 C jhtℓ = T (cid:18) ∂∂θ γ − (0) σ (cid:19) . (23)The third term in (21), which involves the fourth cumulant, vanishes as forGaussian linear processes all the cumulant functions cum k for k > J T ( θ ))is evaluated by considering only the first non-constant term (22).Equation (22) can be rewritten as T X i,j,t,ℓ,h =1 A jℓth = T X i,j,ℓ =1 ∂∂θ γ − ( i − j ) σ γ − (0) ∂∂θ γ − ( i − ℓ ) σ γ ( j − ℓ ) , where γ ( j − ℓ ) = Γ jℓ and γ − ( i − j ) = Γ ij . Let k = i − j and r = j − ℓ .Without lose of generality, we assume that γ ( h ) = 0 if | h | > T . Then theprevious expression and consequently the first term in (21) simplifies to T X k,r = − T ( T − max {| k | , | k + r | , | r |} ) ∂∂θ γ − ( k ) σ γ − (0) ∂∂θ γ − ( k + r ) σ γ ( r ) . T →∞ T X r,k = − T ( T − max {| k | , | k + r | , | r |} ) T × ∂∂θ γ − ( k ) σ γ − (0) ∂∂θ γ − ( k + r ) σ γ ( r )= ∞ X r,k = −∞ ∂∂θ γ − ( k ) σ γ − (0) ∂∂θ γ − ( k + r ) σ γ ( r )= ∞ X r,k = −∞ V ( r, k, θ ) , (24)where V ( r, k, θ ) = (cid:18) ∂∂θ γ − (0) σ (cid:19) + ∂∂θ γ − ( k ) σ γ − (0) ∂γ − ( k + r ) ∂θ γ ( r ) . Combining equations (24) and (23) we obtainvar( J T ( θ )) −→ V, (25)where V = P ∞ r,k = −∞ V ( r, k, θ ).Since J ( θ ) depends on the B i ’s, which involve the sample autocovariance,it follows from the asymptotic normality of the sample autocovariance ofARMA processes that J T ( θ ) is also asymptotically normal with zero meanand variance V . From (20) and (25) we obtain the asymptotic normality of b θ T : √ T ( b θ T − θ ) D −→ N m − (cid:0) , K ( θ ) − V K T ( θ ) − (cid:1) . References [1] Almeida, M. P. and B. Gidas (1993). A variational method for estimatingthe parameters of MRF from complete or incomplete data.
Annals ofApplied Probability , , 103–136.[2] Barndorff-Nielsen, O. E. & Cox, D. R. (1994). Inference and Asymp-totics . Chapman & Hall, London.163] Brier, G. W. (1950). Verification of forecasts expressed in terms of prob-ability,
Monthly Weather Review , 78, 1–3.[4] Brockwell, P. J. and Davis, R. A. (1991).
Time Series: Theory andMethods . Springer, New York.[5] Cattelan, M. & Sartori, N. (2016). Empirical and simulated adjustmentsof composite likelihood ratio statistics,
Journal of Statistical Computa-tion and Simulation , 86, 1056-1067.[6] Chatfield, C. Inverse Autocorrelations.
Journal of the Royal StatisticalSociety. Series A , 142(3), 363-377.[7] Cleveland, W. S. (1972) The Inverse Autocorrelations of a Time Seriesand Their Applications.
Technometrics , 14(2), 277-293.[8] Davis, R. A. & Yau, C. Y. (2011). Comments on pairwise likelihood intime series models.
Statistica Sinica , 21, 255–277.[9] Davison, A. C. (2003).
Statistical Models . Cambridge University Press,Cambridge.[10] Dawid, A. P. & Lauritzen, S. L. (2005). The geometry of decision theory.In
Proceedings of the Second International Symposium on InformationGeometry and its Applications , 22–28. University of Tokyo.[11] Dawid, A.P. (2007). The geometry of proper scoring rules.
Ann. Inst.Statist. Math , 59, 77–93.[12] Dawid, A. P., Lauritzen, S. L. and Parry, M. (2012). Proper local scoringrules on discrete sample spaces.
Ann. Statist.— , 40, 593–608.[13] Dawid, A. P. & Musio, M. (2014). Theory and applications of properscoring rules.
Metron , 72, 169–183.[14] Dawid, A. P., Musio, M. , & Ventura, L. (2016). Minimum scoring ruleinference.
Scandinavian Journal of Statistics , 43, 1, 123–134.[15] Dawid, A. P. & Musio, M. (2015). Bayesian model selection based onproper scoring rules (with Discussion).
Bayesian Analysis , , 479–521.[16] Dawid, A. P. & Musio, M., Columbu, S. (2017). A note on Bayesianmodel selection for discrete data using proper scoring rules, Statistics &Probability Letters , 129, 101–106.1717] Ehm, W., Gneiting, T. Local proper scoring rules of order two
Ann.Statist. , 40, 609–637.[18] Forbes, P. G. M. & Lauritzen, S. (2015) Linear estimating equations forexponential families with application to Gaussian linear concentrationmodels,
Linear Algebra and its Applications , 473, 261–283.[19] Gilbert, P. & Varadhan, R. (2012). numDeriv: Accurate NumericalDerivatives. http://CRAN.R-project.org/package=numDeriv .[20] Good, I. J. (1952). Rational decisions.
Journal of the Royal StatisticalSociety, Series B , 14, 107–114.[21] Hosking, J. R. M. (1980). The Asymptotic Distribution of the SampleInverse Autocorrelations of an Autoregressive-Moving Average Process.
Biometrika , 67(1), 223–226.[22] Hosking, J. R. M. (1981). Fractional differencing.
Biometrika , 68(1),165–176.[23] Hyv¨arinen, A. (2005). Estimation of non-normalized statistical modelsby score matching.
Journal of Machine Learning Research , 6, 695–709.[24] Hyv¨arinen, A. (2007). Some extensions of score matching.
Computa-tional Statistics and Data Analysis , 51, 2499–2512.[25] Kollo, T. & von Rosen, D. (2005).
Advanced Multivariate Statistics withMatrices . Dordrecht: Springer.[26] Kroese, D. P., Taimre, T. & Botev, Z. I. (2011).
Handbook of MonteCarlo Methods . John Wiley & Sons, Inc., Hoboken, New Jersey.[27] Joe, H. & Lee, Y. (2009). On weighting of bivariate margins in pairwiselikelihood.
Journal of Multivariate Analysis , 100, 670–685.[28] Le Cessie, S. & Van Houwelingen, J. C. (1994). Logistic regression forcorrelated binary data.
Journal of the Royal Statistical Society, SeriesC , 43, 95–108.[29] Mardia, K. Kent, J., Laha, A. (2016). Score matching estimators fordirectional distributions. arXiv:1604.08470 .[30] Musio, M. & Dawid, A. P. (2013). Local scoring rules: A versatile toolfor inference. In Proceedings of the 59th ISI World Statistics Congress,Hong Kong.http://2013.isiproceedings.org/Files/STS019-P3-S.pdf.1831] Pace, L., Salvan, A., & Sartori, N. (2011). Adjusting composite likeli-hood ratio statistics.
Statistica Sinica , 21, 129–148.[32] Park, J. and Haran, M. (2017). Bayesian Inference in the Presence ofIntractable Normalizing Functions.
Journal of the American StatisticalAssociation ,
113 (523) , 1372–1390.[33] Parry, M. F., Dawid, A. P., & Lauritzen, S. L. (2012). Proper localscoring rules.
The Annals of Statistics , 40, 561–592.[34] R Core Team (2013). R: A language and environment for statisticalcomputing. R Foundation for Statistical Computing, Vienna, Austria.ISBN 3-900051-07-0.[35] Shao, S., Jacob, P. E., Ding, J. & Tarokh, V. (2018). Bayesian modelcomparison with the Hyv¨arinen score: computation and consistency.
Journal of the American Statistical Association , online first, DOI:10.1080/01621459.2018.1518237[36] Takasu, Y., Yano, K., Komaki, F. (2018). Scoring rules for statisticalmodels on spheres,
Statistics & Probability Letters , , 111–115.[37] Varin, C. (2008). On composite marginal likelihoods. AStA Advances inStatistical Analysis , 92, 1–28.[38] Varin, C., Reid, N., & Firth, D. (2011). An overview of composite like-lihood methods.
Statistica Sinica , 21, 5–42.[39] von Rosen, D. (1988). Moments for the inverted Wishart distribution.
Scandinavian Journal of Statistics , 15, 97–109.19 . . . . . . f A R E f ^ p f ^ H f ^ HW −0.5 0.0 0.5 . . . . . . a A R E a ^ p a ^ H a ^ HW . . . . . . d A R E d^ p d^ H d^ HW Figure 1: Asymptotic relative efficiency of estimators for the AR(1) model(left top panel), for the MA(1) model (right top panel) and for theARFIMA(0 , d,
0) model (left bottom panel).20able 1: Simulation 1. Estimated mean (
Est. ), asymptotic standard de-viation ( sd ), and asymptotic relative efficiency (ARE) of estimators of theparameter φ in the AR(1) model, for n = 200, T = 50, and varying valuesof φ . We denote by b φ the maximum likelihood estimate, by b φ PL the pairwiselikelihood estimate, and by b φ HT and b φ HW the total and the matrix Hyv¨arinenestimates, respectively. b φ b φ PL b φ HT b φ HW φ Est. sd Est. sd ARE
Est. sd
ARE
Est. sd
ARE − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Est. ), asymptotic standard de-viation ( sd ), and asymptotic relative efficiency (ARE) of estimators of theparameter α in the MA(1) model, for n = 200, T = 50, and varying valuesof α . We denote by b α the maximum likelihood estimate, by b α PL the pairwiselikelihood estimate, and by b α HT and b α HW the total and the matrix Hyv¨arinenestimates, respectively. b α b α PL b α HT b α HW α Est. sd Est. sd ARE
Est. sd
ARE
Est. sd
ARE − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . − . − . . − . . . − . . . − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Est. ), asymptotic standard de-viation ( sd ), and asymptotic relative efficiency (ARE) of estimators of theparameter d in the ARFIMA model, for n = 200, T = 50, and varying valuesof d . We denote by b d the maximum likelihood estimate, by b d PL the pairwiselikelihood estimate, and by b d HT and b d HW the total and the matrix Hyv¨arinenestimates, respectively. b d b d PL b d HT b d HW d Est. sd Est. sd ARE
Est. sd
ARE
Est. sd
ARE0 .
01 0 . . . .
007 0 . . . . . . . .
05 0 . . . . . . . . . . . . . .
006 0 . . . . . . . . . .
15 0 . . .
15 0 . . . . . . . . .
20 0 . . . . . . . . . . . .
25 0 . . . . . . . . . . . . . . . . . . .0043 0