Cholesky-based multivariate Gaussian regression
Thomas Muschinski, Georg J. Mayr, Thorsten Simon, Achim Zeileis
CCholesky-based multivariate Gaussian regression
Thomas Muschinski a,b, ∗ , Georg J. Mayr b , Thorsten Simon a,b , Achim Zeileis a a Faculty of Economics and Statistics, Universit¨at Innsbruck, Austria b Department of Atmospheric and Cryospheric Sciences, Universit¨at Innsbruck, Austria
Abstract
Multivariate Gaussian regression is embedded into a general distributional regression framework using flex-ible additive predictors determining all distributional parameters. While this is relatively straightforwardfor the means of the multivariate dependent variable, it is more challenging for the full covariance ma-trix Σ due to two main difficulties: (i) ensuring positive-definiteness of Σ and (ii) regularizing the highmodel complexity. Both challenges are addressed by adopting a parameterization of Σ based on its basic ormodified Cholesky decomposition, respectively. Unlike the decomposition into variances and a correlationmatrix, the Cholesky decomposition guarantees positive-definiteness for any predictor values regardless ofthe distributional dimension. Thus, this enables linking all distributional parameters to flexible predictorswithout any joint constraints that would substantially complicate other parameterizations. Moreover, thisapproach enables regularization of the flexible additive predictors through penalized maximum likelihoodor Bayesian estimation as for other distributional regression models. Finally, the Cholesky decompositionallows to reduce the number of parameters when the components of the multivariate dependent variablehave a natural order (typically time) and a maximum lag can be assumed for the dependencies among thecomponents.The proposed multivariate Gaussian regression models are illustrated in a weather forecasting application:The temperature in Innsbruck, Austria, at ten lead times between 186 hours (+7.75 days) and 240 hours (+10days) is modeled jointly, thus leveraging the temporal correlations over the prediction period to obtain moreprecise and meteorologically consistent forecasts. Both the basic and modified Cholesky parameterizationsoutperform reference methods based on assuming (i) a constant correlation matrix and (ii) a seasonally-varying first order autoregressive correlation, respectively.
1. Introduction
Generalized additive models for location, scale and shape (GAMLSS, Rigby and Stasinopoulos, 2005)extend generalized additive models (GAM, Hastie and Tibshirani, 1990) to allow any parametric distributionfor the response. Furthermore, each parameter of the response distribution – not just the mean – can belinked to an additive predictor. This class of models is therefore also known as distributional regression (Stasinopoulos et al., 2018). Many different univariate response distributions have been employed in suchadditive distribution regressions, ranging from zero-inflated and overdispersed count data (Klein et al.,2015b; Simon et al., 2019) to survival analysis (K¨ohler et al., 2017; Burke et al., 2019) or geoadditivehazards regression (Kneib and Fahrmeir, 2007).Much fewer applications exist for multivariate response distributions. A notable exception is (Klein et al.,2015a) where a bivariate response for childhood undernutrition in India is modeled with a bivariate Gaussiandistribution based on the two means (with identity link), variances (with log link) and the correlation (withrhogit link). However, an extension to higher dimensions is not straightforward because linking individual ∗ Corresponding author
Email addresses:
[email protected] (Thomas Muschinski),
[email protected] (Georg J. Mayr),
[email protected] (Thorsten Simon),
[email protected] (Achim Zeileis) a r X i v : . [ s t a t . M E ] F e b airwise correlations would not assure that the corresponding prediction of the covariance matrix Σ ispositive definite (which in turn is necessary for a well-defined probability density function). Moreover, thenumber of parameters for Σ increases quadratically with the dimension of the response, thus necessitatingsome form of regularization for the high model complexity.We embed multivariate Gaussian regression into the general GAMLSS or distributional regression frame-work by parameterizing Σ through the entries of its basic or modified Cholesky decomposition (Pourahmadi,1999), respectively. The resulting parameterizations are unconstrained, meaning that regardless of the val-ues the additive predictors take, the corresponding covariance matrix Σ is guaranteed to be positive definite.This facilitates regularization through penalized maximum likelihood or Bayesian estimation of the regres-sion coefficients because the additive predictors can be regularized separately. Furthermore, the Choleskyparameterizations allow model complexity to be restricted a priori in cases where the response variablesare ordered (for example with respect to time or one dimension in space). Namely, a structured covarianceof type r -order antedependence (AD- r , Gabriel, 1962; Zimmerman et al., 1998) can be adopted when amaximum lag may be assumed for the autocorrelations.The remainder of this paper is structured as follows: A brief overview of methods for covariance matrixestimation (without dependence on regressors) in Sec. 2 motivates leveraging the basic and modified Choleskydecompositions of Σ for multivariate Gaussian regression beyond the two-dimensional case in Sec. 3. Theproposed model is employed in Sec. 4 for a ten-dimensional weather forecasting illustration, comparingdifferent parameterizations. A discussion of strengths and limitations of our Cholesky-based distributionalregression model is found in Sec. 5. Summarizing remarks conclude the paper in Sec. 6.
2. Parameterizations of the covariance matrix
For addressing the challenges in multivariate Gaussian regression described above in Sec. 1, the impor-tant first step is to adopt an unconstrained parameterization of the covariance matrix Σ. This not onlyfacilitates estimation of the parameters with standard optimizers and without complicated constraints, italso enables different forms of regularizations or restrictions of the model complexity. Hence, we reviewdifferent parameterizations of the covariance matrix proposed in the literature, especially with respect totheir suitability in multivariate Gaussian regression. An overview is provided in Table 1.
The covariance Σ of a k -dimensional random variable y from a multivariate Gaussian distribution is asymmetric k × k matrix, containing k · ( k +1) / z (cid:62) Σ z > z (cid:54) = 0 . (1)to ensure Σ is positive definite (PD). Only in this case the corresponding probability density function f ( y | µ, Σ) exists: f ( y | µ, Σ) = 1 (cid:112) (2 π ) k | Σ | exp (cid:26) −
12 ( y − µ ) (cid:62) Σ − ( y − µ ) (cid:27) , (2)where µ = E( y ) is the expectation of y .To ensure the PD property, joint restrictions for the elements of Σ are necessary. The same is truefor two other natural parameterizations, namely the precision matrix Σ − and the variance-correlationdecomposition of Σ. Similarly, a parameterization using the spectral decomposition is interpretable withrespect to the eigenstructure of Σ but constraints enter through the orthogonality of the correspondingeigenvectors (Pourahmadi, 2013). When estimating a fixed covariance matrix from empirical observations,some techniques ensure the PD property – e.g., glasso (Friedman et al., 2008) and tapering (Furrer et al.,2006) – while others do not – e.g., hard thresholding (Bickel and Levina, 2008).Ensuring a PD Σ becomes even more difficult in the context of regression when the parameters underlyingthe covariance matrix should be linked to regressor variables. Here, it is particularly beneficial to employ2arameterization No constraints Natural interpretationrequired for PD of parametersCovariance (cid:88) Precision (cid:88)
Variance-correlation ( k > (cid:88) Spectral decompositionMatrix logarithm (cid:88)
Cholesky (cid:88)
Modified Cholesky (cid:88) (cid:88)
Table 1: Possible parameterizations of the covariance matrix in multivariate Gaussian distributions, along with properties thatcrucial for linking the covariance structure to regressor variables in a distributional regression setup. The (modified) Choleskydecomposition is particularly appealing as its derivation is less burdensome than the matrix logarithm. a parameterization which ensures PD without requiring joint constraints and then to combine this withlink functions that map the parameters to the real line. The simplest illustration for this is the case of aunivariate Gaussian distribution (i.e., k = 1) with variance σ . To assure positivity, typically a log linkis used, mapping the set of positive real numbers to an unrestricted predictor (Stasinopoulos et al., 2018).Another notable case is the bivariate Gaussian distribution (i.e., k = 2) where the variance-correlationdecomposition can be adopted with log links for the two variances and a rhogit link for the correlationparameter in the interval ( − ,
1) (Klein et al., 2015a). It is also possible to extend the log-link idea to k > A = log Σ with unconstrained entries (Pourahmadi, 2013). However, the disadvantages are that(i) the parameters in A have no natural interpretation and (ii) the matrix logarithm involves a Taylor seriesexpansion that is rather burdensome to compute. A mathematically and computationally more appealing approach that also yields an unconstrained pa-rameterization is using the Cholesky decomposition of Σ. Any Σ can be uniquely decomposed as the productof a positive-diagonal lower triangular matrix L with its transpose L (cid:62) :Σ = LL (cid:62) , Σ − = ( L − ) (cid:62) L − . (3)Subsequently the precision matrix Σ − results from a product based on the inverse Cholesky factor L − .Both L and L − offer unconstrained parameterizations of Σ. Although neither the individual parametersin L nor those in L − are easily interpretable, the latter matrix as a whole has an elegant interpretation. If y ∼ N ( µ, Σ), multiplication with L − can be used to uncorrelate y : L − ( y − µ ) ∼ N (0 , I ).To obtain parameters that are not only unconstrained but individually interpretable, Pourahmadi (1999)suggests a modified Cholesky decomposition that diagonalizes L − :Σ − = T (cid:62) D − T. (4)In setups where the k components of y have a natural order (e.g., longitudinal data), the entries of thematrices T and D are related to the autoregressive structure of y ∼ N ( µ, Σ). The elements of the lowertriangular matrix T are denoted φ ij ( i < j ) – corresponding to autocorrelation coefficients – and the elementsof the diagonal matrix D are denoted as ψ i ( i = 1 , . . . , k ) – corresponding to the innovation variances:ˆ y j = µ j + j − (cid:88) i =1 φ ij · ( y i − µ i ) for j = 2 , . . . , k, (5) ψ i = var( y i − ˆ y i ) for i = 1 , . . . , k. (6)These intuitive interpretations of the parameters φ ij and ψ i facilitate regularization of the high modelcomplexity, particularly when k is large. Suggestions from the literature include: using lasso penalties on3 ij (Levina et al., 2008); approximating the elements of T and D by low-order polynomials (Pourahmadi,1999, 2000; Pan and Pan, 2017); or cutting off the autocorrelation coefficients at a maximum lag of r (Wuand Pourahmadi, 2003), i.e., setting φ ij = 0 for higher lags j − i > r . The latter approach thus yields abanded T matrix, corresponding to a so-called order- r antedependence (AD- r , Gabriel, 1962; Zimmermanet al., 1998). Note that since T and L − (Eq. 3) share the same pattern of zeros, AD- r covariances can bemodeled using both the modified and basic Cholesky parameterizations, although the individual elementsof L − are not directly interpretable as autocorrelation coefficients.In summary, both Cholesky-based parameterizations are appealing candidates for a distributional mul-tivariate Gaussian regression approach. They are relatively easy to compute, yield an unconstrained pa-rameterization that still ensures PD covariances, can be regularized using penalized estimation, and canadditionally be restricted to an AD- r antedependence, if the k components are autocorrelated with plausi-ble maximum lag of r . The modified Cholesky decomposition has the advantage that individual parametersare interpretable while the basic Cholesky is slightly easier to compute.
3. Cholesky-based multivariate Gaussian regression
We propose to establish a new flexible multivariate Gaussian regression model by combining the dis-tributional regression approach (e.g., Klein et al., 2015b; Umlauf et al., 2018) with the Cholesky-basedparameterizations (Pourahmadi, 1999, 2013) of the covariance matrix discussed in Sec. 2.
The idea in distributional regression (e.g., Klein et al., 2015b; Umlauf et al., 2018) is to adopt someparametric distribution for the response variable y and then to link each of the distributional parameters toa separate flexible additive predictor η , typically using a monotone function mapping the support of eachparameter to the unrestricted real values of the predictor η . The predictors can combine additively variouseffects based on some regressor variable(s) x . These comprise simple linear terms x · β but also nonlineareffects based on splines s ( x ), varying coefficients, random intercepts, or spatial effects. Rather than explicitlylisting all common types of terms, we refer to the literature on GAMs (Wood, 2017), GAMLSSs (Rigby andStasinopoulos, 2005), or Bayesian versions thereof (Umlauf et al., 2018). Multivariate Gaussian regression is embedded into this general distributional regression framework byassuming that the response y is a length- k vector following a k -dimensional Gaussian distribution y ∼ N ( µ, Σ) , (7)with probability density function provided in Eq. 2.All parameters of N – the k components of µ and the k · ( k + 1) / k means in µ this is straightforward as these parameters are unconstrained andmay take any real value. Therefore, we simply link them to the corresponding additive flexible predictorsusing the identity function. µ i = η µ,i , i = 1 , . . . , k. (8)In contrast, as already argued in Sec. 2, it is not possible to simply link the k · ( k + 1) / .3. Basic Cholesky parameterization In the basic Cholesky parameterization, Σ is defined through the k diagonal elements λ ii > k · ( k − / λ ij of the inverse Cholesky factor:( L − ) (cid:62) = λ λ λ · · · λ k λ λ · · · λ k λ · · · λ k ... ... ... . . . ...0 0 0 · · · λ kk , where Σ = LL (cid:62) . (9)Restricting the diagonal elements to be positive ensures a unique decomposition, motivating the use of a loglink on these parameters while the off-diagonal elements may take any real value so that an identity linkcan be used: log( λ ii ) = η λ,ii , where i = 1 , . . . , k, (10) λ ij = η λ,ij , where i = 1 , . . . , k − , and j = i + 1 , . . . , k. (11)Modeling the elements of the inverse Cholesky factor L − is motivated by the following considerations:(i) Unlike for the parameterization based on L , no computationally intensive matrix inversions are requiredduring model estimation. (ii) There is an autoregressive interpretation for parameter values equal to zero.In some situations it may not be necessary to model all k · ( k − / y have a natural order (e.g., longitudinaldata) and a maximum lag in the autocorrelations is reasonable, then an order- r antedependence (AD- r ,Gabriel, 1962; Zimmerman et al., 1998) model can be employed. This sets all λ ij = 0 with j − i > r . Forlarge k and small r this yields a significant reduction in model complexity. Alternatively, the modified Cholesky decomposition of Pourahmadi (1999) is obtained by diagonalizing( L − ) (cid:62) from Eq. 9 for the basic decomposition so that Σ − = T (cid:62) D − T . The new parameters are thosecontained in the diagonal matrix D and the upper unitriangular T (cid:62) . D = ψ · · · ψ · · ·
00 0 ψ · · · · · · ψ k , T (cid:62) = − φ − φ · · · − φ k − φ · · · − φ k · · · − φ k ... ... ... . . . ...0 0 0 · · · . (12)The ψ i in D and the φ ij in T (cid:62) are called the innovation variances and generalized autoregressive parametersof y , respectively. They have meaningful interpretations when the components of y have a natural order.Analogously to the basic Cholesky parameterization, a log link is used for the innovation variances toensure positive definiteness while the generalized autoregressive parameters may take any real values:log( ψ i ) = η ψ,i , where i = 1 , . . . , k, (13) φ ij = η φ,ij , where i = 1 , . . . , k − , and j = i + 1 , . . . , k. (14)Again, it is possible to reduce model complexity when an AD- r model can be assumed. Similar to the basicCholesky parameterization, this sets all φ ij = 0 with j − i > r .5 .5. The log-likelihood and its derivatives By rearranging the probability density function of the multivariate Gaussian distribution (Eq. 2) weobtain the likelihood of distributional parameters for an observation vector y . For mathematical ease, wework with the log-transformed likelihood. (cid:96) ( µ, L − | y ) = − k π ) + log( | L − | ) −
12 ( y − µ ) (cid:62) ( L − ) (cid:62) L − ( y − µ ) . (15)Computationally efficient likelihood-based model estimation, be it with a frequentist or Bayesian approach,requires derivatives of the log-likelihood with respect to the additive predictors. We derive analytical solu-tions (Appendix A and Appendix B) for the first and second partial derivatives of (cid:96) with respect to all η ∗ .The first derivatives in the basic parameterization are found to be ∂(cid:96)∂η µ,i = k (cid:88) j =1 ς ij ˜ y j ∂(cid:96)∂η λ,ii = 1 − λ ii ˜ y i i (cid:88) m =1 (˜ y m λ mi ) ∂(cid:96)∂η λ,ij = − ˜ y i j (cid:88) m =1 (˜ y m λ mj ) , (16)where ˜ y = y − µ and ς ij = (Σ − ) ij . The corresponding second derivatives are ∂ (cid:96)∂η µ,i = − ς ii = − k (cid:88) j = i λ ij ∂ (cid:96)∂η λ,ii = − λ ii ˜ y i − λ ii ˜ y i · i − (cid:88) m =1 (˜ y m λ mi ) ∂ (cid:96)∂η λ,ij = − ˜ y i . (17)These are always negative, which means likelihood-based estimation of the proposed regression model is aconvex optimization problem. The same is true for the modified Cholesky parameterization (Appendix B). Distributional regression models may be estimated either in a frequentist (e.g., Rigby and Stasinopoulos,2005) or Bayesian (e.g., Umlauf et al., 2018) framework. Nonlinear terms in the models for each distributionalparameter are identified through different approaches such as penalized maximum likelihood (Wood, 2017)or gradient boosting (Mayr et al., 2012). In a Bayesian setting estimates for model parameters are not fixed,but instead take the form of probability distributions. Typically these do not have closed forms and mayinstead be approximated using Markov chain Monte Carlo (MCMC) sampling. In the following application(Sec. 4) we adopt the Bayesian approach, as it facilitates quantifying the estimated effects using credibleintervals derived from the MCMC samples.Regularization is required to limit the high complexity of the multivariate Gaussian regression models.Their complexity is influenced by (i) the number of distributional parameters and (ii) how flexibly theseare modeled on predictors. For the latter, maximizing the log-likelihood subject to penalties on modelparameters tends to shrink these towards zero and smooth any nonlinear functions.In Sec. 4, model parameter estimates used to initialize MCMC sampling are identified using a backfittingalgorithm ( bfit in bamlss , see Sec. 3.7). This allows nonlinear effects based on cubic regression splines(Wood, 2017) to be estimated efficiently using the chain rule and analytical derivatives of the log-likelihod6ith respect to predictors (Sec. 3.5). New optimal smoothing variances are identified in each iteration of thebackfitting algorithm, thereby regulating the trade-off between under- and overfitting for each individualeffect. For a more thorough description the reader is referred to Sec. 4.2 of Umlauf et al. (2018) andAppendix A.2 of Rigby and Stasinopoulos (2005). The multivariate Gaussian regression models with basic and modified Cholesky parameterizations are im-plemented as distributional families compatible with the R package bamlss (Umlauf et al., 2018). Currentlythe families are available as an additional package mvnchol hosted on the Gitlab server of Universit¨at Inns-bruck with public access, https://git.uibk.ac.at/c4031039/mvnchol . In the future, we plan to migratethem to bamlss .
4. An application in probabilistic weather forecasting
Numerical weather prediction (NWP) models attempt to forecast future atmospheric states throughnumerical integration of governing physical equations. They have systematic and random errors whichresult in part from sparse or asynchronous initial observations and a kilometers-resolution model grid andwhich vary with location, season, time of day and weather situation (Bauer et al., 2015). To quantify andcorrect these errors the raw NWP output is commonly postprocessed by statistical models (Gneiting et al.,2007).Weather services run a set of NWP models to assess uncertainty of NWPs relative to improper initialconditions and unresolved atmospheric processes. Such a set of NWP realizations is called an ensemble (Bauer et al., 2015). Ensembles consist of multiple NWP model runs, each with perturbed initial conditionsand/or differing numerical atmospheric representations (Leutbecher and Palmer, 2008). Their purpose is tocharacterize the uncertainty of the numerical prediction.Distributional regression has become a popular method to postprocess NWP ensembles, but its stateof the art is limited to univariate (e.g., Gneiting et al., 2005, and many others thereafter) or bivariateresponses (Pinson, 2012; Schuhen et al., 2012; Lang et al., 2019). Some meteorological applications requirenot just univariate or bivariate, but higher dimensional joint probability forecasts – across several quantities,locations, or lead times (Feldmann et al., 2015; Worsnop et al., 2018; Schoenach et al., 2020). These jointprobability forecasts are influenced by dependencies among the univariate prediction errors. Until now thesedependencies have not been explicitly modeled on predictors, but rather reconstructed from the empiricalNWP ensemble structure or historical observations (e.g., Schefzik et al., 2013).Here, Cholesky-based multivariate Gaussian regression models are used to postprocess 10-dimensionaljoint probability forecasts. The response vector contains the temperature at Innsbruck for ten sequentialfuture lead times. The basic and modified Cholesky parameterizations are employed with and withoutassumptions on the covariance structure. Cholesky-based models outperform references assuming (i) con-stant or (ii) seasonally-varying first order autoregressive correlation matrices. More discussions including acomparison with the empirical ensemble approach of Schefzik et al. (2013) is provided in Sec. 5.
Predictor variables are obtained from the Global Ensemble Forecast System (GEFS, Hamill et al., 2013).Two-meter temperature forecasts from the 11 GEFS members are available at a spatial resolution of ap-proximately 70 km and a temporal resolution of 6 hours. The ensemble forecasts are bilinearly interpolatedto the spatial coordinates of Innsbruck, Austria, for the ten lead times between 186 hours (+7.75 days) and240 hours (+10 days). The dataset contains ensemble forecasts started at 00 UTC of 1798 days (5 years).Three types of predictor variables are derived from the GEFS (see Table 2). Temperature is oftenapproximated by a Gaussian distribution and postprocessed using the ensemble mean and standard deviation(Gneiting et al., 2005; Gebetsberger et al., 2019) Therefore the mean of the ensemble forecast for eachleadtime i , mean i is used as a predictor variable for the corresponding distributional mean component. Thelog-transformed standard deviations of the ensemble forecasts logsd i are included to represent the ensemble7ariable Description mean i Ensemble mean temperature forecast for lead time i logsd i Logarithm of ensemble standard deviation for lead time i yday Day of year (to capture seasonal variations)
Table 2: Predictor variables used to model multivariate Gaussian parameters for postprocessing application. The placeholder i stands for one of ten lead times (+7.75d, +8d, . . . , +10d). Model name Section No. of covariance parametersFlexible Intercept Zero
Basic Cholesky
Modified Cholesky
Basic Cholesky AD5
Modified Cholesky AD5
AR1
Constant correlation
Table 3: A 10-dimensional error covariance matrix Σ has 55 degrees of freedom. The regression models compared employ dif-ferent parameterizations of Σ. Depending on the parameterization, parameters may either be modeled on predictors, estimatedas intercepts or are restricted to zero a priori. forecast uncertainty. Seasonal variability in subgrid scale atmospheric processes is accounted for throughthe day of the year yday of the GEFS run initialization.
The observed temperature obs at Innsbruck is modeled for ten sequential lead times – every 6 hoursbetween 7.75 and 10 days in the future – by a 10-dimensional Gaussian distribution y = ( obs +7 . d , . . . , obs +10 d ) (cid:62) ∼ N ( µ, Σ) . (18)The distributional parameters of N are linked to flexible additive predictors containing mean i , logsd i and yday . In all regressions, the ten mean parameters are modeled in the same way: µ i = s ,i ( yday ) + s ,i ( yday ) · mean i . (19)These are linear models of the ensemble mean forecasts, but with seasonally varying intercepts and slopesestimated by nonlinear cyclical splines s ,i and s ,i , respectively. The regressions differ in how the covariancematrix is parameterized and subsequently how flexibly it may be modeled (Table 3). ΣWith the basic Cholesky parameterization, all 55 covariance-specifying parameters may be linked topredictors. In the following, this model is referred to as basic Cholesky :log( λ ii ) = g ,i ( yday ) + g ,i ( yday ) · logsd i λ ij = h ij ( yday ) . (20)Again, g ,i and g ,i are nonlinear cyclical functions of the year day, but this time the linear models relatethe log-transformed ensemble standard deviations with the diagonal elements λ ii of the Cholesky factor.Seasonal variations are permitted for the off-diagonal elements λ ij and approximated by cyclical splines h ij .The modified Cholesky parameterization can be employed analagously by replacing λ ii with ψ i , and λ ij with φ ij . The mean parameters and the diagonal elements λ ii /ψ i have seasonally varying linear dependencieson the corresponding ensemble-mean and its log-transformed standard deviation, respectively. Subsequently,this is referred to as the modified Cholesky model. 8 .2.2. Cholesky parameterizations with assumed structure for ΣMotivated by a seasonal autoregressive model, an antedependence model of order 5 (AD-5) may beadopted for the covariance structure. Namely, combining autocorrelations for lag 1 (previous lead time, 6hours ago) and lag 4 (previous day, 24 hours ago) in a multiplicative way would lead to autocorrelations upto lag 5 and these are captured here by the AD-5 specificiation. Thus, only the diagonal and off-diagonalparameters with lags at most 5 are modeled for the covariance as in Eq. 20 and higher lag parameters arefixed at zero. In the following, this model is referred to as basic Cholesky AD5 when it is based on the basicCholesky parameterization: λ ij = (cid:40) h ij ( yday ) , if j − i ≤ , if j − i > r has the advantage that the number of covariance parameters increaseslinearly with the dimension k rather than quadratically, as with unstructured covariances.For the modified Cholesky parameterization, the λ ii are again replaced by ψ i and λ ij with φ ij . Thismodel is referred to as modified Cholesky AD5 . The Cholesky-based multivariate Gaussian regression models are compared to two reference methodswhich parameterize Σ through its standard deviations σ i and correlations ρ ij . In both reference models,standard deviations are linked to the same additive predictors as were the diagonal elements of the basicand modified Cholesky decompositions. Again a log-link is required to ensure the estimated parameters arepositive: log( σ i ) = g ,i ( yday ) + g ,i ( yday ) · logsd i . (22)The problem with a variance-correlation parameterization is that positive definiteness is not generallyguaranteed when linking all correlations to predictors (Sec. 2). It is possible though to estimate a modelwhere the correlation matrix P is assumed to conditionally follow a first order autoregressive structure.P = ρ ρ · · · ρ ρ ρ · · · ρ ρ ρ · · · ρ ... ... ... . . . ... ρ ρ ρ · · · . (23)As a result, P is determined by one single parameter ρ instead of k · ( k − / | ρ | <
1, which allows us to model seasonally varying P, but comes at the cost of very inflexibleassumptions about the covariance across the k = 10 lead times. This model is referred to as AR1 and canbe denoted by rhogit( ρ ) = h ( yday ) , (24)where rhogit is a link function mapping the range ( − ,
1) to the unrestricted predictor.Finally, we compare the Cholesky-based parameterizations against another alternative model with con-stant correlation for each element i , j . In terms of the distributional regression model this means that everycorrelation parameter is modeled as an intercept only.rhogit( ρ ij ) = intercept ij . (25)Thus, unlike all previous specifications considered above, the correlation structure remains fixed and doesnot change across the days of the year. 9 s Means (intercept) +8d+8.5d60 150 240 330 g Innovation vars. (intercept) +8d, +8.25d+8.5d, +8.75d60 150 240 330 . . h i,j Autoregressive pars. (6h lag) yday60 150 240 330 . . +8d+8.5d s Means (slope) yday+8d+8.5d60 150 240 330 − . . . . g Innovation vars. (slope) yday+8d, +9d+8.5d, +9.5d60 150 240 330 . . h i,j Autoregressive pars. (24h lag)
Figure 1: Selected nonlinear effects estimated by the modified Cholesky model. Shaded regions indicate 95% credible intervalsobtained from MCMC sampling. Upper row: Seasonally-varying intercepts for the means (left) and log-variances (center) andautoregressive parameters for lag 1 (6h, right). Lower row: Seasonally-varying slopes for the means (left) and log-variances(center) and autoregressive parameters for lag 4 (24h, right). Red colors indicate forecasts valid for daytime, blue ones fornighttime.
To highlight the flexibility of the Cholesky-based regression models, a selection of the nonlinear effectsestimated by the modified Cholesky model are presented in Figure 1.The functions s ,i and s ,i in Eqs. 19, 20 can be thought of as seasonally varying intercepts and slopesin linear models relating ensemble and distributional means. Slopes s ,i are significantly less than the valueof 1 expected for a perfect NWP model (where the ensemble means/variances would directly correspondto the observed means/variances). This means that the GEFS forecast mean i contains limited additionalinformation about the true temperature compared to that inherent in yday . Therefore intercepts s ,i beginto approximate a temperature climatology, with summer maxima approximately 15 degrees higher thanwinter minima (Fig. 1).For the innovation variances, g ,i and g ,i can again be thought of as seasonally varying interceptsand slopes in linear models. This time though, they relate the log-transformed standard deviations of theensemble logsd i to the log-transformed innovation variances ψ i . Since the GEFS means did not containmuch information about the distributional means, it comes as no surprise that logsd i are even less valuablepredictors. The slopes g ,i average close to zero throughout the year. Intercepts g ,i have significant seasonalvariations for nighttimes (blue) but not for daytime (red).The effects h ij = φ ij can be directly interpreted as seasonal variations of the generalized autoregressiveparameters and paint a complex picture. For some combinations of i and j the estimated seasonal variationsare significant and for others they are not, with no simple dependency on lag j − i or index i .Once all s (cid:63) , g (cid:63) and h (cid:63) have been estimated, the predicted mean ˆ µ and covariance ˆΣ are determined onlyby the NWP-derived variables yday , mean (cid:63) and logsd (cid:63) . In Figure 2 we visualize forecasts from the modelfor two days in 2015: one in winter (top) and one in fall (bottom). In the left panels, simulated temperaturevectors across all ten lead times are shown in gray along with the actual observations in black, showing that10 +8d +9d +10d Simulated vectors 2015−01−03 T e m pe r a t u r e ( ° C ) C o rr e l a t i on P^ 2015−01−03 +8d +9d +10d + + + Lead time (days) ob s +8d +9d +10d Simulated vectors 2015−10−10 T e m pe r a t u r e ( ° C ) Lead time (days) 0.00.20.40.6 C o rr e l a t i on P^ 2015−10−10 +8d +9d +10d + + + Figure 2: Predictions from the modified Cholesky model for µ and Σ given values for yday , mean , and logsd for two specificdays: 2015-01-03 (in winter, top) and 2015-10-10 (in fall, bottom). Left: Vectors containing simulated forecasts for the tenlead times from the predicted distribution are depicted by thin gray lines with true observations shown as thick black circlesand lines. Right: Heat maps depicting the corresponding estimated correlation matrices ˆP. the mean pattern is approximated reasonably well albeit with relatively large variance (due to the long leadtimes). In the right panels, the associated correlation matrices ˆP are depicted.Clearly the correlation is not constant throughout the year – as assumed in the constant correlation model – but differs substantially between winter and fall. First, correlations are generally higher in winterbut also the pattern of correlations is much more complex in fall. In winter a first-order autoregressiveprocess – as assumed in the AR1 model – might fit reasonably well. However, in the fall, this is not the caseand instead there are large diurnal variations in correlations for a given lag. For example, forecast errors at6 UTC in the morning (e.g., +8.25d, +9.25d) have little influence on the subsequent daytime predictions.This is not the case in wintertime, where correlations are less variable for a given lag.
It is evident that Cholesky-based regressions allow Σ to be modeled flexibly based on the additivepredictors. Another question is whether this increased flexibility improves the quality of the postprocessedjoint probability forecasts. As the true distributions are unknown, the quality of the predicted distributionsis evaluated using the Dawid-Sebastiani score (DSS, Dawid and Sebastiani, 1999; Gneiting and Raftery,2007). The DSS is a popular multivariate score in postprocessing and linearly related to the log-likelihoodof the predicted distributional parameters for a given observation vector. Scores are evaluated out of sampleusing five-fold cross-validation. 11
R1Modified Cholesky AD5Basic Cholesky AD5Modified CholeskyBasic Cholesky −2 −1 0 1DSS compared to constant correlation modelWorse Better
Figure 3: Differences in Dawid-Sebastiani Score (DSS) to the reference constant correlation model, aggregated by year andmonth. Positive values mean the given model outperforms the reference. The half circle at the left figure edge indicates thatnot the entire boxplot for
AR1 is shown (minimum value of − . Scores for each method are aggregated by year and month and differences calculated relative to thereference constant correlation model (Fig. 3). All Cholesky models perform better than the constant cor-relation model (vertical line at zero) and much better than the
AR1 model. The models employing thebasic parameterization ( basic Cholesky and basic Cholesky AD5 ) are better than the constant correlation model in 75% of months. The modified Cholesky models are slightly worse, but still outperform the constantcorrelation model most of the time.
5. Discussion
Cholesky-based multivariate Gaussian regression offers a way to flexibly model both the mean and fullerror covariance matrix of a vector response. First, the utility for postprocessing multivariate numericalweather predictions is discussed in Sec. 5.1. Second, despite what one would expect from fixed covarianceestimation, the Cholesky-based regressions do not appear to be very sensitive in this application with respectto changes in the response vector ordering (Sec. 5.2). Third, limitations of the proposed regression frameworkalong with potential remedies are considered in Sec. 5.3
In state of the art NWP postprocessing, joint probability forecasts are typically obtained through ensem-ble copula coupling (ECC, Schefzik et al., 2013). In ECC the margins of an NWP ensemble are calibratedthrough univariate postprocessing while retaining the ensemble’s order statistics. For this application, ECCperformed much worse than all other models (median DSS difference of 6 compared to the constant corre-lation model). This is mainly due to the GEFS as a whole having poor predictive skill for forecasts morethan a week in advance.ECC may perform more favorably at shorter lead times, but even here there is a limit to how welltens of ensemble members can possibly capture multivariate dependencies with dimensions of the sameorder. Additionally it is quite a strong assumption that the ensemble order statistics should reflect errordependencies across the postprocessed univariate forecasts. Cholesky-based multivariate Gaussian regressionis freed from these assumptions and may also be applied when no ensemble (and just a direct forecast) isavailable.
A known limitation of the modified Cholesky decomposition for fixed covariance estimation is its sen-sitivity to order (Pourahmadi, 2013). Many regularization techniques impose an assumed structure on theparameters which would be changed through rearranging the variables. Somewhat surprisingly, we find thatfor this 10-dimensional NWP application the unstructured Cholesky models are quite insensitive to randompermutations of the variables (Fig. 4). This may be because in contrast to fixed covariance estimation, inthe context of a regression no structure is imposed on the distributional parameters. Instead models forindividual parameters are regularized through a combined penalized likelihood maximization.12 eordering 5Reordering 4Reordering 3Reordering 2Reordering 1Basic Cholesky DSS compared to constant correlation model−1 0 1Worse Better
Figure 4: Differences in Dawid-Sebastiani Score (DSS) to constant correlation as in Fig. 3. The five reorderings have the samemodel setup as the basic Cholesky , but are estimated after random permutations of the variable order.
Model complexity is still manageable for this 10-dimensional application, but even here there are 65distributional parameters and more than 500 model parameters to estimate from data with a sample sizeof n = 1798. A fully flexible parameterization becomes computationally demanding long before k = 100,where 5150 distributional parameters would need to be modeled. For very large k it is also not sufficient toreduce complexity just by assuming Σ is AD- r .When there is a natural order to the variables, very parsimonious covariance parameterizations can beobtained by enforcing structure among the Cholesky parameters. Pourahmadi (1999) for example assumespolynomial dependencies among the innovation variances and autoregressive parameters. Σ is then subse-quently defined through the coefficients of these polynomials. The polynomial coefficents could be modeledon predictors in place of the Cholesky parameters. Alternatively, smooth nonlinear functions may be usedto approximate the parameter structure. Reparameterizations of this sort would extend the applicability ofmultivariate Gaussian regression to much higher dimensions.
6. Conclusions
We have developed regression models for a multivariate Gaussian response, where all distributionalparameters may be linked to flexible additive predictors. Modeling the mean components of the multivariatedependent variable in such cases is no different from the univariate case, but it becomes difficult to ensurethe covariance matrix is positive definite for dimensions greater than two. Common parameterizations suchas variances and a correlation matrix require joint constraints among parameters to guarantee this property.Such constraints are difficult to ensure in the context of a regression.These challenges are addressed by adopting a parameterization of Σ based on its basic and modifiedCholesky decomposition, respectively. These parameterizations are unconstrained, ensuring positive definiteΣ for any predictors. Subsequently all parameters of the distribution – the means and those specifying thecovariance – may be flexibly modeled.The ability to model k · ( k + 3) / Appendix A. Basic Cholesky parameterization
The log-likelihood of the distributional parameters for an observation vector y is given by (cid:96) ( µ, L − | y ) = − k π ) + log( | L − | ) −
12 ( y − µ ) (cid:62) ( L − ) (cid:62) L − ( y − µ ) . (A.1)13n terms of the individual matrix entries Eq. A.1 can be expressed as (cid:96) ( µ, L − | y ) = − k π ) + k (cid:88) i =1 log λ ii − / z (cid:62) z, (A.2)where z is the vector z = L − ˜ y = λ · · · λ λ · · · λ λ λ · · · λ k λ k λ k · · · λ kk ˜ y ˜ y ˜ y ...˜ y k (A.3)and ˜ y = y − µ .The mean parameters and off-diagonal Cholesky entries only influence the log-likelihood through thisthird term containing z . Partial derivatives with respect to the means are given by ∂(cid:96)∂µ i = ∂(cid:96)∂η µ,i = k (cid:88) j =1 ς ij ˜ y j , (A.4)where ς ij refers to the corresponding element of Σ − = ( L − ) (cid:62) L − .For the off-diagonal Cholesky entries, ∂(cid:96)∂η λ,ij = ∂(cid:96)∂λ ij = − k (cid:88) n =1 (cid:34) (cid:32) n (cid:88) m =1 (˜ y m λ mn ) (cid:33) ˜ y i j ( n ) (cid:35) = − ˜ y i j (cid:88) m =1 (˜ y m λ mj ) . (A.5)Derivatives with respect to the diagonal entries of L − also involve the second likelihood term and aregiven by ∂(cid:96)∂λ ii = 1 λ ii − ˜ y i i (cid:88) m =1 (˜ y m λ mi ) . (A.6)The log-link on λ ii means ∂λ ii /∂η λ,ii = λ ii , and so ∂(cid:96)∂η λ,ii = 1 − λ ii ˜ y i i (cid:88) m =1 (˜ y m λ mi ) . (A.7)Second derivatives for parameters with identity link are found to be ∂ (cid:96)∂η µ,i = − ς ii = − k (cid:88) j = i λ ij , (A.8) ∂ (cid:96)∂η λ,ij = − ˜ y i . (A.9)The log-link on diagonal entries results in ∂ (cid:96)∂η λ,ii = ∂∂η λ,ii [1 − λ ii ˜ y i i (cid:88) m =1 (˜ y m λ mi )]= − λ ii · ∂λ ii ∂η λ,ii · ˜ y i − ∂λ ii ∂η λ,ii · ˜ y i i − (cid:88) m =1 (˜ y m λ mi )= − λ ii ˜ y i − λ ii ˜ y i · i − (cid:88) m =1 (˜ y m λ mi ) . (A.10)14 ppendix B. Modified Cholesky parameterization The modified Cholesky parameters are related to the basic parameters by L − = D − / T which implies λ ii = ψ − / i , λ ij = − φ ij · ψ − / j . (B.1)The log-likelihood in Eq. A.2 can be rewritten with respect to the new parameters: (cid:96) ( µ, ψ, φ | y ) = − k π ) − k (cid:88) i =1 log ψ i − k (cid:88) j =1 (cid:32) j (cid:88) i =1 (cid:16) ˜ y i φ ij ψ − / j (cid:17)(cid:33) . (B.2)For notational simplicity we define φ ii = − µ , λ ij and λ ii (Eqs. A.4, A.5, A.6) can berelated to derivatives with respect to the modified Cholesky parameters using Eq. B.1: ∂(cid:96)∂µ = T (cid:62) D − T ˜ y, (B.3) ∂(cid:96)∂φ mn = ∂(cid:96)∂λ mn · ∂λ mn ∂φ mn (B.4)and ∂(cid:96)∂ψ n = ∂(cid:96)∂λ nn · ∂λ nn ∂ψ n + n − (cid:88) m =1 (cid:18) ∂(cid:96)∂λ mn · ∂λ mn ∂ψ n (cid:19) . (B.5)Subsequently ∂λ mn ∂φ mn = − ψ − / n (B.6)and ∂λ nn ∂ψ n = − ψ − / n , ∂λ mn ∂ψ n = 12 φ mn ψ − / n . (B.7)Substituting the partial derivatives of (cid:96) with respect to λ nn and λ mn in the basic Cholesky parameteri-zation, one obtains ∂(cid:96)∂φ mn = − ˜ y m ψ − n n (cid:88) i =1 ˜ y i φ in (B.8)and ∂(cid:96)∂ψ n = − (cid:34) ψ n + ˜ y n n (cid:88) i =1 (cid:0) ˜ y i φ in ψ − n (cid:1)(cid:35) + 12 ψ − n n − (cid:88) m =1 (cid:34) ˜ y m φ mn (cid:32) n (cid:88) i =1 ˜ y i φ in (cid:33)(cid:35) (B.9)which simplifies to ∂(cid:96)∂ψ n = 12 ψ − n (cid:32) n (cid:88) i =1 ˜ y i φ in (cid:33) − ψ n . (B.10)Since ψ n are estimated using a log-link (log( ψ n ) = η ψ,n ), derivatives with respect to predictors become ∂(cid:96)∂η ψ,n = ∂(cid:96)∂ψ n ψ n = 12 ψ n (cid:32) n (cid:88) i =1 ˜ y i φ in (cid:33) − . (B.11)The remaining parameters use an identity link so ∂(cid:96)/∂η φ,mn = ∂(cid:96)/∂φ mn and ∂(cid:96)/∂η µ,m = ∂(cid:96)/∂µ m .Continuing with the second derivatives, ∂ (cid:96)∂η φ,mn = − ˜ y m /ψ n , ∂ (cid:96)∂η ψ,n = − ψ n (cid:32) n (cid:88) i =1 ˜ y i φ in (cid:33) and ∂ (cid:96)∂η µ,n = − ς nn , (B.12)where ς nn is the n -th diagonal entry of Σ − . 15 cknowledgements This project was funded by the Austrian Science Fund (FWF, grant no. P 31836). The authors thank theZentralanstalt f¨ur Meteorologie und Geodynamik (ZAMG) for providing observational data. The computa-tional results presented here have been achieved (in part) using the LEO HPC infrastructure of Universit¨atInnsbruck.
References
Bauer, P., Thorpe, A., Brunet, G., 2015. The quiet revolution of numerical weather prediction. Nature 525, 47. doi: .Bickel, P.J., Levina, E., 2008. Covariance regularization by thresholding. The Annals of Statistics 36, 2577–2604. doi: .Burke, K., Jones, M.C., Noufaily, A., 2019. A Flexible Parametric Modelling Framework for Survival Analysis. arXiv 1901.03212.arXiv.org E-Print Archive. URL: https://arxiv.org/abs/1901.03212 .Dawid, A.P., Sebastiani, P., 1999. Coherent dispersion criteria for optimal experimental design. The Annals of Statistics 27,65–81. doi: .Feldmann, K., Scheuerer, M., Thorarinsdottir, T.L., 2015. Spatial postprocessing of ensemble forecasts for temperature usingnonhomogeneous Gaussian regression. Monthly Weather Review 143, 955–971. doi: .Friedman, J., Hastie, T., Tibshirani, R., 2008. Sparse inverse covariance estimation with the graphical lasso. Biostatistics 9,432–441. doi: .Furrer, R., Genton, M.G., Nychka, D., 2006. Covariance tapering for interpolation of large spatial datasets. Journal ofComputational and Graphical Statistics 15, 502–523. doi: .Gabriel, K.R., 1962. Ante-dependence analysis of an ordered set of variables. The Annals of Mathematical Statistics 33,201–212. doi: .Gebetsberger, M., Stauffer, R., Mayr, G.J., Zeileis, A., 2019. Skewed logistic distribution for statistical temperature post-processing in mountainous areas. Advances in Statistical Climatology, Meteorology and Oceanography 5, 87–100. URL: https://ascmo.copernicus.org/articles/5/87/2019/ , doi: .Gneiting, T., Balabdaoui, F., Raftery, A.E., 2007. Probabilistic forecasts, calibration and sharpness. Journal of the RoyalStatistical Society B 69, 243–268. doi: .Gneiting, T., Raftery, A.E., 2007. Strictly proper scoring rules, prediction, and estimation. Journal of the American StatisticalAssociation 102, 359–378. doi: .Gneiting, T., Raftery, A.E., Westveld III, A.H., Goldman, T., 2005. Calibrated probabilistic forecasting using ensemble modeloutput statistics and minimum CRPS estimation. Monthly Weather Review 133, 1098–1118. doi: .Hamill, T.M., Bates, G.T., Whitaker, J.S., Murray, D.R., Fiorino, M., Galarneau Jr, T.J., Zhu, Y., Lapenta, W., 2013. NOAA’ssecond-generation global medium-range ensemble reforecast dataset. Bulletin of the American Meteorological Society 94,1553–1565. doi: .Hastie, T.J., Tibshirani, R.J., 1990. Generalized Additive Models. volume 43. Chapman & Hall/CRC.Klein, N., Kneib, T., Klasen, S., Lang, S., 2015a. Bayesian structured additive distributional regression for multivariateresponses. Journal of the Royal Statistical Society C 64, 569–591. doi: .Klein, N., Kneib, T., Lang, S., 2015b. Bayesian generalized additive models for location, scale, and shape for zero-inflated andoverdispersed count data. Journal of the American Statistical Association 110, 405–419. doi: .Kneib, T., Fahrmeir, L., 2007. A mixed model approach for geoadditive hazard regression. Scandinavian Journal of Statistics34, 207–228. doi: .K¨ohler, M., Umlauf, N., Beyerlein, A., Winkler, C., Ziegler, A.G., Greven, S., 2017. Flexible Bayesian additive joint modelswith an application to type 1 diabetes research. Biometrical Journal 59, 1144–1165. doi: .Lang, M.N., Mayr, G.J., Stauffer, R., Zeileis, A., 2019. Bivariate Gaussian models for wind vectors in a distributionalregression framework. Advances in Statistical Climatology, Meteorology and Oceanography 5, 115–132. doi: .Leutbecher, M., Palmer, T.N., 2008. Ensemble forecasting. Journal of Computational Physics 227, 3515–3539. doi: .Levina, E., Rothman, A., Zhu, J., 2008. Sparse estimation of large covariance matrices via a nested lasso penalty. The Annalsof Applied Statistics 2, 245–263. doi: .Mayr, A., Fenske, N., Hofner, B., Kneib, T., Schmid, M., 2012. Generalized additive models for location, scale and shape forhigh dimensional data—a flexible approach based on boosting. Journal of the Royal Statistical Society: Series C (AppliedStatistics) 61, 403–427. doi: https://doi.org/10.1111/j.1467-9876.2011.01033.x .Pan, J., Pan, Y., 2017. jmcm: An R package for joint mean-covariance modeling of longitudinal data. Journal of StatisticalSoftware 82, 1–29. doi: .Pinson, P., 2012. Adaptive calibration of ( u, v )-wind ensemble forecasts. Quarterly Journal of the Royal Meteorological Society138, 1273–1284. doi: .Pourahmadi, M., 1999. Joint mean-covariance models with applications to longitudinal data: Unconstrained parameterisation.Biometrika 86, 677–690. doi: . ourahmadi, M., 2000. Maximum likelihood estimation of generalised linear models for multivariate normal covariance matrix.Biometrika 87, 425–435. doi: .Pourahmadi, M., 2013. High-Dimensional Covariance Estimation: With High-Dimensional Data. volume 882. John Wiley &Sons. doi: .Rigby, R.A., Stasinopoulos, D.M., 2005. Generalized additive models for location, scale and shape. Journal of the RoyalStatistical Society C 54, 507–554. doi: .Schefzik, R., Thorarinsdottir, T.L., Gneiting, T., 2013. Uncertainty quantification in complex simulation models using ensemblecopula coupling. Statistical Science 28, 616–640. doi: .Schoenach, D., Simon, T., Mayr, G.J., 2020. Postprocessing ensemble forecasts of vertical temperature profiles. Advances inStatistical Climatology, Meteorology and Oceanography 6, 45–60. doi: .Schuhen, N., Thorarinsdottir, T.L., Gneiting, T., 2012. Ensemble model output statistics for wind vectors. Monthly WeatherReview 140, 3204–3219. doi: .Simon, T., Mayr, G.J., Umlauf, N., Zeileis, A., 2019. NWP-based lightning prediction using flexible count data regression.Advances in Statistical Climatology, Meteorology and Oceanography 5, 1–16. doi: .Stasinopoulos, M.D., Rigby, R.A., De Bastiani, F., 2018. GAMLSS: A distributional regression approach. Statistical Modelling18, 248–273. doi: .Umlauf, N., Klein, N., Zeileis, A., 2018. BAMLSS: Bayesian additive models for location, scale, and shape (and beyond).Journal of Computational and Graphical Statistics 27, 612–627. doi: .Wood, S.N., 2017. Generalized Additive Models: An Introduction with R. 2nd ed., Chapman & Hall/CRC, Boca Raton.doi: .Worsnop, R.P., Scheuerer, M., Hamill, T.M., Lundquist, J.K., 2018. Generating wind power scenarios for probabilisticramp event prediction using multivariate statistical post-processing. Wind Energy Science 3, 371–393. doi: .Wu, W.B., Pourahmadi, M., 2003. Nonparametric estimation of large covariance matrices of longitudinal data. Biometrika 90,831–844. doi: .Zimmerman, D.L., N´u˜nez-Ant´on, V., El-Barmi, H., 1998. Computational aspects of likelihood-based estimation of first-orderantedependence models. Journal of Statistical Computation and Simulation 60, 67–84. doi: ..