[PDF] Bayesian estimation of trend components within Markovian regime-switching models for wholesale electricity prices: an application to the South Australian wholesale electricity market

Abstract

We discuss and extend methods for estimating Markovian-Regime-Switching (MRS) and trend models for wholesale electricity prices. We argue the existing methods of trend estimation used in the electricity price modelling literature either require an ambiguous definition of an extreme price, or lead to issues when implementing model selection [23]. The first main contribution of this paper is to design and infer a model which has a model-based definition of extreme prices and permits the use of model selection criteria. Due to the complexity of the MRS models inference is not straightforward. In the existing literature an approximate EM algorithm is used [26]. Another contribution of this paper is to implement exact inference in a Bayesian setting. This also allows the use of posterior predictive checks to assess model fit. We demonstrate the methodologies with South Australian electricity market.

Full PDF

aa r X i v : . [ s t a t . M E ] S e p Bayesian estimation of trend components withinMarkovian regime-switching models for wholesaleelectricity prices: an application to the South Australianwholesale electricity market

Angus Lewis , Nigel Bean , and Giang T. Nguyen

The University of Adelaide, School of Mathematical Sciences Australian Research Council Centre of Excellence for Mathematical andStatistical FrontiersSeptember 17, 2020

Abstract

We discuss and extend methods for estimating Markovian-Regime-Switching (MRS)and trend models for wholesale electricity prices. We argue the existing methods oftrend estimation used in the electricity price modelling literature either require anambiguous deﬁnition of an ‘extreme price’, or lead to issues when implementing modelselection [23]. The ﬁrst main contribution of this paper is to design and infer a modelwhich has a model-based deﬁnition of extreme prices and permits the use of modelselection criteria.Due to the complexity of the MRS models inference is not straightforward. In theexisting literature an approximate EM algorithm is used [26]. Another contribution ofthis paper is to implement exact inference in a Bayesian setting. This also allows the useof posterior predictive checks to assess model ﬁt. We demonstrate the methodologieswith South Australian electricity market.

Keywords—

Regime-switching models, electricity spot price, price spikes, trend model.1

Introduction

Electricity is a unique commodity due to the fact that it is not currently economicallystorable at scale, and demand is eﬀectively inelastic. This causes electricity spot pricesto exhibit interesting behaviours not seen in other markets, such as large price spikes anddrops, negative prices, mean reversion, weekly, seasonal and yearly trends, and a strongdependence on weather and business activities [4, 12, 14, 23, 32, 39]. This makes modellingelectricity prices a challenging problem.The South Australian (SA) market is a particularly interesting case study due to itsrelative isolation, extreme hot weather and high penetration of non-baseload renewables –approximately 49% of SA’s generation in 2018 was from non-baseload renewables, primarilywind [1].A standard approach to modelling electricity prices decomposes the price series into aseasonal component, s t , and a stochastic component, X t . The price at time t is then given by P t = s t + X t or P t = s t X t . Furthermore, it is common to decompose X t as a mixture of base prices and price spikes . In the existing literature, the seasonal and stochastic componentsare estimated separately. First, the seasonal component is estimated and removed from theprice series, to create a stationary process, from which the stochastic component is thenestimated [6, 23].For wholesale electricity prices, estimating the seasonal component is made more complexby extreme price spikes which, due to their magnitude, will aﬀect trend estimates [23]. Onesolution posed in the existing literature is to ﬁrst identify extreme observations and replacethem with a “normal” value, then estimate the seasonal component on this simpliﬁed datausing standard techniques [23]. However, if the stochastic model is used to deﬁne extremeprices then some statistical model comparisons are not possible. To overcome this issue,we propose that the trend model is combined with the stochastic model, and that thetwo components are estimated jointly. This ensures that statistical model selection criteriaare permissible and furthermore, can be used to compare both the stochastic and trendcomponents.We use cubic splines and wavelet-based models to capture long-term trends, and piece-wise constant functions to capture weekly periodicities. The use of wavelet ﬁltering forlong-term trends, and piecewise constant models for weekly seasonalities, is common in the2lectricity pricing literature [34, 23], but cubic splines are less common. Since we com-bine the trend and stochastic models into one, as well as working in a Bayesian paradigm,we cannot apply wavelet ﬁltering directly, instead we construct a wavelet-based regressionmodel. This model is relatively complex due to the need for padding , so we also introducethe simpler cubic spline-based model.To capture stochastic variations in electricity prices, we use a Markovian-Regime-Switching(MRS) model, which is an extension of the Hidden Markov Model (HMM) that allows fordependence between observations. For reviews on existing models for electricity prices see[5, 12, 42]. The MRS models used in the electricity pricing literature [7, 10, 25, 33, 41],and considered in this paper, have a subtle but important diﬀerence in their speciﬁcationcompared to MRS models used in the wider literature: the MRS models in this paper have independent regimes (see Section 2.2 for details). This speciﬁcation is more suitable forelectricity spot markets where prices return rapidly to the base level following a price spike.Independent regime models were introduced to electricity price modelling by Huisman andDe Jong [22], and have since been popular [7, 10, 25, 33, 41].Inference for MRS models with independent regimes is not straightforward since theregime sequence is unobservable. In the electricity pricing literature, an approximation tothe EM algorithm, which we will refer to as the EM-like algorithm, is used for inferenceof these models (see Janczura and Weron [26], who extend work by Kim [29]). However,since this is an approximation to the EM algorithm, the theory of the EM algorithm is notvalid and there is no promise of convergence, or error bounds, for this algorithm [31]. Weaddress this by working under a Bayesian paradigm and use augmented Markov Chain MonteCarlo (MCMC) methods for inference. Working in a Bayesian paradigm also alleviatesidentiﬁability issues related to estimation of shifted distributions compared to working in amaximum-likelihood context [20, 31].Once a model has been ﬁtted, we must assess the assumptions underlying the model.Janczura and Weron [27] suggest a method to assess goodness-of-ﬁt when parameters areestimated using their EM-like algorithm. This is not directly applicable in our Bayesiansetting. Instead, we adopt posterior predictive checks [13, Chapter 6]. The logic behindposterior predictive checks is that if the model is good, then data replicated under theposterior predictive distribution should look similar to the observed data. By comparing thepredictive distribution to observed data, we can assess if there are any signiﬁcant deﬁciencies3n the model. We also compute Bayes factors which we also use to compare models.To demonstrate our methodologies, we apply them to daily-average wholesale electricityprices in South Australia. Here we model daily average prices, as is common in the existingliterature [4, 11, 15, 25, 32, 35, 41].Throughout this paper we imprecisely use “spot price” to refer to the daily average spotprice. We choose to model the raw daily-average price series, rather than log prices. Thereis some existing evidence to suggest that this is a reasonable approach [41]. Our justiﬁcationis that to take the logarithm of prices, all prices need to be shifted upward so that they arepositive. However, there is limited advice on how to shift the data, and the shift can havea large eﬀect on the features of the price series.

Spot price evolution is a discrete-time process hence we use a discrete-time model. Themodel is composed of two parts, a seasonal component { s t } t ∈ N , capturing deterministictrends in electricity prices, and a stochastic component { X t } t ∈ N , capturing mean reversion,spikes and drops. We use an additive model; at time t the price, P t , is given by P t = s t + X t .Extreme prices in electricity markets can bias estimates of seasonal components if we donot treat them carefully. A solution proposed by Janczura et al. [23] is to use some methodto identify and replace price spikes, and then estimate the trend model on the altereddataset. Once the trend model has been estimated it can be removed from the data, thenthe stochastic model estimated. For example, one method of spike identiﬁcation used byJanczura et al. [23] is to regard any price further than 3 standard deviations above the meanof the data, as a spike. An alternative, also proposed by Janczura et al. [23], is to iteratebetween estimating the seasonal component and estimating the stochastic component: abyproduct of estimating the stochastic component is a classiﬁcation of observations intoregimes, which can then be used to identify spikes.If a spike classiﬁcation method which is independent of the stochastic model is used, suchas classifying prices as spikes if they are three standard deviations or more away from themean, then we can remove the trend from the data and estimate all stochastic componentson the same detreded data set. Since the stochastic components all use the same detrendeddata, statistical (likelihood-based) metrics can be used to compare the stochastic component4f the model. Issues with this method are that the choice of spike classiﬁcation method isquite arbitrary, and we cannot use these metrics to compare trend components.Using the stochastic component to identify and replace spikes is attractive as it is model-based , so there is an obvious deﬁnition of a price spike. Diﬀerent speciﬁcations of the stochas-tic component will likely lead to diﬀerent classiﬁcations of which prices are extreme, andtherefore which prices must be treated before we estimate the trend component. Ultimately,the estimates of the trends will diﬀer and therefore lead to diﬀerent detrended datasets thatwe then use to estimate the stochastic components. The result is that the likelihoods oftwo diﬀerent models are incomparable and therefore the use of statistical (likelihood-based)model comparisons such as the BIC or Bayes factors are invalid.In this paper, we propose a diﬀerent method. We combine the stochastic and trendmodels into one, and estimate them jointly. The advantages of this approach are thatthe classiﬁcation of data as spikes is model-based, while also enabling statistical modelcomparison techniques for the stochastic components and the trend components. s t Electricity spot prices exhibit seasonality on weekly, seasonal and yearly scales. To capturethis multi-scale seasonality, the seasonal component consists of two parts: a short-termcomponent, g t , and a long-term component, h t . We present two diﬀerent models for thelong-term component, cubic splines , and a wavelet-based model. It is common to use waveletﬁltering to estimate long-term trends in the electricity pricing literature [23], but cubicsplines are less common. Some authors discount the use of splines due to the fact that theydo not perform well for prediction, since it is not clear how to extrapolate spline modelsbeyond the data [34]. However wavelet ﬁltering has similar issues. The appeal of cubicsplines and wavelet models are that they are able to capture a diverse range of behaviours(unlike sinusoidal-based models, for example, which can capture periodic trends only).The general idea of wavelet ﬁltering is to project a time series on to a set of functionswhich can represent low-frequency components of the data only. Speciﬁcally, for waveletﬁltering the data is projected on to a set of scaling functions . Due to properties of waveletand scaling functions, the coeﬃcients of the scaling functions can be computed as a linearﬁlter which is applied recursively to the data. The weights of the ﬁlter take values deﬁnedby wavelet and scaling basis functions. 5here are diﬀerent possible choices of wavelet bases which give the ﬁlter diﬀerent proper-ties. Here, in line with existing literature, we choose to use Daubechies wavelets and applythe ﬁlter recursively 8 times [41]. In [41] Daubechies wavelets of order 24 are speciﬁed.However, we use Daubechies wavelets of order 8 due to computational demands of the al-gorithms used in this work. This serves to reduce the number of parameters of the waveletcomponent from 55 to 24, and also to reduce the amount of padding required. In order toapply the wavelet ﬁltering method, the data must be padded to make the data the correctlength. Here we use symmetric padding and extend the data at both ends by reﬂecting thesignal (see the supplementary material for more details).Let the vector of data be x = ( x , ..., x T ) ′ , where y ′ is the transpose of the vector y , and assume it is an appropriate length (or has been suitably padded). We show, inthe supplementary material, that computation of the scaling function coeﬃcients at level b a j = ( b a ,j , ..., b a k,j ) ′ , can be represented as a matrix-vector product; b a j = W j x , where W j is a matrix of ﬁlter coeﬃcients constructed in the supplementary material. We alsoshow that, due to the properties of wavelet ﬁltering, the coeﬃcients b a j are the least-squaresestimates of the regression model x = W ′ j a j + σ ε , where ε is a column vector of independentstandard normal random variables. Motivated by this, we use the design matrix W ′ j as abasis for our wavelet-based trend model.Similarly to how wavelet ﬁltering is a projection on to scaling functions, we can alsoproject the data on to a set of cubic spline functions. Let η < η < ... < η k be a set of pointswhich are called knots . Then cubic splines are piecewise cubic polynomials on the intervals[ η , η ] , [ η , η ] , ..., [ η k − , η k − ] , [ η k − , η k ], with continuous ﬁrst and second derivatives at theboundaries of each interval, η , ..., η k − . In the supplementary material we show how toconstruct the design matrix of a cubic spline ordinary least squares regression model. Wedenote the design matrix of the cubic spline regression as C ′ . Here we use B-splines toconstruct the design matrix and place knots every 180 days (approximately two per year).An advantage of the cubic spline model is that no padding is required which simpliﬁes thedesign matrix and computations.The other component of the trend model is the short-term component, which is made up6f piecewise constant functions, and captures the average price ﬂuctuations over the courseof a week. That is, g t = β Mon I ( t ∈ Mon) + β Tue I ( t ∈ Tue) + ... + β Sun I ( t ∈ Sun) , where β Mon , β

Tue , ..., β

Sun , are the mean prices, after accounting for long-term trends, onMonday, Tuesday,..., Sunday, respectively and I is the indicator. The design matrix for thiscomponent of the trend is denoted M ′ and deﬁned in the supplementary materials.Putting this all together, the trend model has design matrix Z = (cid:2) M ′ | W ′ j (cid:3) in the caseof the wavelet-based model (which includes padding) and has design matrix Z = [ M ′ | C ′ ]in the case of the cubic spline based model. We denote the parameters of the trend modelas γ = ( γ , ..., γ p ) ′ , therefore the trend is given by s = Z γ . The Markovian-Regime-Switching (MRS) model is a generalisation of the hidden Markovmodel since MRS models allow for dependence between observations. MRS models are com-prised of two components; a hidden regime sequence, { R t } t ∈ N , and an observation sequence, { X t } t ∈ N . The evolution of the regime process { R t } t ∈ N is governed by a discrete-time Markovchain and is assumed to be unobservable. The regime process is completely deﬁned by itsstate space, S = { , , ..., N } , which corresponds to the set of regimes, a transition probabil-ity matrix, P = ( p ij ) , i, j ∈ S , and an initial distribution of the state of the chain. At eachtime t , the observation, x t , follows a distribution that is known given the regime sequenceup to time t , R t = ( R , ..., R t ) ′ , and all previous observations, ( x , ..., x t − ) ′ =: x t − . Thatis, the density function, f ( x t | x t − , R t ), is known. Notice, unlike a hidden Markov model,the observations are not conditionally independent given the hidden regimes.A key element of the independent regime MRS models used in the electricity price mod-elling literature is that observations generated by one regime are independent of observationsgenerated from any other regime. That is, deﬁne A i := { X t : R t = i } , for i ∈ S , then A i and A j are independent for i = j . One way to think about the evolution of these modelsis as follows. Consider M random processes evolving simultaneously, each of which corre-spond to a regime in the MRS model, and each of which produces a realisation at all timepoints. At each time t we observe one of these processes only, and which process is observed7s determined by the hidden regime process, { R t } . That is, at time t , we observe the i th process if R t = i , and all other realisations are unobserved. Due to this assumption, thedistributions f ( x t | x t − , R t ) may depend on the history of R t . We wish to combine the MRS and trend models into one so that we may estimate themjointly. We specify two types of regimes, base regimes, denoted as { B ( i ) t } t ∈ N , i = 1 , ..., n ,which are AR(1) processes and include trend components, and non-base regimes, { Y ( i ) t } t ∈ N , i = n + 1 , ..., N , which are i.i.d processes and do not have trend components. The processes, { B ( i ) t } t ∈ N have the form B ( i ) t = φ i ( B ( i ) t − − s t − ) + s t + σ i ε ( i ) t (1)= φ i ( B ( i ) t − − z t − γ ) + z t γ + σ i ε ( i ) t , where γ , φ i and σ i are parameters and { ε ( i ) t } t ∈ N is a sequence of i.i.d. N (0 ,

1) randomvariables, i = 1 , , ..., n . The terms z t − and z t are the rows of Z corresponding to times t − t , respectively. When multiple AR(1) regimes are included in a model we restrict σ i < σ i +1 , for all i = 1 , , ..., n . This condition serves to identify the model. RearrangingEquation (1) to B ( i ) t − s t = φ i ( B ( i ) t − − s t − ) + σ i ε ( i ) t shows that this model corresponds to modelling the noise around the trend as a stationaryAR(1) process.For all other regimes, let { Y ( i ) t } t ∈ N be a sequence of i.i.d. random variables for each i = n + 1 , ..., N . These regimes are labelled as either spikes or drops depending on theirnature.The largest models that we consider have N = 5 regimes: two base regimes, two spikeregimes, and a drop regime. Other models considered are subsets of this largest model.The inclusion of two base regimes in our modelling is motivated by our observations of astructural change in volatility in 2016 in the dataset. As for spike regimes, one spike regimecaptures typical spikes and another captures extraordinary spikes, which is motivated by ourobservation that a single spike regime is unable to capture extreme price spikes in the SA8arket data. The drop regime captures large downward price movements which can occurwhen the market is oversupplied. Therefore, the largest model that is considered takes thefollowing form, P t =  B (1) t if R t = 1 ,B (2) t if R t = 2 ,Y (3) t if R t = 3 ,Y (4) t if R t = 4 ,Y (5) t if R t = 5 , where P t is the price at time t , B ( i ) t , i = 1 , , are as in Equation (1), and Y ( j ) t , j = 3 , , , arei.i.d. random variables. We consider two diﬀerent speciﬁcations for Regimes 3 and 4 (spikeregimes): a shifted log-normal and a shifted gamma distribution, Y ( i ) t − q i ∼ F i (cid:0) µ i , σ i (cid:1) for i = 3 , , where F i is either a log-normal or a gamma distribution, and q i is a shifting parameter.Note that we parameterise the Gamma distribution with shape parameters µ i , and scaleparameter σ i . Regime 5, the drop regime, follows an i.i.d. shifted log-normal distributionbut with the direction of the distribution reversed; q − Y (5) t ∼ LN (cid:0) µ , σ (cid:1) , and is included to capture large downward price movements. The shifting parameters q , q and q can be prespeciﬁed [26], although we leave them to be inferred. We restrict thesupport of their prior distributions for identiﬁability and interpretability.As mentioned earlier, the regimes of the MRS model are assumed to be mutually inde-pendent. This feature allows for the true spike-like behaviour observed in electricity markets,because it ensures that the return to base prices after spikes is immediate. However, thisassumption does create some complicated dependence between observations of the AR(1)regimes. To determine the distribution of the AR(1) processes at time t , B ( i ) t , we eitherneed to know the previous price from that regime, B ( i ) t − (which may or may not have beenobserved), or, the time at which the last observation from that regime occurred, in which9 t -4-20246810 V a l ue Realisation of an MRS model with independent regimes

ObservedBaseSpike

Figure 1: Simulation of an MRS model with two independent regimes; an AR(1) base regimeand an i.i.d spike regime. To write down the distribution of the observation at time t = 13,we would like to know either the base regime value immediately before it, circled in green,or the time that the last observed price from that regime occurred, highlighted by the pinkbox.case we can integrate out the missing values of the AR(1) process between B ( i ) t and the lastobserved price from that regime, to construct its distribution. We take the latter approach.To illustrate, Figure 1 shows a simulation of a two-regime model with one AR(1) regimeand one i.i.d. spike regime. To complete our model we need to specify prior distributions for the parameters of our model.We look, where possible, to specify uninformative priors. However, for some parameters weuse the prior distributions to enforce certain behaviours.We use the prior distribution to restrict the support of the parameters q i of the shiftedspike and drop distributions. If left unrestricted, then these parameters can make the spikeand drop regimes capture base prices, rather than the behaviour they were designed tocapture. For example, for spike distributions the shifting parameter may become negativeand the corresponding regime captures prices that we would not logically classify as spikes.Janczura and Weron [24] suggest setting q , the shifting parameter of the drop regime tothe 25 th percentile of the data and q , the shifting parameter of the spike regime to the75 th percentile of the data if they cannot be estimated. Motivated by this, we place auniform prior on these parameters around these values. Speciﬁcally q ∼ U [ Q . , Q . ]and q ∼ U [ Q . , Q . ], where Q α is the α × q ∼ U [ Q . , Q . ].Estimating shifted log-normal and shifted gamma distributions is not straightforwardas noted by Hill [20] and Johnson and Kotz [28]. This is due to the fact that there arepoints where the likelihood may take arbitrarily large values. For example, if the shiftingparameter is equal to a data point, q i = x t , then the likelihood of the shifted log-normaldistribution can be arbitrarily large if the variance tends to 0.For the shifted log-normal distribution the variance parameters, σ i , have a prior distri-bution with densities f ( σ i ) ∝ σ i , i = 3 , ,

5, and support σ i ∈ [0 . , × s ], where s is thestandard deviation of the data. A key feature here is that we restrict σ i to not be arbitrarilysmall. This form of prior speciﬁcation is often justiﬁed by the fact that, if the support ofthis prior were (0 , ∞ ), then this would be the Jeﬀeries prior for this distribution, whichis invariant to reparameterisation, and also log σ has a uniform distribution on R . Therestriction of the prior to a ﬁnite interval is so that the prior can be normalised and so that σ i cannot become arbitrarily small preventing the likelihood from taking arbitrarily largevalues. We set the prior distribution for the location parameter of the shifted log-normal toa normal distribution N (0 , × s ) where s is the standard deviation of the data.It can also be shown that the likelihood of the shifted gamma distribution can be inﬁnitewhen q i = x t and the shape parameter µ i ≤

1. In addition, the density of the gammadistribution is equal to 0 at x = q i only if µ i >

1. Therefore, when included in our MRSmodels, if we allow µ i ≤

1, the shifted gamma spike regimes will have a discontinuous peakat x = q i , and most of their mass will lie near this point. Since we do not expect pricespikes to be largely clumped around the shifting parameters q i , we force the shifted gammadistributions to have a peak away from the boundary using the prior distribution of µ i .For the shifted gamma distributions, the prior distributions for the parameters µ i areshifted inverse-gamma distributions, which are chosen so that we can enforce desirable be-haviour on the distribution in these regimes. These prior distributions are shifted to havesupport from 1 to ∞ (rather than 0 to ∞ ) to ensure that the shape parameters µ i arestrictly greater than 1. Furthermore, the inverse-gamma prior distributions are given shapeparameter α = 3 (the smallest integer such that the prior distribution has ﬁnite variance)and scale parameter β = 5 .

5, so that the prior distribution has its mode at 2 .

5. This followsadvice from Johnson and Kotz [28] who, in a maximum-likelihood context, observed that11ssues related to the likelihood becoming large for µ i < µ i is near 1 and theyadvise against maximum likelihood estimation of the shifting parameter q i when µ i < . f ( σ i ) ∝ σ i with support [0 . , × s ], where s is the standarddeviation of the data.For the transition probabilities of the hidden Markov chain we assign a Dirichlet priordistribution with all parameters equal to 1:( p j , . . . , p Nj ) ∼ Dirichlet (1 , . . . , . This is actually a uniform prior distribution subject to a unit sum conditionThe prior distribution of the correlation parameters φ i is U ( − , − , | φ i | < f ( σ i ) ∝ σ i i = 1 ,

2, withsupport [1 , × s ] where s is the standard deviation of the data. The justiﬁcation for thisis the same as for the prior distributions of the log-normal distributions speciﬁed above.For the parameters of the trend models, γ , we assign independent, normally distributedprior distributions with mean 0 and standard deviation 10 × s .As a small investigation on the sensitivity of our modelling to the choice of priors, wereplaced the prior distributions of the form f ( σ i ) ∝ σ i with uniform prior distributionswith the same support and observed no substantial change in our conclusions. To estimate the parameters of our MRS models we use an adaptive, data-augmented, blockMetropolis-Hastings algorithm to sample from the posterior distribution. In the followingwe use the notation for a parameter vector, θ ∈ Θ, where Θ is the parameter space, theprior distribution, π ( θ ), the posterior distribution, p ( θ | x ), the likelihood, p ( x | θ ) and deﬁne12 ( x ) = R Θ p ( x | θ ) π ( θ ) d θ . Likelihood

Due to the speciﬁcation of the MRS model, the distribution of observations isdetermined by knowledge of the hidden regime sequence and previous price data, thereforewe must write the likelihood as a marginal distribution. Let S = { , ..., N } be the statespace of the regime process and x = ( x , ..., x T ) ′ be our sequence of observed prices. Thenthe likelihood can be written as p ( x | θ ) = X R ∈S T p ( x , R | θ ) = X R ∈S T p ( x | θ , R ) p ( R | θ ) . (2)The number of sequences in S T is N T . For most realistic datasets it is impossible toenumerate all N T possible regime sequences so we cannot naively compute the sum on theright-hand side of Equation (2).Recall Figure 1 and the related discussion about the complex dependence structurebetween observations. Due to this dependence structure the standard EM algorithm forMRS models is not valid (see [16, 17] for the algorithm and [26, 31] for a discussion of whyit is not applicable to this problem). In the existing literature an EM-like algorithm is usedfor inference [26] of these electricity price models. However, the EM-like algorithm is anapproximation to the EM algorithm, hence the theory of the EM algorithm is not valid andthere are no known error bounds, or promise of convergence [31]. Janczura and Weron [26]provide some simulation evidence to show that their algorithm works well, but it is not toohard to ﬁnd examples where it fails [31].To resolve the inference problem we work in a Bayesian setting and use data-augmentedMCMC methods. Thus, our methodology is backed by a vast literature which suggest wewill achieve reasonable parameter estimate. The data-augmentation allows us to work with p ( x | θ , R ) instead of the likelihood, overcoming the computation issue in Equation (2).Let s ti be the length of time since the regime was last in state i before time t , andthe regime at time t is i . For example, the event { s t = k } = { k ∈ N | R t = 1 , R t − =13 , ..., R t − k +1 = 1 , R t − k = 1 } . Then we can write p ( x | θ , R ) as p ( x | θ , R ) = Y i =1 p B ( i ) ( x | θ ) I ( R = i ) 2 Y i =1 T Y t =2 t − Y k =1 p B ( i ) ,k ( x t | x t − k , θ ) I ( s ti = k ) × Y i =3 T Y t =1 p Y ( i ) ( x t | θ ) I ( R t = i ) . Here p B ( i ) ,k ( ·| x t − k , θ ) is the density of the i th AR(1) process, given that the last observationfrom the i th process, was x t − k and was k lags ago, p Y ( i ) ( ·| θ ) is the density when in the i th i.i.d. regime. For simplicity we assume that R = 1, and, as there is no prior priceinformation, p B ( i ) ( x k | θ ) = 1 for i = 1 ,

2, where k is the ﬁrst time in regime i . That is, weeﬀectively ignore the ﬁrst observation from the autoregressive regimes. For large samplesizes, this assumption is reasonable and has minimal eﬀect.The decision to evaluate p ( x | θ , R ) instead of the likelihood means that we infer theposterior distribution of the parameters and the unobserved regime sequence. That is, weinfer p ( θ , R | x ) = p ( x | θ , R ) p ( θ , R ) p ( x ) ∝ p ( x | θ , R ) p ( R | θ ) π ( θ ) . The marginal distribution p ( θ | x ) that we seek is extracted by integrating the posterior p ( θ , R | x ) over R . Conveniently, when using the Markov chain Monte Carlo (MCMC)methods that we use here, this integration is equivalent to simply ignoring the samplingof R from our MCMC output. Thus, data augmentation enables us to construct an MCMCalgorithm that eﬃciently samples from the posterior distribution but at the cost of alsorequiring us to infer the sequence of regimes, R . We use posterior predictive checks (PPCs) for model checking and comparison. PPCs aretypically used as a ﬂexible tool for model checking in Bayesian settings [13, Chapter 6]. Thegeneral idea is to sample parameters from the posterior distribution and then use these tocalculate statistics from the observed data. This is repeated for many samples from theposterior and we look to see if, overall, our model looks reasonable.More speciﬁcally, for the implementation of PPCs in this context, we sample a parametervector θ ∗ and a regime sequence R ∗ from the posterior distribution. Sampling the hidden R ∗ R ∗ is used to classify points into base regime i ,which gives a partially observed AR(1) process. For each point classiﬁed into base regime i ,except the ﬁrst point, the standardised residual, r ( i ) t , is constructed using r ( i ) t = x t − z t γ ∗ − ( φ ∗ i ) s ti (cid:16) x t − s ti − z t − s ti γ ∗ (cid:17) σ ∗ i vuuut − ( φ ∗ i ) s ti − ( φ ∗ i ) , (3)where s ti is number of lags since the process was last in regime i before time t , and isdetermined by R ∗ .For each regime the residuals are used to perform the following model checks:1. Construct QQ plots and compare to distributional assumptions.2. Plot residuals against t , to check for constant variance across time.3. Plot residuals against the absolute value of the last observed value in that regime,i.e. for the base regime i , plot r ( i ) t against | x t − s ti | where R ∗ determines s ti . We use thisplot to check that the variance of the process does not depend on the magnitude ofprices.In Section 4, for the sake of brevity, we only provide a single representative plot from ourPPCs.PPCs are easy to implement but, as with any checking procedure, we should be awareof their interpretation and drawbacks. A word of caution is in order. PPCs are useful todetermine if our model fails in any obvious way but, due to the fact that we use the samedata to ﬁt and check the model, these procedures tend to make models look better thanthey actually are.As another model check we also plot the price series and a classiﬁcation of observedprices into a regime estimated under the model. This allows us to visually assess whichregime captures which data points to see if the model is behaving as we expect.We also compute Bayes factors to compare models. For two models M and M , the15ayes factor to compare these models is the ratio K = p ( x | M ) p ( x | M ). That is, the ratio ofthe probability of the observed data under M compared to the probability of the dataunder M . Thus, if K is greater than 1, then the probability of the observed data is greaterunder model M than under M , thus M would be preferable. The quantity p ( x | M ) isapproximated within the implementation of the MCMC procedure. The National Electricity Market (NEM) is comprised of ﬁve interconnected states of Aus-tralia that also act as price regions: Queensland, New South Wales (including the AustralianCapital Territory), South Australia, Victoria, and Tasmania. Each state has its own gen-eration capacity and can also import/export electricity via interconnectors between states.

Dispatch prices for each region are set every 5 minutes in an auction process managed bythe Australian Energy Market Operator (AEMO), and the spot price is the average over 30minutes of the dispatch prices, resulting in 48 realisations of the spot price process per day.The spot price is the price at which transactions are settled.The South Australian electricity market is a particularly interesting case study due toa number of factors including its relative isolation, occasional extremely hot weather, andgeneration mix – in 2016 SA had 39.2% of is total generation come from wind farms, 50.5%from gas and 9.2% residential solar panels [1]. Our dataset consists of 112416 half hourlyelectricity spot prices from the South Australian electricity market (available at the AEMOwebsite [2]) for the period 00:00 hours, 1st of January 2013 to 23:30 hours 1st of May 2019.This gives us a dataset of 2342 daily average price observations to which we ﬁt our model.To be clear, we calculate the daily average prices as the arithmetic mean of the 48 spotprices in a day. The data that we model is plotted in Figure 2.We ﬁt the models discussed above to the South Australia dataset, using our MCMCalgorithm to sample from the posterior distribution. For each model, 4 MCMC chains oflength 2,000,000 were generated and the ﬁrst 500,000 samples of each chain were used foradaption and discarded as burn-in. First, we discuss the two trend models introduced Code available online at https://github.com/angus-lewis/MRSMCMC

013 2014 2015 2016 2017 2018 2019

Year D a il y A v e r age P r i c e A UD Figure 2: The daily average wholesale electricity spot price for South Australia for the periodfrom the 1st of January 2013 to the 31st of May 2019, quoted in $AUD per Megawatt hour.The plot has been truncated at $1500 so that details are clearer. The price which has beencut oﬀ in the truncated plot is a price of $3359.81 which occurred in 2019.above, and then the MRS models.

Trend model

Recall the two trend models introduced earlier in the paper; the wavelet-based model, and the cubic spline-based model. It turns out that, under the model checkingprocedure that we follow below, the choice of trend model does not signiﬁcantly aﬀect theconclusions about the distributional assumptions of the model. That is, the distributionalassumptions appear to be suﬃcient or deﬁcient in the same ways for either trend model.Therefore to simplify this discussion we will focus on the trend model for log-normallydistributed spikes only.In Figure 3, posterior mean estimates of the short-term and long-term trends are shownfor stochastic Models 1-3 (deﬁned below) for both the spline and wavelet models. Thediﬀerence between the estimated mean long-term trend components is most notable at theedges of the dataset. This is not unexpected, since the wavelet-based model incorporatesedges eﬀects due to padding, while this is not an issue for the spline-based model. Bothtrend models, however, seem to capture the main features of the data, a trough in 2015 anda peak in 2017.Notable, Model 1 suggests that the trough in prices in 2015 was not as low as the othermodels suggest. However, this can be explained by the fact that Model 1 includes a dropregime to capture sudden price drops. However, for this model, rather than capture suddenprice drops, the drop regime captures the period of low prices in 2015. In contrast Models17

013 2014 2015 2016 2017 2018 2019

Year D a il y A v e r age P r i c e A UD Model 1Model 2Model 3

M T W T F S S

Year -15-10-505 D a il y A v e r age P r i c e A UD Model 1Model 2Model 3

Year D a il y A v e r age P r i c e A UD Model 1Model 2Model 3

M T W T F S S

Year -15-10-505 D a il y A v e r age P r i c e A UD Model 1Model 2Model 3

Figure 3: Wavelet-based (left) and spline-based (right) posterior mean long-term (top) andshort-term (bottom) seasonal components for stochastic Models 1-3. The short-term trendmodels have been adjusted so that they represent the average price on a given day of theweek compared to the average price on a Monday.2 and 3 do not include drop regimes and so we see a lower trough in 2015 for these models.There is not much diﬀerence between estimates of the short-term trend component forthe wavelet and spline based models.We can use Bayes factors to compare the cubic-spline based and wavelet based models.Table 1 contains the natural logarithm of the Bayes factor comparing the the wavelet-basedmodel to the cubic-spline based model for stochastic models 1-3. Table 1 shows that theTable 1: The natural log of the Bayes factor comparing the wavelet-based and cubic-splinebased trend models for stochastic models 1-3.Wavelet vs. Cubic-splineStochastic model 1 6.59Stochastic model 2 12.1Stochastic model 3 34.1cubic-spline model is favourable in every case. Furthermore, the cubic-spline based modelis simpler to implement and interpret hence we recommend the cubic spline based model is18sed. The results presented henceforth are all based on the spline-based model.

MRS model

To estimate the MRS elements of the model we take a stepwise approach tomodel building. Starting with a 3-regime model we ﬁt it to the data and then interrogatethe model using PPCs to ﬁnd which aspects of the data the model is failing to capture.Then we add or remove elements of the model to attempt to improve model ﬁt. This isrepeated until a suitable model is found.

Model 1.

First consider the following 3-regime model. X t =  B (1) t if R t = 1 ,Y (3) t if R t = 3 ,Y (5) t if R t = 5 , (Model 1)which is a reduced model of that introduced in Section 2.2. Regime Y (3) t follows a shiftedlog-normal distribution and captures price spikes, and Y (5) t follows a shifted and reversedlog-normal distribution, to capture price drops.Fitting this model to the data we notice two aspects where the model clearly fails tomodel the data appropriately. The drop regime, Regime 5, does not capture large downwardprice movements, as expected. Rather, it captures a period of low ‘base’ prices around theend of 2014 and in to the ﬁrst half of 2015, as shown in Figure 4. Also, the assumptionthat the spikes follow a shifted log-normal distribution appears to be violated, which isindicated by the posterior predictive checks in Figure 5. Here, Figure 5 includes just onerepresentative posterior predictive check plot only, for brevity.For these reasons a second spike regime is added to the model – one with a larger shiftingparameter to capture extreme spikes – while also removing the drop regime. Model 2.

This model therefore includes one base regime, B (1) t , two spike regimes, onefor typical spikes, Y (3) t , and another for extreme spikes, Y (4) t , and no drop regime: X t =  B (1) t if R t = 1 ,Y (3) t if R t = 3 ,Y (4) t if R t = 4 , (Model 2)19

013 2014 2015 2016 2017 2018 2019

Year D a il y A v e r age P r i c e A UD Figure 4: The daily average wholesale electricity spot price for South Australia for theperiod from the 1st of January 2013 to the 31st of May 2019 quoted in $AUD per Megawatthour. The points highlighted in red were classiﬁed in to the drop regime, Regime 5, ofModel 1, using the following classiﬁcation rule. If an observation at time t has the greatestposterior probability of being in the drop regime, that is arg max j P ( R t = j | x ) = 5, thenthe observation is classiﬁed as coming from the drop regime. Clearly the drop regime is notcapturing the behaviour it was designed to capture.

200 400 600 800 1000 1200 1400 160050010001500200025003000

Figure 5: A representative posterior predictive check QQ plot of the spike process, Regime3, in Model 1. Due to the curvature in the points on the plot, the assumption that spikesfollow a log-normal distribution is not valid.20

50 100 150 200 250 300 350 40050100150200250300350400

500 1000 1500 2000 2500 3000 3500050010001500200025003000350040004500

Figure 6: Representative posterior predictive check QQ plots of residuals from the posteriorfor Model 2. (Left) QQ plot of standardised residuals for the base regime, Regime 1, cal-culated using Equation (3). (Right) QQ plot of residuals for the shifted log-normal spikes,Regime 3. (Bottom) QQ plot of the residuals of the shifted log-normal spikes, Regime 4, forextreme spikes.where Y (3) t and Y (4) t follow shifted log-normal distributions.In Figure 6 representative QQ plots from our PPCs for Model 2 are shown, which areused to assess the distributional assumptions for each regime. Figure 6 shows that the log-normal distribution for both of the spike regimes is not reasonable – the log-normal is tooheavy-tailed. Figure 6 also indicates that the standardised residuals of the base regime arelikely not normally distributed – the QQ plot suggests that the errors have excess kurtosis.Furthermore, as Figure 7 shows, the variance of the residuals of Regime 1 increases overtime for Model 2. In fact there appears to be a distinct jump in the variance of the residualsaround t = 1200 which corresponds to April 2016. One notable event that occurred aroundthis time was the closure of South Australia’s only coal generation facility [30]. It could bethat this event disrupted the market causing the jump in volatility, but more research wouldbe needed to determine the cause of the increased volatility.21

00 1000 1500 2000-3-2-10123

Figure 7: A representative posterior predictive check plot of the residuals, calculated usingEquation (3), of the base Regime 1 in Model 2, versus time. There is a distinct jump in thevariance of residuals near t = 1200. Model 3.

To investigate the ﬁndings from the interrogation of Model 2, Model 3 isproposed, with two base regimes, to address heteroskedasticity, and two shifted log-normalspike regimes: X t =  B (1) t if R t = 1 ,B (2) t if R t = 2 ,Y (3) t if R t = 3 ,Y (4) t if R t = 4 . (Model 3)Recall that we assume σ < σ , so that the variance of base Regime 1 is less than thevariance of base Regime 2.Observing the QQ plots in Figure 8 that were produced as part of our PPCs, the distri-butional assumptions of Regimes 2, 3 and 4 are all reasonable. However the distributionalassumptions of Regime 1 are questionable. There appears to be too much mass in the tailsof the data compared to a normal distribution.To check for constant variance of residuals as a function of the most recently observedlagged value of the AR processes, PPCs are used and the standardised residuals from thebase regimes are plotted against the most recently observed lagged values – a representativeplot is provided in Figure 9 for Model 3. In Figure 9 (a) there is some evidence that thevariance depends on the magnitude of lagged values since the spread of the points increaseswith | x t − s t | . In Figure 9 (b) there are no obvious signs that the variance in Regime 222

50 100 150 200 250 30050100150200250

500 1000 1500 2000 2500 3000 3500 4000 4500500100015002000250030003500

Figure 8: Representative posterior check QQ plots of residuals for a single sample of pa-rameters and hidden regime sequence from the posterior for Model 3. (Top-left) QQ plotof standardised residuals of the AR(1) base Regime 1, calculated using Equation (3). (Top-right) QQ plot of the standardised residuals of the AR(1) base Regime 2, calculated usingEquation (3). (Bottom-left) QQ plot of residuals for the shifted log-normal spike Regime 3,for typical spikes. (Bottom-right) QQ plot of the residuals of the shifted log-normal spikeRegime 4, for extreme spikes. The QQ plots for Regimes 2, 3 and 4 show little curvature inthe points, suggesting the distributional assumptions of these regimes are reasonable. TheQQ plot for Regime 1 shows some curvature, suggesting the distributional assumptions forthis regime are may still be incorrect. 23 | x t − s t | (a)

20 40 60 80 100 120 140 160 180-2-10123 | x t − s t | (b)Figure 9: Representative posterior predictive check plots of the standardised residuals ofthe base regimes for Model 3, calculated using Equation (3), against the absolute value ofthe most recently observed lagged values, x t − s ti , of the same base regime. The lagged values x t − s ti are determined using the sample R ∗ from the posterior and are the last observed valuebefore time t from (a) base Regime 1, (b) base Regime 2. In Figure (a) there is some evidencethat the variance of the residuals depend on the magnitude of lagged values since the spreadof the points increases with | x t − s t | . Note that in Figure (a) there are more observed valuesbetween 20 and 70 than there are above 70 (approximately 70% of the points lie between20 and 70) which makes this increase in spread with | x t − s t | appear less pronounced than itactually is.depends on the magnitude of lagged values.In Figure 10 we plot the price series and a classiﬁcation of prices into regimes accordingto Model 3. We classify prices into Regime i if the posterior probability of the price beinggenerated by Regime i in Model 3 is greatest of all the regimes. That is, we classify in toRegime i if i = arg max j P ( R t = j | x ). In Figure 10 we see that the low-volatility baseregime, Regime 1, mostly captures base prices from January 2013 to April 2016, and thehigh-volatility base regime, Regime 2, mostly captures base prices from April 2016 onwards.Thus, Model 3 captures the jump in market volatility in 2016 with Regime 2. Previouslynon-constant variance of the base regime has been modelled by a constant elasticity ofvolatility (CEV) process, B t = φB t − + σ | B t − | γ ε t [24]. However, this type of model is notan obvious candidate to capture the jump in volatility here since the jump in volatility ismore persistent than we would expect from the CEV model. Although, the CEV modelcould be used to rectify the excess mass in the tails of the distribution of the residuals forRegime 1 in Model 3. Figure 10 also illustrates that the regime process, { R t } , may not betime-homogeneous, or even Markovian, for Model 3.In Table 2 we show the posterior mean estimates for the parameters of the models24

013 2014 2015 2016 2017 2018 2019

Year D a il y A v e r age P r i c e A UD (a) Year D a il y A v e r age P r i c e A UD (b) Year D a il y A v e r age P r i c e A UD (c) Year D a il y A v e r age P r i c e A UD (d)Figure 10: These plots display the price process, with points highlighted in red if they havethe greatest (marginal) posterior probability of belonging to that regime; (a) Regime 1 –base regime 1, (b) Regime 2 – base regime 2, (c) Regime 3 – spikes, (d) Regime 4 – extremespikes. 25iscussed above, as well as the natural logarithm of the Bayes factor with reference toModel 3. The Bayes factor provides decisive evidence that Model 3 is favourable over theothers. Table 2: Posterior mean parameter estimates.Model 1 Model 2 Model 3Baseregime 1 φ σ

351 306 153Baseregime 2 φ - - 0.585 σ - - 756Spikeregime 1 q µ σ q - 135 126 µ - 4.58 4.26 σ - 2.56 2.45Dropregime q -50.2 - - µ σ p ij  . . . . . . . . .  . . . . . . . . .  . . . . . . . . . . . . . . . .  log-Bayesfactor 149 159 0For completeness, we also ﬁtted a model with all 5 regimes but found that the dropregime was completely unnecessary since no points were classiﬁed as drops. Moreover, theBayes factor for the 5 regime model also provided strong evidence that the drop regime wasunnecessary. The natural log of the Bayes factor between Model 3 and the same model withan added drop regime was 13.07.We also ﬁt Model 3 with shifted gamma spike regimes instead of log-normal. For thismodel the PPCs (not shown) showed little evidence of an ill-ﬁtting model; they were com-parable to the PPCs of Model 3. However, the Bayes factor suggested that log-normallydistributed spikes were better. The natural log of the Bayes factor of the model with gammaspikes compared to Model 3 was 8.06. Discussion: All models are wrong, some are useful – George Box

Here we haveused PPCs as well as a classiﬁcation of prices into regimes from models to interrogate modelsand assess in which ways they fail to ﬁt the data. Using these methods we are able to clearlysee how models may misrepresent the data. For example, we are able to recognise that the26odels used in this paper are unable to completely capture distributional properties of baseprices. Furthermore, our PPCs eluded to the fact that there is a distinct change-point involatility in the data, which led us to include a regime to capture this feature. Thus PPCsare an extremely valuable tool to inform us about which aspects of the models are wrongand in which ways. Hence we may make more informed decisions about the conclusions wedraw from our modelling, or how we use the models. We have also used Bayes factors tocompare models. Statistical summaries of model such as Bayes factors are useful metricsfor model comparison, but unlike PPCs, give no indication if models accurately capture thefeatures of the data.This modelling highlights another issue, which is that electricity markets are dynamic;they are constantly changing. For example there is an increasing push for renewable re-sources, and less reliance on traditional fossil fuels; there are regular policy and regulatorychanges; and there are systematic inﬂuences from other commodity markets, such as gasmarkets. As a result, it is hard to model these markets with models which cannot accountfor these changes. There has been some work to reconcile this, see [4, 33] for example, whereexogenous factors are included in MRS models, to capture changing market conditions. Thenovel methods presented in this paper – the PPCs, the Bayesian approach including theuse of MCMC, and the integrated trend modelling process – can be used to build modelswith exogenous factors and infer their parameters. We see this as an interesting directionfor future research.We should also look to challenge other modelling assumptions, such as the time-homogeneousMarkovian nature of the occurrence of price spikes. The occurrence of price spikes as apoint process has been studied, see [3, 8, 9, 18, 21], for example, but these results have notbeen used to inform MRS models for electricity prices. We believe that the methodologiespresented in this paper can be used to develop non-time-homogenous MRS models and toinclude exogenous factors in the switching parameters. We suggest this as an area for furtherresearch for the electricity price modelling community.

In this paper we have discussed and extended the methodologies for stochastic MRS andtrend models for wholesale electricity prices. Trend modelling for electricity prices is not27traightforward since large price spikes can skew trend estimates. We suggest that the trendmodel is integrated with the stochastic model and all components estimated jointly. Bymodelling the trend in this way we are able to stop large price spikes aﬀecting trend estimatessince the MRS model is able to identify spikes in a very natural way as part of the estimationprocedure. Furthermore, combining the trend and stochastic models permits the use ofstatistical model selection criteria, such as the BIC and Bayes factors, for both the trendand stochastic components. We have also shown how Bayes factors and posterior predictivechecks can be used to inform modelling choices and for model comparisons. We have shownhow to include two diﬀerent types of trend models within the stochastic component; awavelet-based model and a cubic splines-based model. Fitting the models to the SouthAustralian data we found that the choice of trend model did not greatly aﬀect the ultimateconclusions. Therefore, due to the relative simplicity we recommend the cubic spline modelover the wavelet-based model.We have also shown how to estimate the model in a Bayesian setting. For electricity pricemodelling, model estimation is typically done via an approximation to the EM algorithm. Byusing a Bayesian approach we are able to estimate model parameters exactly. Furthermore,a Bayesian approach to estimation of shifting parameters of spike distributions can alleviatesome issues regarding identiﬁability of the model in the maximum-likelihood context. Thisapproach also allows us to easily implement posterior predictive checks so assess model ﬁt.The methods presented in this paper were demonstrated by an application to the SouthAustralian wholesale electricity market. We found that the cubic-spline based trend modelwas favourable over the wavelet-based model. We also found that no drop regime wasnecessary; that two spike regimes may be required, one to capture a typical spike, and oneto capture the very extreme observations; and that there was a signiﬁcant jump in volatilityin prices in 2016, which could not be explained by the stationary base regimes. More work isrequired to be able to adequately capture this feature, and we suggest including exogenousfactors in the model as a way to do this. The methodologies developed in this paper can beextended to include exogenous factor in the model.28 cknowledgements

All authors would like to acknowledge the ﬁnancial support of the Australian ResearchCouncil Centre of Excellence in Mathematical and Statistical Frontiers (ACEMS), and theﬁrst author would also like to acknowledge the support of the Australian government Re-search Training Program (RTP) and The University of Adelaide Master of Philosophy (NoHonours) scholarship.

References [1] Australian Energy Market Operator. South Australian electricity report. ,2017. Accessed: 2018-02-07.[2] Australian Energy Market Operator. Data dashboard. ,2018. Accessed: 2018-02-17.[3] R. Becker, A. Clements, and W. Zainudin. Modeling electricity price events as pointprocesses.

The Journal of Energy Markets , 6, 7 2013.[4] R. Becker, S. Hurn, and V. Pavlov. Modelling spikes in electricity prices.

EconomicRecord , 83(263):371–382, 2007.[5] F. E. Benth, J. S. Benth, and S. Koekebakker.

Stochastic modelling of electricity andrelated markets . Advanced series on statistical science and applied probability. WorldScientiﬁc, Singapore; Hackensack, N.J., 2008.[6] F. E. Benth, R. Kiesel, and A. Nazarova. A critical empirical study of three electricityspot price models.

Energy Economics , 34(5):1589–1616, 2012.[7] M. Bierbrauer, C. Menn, S. T. Rachev, and S. Tr¨uck. Spot and derivative pricing inthe EEX power market.

Journal of Banking & Finance , 31(11):3462–3485, 2007. RiskManagement and Quantitative Approaches in Finance.[8] T. Christensen, A. Hurn, and K. Lindsay. Forecasting spikes in electricity prices.

In-ternational Journal of Forecasting , 28(2):400 – 411, 2012.299] T. Christensen, S. Hurn, and K. Lindsay. It never rains but it pours: Modeling thepersistence of spikes in electricity prices.

The Energy Journal , Volume 30(1):25–48,2009.[10] M. Eichler and D. T¨urk. Fitting semiparametric Markov regime-switching models toelectricity spot prices.

Energy Economics , 36:614–624, 2013.[11] C. Erlwein, F. E. Benth, and R. Mamon. HMM ﬁltering and parameter estimation ofan electricity spot price model.

Energy Economics , 32(5):1034–1043, 2010.[12] A. Eydeland and K. Wolyniec.

Energy and Power Risk Management : New Develop-ments in Modeling, Pricing, and Hedging.

Wiley Finance Series. Wiley, 2003.[13] A. Gelman, J. B. Carlin, H. S. Stern, and D. B. Rubin.

Bayesian Data Analysis, SecondEdition (Chapman & Hall/CRC Texts in Statistical Science) . Chapman and Hall/CRC,2 edition, July 2003.[14] A. M. Gonzalez, A. M. S. Roque, and J. Garcia-Gonzalez. Modeling and forecastingelectricity prices with input/output hidden Markov models.

IEEE Transactions onPower Systems , 20(1):13–24, Feb 2005.[15] J. Gonzalez, J. Moriarty, and J. Palczewski. Bayesian calibration and number of jumpcomponents in electricity spot price models.

Energy Economics , 65(Supplement C):375–388, 2017.[16] J. D. Hamilton. A new approach to the economic analysis of nonstationary time seriesand the business cycle.

Econometrica , 57(2):357–384, 1989.[17] J. D. Hamilton. Analysis of time series subject to changes in regime.

Journal ofEconometrics , 45(1):39–70, 1990.[18] R. Handika, C. Truong, S. Tr¨uck, and R. Weron. Modelling price spikes in electricitymarkets - the impact of load, weather and capacity. HSC Research Reports HSC/14/08,Hugo Steinhaus Center, Wroclaw University of Technology, May 2014.[19] J. S. Henneke, S. T. Rachev, F. J. Fabozzi, and M. Nikolov. MCMC-based estimationof Markov switching ARMA-GARCH models.

Applied Economics , 43(3):259–271, 2011.3020] B. M. Hill. The three-parameter lognormal distribution and Bayesian analysis of apoint-source epidemic.

Journal of the American Statistical Association , 58(301):72–84,1963.[21] R. Huisman. The inﬂuence of temperature on spike probability in day-ahead powerprices.

Energy Economics , 30(5):2697 – 2704, 2008.[22] R. Huisman and C. de Jong. Option pricing for power prices with spikes.

Energy PowerRisk Management , 7(11):12–16, 2003.[23] J. Janczura, S. Tr¨uck, R. Weron, and R. C. Wolﬀ. Identifying spikes and seasonal com-ponents in electricity spot price data: A guide to robust modeling.

Energy Economics ,38:96–110, 2013.[24] J. Janczura and R. Weron. Regime-switching models for electricity spot prices: Intro-ducing heteroskedastic base regime dynamics and shifted spike distributions. In , pages 1–6, May 2009.[25] J. Janczura and R. Weron. An empirical comparison of alternate regime-switchingmodels for electricity spot prices.

Energy Economics , 32(5):1059–1073, 2010.[26] J. Janczura and R. Weron. Eﬃcient estimation of Markov regime-switching models: Anapplication to electricity spot prices.

Advances in Statistical Analysis , 96(3):385–407,2012.[27] J. Janczura and R. Weron. Goodness-of-ﬁt testing for the marginal distribution ofregime-switching models with an application to electricity spot prices.

Advances inStatistical Analysis , 97(3):239–270, 2013.[28] N. L. Johnson, S. Kotz, and N. Balakrishnan.

Continuous univariate distributions . NewYork Wiley, 2nd edition, 1994.[29] C.-J. Kim. Dynamic linear models with Markov-switching.

Journal of Econometrics ,60(1-2):1–22, January-February 1994.[30] S. Letts. Port Augusta’s coal-ﬁred power station closes in south Australia.

ABC News

Inference of Markovian-regime-switching models with application to SouthAustralian electricity prices . Master’s thesis in mathematical sciences. The Universityof Adelaide, 2018.[32] T. D. Mount, Y. Ning, and X. Cai. Predicting price spikes in electricity markets using aregime-switching model with time-varying parameters.

Energy Economics , 28(1):62–80,2006.[33] V. Nor´en.

Modelling Power Spikes with Inhomogeneous Markov-switching Models . Mas-ter’s thesis in mathematical sciences. Lund University, 2013.[34] J. Nowotarski, J. Tomczyk, and R. Weron. Robust estimation and forecasting ofthe long-term seasonal component of electricity spot prices. HSC Research ReportsHSC/12/06, Hugo Steinhaus Center, Wroclaw University of Technology, 2012.[35] F. Regland and E. Lindstr¨om. Independent spike models: Estimation and validation.

Finance a Uver: Czech Journal of Economics & Finance , 62(2):180–196, 2012.[36] C. P. Robert and G. Casella.

Monte Carlo statistical methods . Springer texts in statis-tics. Springer, New York, 2nd. ed. edition, 2004.[37] G. O. Roberts and J. S. Rosenthal. Optimal scaling for various Metropolis-Hastingsalgorithms.

Statistical Science , 16(4):351–367, 2001.[38] G. O. Roberts and J. S. Rosenthal. Examples of adaptive MCMC.

Journal of Compu-tational and Graphical Statistics , 18(2):349–367, 2009.[39] J. Seifert and M. Uhrig-Homburg. Modelling jumps in electricity prices: theory andempirical evidence.

Review of Derivatives Research , 10(1):59–85, Jan 2007.[40] M. St´ephane. Chapter 7 - wavelet bases. In M. St´ephane, editor,

A Wavelet Tourof Signal Processing (Third Edition) , pages 263 – 376. Academic Press, Boston, thirdedition edition, 2009.[41] R. Weron. Heavy-tails and regime-switching in electricity prices.

Mathematical Methodsof Operations Research , 69(3):457–473, 2009.[42] R. Weron. Electricity price forecasting: A review of the state-of-the-art with a lookinto the future.

International Journal of Forecasting , 30(4):1030–1081, 2014.32

Supplementary material

Preliminaries

Wavelet ﬁltering is based on multiresolution analysis , where a function (orsignal, or time series) is decomposed and represented as a sum of functions at various levelsof resolution. That is, we have a function f ( x ) that we can represent as f ( x ) = ∞ X j = −∞ ∞ X k = −∞ b j,k ψ j,k ( x ) , where b j,k , j, k ∈ Z , are coeﬃcients, and ψ j,k , j, k ∈ Z , are translated and dilated wavelet functions. For ﬁxed j , we may think of the terms involving the functions ψ j,k , k ∈ Z , asrepresenting the features of the function, f , at resolution j . We deﬁne the wavelet functions ψ j,k , j, k ∈ Z , from a single mother wavelet , ψ ( x ) by translation, shifting the function byintegers k ∈ Z , and dilation, scaling the function by 2 j , ψ j,k ( x ) = 2 j/ ψ (2 j x − k ) . To form a multiresolution analysis, we require a sequence of subsets { V j } such that { } ... ⊂ V − ⊂ V ⊂ V ⊂ ... ⊂ L , where L is the set of square integrable functions (the set of functions, g , such that R R | g ( x ) | dx < ∞ ). Moreover, we also require S j ∈ Z V j to be dense in L , T j ∈ Z V j = { } , the scaling propertyimplies g ( x ) ∈ V j ⇐⇒ g (2 x ) ∈ V j +1 , and we must also have the closure under translationproperty g ( x ) ∈ V j ⇐⇒ g ( x + 2 − j ) ∈ V j .We also need a set of orthonormal basis functions, φ ( x − k ), k ∈ Z , called scalingfunctions, which form a basis for V , and therefore, φ j,k ( x ) = 2 j/ φ (2 j x − k ), k ∈ Z is abasis for V j , for each j ∈ Z . Since V ⊂ V , then φ ( x ) = ∞ X k = −∞ h k φ (2 x − k ) . (4)Here we assume there are only a ﬁnite number, N , of non-zero terms in { h k , k ∈ Z } , namely h > , ..., h N − >

0, which is the case for the scaling function related to the Daubechies33avelets.The Daubechies wavelet family is a family of wavelet functions indexed by the number of vanishing moments which they possesses. A wavelet has p vanishing moments if R R ψ ( x ) x ℓ =0 for ℓ = 0 , , ..., p −

1. This property means that any function of the form f ( x ) = α + α x + · · · + α p − x p − can be exactly represented as f ( x ) = ∞ P k = −∞ a ,k φ ( x ).We can also characterise the diﬀerence between the subspaces V j and V j +1 . Let W j bethe orthogonal complement of V j in V j +1 . Then W j , j ∈ Z , are orthogonal and the directsum , L j ∈ Z W j = L . Now let ψ ( x − k ), k ∈ Z , be an orthonormal set of basis functions for W . Therefore ψ j,k = 2 j/ ψ (2 j x − k ), k ∈ Z , is an orthonormal basis for W j . Since W ∈ V ,then ψ can be represented as ψ ( x ) = ∞ X k = −∞ g k φ (2 x − k ) . (5)The function ψ is known as the mother wavelet, and ψ j,k are wavelet functions.Moreover, ψ j,k ( x ), j, k ∈ Z is an orthonormal basis for L . That is, any function f in L can be represented as f ( x ) = ∞ X j = −∞ ∞ X k = −∞ b j,k ψ j,k ( x ) , where the coeﬃcients b j,k = R R f ( x ) ψ j,k ( x ) dx , are the inner product of f and ψ j,k . Approximation

A smoothed approximation to f may be obtained by removing all termsabove resolution M −

1. That is, a smoothed approximation at resolution M , b f M , to f , is b f M ( x ) = M − X j = −∞ ∞ X k = −∞ b j,k ψ j,k ( x ) . (6)Since L { j | j ∈ Z,j

Due to the fact that W M − and V M − are orthog-onal and W M − L V M − = V M , then we may decompose b f M in terms of the bases for W M − V M − ; b f M ( x ) = ∞ X k = −∞ a M,k φ M,k ( x )= ∞ X k = −∞ a M − ,k φ M − ,k ( x ) + ∞ X k = −∞ b M − ,k ψ M − ,k ( x )= b f M − ( x ) + ∞ X k = −∞ b M − ,k ψ M − ,k ( x ) , (8)where a M − ,k = R R f ( x ) φ M − ,k ( x ) dx and b M − ,k = R R f ( x ) ψ M − ,k ( x ) dx . Rearranging gives b f M − ( x ) = b f M ( x ) − ∞ X k = −∞ b M − ,k ψ M − ,k ( x ) (9)Due to orthogonality properties we may write a M − ,k = R R b f M − ( x ) φ M − ,k ( x ) dx , and uponsubstituting Equation (9) we ﬁnd that a M − ,k = 2 − / ∞ X j = −∞ a M,j h j − k , (10)where { h k } are the same as those in Equation (4). Similarly, we ﬁnd a relation for b M − ,k in terms of a M,k ; b M − ,k = 2 − / ∞ X j = −∞ a M,j ( − j h N − − j +2 k . (11)It can be shown that the ﬁlter coeﬃcients which appear in Equation (5), are related by( − j h N − − j +2 k = g k .Similar formulas for reconstructing coeﬃcients { a M,k , k ∈ Z } from { a M − , k ∈ Z } and { b M − ,k ∈ Z } can be derived. Using Equation (8) in a M,k = R R b f M ( x ) ψ M,k ( x ) dx , we ﬁnd a M,k = 2 − / ∞ X ℓ = −∞ a M − ,ℓ h k − ℓ + 2 − / ∞ X ℓ = −∞ b M − ,ℓ ( − k h N − − k +2 ℓ . (12) Discrete data and initialisation

Let x = { x k } ∞ k = −∞ be a discretely observed processthat we want to approximate using wavelet ﬁltering. We can consider the observations x n as averages of a function f over the intervals ( n − , n + ). To proceed, we need to ﬁnd a35uitable function f with the property x n = Z n + n − f ( x ) dx, and then compute a N,k = R R f ( x ) φ N,k ( x ) dx , then we may proceed with the decompositionalgorithm. Theoretically, (under certain conditions), the correct a N,k can be found, however,here we use a simpler and faster approximation method [40]. Consider f ( x ) = ∞ X k = −∞ x k φ N,k ( x ) . Therefore a N,m = Z R f ( x ) φ N,m ( x ) dx = Z R ∞ X k = −∞ x k φ N,k ( x ) φ N,m ( x ) dx = x m . (13)Also, since R R φ N,m ( x ) dx = 1, a N,m = Z R f ( x ) φ N,m ( x ) dx can be viewed as a weighted average of f ( x ) over the support of φ N,m . Hence, assuming f is nicely behaved we can approximate x m ≈ f on the support of φ N,m ( x ). Decomposition and reconstruction as matrix operations on discrete data

To ap-proximate (smooth) a time series by removing the high resolution (frequency) componentsof the data, the relations in Equations (10) and (11), along with the initialisation in Equa-tion (13), suggest a sequential approximation procedure. We may represent the sequentialprocedure by matrix operations, and this will help to illuminate the equivalence betweenwavelet ﬁltering and ordinary least squares regression.Assume a doubly-inﬁnite signal x = ( . . . , x t − , x t , x t +1 , . . . ), and let H be the inﬁnite36imensional matrix H = 2 − /  . . . . . . . . . . . .0 h h h . . . h N − h . . . h N − h N − h N −

0. . . . . . . . .  , and denote a j = ( . . . , a j,k − , a j,k , a j,k +1 , . . . ) ′ . Note that H has the property H ′ H = I , theidentity. Then, to ﬁnd the coeﬃcients a N − J , we initialise the algorithm with a N,k = x k andcompute a N − J = H a N − J +1 = H J a N = H J x . Deﬁning G as G = 2 − /  . . . . . . . . . . . .0 g g g . . . g N − g . . . g N − g N − g N −

0. . . . . . . . . . . .  , we may similarly calculate the coeﬃcients b j = ( . . . , b j,k − , b j,k , b j,k +1 , . . . ) ′ , via b N − J = G a N − J +1 = GH J − a N = GH J − x . Reconstruction can also be written in terms of the operators G and H ; a M = H ′ a M − + G ′ b M − . This reconstruction formula can be used to ﬁnd an estimate of the smoothed signal. Let b f N − J be an approximation to f , then we may evaluate b f N − J as b f N − J = H J a N − J = (cid:0) H J (cid:1) ′ H J x . Finite sequences and edge eﬀects

In reality, x is only ﬁnitely long, which can introduceissues at the boundaries. Suppose x = ( x , ..., x T ) ′ = ( a N, , a N, , . . . , a N,T ) ′ , and consider ananalysis based on the Daubechies wavelet of order p . If we wish to produce an approximation37o x at level N − J , there will be exactly q = ⌈ T +2( p − J − J ⌉ , scaling functions at level N − J whose support overlaps with the support of the dataset, (1 , , ..., T ). In total, thesupport of these scaling functions is of length L = 2( p − J −

1) + 2 J q . To deal with thiswe extend the dataset to be the appropriate length before applying a truncated version ofthe operators H and G . In this work, we use a symmetric padding, where the dataset isreﬂected at t = 1 and t = T . In particular, we place 2( p − J −

1) extra terms at thebeginning of the dataset, and L − T − p − J −

1) terms at the end of the dataset.Now to deﬁne the truncated versions of H and G to apply to this dataset. First, set s = L , s = ⌊ L − ⌋ + p then recursively deﬁne s j +1 = ⌊ s j − ⌋ + p , j = 2 , , .., J . Then let H j and G j be the s j +1 × s j matrices H j =  h h h . . . h N − . . . . . . . . . . . .

00 0 h . . . h N − h N − h N − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . h N − . . . . . . . . . . . . . . . . . . . . . h N − h N − h N −  , and G j =  g g g . . . g N − . . . . . . . . . . . .

00 0 g . . . g N − g N − g N − . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . g N − . . . . . . . . . . . . . . . . . . . . . g N − g N − g N −  . Thus we may compute coeﬃcients a N − J and b N − J via a N − J = H N − J H N − J +1 . . . H e x , and b N − J = G N − J H N − J +1 . . . H e x , where e x is the extended dataset. 38imilarly to before, the smoothed version of the signal is given by b f N − J = ( H N − J H N − J +1 . . . H ) ′ H N − J H N − J +1 . . . H e x . An ordinary least squares regression view of wavelet ﬁltering

Wavelet ﬁlteringcan also be viewed as an ordinary least squares regression problem. Consider the regressionmodel x = ( H N − J H N − J +1 . . . H ) ′ a + σε t , where ε t ∼ N (0 , a is b a = (cid:2) ( H N − J H N − J +1 . . . H ) ( H N − J H N − J +1 . . . H ) ′ (cid:3) − ( H N − J H N − J +1 . . . H ) x = (cid:2) H N − J H N − J +1 . . . H H ′ . . . H ′ N − J +1 H ′ N − J (cid:3) − ( H N − J H N − J +1 . . . H ) x = H N − J H N − J +1 . . . H x = a N − J , due to the orthogonality of the columns of H ′ j .The estimated line of best ﬁt to the data is given by( H N − J H N − J +1 . . . H ) ′ b a = ( H N − J H N − J +1 . . . H ) ′ a N − J = b f N − J . Since the wavelet functions at resolution N − J and above are orthogonal to the scalingfunctions at resolution N − J then we do not need to include columns in the design matrixcorresponding to wavelet functions in the regression model when all we wish to estimate isa smoothed trend. That is, the inclusion of wavelet components in the model would notaﬀect estimates of b a , and therefore not aﬀect the estimate of the smooth trend. A wavelet ﬁltering inspired model for trends in a Bayesian setting

In this paper,we consider a Bayesian setting to infer model parameters. Hence, we wish to translate thewavelet trend model described above into a Bayesian setting. With the view of waveletﬁltering as a regression model, this becomes relatively simple. We include the columns ofthe design matrix W ′ j = ( H N − J H N − J +1 . . . H ) ′ , in a design matrix for the trend componentof our model. Speciﬁcally, we consider a model constructed using the Daubechies-8 wavelet,39ith J = 8 levels of smoothing. A cubic spline is a piecewise cubic polynomial, which is continuous and has continuousﬁrst and second derivatives. Let x = ( x , . . . , x T ) be a dataset and η , . . . , η k be knots ,where η < η < · · · < η k . The knots deﬁne the intervals over which cubic splines arepiecewise cubic, [ η , η ] , [ η , η ] , . . . , [ η k − , η k ]. Within each of these intervals, cubic splinesare inﬁnitely diﬀerentiable. At the boundaries, however, we place restrictions on the basisfunctions so that the cubic splines have continuous ﬁrst and second derivatives at the knots.We choose to use B-splines to construct the basis functions which deﬁne the cubic splines.There are many equivalent bases that can be used for representing cubic splines, but B-splines are numerically stable, since they avoid calculating powers of large, or small, numbers.Before constructing our cubic B-spline basis functions, we ﬁrst need to deﬁne an aug-mented set of knots, τ , τ , . . . , τ k +2 M . Set τ = τ = · · · = τ M <

0; set τ M < τ M +1 = η , τ M +2 = η , . . . , τ M + k = η k < τ M + k +1 ; and set T < τ M + k +1 = τ M + k +2 = · · · = τ M + k .B-splines are constructed recursively. Let B i,m ( x ) be the i th basis function of an m -splinefunction, m ≤ M , with B i, ( x ) =  τ i ≤ x < τ i +1 , i = 1 , . . . , k + 2 M −

1. Then B i,m ( x ) can be constructed recursively via B i,m ( x ) = x − τ i τ i + m − − τ i B i,m − ( x ) + τ i + m − xτ i + m − τ i +1 B i +1 ,m − ( x ) , for i = 1 , . . . , k + 2 M − m . We are interested in cubic splines, hence the basis functionswe require are B i, ( x ) for i = 1 , . . . , k + 4. Note that we have some repeated knots, whichwould mean dividing by 0. To avoid this, we deﬁne B i, = 0 when τ i = τ i +1 .For a cubic spline with k knots, η , . . . , η k , there are k + 2 parameters to be estimatedusing a B-spline basis. The number and location of the knots need to be chosen, which is amodel selection problem. To simplify matters, we use 13 equally spaced knots in this work,which roughly corresponds to 1 knot every 180 days; we also set η = − η k = T + 1.The observations in the dataset used here are taken at times 0 , , . . . , T , hence the design40atrix looks like C ′ =  B , (0) B , (0) . . . B k +4 , (0) B , (1) B , (1) . . . B k +4 , (1)... ... ... ... B , ( T ) B , ( T ) . . . B k +4 , ( T )  . Another element of the trend component, is the short-term periodic component. This com-ponent captures a day-of-week eﬀect observed in prices. The model we use is g t = β Mon I ( t ∈ Mon) + β Tue I ( t ∈ Tue) + ... + β Sun I ( t ∈ Sun) , where β Mon , β

Tue , ..., β

Sun , are parameters to be estimated relating to average prices onMonday, Tuesday, ..., Sunday, respectively. In a regression context, this model can beencoded as the columns of the following design matrix; M ′ = ⊗ I , where is a vector ofones I is the 7 ×

7. That is, M ′ contains the identity matrix repeated as many times asrequired so that M ′ has T rows. Let Z be the design matrix of some trend model, and z k be the row of Z corresponding tothe time point t k . Given a vector of parameters γ for the trend model, the estimate of thetrend at time t k would therefore be z k γ .For the cubic spline model, the matrix Z is simple to construct given the work above; Z = [ M ′ | C ′ ], that is, the matrices M ′ and C ′ concatenated. The wavelet-based modelis slightly trickier, since we need to account for edge eﬀects using padding. The designmatrix for the wavelet trend component includes extra rows at the top, and the bottom,corresponding to padding. Speciﬁcally, since we use the Daubechies-8 wavelet and J = 8levels of smoothing, there are ℓ = 3570 rows at the top, and ℓ = 3802 rows at the bottom,of the matrix W ′ which align with padding. So that we may concatenate the matrices M ′ and W ′ , we pad the columns of M ′ via the same symmetric extension we use to padthe dataset. The full design matrix for the wavelet-based trend model is Z = [ M ∗ | W ′ ],where M ∗ is the padded version of M ′ . Only when estimating parameters corresponding to41oeﬃcients of scaling functions in the wavelet model do we use the extra rows in the designmatrix corresponding to padding. In this setting, padding can be thought of as a kind of regularisation , keeping the model from overﬁtting at the boundaries. The Metropolis-Hasting (MH) algorithm is a very popular MCMC algorithm. For a targetdensity p ( θ | x ) ∝ p ( x | θ ′ ) π ( θ ′ ) the Metropolis-Hastings algorithm constructs a Markov chainthat converges to the target as follows.1. Propose that the chain move to a new position θ ′ which is drawn randomly from aproposal distribution centred at the current state of the chain θ ( n ) .2. Calculate the ratio λ ( θ ′ , θ ( n ) ) = p ( x | θ ′ ) π ( θ ′ ) R ( θ ( n ) | θ ′ , s ) p ( x | θ ( n ) ) π ( θ ( n ) ) R ( θ ′ | θ ( n ) , s ) , where p ( x | θ ) is the likelihood evaluated with parameter θ , π ( θ ) is the prior, and R ( ·| θ , s ) is the proposal density centred at θ .3. Generate a random number u ∼ U ([0 , λ ( θ ′ , θ ( n ) ) < u set θ ( n +1) = θ ′ ; else,set θ ( n +1) = θ ( n ) . The choice of proposal distribution for the Metropolis-Hastings algorithm is theoreticallyarbitrary, up to some not very restrictive suﬃcient conditions (see [36, Chapter 7] for suﬃ-cient conditions), in that the chain converges to the stationary distribution no matter whatproposal distributions we use. However, if the chain is to reach stationarity in a reasonableamount of time, we must choose ‘good’ proposal distributions. This is the justiﬁcation forthe block structure and adaptive steps in our MCMC implementation.The block Metropolis-Hastings algorithm can be seen as an extension of the Gibbs sam-pler: the sequential update structure of a Gibbs sampler is used but instead of samplingparameters directly from their conditional posterior, Metropolis-Hastings-type steps areused to update the chain. To implement a Gibbs sampler, the parameter vector is par-titioned into blocks; θ = ( θ , θ , ..., θ ℓ ) ′ . The hidden regime sequence is also partitionedinto blocks. Each element of the regime sequence, R t , is a block, R = ( R , ..., R T ) ′ . TheGibbs sampler iterates block-by-block sequentially updating the block by sampling from42he conditional posterior distributions p ( θ i | θ , ..., θ i − , θ i +1 , ..., θ ℓ , x , R ) for i = 1 , ..., ℓ, and p ( R t | x , θ , R , ..., R t − , R t +1 , ..., R T ) for t = 1 , ..., T . Once each block is updated, the Gibbssampler goes back to the ﬁrst block are repeats the sequential update process. We refer toa single update of all the blocks as a sweep of the algorithm.In some cases the Gibbs conditional distributions will be expensive to compute, oreven intractable. A solutions it to propose updates to each block using a Metropolis-Hastings-type update rather than a Gibbs update. Hence the conditional posterior dis-tributions do not need to be computed and performance of the algorithm may be im-proved. For the models considered here Metropolis-Hastings updates for the regime se-quence are signiﬁcantly faster than constructing and sampling from the conditional pos-teriors, p ( R t | x , θ , R , ..., R t − , R t +1 , ..., R T ). In contrast, sampling from the conditionaldistributions p ( p i , ..., p iN | x , θ , R ( n ) ) = p ( p i , ..., p iN | R ( n ) ) is relatively eﬃcient.For the models in this paper, the parameter vector θ contains the parameters for the trendcomponent, γ , as well as appropriate subsets of the parameters φ i , σ i , i = 1 , µ i , σ i , q i ,i = 3 , ,

5, and p ij for i, j ∈ S for the transition matrix of the regimes, depending onwhich regimes are included in the model. Each individual element in γ and each individualparameter φ i , σ i , i = 1 , µ i , σ i , q i , i = 3 , , p i , ..., p iN , is also a block. Since p iN = 1 − P k ∈S p ik , we need to infer N ( N − Proposal Distributions

At each sweep of the algorithm, the parameters p ij are updatedusing a Gibbs-type update. That is, for each i ∈ S , we sample p i , ..., p iN from [19] p ( p i , ..., p iN | R ( n ) ) ∼ Dirichlet ( η i, + 1 , ..., η i,N + 1) , (14)where η i,j is the number of transitions from state i to state j in the current regime sequence R ( n ) .At each sweep of the algorithm, 10% of indices from the set I = { , , ..., T } are randomlyselected and for each sampled index t we update R t as follows using a uniform distribution.Suppose the current regime sequence is R ( n ) and R ( n ) t = i . Uniformly propose a move to anyother state R ′ t = j = i , and accept or reject this move with the appropriate probability asgiven by the Metropolis-Hastings algorithm. By only choosing 10% of the elements in hiddensequence to update at each sweep of the algorithm, rather than update all T elements, we43reatly decrease the running time of the algorithm, and yet this does not signiﬁcantly aﬀectthe mixing of the MCMC chain.At each sweep of the algorithm, given the current state of the MCMC chain, θ ( n ) ,the k th block of the parameter vector is updated via proposing a move from a normaldistribution with mean θ ( n ) k and variance s k . The move are accepted or rejected accordingthe a Metropolis-Hasting-type acceptance probability. The proposal variances s i are foundusing an adaptive algorithm which we discuss later. Adaptive proposal distributions

Roberts and Rosenthal [37, 38] provide an adaptivealgorithm to ﬁnd eﬀective variance parameters, s k , for an MCMC algorithm similar toours. In [37] Roberts and Rosenthal prove, for an idealised version of our block Metropolis-Hasting algorithm, an optimal acceptance rate is 0.44. In [38] they provide an exampleof an adaptive scheme, which automatically adjusts the parameters s k in the algorithm toasymptotically reach the optimal acceptance rate while maintaining the necessary ergodicityand convergence properties for the same idealised problem. The problem they consideredhas a posterior distribution that is a multivariate normal, whereas for our problems, it isunlikely that the posterior distributions are normal, and in addition we also have to samplethe hidden regime sequence. Nonetheless, this adaptive scheme works well for our purposes.Our implementation of their adaptive scheme is as follows. For each parameter weinitialise the standard deviation of the proposal to s k = 1 and begin our block Metropolis-Hastings algorithm. After the n th batch of 50 iterations of the block Metropolis-Hastingsalgorithm, we update s k by multiplying by exp( δ ( n )) if the acceptance rate is above 0.44, orby exp( − δ ( n )) if the acceptance rate is less than 0.44. Following the ideas in [38], we deﬁne δ ( n ) = min (cid:18) √ n , n , n (cid:19) . Note that to satisfy the conditions for convergence of this algorithm outlined in [38], we alsoneed to specify a bound