[PDF] Accurate Kernel Learning for Linear Gaussian Markov Processes using a Scalable Likelihood Computation

Abstract

We report an exact likelihood computation for Linear Gaussian Markov processes that is more scalable than existing algorithms for complex models and sparsely sampled signals. Better scaling is achieved through elimination of repeated computations in the Kalman likelihood, and by using the diagonalized form of the state transition equation. Using this efficient computation, we study the accuracy of kernel learning using maximum likelihood and the posterior mean in a simulation experiment. The posterior mean with a reference prior is more accurate for complex models and sparse sampling. Because of its lower computation load, the maximum likelihood estimator is an attractive option for more densely sampled signals and lower order models. We confirm estimator behavior in experimental data through their application to speleothem data.

Full PDF

AAccurate Kernel Learning for Linear Gaussian Markov Processesusing a Scalable Likelihood Computation

Stijn de Waele Abstract

We report an exact likelihood computation for Lin-ear Gaussian Markov processes that is more scal-able than existing algorithms for complex modelsand sparsely sampled signals. Better scaling isachieved through elimination of repeated compu-tations in the Kalman likelihood, and by usingthe diagonalized form of the state transition equa-tion. Using this efﬁcient computation, we studythe accuracy of kernel learning using maximumlikelihood and the posterior mean in a simulationexperiment. The posterior mean with a referenceprior is more accurate for complex models andsparse sampling. Because of its lower compu-tation load, the maximum likelihood estimatoris an attractive option for more densely sampledsignals and lower order models. We conﬁrm esti-mator behavior in experimental data through theirapplication to speleothem data.

1. Introduction

Gaussian processes (GPs) are used in a wide range of appli-cations in science and engineering (Rasmussen & Williams,2006). A GP model with a ﬁxed kernel sufﬁces for someapplications. However, typically it is required to learn thekernel from available data. Furthermore, there is a trendtowards more complex kernel parametrizations beyond thebasic models with a few parameter, such as the Matérn orsquared exponential kernels. An example of this trend isdeep kernel learning, where a neural network is used as partof the model (Wilson et al., 2016). This trend agrees wellwith results from the domain of time series analysis, whereit was found that simple models are adequate in describingsome processes, but models of moderate to high complex-ity are typically preferred, and often critical to solve theproblem at hand (Prado & West, 2010). ExxonMobil Research and Engineering Company, Annan-dale, New Jersey, USA. Correspondence to: Stijn de Waele.Copyright 2018 by the author(s)

In many applications of Gaussian processes, kernel learn-ing is performed from irregularly spaced samples, either byexperimental design, e.g. in Bayesian optimization (Ghahra-mani, 2015), or as a result of the sampling process, e.g. inclimate data (Sinha et al., 2015) and exoplanet detectionin astronomy (Khan et al., 2017). While irregular sam-pling offers the beneﬁt of spectral estimation beyond theaverage sampling frequency (Broersen, 2007), this type ofsampling also poses challenges for the statistical accuracyof estimated kernels (Broersen, 2010).Motivated by these trends and challenges, we investigate thequality of kernel learning algorithms from irregularly sam-pled data. Linear Gaussian Markov models are an excellentpractical choice for parametrization of Gaussian processeswith a one-dimensional index set (or: time series), becauseof their wide applicability, computational convenience, andavailability of estimation algorithms (Prado & West, 2010;Murphy, 2012; Rasmussen & Williams, 2006). As moti-vated in section 2.2, we focus on two Markov models: theLinear-Gaussian State-Space model and the Autoregressivemodel.We compare the frequentist Maximum Likelihood (ML)estimate to the Bayesian posterior mean. We do not usestrong priors so that both the ML and Bayesian estimateare only based on the model structure and the data. Inthis way, the Bayesian estimate can be used as a directreplacement for ML where desired. However, this work isalso relevant for the situation where we use stronger priors,because in this situation we still have to decide betweenusing the more efﬁcient posterior mode estimate versus otherpoint estimates such as the posterior mean. This choice isanalogous to the choice between the ML and posterior meanas discussed in this paper.The beneﬁt of the ML estimator is that it is considerablymore computationally efﬁcient than Bayesian estimates.Many theoretical results exist that show exact equivalencebetween ML and Bayesian point estimates with uninfor-mative priors for certain model structures, as well as thegeneral result that the estimates converge in the limit oflarge samples (Bayarri & Berger, 2004). Furthermore, theML estimate, unlike the posterior mode, is independent ofthe chosen model parametrization, a desirable property that a r X i v : . [ s t a t . M L ] M a y ccurate Kernel Learning for Linear Gaussian Markov processes it shares with preferred Bayesian point estimates such as theposterior mean. However, (Broersen, 2010) reports that forthe kernel learning problem the ML estimator can performpoorly under certain conditions, resulting in very inaccuratemodels, which manifests in the kernel spectrum as spuriouspeaks.In an extensive simulation study, we quantify this problemunder various estimation conditions, and show that the pos-terior mean estimate is successful in reducing the spuriouspeaks, resulting in more accurate models. The simulationstudy is a key contribution of this paper, because only asimulation study allows a meaningful comparison betweenthe estimators. The existing asymptotic theory does notdescribe the signiﬁcant difference in performance betweenthe estimators that is observed in practice.Computational complexity is a signiﬁcant challenge forkernel learning and is therefore an active area of research.For example, (Dong et al., 2017) reports an algorithm foran approximate likelihood computation that scales linearlywith the number of observations. As reported in a earlypaper by Jones (Jones, 1980), the exact likelihood for Lin-ear Gaussian State-Space models can be computed withthe same linear scaling; this result has been used for GPkernel learning using Maximum Likelihood (Sarkka et al.,2013; Gilboa et al., 2015). In many parameter estimationproblems, usage of the exact likelihood as opposed to anapproximation leads to more accurate and robust estimators,and is therefore preferable. For example, for the autore-gressive model, the usage of the approximate likelihood canresult in non-stationary models, whereas the exact ML esti-mator is guaranteed to result in stationary models (Broersen,2006). We introduce an improvement of the exact likeli-hood computation that further reduces computational costand improves scalability with model complexity.While we focus on parameter estimation from irregularlysampled data, the presented algorithms can directly be usedfor the problem of missing data. The reason is that theproblem of irregular sampling is addressed by roundingsampling times to a ﬁne regular grid, effectively convertingthe problem into a regularly sampled signal with a largefraction of missing data.Section 2 provides deﬁnitions of the considered models anderror metric. Section 3 introduces the likelihood computa-tion with improved scalability. 4 describes the estimatorsbased on this likelihood computation. Crucial to the qualityevaluation of the estimators, section 5 describes the simu-lation experiments comparing the proposed estimators. Fi-nally, the algorithms are applied to experimental speleothemdata in section 6.

2. Deﬁnitions

We consider a zero mean, stationary Gaussian Markov pro-cess y n ∈ R m over the discrete one-dimensional index(time) variable n ∈ Z . The available observations of thisprocess are N a irregularly spaced observations y n i at index n i , taken over a measurement interval of length N . The setof available index values is denoted N . If a mean value ortrend is present in a dataset, it can be subtracted as prepro-cessing, and be added back to the predictions made with theestimated model. We deﬁne two types of Linear Markov models: the linearGaussian State-Space (LG-SS) model and the autoregres-sive (AR) model. Please refer to (Prado & West, 2010) and(Broersen, 2006) for some of the basic properties and re-sults for these models that are used in the remainder of thissection.Following the notation in (Murphy, 2012), we write thelinear Gaussian State-Space model as: z n = Az n − + (cid:15) n y n = Cz n + δ n with state vector z n ∈ R s , matrices A ∈ R s × s and C ∈ R m × s ; (cid:15) n ∈ R s and δ n ∈ R m are normally distributed,temporally uncorrelated stationary stochastic processes withcovariance matrix Q ∈ R s × s , and R ∈ R m × m , respectively.The LG-SS model is the most general of the two models;the autoregressive model can be rewritten as an equivalentLG-SS model. Furthermore, this model is used to computethe Kalman likelihood in section 3.The autoregressive model of order p is deﬁned by: y n = p (cid:88) i =1 a i y n − + v n where the a i are referred to as prediction coefﬁcients, and v n is a normally distributed, temporally uncorrelated station-ary stochastic process with covariance matrix V ∈ R m × m ,written as σ v for m = 1 .We will now discuss some of the many practical and theoreti-cal motivations for the AR model. (i) Under mild conditions,a stationary Gaussian process can be approximated arbitrarywell by an AR( p ) model of sufﬁciently high order, i.e. everyprocess satisfying these conditions has an AR( ∞ ) repre-sentation. (ii) AR models of low to moderate order p aresuccessfully used to model a wide range of processes: theAR(1) model is the discrete-time version of the Matérn-3/2 kernel or Ornstein–Uhlenbeck process used in machinelearning, physics and economics; models of moderate order ccurate Kernel Learning for Linear Gaussian Markov processes (5-30) are successful in an even wider range of applications,see, e.g., (Zhang et al., 2017; Ramadona et al., 2016). Fi-nally, (iii) AR models are used as the basis for a range ofsuccessful reduced statistics Moving Average (MA) andAutoregressive-Moving average (ARMA) estimators.Conversely, Maximum Likelihood estimation for the MAand ARMA models is inaccurate even for regularly spacedobservations. For ARMA models, the theoretical result ex-plaining these problems is that the Cramèr-Rao lower boundfor the estimation error for even the simplest ARMA(1,1)model is inﬁnite. Because of the correspondence betweenARMA models and LG-SS, the same issue exist for generalLG-SS parameter estimation (Auger-Méthé et al., 2016).Since the AR model parameter fully characterize the Gaus-sian process, both the covariance function k , and the powerspectral density h can be computed from the AR modelparameters. An alternative parametrization of the AR( p )model are the partial autocorrelations φ i . Because the re-quirement of stationarity can be simply stated as | φ i | < ,this representation is central in AR parameter estimation.Partial autocorrelations are also deﬁned for the vector ARprocess (Marple, 1987). In general, models should not be evaluated on the differencebetween estimated and true model parameters, because theimpact of a given difference in parameter values can havea vastly different impact on model performance dependingon the location in parameter space where this difference isobserved. Furthermore, it precludes comparing models ofa different structure. Rather, we should evaluate modelsby evaluating their accuracy when used for inference. Weevaluate models using the model error ME (Broersen, 2006).We will now summarize some key results for ME as neededfor this paper.The model error ME is a normalized version of the one-step-ahead prediction error PE:ME = N a (cid:18) PE σ v − (cid:19) , where PE is the expectation of the one-step-ahead predictionerror of the estimated model compared to the generatingprocess. Besides this direct time domain interpretation asnormalized one-step ahead prediction error, it is asymptoti-cally equivalent to the Spectral Distortion. This equivalencemotivates reporting estimated models in the frequency do-main using the log power spectral density. Furthermore, themodel error is asymptotically equivalent to the Kullback-Leibler Divergence (KLD) for regularly sampled data. Forregularly sampled data, the asymptotic expectation of MEfor estimated models is equal to the number of estimatedparameters: E [ ME ] = p . The ME can be be computed efﬁciently for general ARMA(and consequently for LG-SS) models. Because of its use inour simulation study, here we report the expression for anestimated AR( p ) model ˆ a = (ˆ a , ˆ a , ..., ˆ a p ) T with respectto an AR( p ) process with parameter vector a :ME (ˆ a, a ) = N a (ˆ a − a ) T R − (ˆ a − a ) , (1)where R ∈ R p × p is the covariance matrix of p consecutiveobservations of the true process a .

3. Scalable exact likelihood computation

In this section we develop a scalable exact computation ofthe likelihood for the LG-SS model that is more computa-tionally efﬁcient than the existing Kalman likelihood. Thiscomputation is used in the estimation algorithms describedin section 4.

The likelihood for observations y ∈ R N a , y =( y n y n · · · y n Na ) T of a Gaussian Process can bewritten as (Murphy, 2012): L ( y ) = log N (0 , K ) = −

12 log | K | − y T K − y + constantwhere N ( µ, S ) is the multivariate Gaussian distribution,and K ∈ R N a × N a is the data covariance matrix: K ij = k ( n i − n j ) . We refer to this computation of the likelihoodfor general covariance matrices K as the covariance matrix(COVM) method. This method is computationally expen-sive, i.e. O ( N a ) , and consequently its application is limitedto small datasets.The Kalman likelihood (KAL) is a more efﬁcient, exactlikelihood computation for LG-SS models, which uses thedecomposition of the likelihood into a sum of conditionallikelihoods (Murphy, 2012): L ( y ) = (cid:88) n ∈ N log N ( y n ; µ n , Σ n ) (2)where the mean vector µ n ∈ R s and covariance matrix Σ n ∈ R s × s are computed using the Kalman measurementequations for each n ∈ N , while the Kalman prediction stepis performed for all N grid points. The Kalman likelihoodfor the initial state distribution is given by the Lyapunovequation. This computation is O ( N ) . Since it is oftenbeneﬁcial to use a small grid time T g , it is typical that N a (cid:28) N . Hence, we proceed to improve algorithm efﬁciency toachieve an exact computation of O ( N a ) . For sparsely sampled data, it will occur frequently that nodata is available for several consecutive sample points. For ccurate Kernel Learning for Linear Gaussian Markov processes the likelihood computation, this results in repeated appli-cation of the Kalman prediction step. In this section, wedescribe an algorithm that precomputes elements of therepeated prediction step, thus reducing the overall computa-tional complexity of the likelihood computation.Performing k prediction steps in absence of measurementsyields: µ n + k | n = A k µ n Σ n + k | n = F ( A, Σ n , k ) + G ( A, Q, k ) (3)where Σ n is the state covariance matrix conditional on allavailable preceding observations, and F ( A, Σ n , k ) = A k Σ n ( A k ) T G ( A, Q, k ) = k − (cid:88) i =0 A i Q ( A i ) T Computation of each of these contributions can be accel-erated using precomputation of components that only de-pend on the model parameters, i.e. A k and G ( A, Q, k ) .We write G in terms of the contributions to the sum: G = (cid:80) k − i =0 g i , with g i = A i QA iT . For efﬁcient com-putation of g i we use the recursive relations g i = Ag i − A T and G ( A, Q, k ) = G ( A, Q, k −

1) + g k − .After completion of the precomputation, we perform N a repeated prediction steps and measurement steps, each ofwhich has constant complexity for constant s . Therefore,the computational complexity has been reduced to O ( N a ) .For AR models, the computation can be further acceleratedby exploiting the sparsity in C and Q . For more efﬁcient computation of the matrix powers in theKalman likelihood for larger state dimension s , we decom-pose the state vector z according to the eigenbasis of A ,yielding the following equations for the state-space model: z en = Λ z en − + (cid:15) en y n = C e z en + δ n where z e , (cid:15) e and C e are the state, process noise and C matrix, expressed on eigenbasis of A .This decomposition requires that A is diagonalizable. The A matrix corresponding to an AR( p ) model is diagonalizableas long as a p (cid:54) = 0 . If a p = 0 , we can remove trailing zero-valued coefﬁcients to obtain an AR( p ∗ ) model with p ∗ < p ,and a ∗ p (cid:54) = 0 which can be diagonalized. This model orderreduction order does not affect the covariance function of y and will therefore yield the correct likelihood value.The measurement step is the same as in the original Kalman likelihood (KAL). The repeated prediction step (eq. 3) be-comes: µ en + k | n = Λ k µ en Σ en + k | n = Λ k Σ n Λ kH + σ (cid:15) (cid:80) k − i =0 Λ i Q e Λ iH which is more computationally efﬁcient to evaluate, because Λ is diagonal. For the functions F and G, we ﬁnd: F (Λ , Σ n , k ) = F p (Λ , k ) ◦ Σ n where ◦ is the elementwise matrix product or Hadamardproduct, and F p (Λ , k )[ i, j ] = ( λ i λ ∗ j ) s . Since F p is data-independent, it can be precomputed. For the second contri-bution G , we ﬁnd: G (Λ , Q e , k ) = G p (Λ , k ) ◦ Q e (4)with G (Λ , k )[ i, j ] = 1 − ( λ i λ ∗ j ) k − λ i λ ∗ j Equation 4 has reduced computational complexity comparedto the original expression because (i) it uses the elementwiseproduct, which is O ( s ) , whereas the matrix multiplicationis O ( s ) and (ii) the usage of the geometric series allows forefﬁcient exponentiation, e.g. exponentiation by squaring.Finally, the convariance of the unconditional distribution canalso be computed more efﬁciently, because the Lyapunovequation reduces to a per-element expression: Σ | [ i, j ] = Q e [ i, j ]1 − λ i λ ∗ j . The computational load of the likelihood computation ismeasured experimentally on speech sample data from (Wil-son & Nickisch, 2015). Irregularly sampled data is derivedby random subsampling. In ﬁgure 1, we report the com-putational load of the described likelihood computationsimplemented in Julia 0.6.0, and executed on a 2.60 GHzIntel Xeon E5-2670 CPU.The experiments conﬁrm the expected scaling behavior:Kalman-based computations scale much better with signalduration N (ﬁg.1a); precomputation methods PRE-KALand DIAG-PRE-KAL are more efﬁcient for larger samplingintervals compared to the original Kalman likelihood KAL(ﬁg. 1b). Finally, the usage of the diagonalized state-spaceformulation (DIAG-PRE-KAL) scales better with increasingmodel complexity (ﬁg. 1c). However, because PRE-KALuses only real-valued variables, it has a lower memory foot-print, which makes it more efﬁcient for low-complexitymodels. ccurate Kernel Learning for Linear Gaussian Markov processes N (a) KAL PRE_KAL DIAG_PRE_KAL COVM T t (b) p (c) Figure 1.

Execution time in seconds for likelihood computationmethods as a function of the signal duration (a); average samplinginterval (b) and model complexity (c).

4. Estimators

In this section we describe our implementation of the max-imum likelihood and Bayesian point estimators for kernellearning for the autoregressive (AR) model. We use thePRE-KAL algorithm, which is the most efﬁcient for the con-sidered model complexity, implemented in the probabilisticprogramming language Stan (Carpenter et al., 2016). TheStan script is provided as supplementary material with thispaper.

In case observations are made on a continuous index vari-able t , the time indices are converted to integer indices n i by rounding the continuous index variable to values on aregularly spaced grid with spacing T g . A coarser grid willresult into larger errors in the estimated process parameters.Considerations in setting this value include (i) the time scaleof interest (ii) process dynamics and (iii) computational loadand numerical stability.An effective way to achieve accurate models at the time scaleof interest T A is to simply set the grid spacing T g equalto T A . However, this does not fully exploit all availabledata, and may introduce errors in the process dynamics that are too large. In this case, we can use modeling atinterval T A (De Waele & Broersen, 2000). The discrete-time signal can be split in T A /T g segments of data that canbe treated independently, which means that the likelihoodcan be computed as the sum of the likelihood per segment. Maximization of the likelihood is performed over the partialautocorrelations φ using the L-BFGS algorithm, initiatedfrom a number of different starting points to reduce theprobability of ﬁnding a local optimum. Three starting pointsare obtained using the Burg AR parameter estimator appliedto regularly sampled signals derived from the original data(i) using Nearest Neighbor interpolation (ii) using LinearInterpolation and (iii) by ignoring missing data. Finally, (iv)the white noise model, i.e. φ i = 0 for i ∈ [1 , p ] , is used asa starting point. The starting point resulting in the highestlikelihood is used as the ﬁnal estimate. In addition to the ML estimate, we propose a Bayesian pointestimate for the model parameters. While it is possible andperfectly valid to use the generated parameter samples forfurther inference, we use a spoint estimate here because ofthe following reasons: (i) In practice a posterior is rarelythe end result of the analysis. Rather, the samples are usedto draw a conclusion that can be formulated as a Bayesiandecision problem. A point estimate can also be interpretedas a result of a decision problem (Gelman et al., 2014); (ii)the point estimate allows a direct comparison with the MLestimator, and can be used as a direct replacement for it.Finally (iii) usage of a point estimate greatly reduces thecomputational complexity of subsequent application of theestimated kernel parameters, e.g. prediction, optimizationor interpolation.We approximate an uninformative prior by applicationof the AR(1) reference prior (Berger & Yang, 1994) tothe partial autocorrelations φ i : p ( φ i ) ∝ (1 − φ i ) − / .The approximate location parametrization is given by κ i = arcsin φ i , − π < κ i < π . By deﬁnition, this isthe parametrization where the reference prior is uniform(Bernardo, 2005).For many estimation problems, asymptotical theory accu-rately describes estimator performance. In this regime, theposterior is narrow, and various point estimators converge tothe same estimate, including the posterior mode, posteriormean, and the ML estimate (Gelman et al., 2014). Notably,because in the asymptotic regime E [ f ( θ )] ≈ f ( E [ θ ]) , theposterior mean of different parameterizations are equivalent.However, for irregularly sampled data, we are typically notin this regime and therefore have to be more precise in deﬁn- ccurate Kernel Learning for Linear Gaussian Markov processes ing a Bayesian point estimate. Because the model error ME(eq. 1) is approximately quadratic in the prediction param-eters ˆ a , we use the posterior mean of ˆ a as point estimate: ˆ a = E [ a | y ] , which is the optimal Bayesian decision for aquadratic loss in ˆ a . This estimator is referred to as PMEAN.Posterior samples are drawn with the Stan implementationof Hamiltonian Monte Carlo (HMC), initiated from the MLestimate.

5. Simulation study

Simulation experiments are critical for performance evalua-tion of kernel learning algorithms, because the asymptoticresults that can be obtained theoretically are quite differentfrom estimator performance in practice. In section 5.1, wequantify the impact on the model error of the occurrenceof spurious spectral peaks. In section 5.2, we determinethe dependence of the model error on various process andmodel parameters.

The case study kernels are deﬁned as follows: • case A : exponential kernel with covariance function ρ ( r ) = exp( − r/ , or, equivalently, an AR(1) pro-cess with parameter φ = a = exp( − / ; • case B : squared exponential kernel with covariancefunction ρ ( r ) = exp( − r / ) • case C : AR(4) process with poles exp( − . ± . · πi ) , exp( − . ± . · πi ) , corresponding to spec-tral peaks at f = 0 . and f = 0 . .Repeated irregular samples are drawn from this process asfollows: (i) a random draw of N observations of the processis generated from the prescribed covariance; (ii) samplesare random selected selected with probability /T , where T is the average sampling interval. We use N = 1000 and T = 5 . Parameters are estimated for the AR( ) modelfor case A and B, and an AR( ) model for case C. Themotivation for the lower order for case C is to create abest scenario case for the ML estimator, by matching theestimated model order with the actual model order. Forthe ML estimator, the ME increases more strongly withincreasing model order compared to PMEAN.The average model error over S = 400 simulation runs isgiven in ﬁgure 2. We draw the following main conclusionsfrom this experiment: (i) The Bayesian PMEAN estimatorsigniﬁcantly reduces the model error compared to ML forcase A and B. The ME is reduced by a factor of 8.2 for caseA, by a factor of 3.6 for case B; (ii) All estimators performwell for case C, which shows that PMEAN is successful atsuppression of spurious peaks in cases A and B, while at the same time it correctly estimates a peak when it occurs in thetrue spectrum.In addition to PMEAN, we also report the results for the pos-terior mean with a ﬂat prior on the partial autocorrelations φ i (PMEAN-f). The results for PMEAN-f are comparableto those of PMEAN, showing that results are not sensitiveto the kind of uninformative prior used. Since this resultholds equally for other reported simulation experiments,PMEAN-f is not reported separately elsewhere. A B C020040060080010001200

ME MLPMEANPMEAN-f

Figure 2.

Model Error ME for the kernel case studies for the MLand PMEAN estimators. The posterior mean estimator PMEANachieves a signiﬁcant error reduction for cases A and B. The MEis the average over 400 simulations.

Spectral estimates for representative simulation draws aregiven in ﬁgure 3 in comparison to the true spectrum alongwith the resulting model error ME. For case A, the spectralestimate at higher frequencies for ML estimate has an errorof close to 3 orders of magnitude, resulting in a large MEof 1045, while PMEAN is much more accurate in this fre-quency range. Also for case B, we observe that PMEANis successful at suppression of spurious peaks comparedto ML. Finally, both estimators accurately model the truespectral peak when it occurs in the true spectrum in case C.Figure 4 shows the posterior distribution and point estimatesfor AR parameters a through a for an estimated AR(8)model for case A. Higher order coefﬁcients are omittedbecause they show a similar pattern. The posterior for the(non-zero) parameter a is quite narrow. Both the ML andPMEAN estimates are accurate for this parameter. However,we observe a wide posterior for the higher order coefﬁcients,roughly centered around the true value of . For thesecoefﬁcients, the posterior mean, being the center of massof the distribution, more robustly estimates a value closerto because it takes into account the entire distribution,thus resulting in a more accurate model. Conversely, themaximum likelihood estimator does not take into asccountthe entire shape of the distribution, but only the maximum,and thus is more prone to shift dramatically due to smallrandom variations, resulting in a larger error. ccurate Kernel Learning for Linear Gaussian Markov processes (a) ML 1045PMEAN 131true h (b) ML 827PMEAN 247 f (c) ML 27PMEAN 22 Figure 3.

Estimated power spectra for representative simulationruns for the kernel case studies, compared to the true kernel spec-trum. The legend shows the model error ME for each estimate.

For AR(8) model estimation in cases A and B, the averagecomputation time for the ML estimator is . seconds, whilePMEAN takes seconds. Hence, ML is the best optionkernel learning when its speed is required. However, thecomputation time for PMEAN is reasonable and its signif-icantly greater accuracy makes it the algorithm of choicein case the computational resources are available. Further-more, our results can motivate future work by acceleratingthe sampling, e.g. by a specialized derivative computationinstead of relying on the Stan autodiff algorithm; usage ofthe DIAG-PRE-KAL algorithm; and faster sampling algo-rithms such as variational inference. We report the results of a simulation experiment for a rangeof process parameters, using case A from section 5.1 asa reference. When varying the sampling interval T s , themeasurement time is increased in proportion to T s so that theaverage number of available samples N a remains constant.The results are given in ﬁgure 5Figure 5a) shows that the ML estimator has a larger modelerror with increasing correlation length T s , corresponding toa larger dynamic range in the frequency domain; increasingaverage sampling time T (5b); and increasing estimatedmodel order p est (5c).Conversely, the model error decreases with increasing mea- a ML 959PMEAN 60 trueposterior a a a a Figure 4.

Posterior distribution and point estimates for AR param-eters a to a (top to bottom) for an estimated AR(8) model forcase A. The legend shows the model error ME for each estimate. surement time N a (5d), indicating that the estimator con-verges to an accurate result as the number of observationsincreases. Note that the model error is scaled with N a .Hence, the unscaled error decreases faster than /N a . Thisis caused by ﬁnite sample effects, which are not describedby asymptotic theory. In the asymptotic regime, the modelerror is independent of N a .

6. Monsoon rainfall variability data

We investigate long-term monsoon rainfall variability basedon radiometric-dated, speleothem oxygen isotope δ O data(Sinha et al., 2015). This data enables evaluation of climatevariability on a much larger time scale than the few decadesof precipitation recorded by meteorologists. The data isintrinsically irregularly sampled, because it is formed bynatural deposition rather than experimenter controlled sam-pling. For the same reason, irregular sampling occurs formany other long-term climate records as well, e.g. ice coredata (Petit et al., 1999).Speleothem (cave formation) data as studied across variouslocations can have a range of average sampling rates. Thecurrent dataset is particularly suitable for algorithm bench-marking because it has a higher average sampling rate thandatasets collected from other locations. This allows us tostudy algorithm performance as a function of sampling rateby subsampling the original data, and comparing the results ccurate Kernel Learning for Linear Gaussian Markov processes Ts M E (a) MLPMEAN 0 5 10 T M E (b)0 2000 4000 N M E (c) 0 5 10 15 p est M E (d) Figure 5.

Model error ME of ML and PMEAN estimators as afunction of correlation length T s (a) ; average sampling time T (b) ;measurement time N (c) ; and model order p est (d) . to estimates obtained from the full dataset.The oxygen isotope data consists of N = 1848 irregularlysampled observations of δ O anomalies over a time spanof years, resulting in average sampling interval of T = 1 . years. We convert the irregularly sampled datato a regularly sampled grid with missing data with a gridspacing of T g = 2 years and use this data to estimate AR(8)models. Because of the high sampling rate and selectedgrid spacing, on 13% of samples on the regular grid aremissing when all samples are used, and we can consequentlyestimate a reliable reference model. This is conﬁrmed by thefact that both the ML and PMEAN are practically identical,with ME < between the two estimates.We proceed to increase the average sampling interval T byrandom subsampling to create lower-ﬁdelity datasets as theymay be observed at other locations. The sampling interval T is expressed in grid time steps, so that T = 1 correspondsto no missing data, as in the simulation experiments.For each subsampling rate, we generate different signalsby repeated random subsampling. The error of estimatedAR( ) models compared to the reference model as a functionof T , as well as estimated power spectra for T = 6 . aregiven in ﬁgure 6.Consistent with the simulation results, we observe that bothestimators are accurate for low T , but that the PMEANestimator has a substantially lower error as T increases.Also, PMEAN successfully suppresses the spurious peaks at higher frequencies. T ME MLPMEAN f h reference ML 91 PMEAN 37 Figure 6.

Estimator performance on randomly subsampledspeleothem δ O data. As shown by the model error ME as afunction of average sampling interval T (a), the posterior meanPMEAN is substantially more accurate than ML with increasing T .As in the simulation experiments, the ML estimate shows spuriouspeaks which are strongly reduced in the PMEAN estimate (b),resulting in a lower ME (37 vs 91)

7. Concluding remarks

The reported algorithms and results are directly relevant tokernel learning when using Linear Gaussian Markov modelsfor scalar index variables or time series. Furthermore, theresults can guide a wide range of related research, as wemotivate below.First, we report that the posterior mean based on the refer-ence prior yields considerably more accurate models thanthe ML estimate, which strongly deteriorates for complexmodels. It is expected that this result will generalize toother kernels. Considering the trend towards more complexmodels, this is a key result for the ﬁeld of kernel learning.Second, the Linear Gaussian Markov model can be usedto model data with multidimensional index variables byusing it to compose a separable multidimensional covariancefunction (Gilboa et al., 2015).

References

Auger-Méthé, M., Field, C., Albertsen, C. M., Derocher,A. E., Lewis, M. A., Jonsen, I. D., and Flemming, J. M.State-space models’ dirty little secrets: even simple linear ccurate Kernel Learning for Linear Gaussian Markov processes gaussian models can have estimation problems.

Scientiﬁcreports , 6:26677, 2016.Bayarri, M.J. and Berger, J. O. The interplay of Bayesianand frequentist analysis.

Statistical Science , pp. 58–80,2004.Berger, J. O. and Yang, R.-Y. Noninformative priors andbayesian testing for the ar (1) model.

Econometric Theory ,10(3-4):461–482, 1994.Bernardo, J. M. Reference analysis.

Handbook of statistics ,25:17–90, 2005.Broersen, P. M. T.

Automatic autocorrelation and spectralanalysis . Springer Science & Business Media, 2006.Broersen, P. M. T. Spectral estimation from irregularlysampled data for frequencies far above the mean data rate.In , pp. 1–6, May 2007. doi: 10.1109/IMTC.2007.379314.Broersen, P. M. T. The removal of spurious spectral peaksfrom autoregressive models for irregularly sampled data.

IEEE Transactions on Instrumentation and Measurement ,59(1):205–214, Jan 2010. ISSN 0018-9456. doi: 10.1109/TIM.2009.2022451.Carpenter, B., Gelman, A., Hoffman, M., et al. Stan: Aprobabilistic programming language.

J Stat Softw , 2016.De Waele, S. and Broersen, P. M. T. Order selection for themultirate burg algorithm. In

Signal Processing Confer-ence, 2000 10th European , pp. 1–4. IEEE, 2000.Dong, K., Eriksson, D., Nickisch, H., Bindel, D., and Wil-son, A. G. Scalable log determinants for gaussian processkernel learning. In

Advances in Neural Information Pro-cessing Systems , pp. 6330–6340, 2017.Gelman, A., Carlin, J. B., Stern, H. S., and Rubin, D. B.

Bayesian data analysis , volume 2. Chapman & Hall/CRCBoca Raton, FL, USA, 2014.Ghahramani, Z. Probabilistic machine learning and artiﬁcialintelligence.

Nature , 521(7553):452–459, 2015.Gilboa, E., Saatçi, Y., and Cunningham, J. P. Scaling multi-dimensional inference for structured gaussian processes.

IEEE transactions on pattern analysis and machine intel-ligence , 37(2):424–436, 2015.Jones, R. H. Maximum likelihood ﬁtting of arma models totime series with missing observations.

Technometrics , 22(3):389–395, 1980. Khan, M. S., Jenkins, J., and Yoma, N. B. Discovering newworlds: A review of signal processing methods for de-tecting exoplanets from astronomical radial velocity data[applications corner].

IEEE Signal Processing Magazine ,34(1):104–115, 2017.Marple, S. L.

Digital spectral analysis: with applications ,volume 5. Prentice-Hall Englewood Cliffs, NJ, 1987.Murphy, K. P.

Machine learning: a probabilistic perspective .MIT press, 2012.Petit, J.-R., Jouzel, J., Raynaud, D., Barkov, N. I., et al.Climate and atmospheric history of the past 420,000 yearsfrom the vostok ice core, antarctica.

Nature , 399(6735):429–436, 1999.Prado, R. and West, M.

Time series: modeling, computation,and inference . CRC Press, 2010.Ramadona, A. L., Lazuardi, L., Hii, Y. L., et al. Predictionof dengue outbreaks based on disease surveillance andmeteorological data.

PloS one , 11(3):e0152688, 2016.Rasmussen, C. E. and Williams, C. K. I.

Gaussian processesfor machine learning , volume 1. MIT press Cambridge,2006.Sarkka, S., Solin, A., and Hartikainen, J. Spatiotemporallearning via inﬁnite-dimensional bayesian ﬁltering andsmoothing: A look at gaussian process regression throughkalman ﬁltering.

IEEE Signal Processing Magazine , 30(4):51–61, 2013.Sinha, A., Kathayat, G., Cheng, H., et al. Trends and oscil-lations in the indian summer monsoon rainfall over thelast two millennia.

Nature communications , 6, 2015.Wilson, A. G. and Nickisch, H. Kernel interpolation for scal-able structured gaussian processes (kiss-gp). In

Interna-tional Conference on Machine Learning , pp. 1775–1784,2015.Wilson, A. .G., Hu, Z., Salakhutdinov, R., and Xing, E. P.Deep kernel learning. In

Artiﬁcial Intelligence and Statis-tics , pp. 370–378, 2016.Zhang, Y., Liu, B., Ji, X., and Huang, D. Classiﬁcation ofEEG signals based on autoregressive model and waveletpacket decomposition.