[PDF] Scalable Grouped Gaussian Processes via Direct Cholesky Functional Representations

Abstract

We consider multi-task regression models where observations are assumed to be a linear combination of several latent node and weight functions, all drawn from Gaussian process (GP) priors that allow nonzero covariance between grouped latent functions. We show that when these grouped functions are conditionally independent given a group-dependent pivot, it is possible to parameterize the prior through sparse Cholesky factors directly, hence avoiding their computation during inference. Furthermore, we establish that kernels that are multiplicatively separable over input points give rise to such sparse parameterizations naturally without any additional assumptions. Finally, we extend the use of these sparse structures to approximate posteriors within variational inference, further improving scalability on the number of functions. We test our approach on multi-task datasets concerning distributed solar forecasting and show that it outperforms several multi-task GP baselines and that our sparse specifications achieve the same or better accuracy than non-sparse counterparts.

Full PDF

SScalable Grouped Gaussian Processes via DirectCholesky Functional Representations

Astrid Dahl

The University of New South Wales [email protected]

Edwin V. Bonilla

CSIRO’s Data61 [email protected]

Abstract

We consider multi-task regression models where observations are assumed to bea linear combination of several latent node and weight functions, all drawn fromGaussian process (GP) priors that allow nonzero covariance between grouped latent functions. We show that when these grouped functions are conditionallyindependent given a group-dependent pivot, it is possible to parameterize the priorthrough sparse Cholesky factors directly, hence avoiding their computation duringinference. Furthermore, we establish that kernels that are multiplicatively separableover input points give rise to such sparse parameterizations naturally without anyadditional assumptions. Finally, we extend the use of these sparse structures toapproximate posteriors within variational inference, further improving scalabilityon the number of functions. We test our approach on multi-task datasets concerningdistributed solar forecasting and show that it outperforms several multi-task GPbaselines and that our sparse speciﬁcations achieve the same or better accuracythan non-sparse counterparts.

Gaussian process ( GP ) models are a ﬂexible nonparametric Bayesian approach that can be appliedto various problems such as regression and classiﬁcation [24] and have been extended to numerousmultivariate and multi-task problems including spatial and spatio-temporal contexts [7]. Multi-task GP methods have been developed along several lines (see e.g. [3] for a review) includingmixing approaches that combine multiple latent univariate GP s via linear or nonlinear mixing topredict multiple related tasks [37, 10, 5, 30, 18, 14]. To maintain scalability in multi-task mixingmodels, various constraints have been employed. In particular, latent GP s may be constrained tobe statistically independent as in [37, 11] or covarying with constrained kernel structures to allowalgebraic efﬁciencies as in [10, 5].In this paper we consider the multi-task setting where subsets of latent functions in Gaussian processregression networks ( GPRN ; [37]) covary within a constrained structure. We build upon the groupedGaussian process (

GGP ) approach of [10], where groups of latent functions may covary arbitrarilywith a separable kernel structure. Posterior estimation in this

GGP framework, as originally proposedin [10], is underpinned by variational inference based on inducing variables [31] and its stochasticoptimization extension [17], hence it should be inherently scalable to a large number of observations.However, both the time and space complexity deteriorate signiﬁcantly when considering a largenumber of tasks, due to the assumed grouping structure.Therefore, to address the above limitation, we consider the case where grouped functions areconditionally independent given a speciﬁc group-dependent pivot function. We exploit this structurein the prior and in the approximate posterior within a variational inference framework to develop anefﬁcient inference algorithm for the

GGP model. Our approach outperforms competitive multi-task

Preprint. Under review. a r X i v : . [ s t a t . M L ] J u l aselines and is signiﬁcantly more efﬁcient (in time and memory) than its non-sparse counterpart.Our speciﬁc contributions are given below. Direct sparse Cholesky functional representation : We show that when the grouped functions in a

GGP model are conditionally independent given a group-dependent pivot, it is possible to parameterizethe prior through sparse Cholesky factors directly, hence avoiding their computation during inference.We refer to these factors as

Cholesky functions as, for a given kernel, they allow us to operate in theCholesky factorization space directly as a function of the data and the GP hyper-parameters. Exact construction of sparse Cholesky functionals : We establish for the ﬁrst time (to the bestof our knowledge) that kernels that are multiplicatively separable over input points give rise tosuch sparse parameterizations naturally without any additional assumptions. This enables directconstruction of sparse Cholesky factors without resorting to iterative routines that are inherentlycostly and unstable, potentially increasing the scope for such kernels to be adopted in other machinelearning settings and applications.

Sparse approximate posteriors : Finally, we extend the use of sparse structures to the correspondingapproximate distributions within a variational inference framework. The required expectations overthese distributions are neatly estimated using a simple but effective ‘indirect’ sampling approach,which further improves scalability on the number of functions.

Experiments : Our approach is driven by the problem of forecasting solar power output at multiplesites. We apply our

GGP model and sparse inference framework to this problem using two realmulti-task datasets, showing that it outperforms competitive multi-task benchmarks and achieve thesame or better forecast performance than non-sparse counterparts in signiﬁcantly less time. GP regression A Gaussian process ( GP ; [24]) is formally deﬁned as a distribution over functions such that f ( x ) ∼GP ( µ ( x ) , κ ( x , x (cid:48) )) is a Gaussian process with mean function µ ( x ) and covariance function κ ( x , x (cid:48) ) iff any subset of function values f ( x ) , f ( x ) , . . . , f ( x N ) follows a Gaussian distribution withmean µ and covariance K , which are obtained by evaluating the corresponding mean function andcovariance function at the input points X = { x , . . . , x N } .In this paper we consider a form of multi-task GP regression where multiple outputs are modeledas a linear combination of node and weight functions, each of which is a GP . Data is of the form D = { X ∈ R N × D , Y ∈ R N × P } where each x ( n ) in X is a D -dimensional vector of input featuresand each y ( n ) in Y is a P -dimensional vector of task outputs. We consider the grouped Gaussianprocess ( GGP ) model of [10] who place a prior over Q latent GP functions F = { f j ( x ) } Qj =1 such thatarbitrary, non-overlapping subsets (‘groups’) of latent functions may have nonzero covariance.We denote arbitrarily chosen subsets in F as F r ∈ R N × Q r , r = 1 , . . . , R , where R is the totalnumber of groups. For each group the number of latent functions within is denoted Q r (group size)such that (cid:80) Rr =1 Q r = Q . In the GP each group is comprised of latent functions F r = { f j } j ∈ group r and the covariance between two functions is non-zero iff the corresponding processes belong to thesame group. This leads to a prior over functions given by p ( F | θ ) = R (cid:89) r =1 p ( F r | θ r ) = R (cid:89) r =1 N ( F r ; , K rff ) , (1)where K rff ∈ R NQ r × NQ r is the covariance matrix generated by the group kernel function κ r ( f j ( x ) , f j (cid:48) ( x (cid:48) )) , which evaluates the covariance of functions f j and f j (cid:48) at the locations x and x (cid:48) ,respectively. κ r ( f j ( x ) , f j (cid:48) ( x (cid:48) )) = 0 iff f j and f j (cid:48) do not belong to the same group r .Correlations between outputs are modeled as in the Gaussian process regression network ( GPRN )likelihood of [37], where f ( n ) is a Q -dimensional vector of latent function values at time n and p ( Y | F , φ ) = N (cid:89) n =1 p ( y ( n ) | f ( n ) , φ ) = N (cid:89) n =1 N ( y ( n ) ; W ( n ) g ( n ) , Σ y ) .Here we deﬁne W and G subsets of F formed by gathering P Q g and Q g functions in F , respectively,with Q g ( P + 1) = Q , φ = Σ y , f ( n ) = { W ( n ) , g ( n ) } and Σ y is a diagonal matrix. P -dimensional2utputs are constructed at x ( n ) as the product of a P × Q g matrix of weight functions, W ( n ) , and Q g -dimensional vector of node functions g ( n ) . Partitions of F with respect to W and G need notalign with partitions into groups F r . Hence grouping in the prior can be independent of the likelihooddeﬁnition and, for brevity, inference is presented below simply in terms of F and F r rather than W and G .To maintain scalability of the approach, we consider separable kernels of the form κ r ( f j ( x ) , f j (cid:48) ( x (cid:48) )) = κ r ( x , x (cid:48) ) κ r ( h j , h j (cid:48) ) , where for each group Q r vectors h ∈ H form a fea-ture matrix H r ∈ R Q r × H that governs covariance across functions f j ∈ F r . Group covariance K rff = K r hh ⊗ K r xx thus decomposes into K r xx ∈ R N × N and K r hh ∈ R Q r × Q r .In this paper we propose sparse forms of K r hh arising from a constrained form of cross-functioncovariance, whereby functions within a group are conditionally independent given a group ‘pivot’latent function. By exploiting conditional independence constraints that can reasonably ﬁt with spatio-temporal applications such as distributed solar prediction (rationale are discussed in the supplement,§F), it is possible to dramatically reduce the complexity of the GP with respect to group size. In this section we describe the main assumptions on the task-dependent covariance of our sparse

GGP model that will yield signiﬁcant computational gains over the original

GGP . The starting point is thatof conditional independence across functions given a group- pivot latent function.

When variables are jointly normally distributed and subsets of these variables are conditionallyindependent, their multivariate normal distribution is known to have certain useful properties. Supposevariables F r = f , f , . . . , f Q r are jointly normally distributed with covariance K (subscripts on K are dropped for ease of exposition), and suppose that, given some variable f k , the remaining variablesare conditionally independent. That is, F r ∼ N ( µ, K ) and f i ⊥ f j | f k , ∀ i, j (cid:54) = k, i (cid:54) = j, (2)where f i ⊥ f j | f k denotes independence of f i and f j given f k . This joint distribution can berepresented as a ‘star’ or ‘tree’ undirected graphical model, where f k can be conceived as a ‘pivot’variable connecting all other variables (see Figure 2 in the supplement).Where variables are jointly distributed according to (2), known, sparse expressions for the Choleskyfactor and inverse covariance, that allow direct construction of these matrices, can be obtainedanalytically [29, 28, 34]. For i, j (cid:54) = k , i (cid:54) = j , the covariance element K i,j = Cov ( f i , f j ) is givenby K i,j = K i,k K − k,k K k,j ⇐⇒ f i ⊥ f j | f k , leading to Φ i,j = 0 and Λ i,j = 0 , (3)where Λ and Φ are the precision matrix and lower triangular Cholesky factor of K . Moreover, nonzero elements of Λ and Φ have a known form (see [29, 28] for a useful summary).Without loss of generality, where f , . . . , f Q r are conditionally independent given f , the precisionmatrix takes a known winged-diagonal form (see the supplement). Of key importance for ourmodel, the associated Cholesky lower triangular factor Φ also has a known, sparse form where Φ = Chol ( K ) , Φ i = K i ( Φ ) − , i = 2 , . . . , Q r and Φ ii = Chol ( K ii. ) , i = 2 , . . . , Q r , and K ii. = K ii − K i K − K i , i = 2 , . . . , Q r , and all other elements are zero. Chol ( · ) denotes theCholesky factorization of the given argument. Let f ( h ) be drawn from a Gaussian process, f ( h ) ∼ GP ( , κ ( h , h (cid:48) )) and assume that f ( h i ) ⊥ f ( h j ) | f ( h k ) for some i, j (cid:54) = k, i (cid:54) = j . Thus, the properties of multivariate Gaussians aboveimply the constrained covariance form κ ( h i , h j ) ≡ κ ( h i , h k ) κ ( h k , h k ) − κ ( h k , h j ) . (4) This result also holds for the multivariate analogue where, for example, F r is partitioned into subsets ( F Q r , F Q r , . . . ) in which case K i,j = Cov ( F i , F j ) represents a submatrix, and similar for Φ i,j and Λ i,j . k = 1 and i = 2 , . . . , Q r yields a sparse form of theCholesky factor where Φ = Chol( κ ) , Φ i = κ i ( Φ ) − , and Φ ii = Chol( κ ii − κ i κ − κ i ) . (5)This form has several useful characteristics. Computation involves only univariate operations,meaning that Chol( · ) = (cid:112) ( · ) , ( Φ ) − = Φ and κ − = κ . Since there are Q r − nonzeroterms, complexity of both computation and storage is O ( Q r ) rather than the more general O ( Q r ) and O ( Q r ) . Computation involving univariate operations only also allows direct control over numericalstability of the factorization.In addition, the sparse form can be decomposed as two sparse matrices with a single nonzero columnand nonzero diagonal respectively, i.e. Φ = [ Φ . , , . . . , Q r − ] + diag(0 , Φ , . . . , Φ Q r Q r ) , (6)where [ · ] is the concatenation of the ﬁrst column of Φ ( Φ . ) and Q r − zero column vectors withlength Q r , and diag( · ) is the diagonal matrix with diagonal elements (0 , Φ , . . . , Φ Q r Q r ) . Inpractice, this means matrix operations can be evaluated using efﬁcient vector-broadcasting, ratherthan matrix multiplication, routines.To the best of our knowledge, direct construction of sparse Cholesky factors using explicit expressionsas above has not been employed in Gaussian process models. Rather, where constrained covarianceforms such as given at (2) are used, including in sparse methods discussed at §6, it is in the context ofprediction on test points that are conditionally independent given some inducing set of variables orlatent functions (see e.g. [22]). GP priors In some spatio-temporal settings it may be reasonable to explicitly impose the constraint at (4),which we term ‘explicit’ sparsity. In other cases, construction of a Gram matrix, associated Choleskyfactor and inverse using general routines and using (3) – (5) are equivalent. This is due to certainkernels implicitly giving rise to the identity in (4). Where kernels can be expressed as productsof real-valued functions of h i and h j , i.e. assuming κ ( h i , h j ) = φ ( h i ) ψ ( h j ) , and assuming theinverses ( φ ( h i ) − , ψ ( h j ) − ) are deﬁned for all h i , h j ∈ H , kernels give rise to this identity (see thesupplement for details). The requirement for symmetry in Mercer kernels (see e.g. [15]) requires κ ( h i , h j ) = κ ( h j , h i ) , implying φ ( h i ) ψ ( h j ) = φ ( h j ) ψ ( h i ) for all h i , h j ∈ H , however we notethat functions ( φ ( h ) , ψ ( h ) ) need not be identical for (4) to hold.Trivially, multiplicative kernels comprised of kernel functions that have the property at (4) retain thisproperty. Thus Gram matrices that can be expressed as the Hadamard product of matrices constructedvia kernel functions of this type also retain the properties at (3). Kernels that meet this criterioninclude constant kernels, polynomial kernels in one dimension, dot product wavelet kernels, andseparable stationary kernels (see [15]). All these kernels decompose multiplicatively into real valuedfunctions of inputs points h i and h j in a straightforward way. When kernels decompose multiplicatively for any points h i , h j and h k , theCholesky has the form deﬁned by (5). The Cholesky can be expressed and directly constructed in thisway because the relationship holds for any three points. Therefore any point can be assigned as the‘pivot’ point κ and pairwise covariances between any other two points can be expressed in terms ofcovariances with κ . Degeneracy:

A corollary of this, however, is that where (4) holds, only one point on the diagonalis nonzero, Φ , since for all other points, κ ii − κ i κ − κ i = 0 (from (5)), implying that the GP isdegenerate. The kernels listed above that decompose multiplicatively are positive semi-deﬁnite, asopposed to strictly positive deﬁnite, Mercer kernels [24, 39]. One of the criticisms of the degeneratemodel is that, for distance-based kernels, predictive variance at new points reduces with distancefrom observed points and in the limit can be zero [21]. The latter follows from the result provided in [23] that a GP is degenerate iff the covariance function has aﬁnite number of nonzero eigenvalues. voiding degeneracy in GGP models:

Issues around degeneracy are avoided in our framework fortwo reasons. Firstly, use of a sparse construction is limited to K r hh . In essence, the model inducessparsity over tasks rather than observations . Under multi-task latent function mixing approaches,including GGP , it is generally the case that predictions are not made on new tasks. As such, thereis no degeneration over test points since locations across latent functions are ﬁxed for training andtest observations. Secondly, as is common practice we add a Kronecker delta kernel (see [15]) todiagonal elements, however excluding the pivot which maintains the model as an exact GP priorwithout losing the direct Cholesky construction. This is possible because, for K r hh , the pivot inputpoint does not change. Pivot choice is discussed at §5 and in the supplement. Inference for the sparse

GGP follows the sparse variational approach of [31] and extended by [11, 10]where the prior at (1) is augmented with inducing variables { u r } Rr =1 , drawn from the same GP priorsas F r at M inducing points Z r in the same space as X , giving p ( u | θ ) = R (cid:89) r =1 N ( u r ; , K ruu ) and p ( F | u ) = R (cid:89) r =1 N ( F r ; ˜ µ r , (cid:101) K r ) , (7)where ˜ µ r = A r u r , (cid:101) K r = K rff − A r K ruf and A r = K rfu ( K ruu ) − = I Qr ⊗ K r xz ( K r zz ) − . K ruu ∈ R MQ r × MQ r is governed by κ r ( f j ( x ) , f j (cid:48) ( x (cid:48) )) evaluated over Z r , H r , similarly yielding thedecomposition K ruu = K r hh ⊗ K r zz . The joint posterior distribution over { F , u } is approximated byvariational inference with p ( F , u | Y ) ≈ q ( F , u | λ ) def = p ( F | u ) q ( u | λ ) with q ( u | λ ) = K (cid:88) k =1 π k r (cid:89) r =1 q k ( u r | λ kr ) , (8)where q k ( u r | λ kr ) = N ( u r ; m kr , S kr ) and λ kr = { m kr , S kr , π k } . m kr and S kr are, respectively,a freely parameterized mean vector and covariance matrix and π k are mixture weights. Prediction fora new point y (cid:63) given x (cid:63) is computed as the expectation over the variational posterior for the newpoint: p ( y (cid:63) | x (cid:63) ) = K (cid:88) k =1 π k (cid:90) p ( y (cid:63) | f (cid:63) ) q k ( f (cid:63) | λ k ) d F (cid:63) , (9)where q k ( f (cid:63) | λ k ) is deﬁned as for q k ( n ) ( f ( n ) | λ k ) below (see §4.1). The expectation in Eq. (9) isestimated via Monte Carlo ( MC ) sampling: E p ( y (cid:63) | x (cid:63) ) [ y (cid:63) ] ≈ S (cid:80) Ss =1 W s(cid:63) g s(cid:63) , where { W s(cid:63) , g s(cid:63) } = f s(cid:63) are samples from q k ( f (cid:63) | λ k ) . To estimate the posterior parameters we optimize the evidence lowerbound ( ELBO ) deﬁned as L elbo def = L ent + L cross + L ell where L ent , L cross and L ell are entropy, crossentropy and expected log likelihood terms, respectively. Derivations of the general expressions forthe GP model and ELBO can be found at [10, 11]. In evaluating the efﬁciency of our approach weconsider both a diagonal and Kronecker structure for S kr . We deﬁne the Kronecker speciﬁcation as S kr = S krb ⊗ S krw where S krb ∈ R Q r × Q r and S krw ∈ R M × M are both sparse, freely parameterizedmatrices that heuristically correspond to ‘between’ and ‘within’ covariance components. Sparsity isinduced via Cholesky factors of the form at (5). The corresponding components of the

ELBO are as follows, L ent ≥ − K (cid:88) k =1 π k log K (cid:88) l =1 π l R (cid:89) r =1 N ( m kr ; m lr , S kr + S lr ) def = (cid:100) L ent , (10) L cross ( λ ) = − K (cid:88) k =1 π k R (cid:88) r =1 (cid:2) c r + log | K ruu | + m (cid:48) kr ( K ruu ) − m kr + tr (( K ruu ) − S kr ) (cid:3) , (11) L ell ( λ ) = N (cid:88) n =1 E q ( n ) ( f ( n ) | λ ) (cid:2) log p ( y ( n ) | f ( n ) , φ ) (cid:3) , (12)5here c r = M r log 2 π and (cid:100) L ent is used instead of L ent . The main computational gains, followingthe efﬁciencies due to the Kronecker factorization, arise from the sparsiﬁcation of the prior and theapproximate posterior. Indeed, we can show that costly algebraic operations on the matrices S kr and K r hh (obtained from the Kronecker factorization of K ruu ) such as log determinants and inversesare reduced from O ( Q r ) to O ( Q r ) for the entropy and the cross-entropy terms in Eqs. (10) and(11).However, for the expected log likelihood ( ELL ) term in Eq. (12) we still need to sample fromthe marginal posterior q ( n ) ( f ( n ) | λ ) which is done by sampling from component group mixtureposteriors i.e. q k ( n ) ( f ( n ) Q r | λ kr ) = N ( b kr ( n ) , Σ kr ( n ) ) with mean and covariance expressions givenby b kr ( n ) = A r ( n ) m kr and Σ kr ( n ) = (cid:101) K r ( n ) + A r ( n ) S kr A (cid:48) r ( n ) , where (cid:101) K r and A r are deﬁnedas in (7) (detailed expressions are provided in the supplement, §E). Naïvely, this is O ( Q r ) . Weaddress this issue by ‘indirect sampling’, i.e. drawing independent samples from two distributionsspeciﬁed as N ( , (cid:101) K r ( n ) ) and N ( b kr ( n ) , A r ( n ) S kr A (cid:48) r ( n ) ) , and summing these to return samplesfrom N ( b kr ( n ) , Σ kr ( n ) ) , hence reducing the time complexity to O ( Q r ) . Similarly, these savings forthe ELL apply to predictions as obtained from Eq. (9). Finally, there are signiﬁcant gains from theproposed sparsiﬁcation and indirect sampling method in terms of memory complexity, going from O ( Q r ) to O ( Q r ) . Full complexity analysis is given in the supplement, §E. PV ) forecasting We test the sparse

GGP model on solar forecasting applications, where tasks are solar sites and themodelling goal is to jointly predict power output at all sites at each test time point n . The sparse GGP model for solar follows the approach in [10] where, for P tasks, there are P latent node functions,meaning Q g ( P + 1) = Q . Latent weight functions are grouped according to rows of W ( n ) whilelatent node functions are assumed to be independent. Thus, predicted output at site i , y ( n ) i , is a linearcombination of (site-associated) node functions. Spatial features, h j = ( latitude j , longitude j ) , j =1 , . . . , Q r , populate the cross function feature matrix H r for every group. We test three forms of K r hh Cholesky factors: an implicitly sparse form via multiplicatively separable kernels; an explicitlysparse form where the equality in (4) is imposed rather than naturally arising; and a sparse, freelyparameterized factor. Group pivot functions are set as the diagonal elements of W ( n ) (see thesupplement, §F). Data are ﬁve minute average power output between 7am and 7pm. We test our model using twodatasets (with P = 25 and P = 50 ). The ﬁrst dataset consists of 25 proximate sites over 45 daysin Spring with 105,000 (57,000) observations for training (testing). The second dataset consists of50 proximate sites in a different location and season (Autumn) with 210,000 (114,000) observationsin total for training (testing). We forecast power output 15 minutes ahead at each site over the testperiod.For P = 25 we evaluate performance relative to the general GGP and three classes of non-

GGP benchmarks: the linear coregional model (

LCM ), GPRN and multi-task model with task-speciﬁcfeatures (

MTG ) of [6]. In addition, we test a general

GGP model with full, freely parameterized crosssite covariance. For P = 50 , we evaluate sparse models using Kronecker posteriors against non GGP benchmarks. However, other

GGP models could not be estimated under the same computationalconstraints. We consider the root mean squared error (

RMSE ), mean absolute error (

MAE ), negativelog predictive density (

NLPD ), average model rank ( M - RANK ) and forecast variance ( F - VAR ) toevaluate the performance of all methods. Full details of the experimental set-up are given in thesupplement, §F.

Results for all models are reported at Table 1with further results available in the supplement (§G).For P = 25 forecast accuracy of sparse and general GGP models is comparable. Results for sparseand non sparse

GGP models differ by less than a percentage point on

RMSE and

MAE , while

NLPD Datasets and implementations are available on GitHub. See the supplement for details. able 1: Forecast accuracy and variance of

GGP , sparse

GGP and benchmark models. Results shownfor Kronecker posteriors with K = 1 for Adelaide ( P = 25 ) and Sydney ( P = 50 ) datasets. Resultsare reported for best performing GPRN and

LCM benchmarks (based on

RMSE ) and for

LCM where Q g = P . All metrics are losses, i.e. the lower the better. P = 25

Kronecker

RMSE MAE NLPD M - RANK F - VAR

GGP

GGP (free) 0.344

GGP (implicit) 0.342 0.213 0.414 4.7 0.120Sparse

GGP (explicit)

GGP (free) 0.344 0.216 0.378 7.0

LCM ( Q g = P ) 0.375 0.236 0.583 16.0 0.180 LCM ( Q g = 4 ) 0.367 0.240 0.475 14.7 0.147 GPRN ( Q g = 2 ) 0.342 0.214 0.446 5.7 0.150 MTG

Kronecker

RMSE MAE NLPD M - RANK F - VAR

Sparse

GGP (implicit)

GGP (explicit) 0.421 0.257 0.626 2.3 0.141Sparse

GGP (free) 0.423 0.258 0.625 2.7

LCM ( Q g = P ) 0.451 0.283 0.807 5.3 0.211 GPRN ( Q g = 2 ) 0.428 0.263 0.664 4.0 0.185 MTG results are slightly more variable. For P = 25 , accuracy is comparable for GGP and

GPRN on MAE and

RMSE , however

GGP based models perform signiﬁcantly better than benchmarks on

NLPD .The

LCM and

MTG perform relatively poorly on all measures. Mean forecast variance was alsofound to be signiﬁcantly lower (28 percent on average) under

GGP models relative to benchmarkmodels. For P = 50 , sparse GGP models outperform benchmark models on all accuracy and variancemeasures.Computation times are shown at Figure 1 for sparse versus general

GGP models and benchmarkmodels. Results conﬁrm that sparse

GGP models have substantially lower time costs than generalcounterparts; step time and

ELBO computation time are decreased by 49 to 57 percent and 43 to 57percent respectively, with Kronecker posteriors showing greater reductions than diagonal posteriors(55 versus 50 percent, respectively, on average). The most substantial improvements, however, arefor prediction, where sparse models are three to four times faster over the full test set. Further,mean sparse model prediction time scales close to linearly between P = 25 (at 3.6 seconds) and P = 50 (at 8.5 seconds). Computation time for step and ELBO for P = 50 (not shown) scales at ahigher-than-linear rate overall (on average three times the cost of comparable models for P = 25 ),which we mainly attribute to the grouping structure selected.We also ﬁnd that, for the same prior, average time cost is always lower under the Kronecker posterior,consistent with their lower complexity as discussed in the supplement, E. Further, freely parameterized GGP models have lower time costs than ‘standard’ counterparts reﬂecting the lower complexity ofoperations on K r hh elements ( O (1) versus O ( H ) for explicitly deﬁned kernels). Time costs for GPRN and

LCM benchmarks are lower again than sparse

GGP based models, in the order of 1.3 to 6.3 timesfaster per operation. The

MTG , however, has very low step time (up to 27 times faster than sparse

GGP models) and comparable time cost for

ELBO in the diagonal case but substantially poorer resultsfor

ELBO and prediction computation times for remaining tests.

The problem of forecasting local solar output in the short term is of signiﬁcant interest for the purposeof distributed grid control and household energy management [32, 35]. A number of studies conﬁrmthat exploiting spatial dependencies between proximate sites can yield signiﬁcant gains in forecast7iagonal Kronecker Diagonal Kronecker

Figure 1:

Computation times for

GGP versus sparse

GGP models ( P = 25 ) for diagonal andKronecker posteriors. Left:

Average time to compute the full evidence lower bound.

Right:

Averagetime to compute predictions over the full test set.accuracy and stability [10, 38]. More importantly, inherent to this application is the need for modelinguncertainty in a ﬂexible and principled way (see e.g. [4]), hence our motivation to use GP s .The literature on scalable inference for GP models is vast and the reader is referred to [20] for acomprehensive review of this area. Here we focus on closely related approaches to ours. Earlymethods adopt low-rank decompositions or assume some kind of conditional independence [26, 22],with later studies extending these conditional independence ideas to convolutional multi-task models[1, 2]. Nowadays it is widely recognized that the inducing variable framework of [31], made scalableby [17], is one of the de facto standards for scalable GP s . However, other approaches such as thosebased on random feature expansions of the covariance function remain relevant [16, 8]. Another ideathat has been used in sparse methods is that of structured inputs, e.g. regular one-dimensional griddedinputs, which induce Toeplitz matrices with sparse inverses [36].Other methods make use of conditional independence relationships arising naturally in GaussianMarkov random ﬁelds, particularly the integrated nested Laplace approximation ( INLA ) of [25].These approaches assume a more general conditional independence structure arising from the Markovblanket property. Closely related to methods inducing conditional independence, several recenttechniques consider low-rank approximations in inducing point frameworks [13, 9]. In fact, [13] usea low-rank pivoted Cholesky approximation with pre-conditioning for various GP models includingsparse methods. Other approaches using low-rank approximations based on spectral methods havealso been proposed as in [27, 12].While sparse variational and other recent low-rank GP methods have provided substantial gainsin scalability in recent years, as general approaches they do not necessarily exploit efﬁciencieswhere sparse, cross-task covariances can be speciﬁed a priori as may be possible in well understoodspatio-temporal problems. Finally, given the increasing capabilities of current GPU-based computingarchitectures, it is not surprising to see very recent developments to solve GP regression exactly invery large datasets [33]. We have shown that by exploiting known properties of multivariate Gaussians and conditionalindependence across latent functions, it is possible to signiﬁcantly improve the scalability of multitask mixing models with respect to the number of functions. We have applied this approach to theproblem of solar forecasting at multiple distributed sites using two large multi task datasets anddemonstrated that both spatially driven and freely parameterized sparse cross function covariancestructures can be exploited to achieve faster learning and prediction without sacriﬁcing forecastperformance. In particular the cost of prediction is dramatically reduced and shown to be linear in thenumber of grouped latent functions. 8

An example of a sparse

GGP prior

Figure 2:

Example of Undirected Graphical Model in Star Formation for F r = ( f , f , . . . , f Q r ) (cid:48) . B Code and Datasets

We have made our implementation of the sparse and non sparse

GGP models available at the Githubrepository: https://github.com/spggp/sparse-ggp .The repository also contains datasets used in reported experiments and example scripts for execu-tion.

C Expressions for multivariate Gaussian precision matrix

The precision matrix from [28, 34], has a sparse, winged diagonal form Λ =  Λ Λ (cid:48) Λ (cid:48) · · · Λ (cid:48) Q r Λ Λ · · · Λ Λ · · · ... ... ... . . . ... Λ Q r · · · Λ Q r Q r  where Λ = K − + Q r (cid:88) i =2 K − K i K − ii. K i K − Λ i = − K − K i K − ii. Λ ii = K − ii. , where K ii. = K ii − K i K − K i , i = 2 , . . . , Q r The sparse Cholesky factor of K has the form Φ =  Φ · · · Φ Φ · · · Φ Φ · · · ... ... ... . . . ... Φ Q r · · · Φ Q r Q r  where Φ = Chol( κ ) , Φ i = κ i ( Φ ) − , and Φ ii = Chol( κ ii − κ i κ − κ i ) . D Proof of implicit sparsity in exact GP priors Construction of a Gram matrix, associated Cholesky factor and inverse using general routinesand direct sparse constructions are equivalent where kernels implicitly give rise to the identity9here κ ( h i , h j ) = κ ( h i , h k ) κ ( h k , h k ) − κ ( h k , h j ) ⇐⇒ f ( h i ) ⊥ f ( h j ) | f ( h k ) (13)for any points h i , h j and h k .Where kernels can be expressed as products of real-valued functions of h i and h j , i.e. assuming κ ( h i , h j ) = φ ( h i ) ψ ( h j ) , and assuming the inverses ( φ ( h i ) − , ψ ( h j ) − ) are deﬁned for all h i , h j ∈ H , kernels give rise to this identity. To see this, consider a positive semi-deﬁnite kernel suchthat κ ( h i , h j ) = φ ( h i ) ψ ( h j )= ⇒ κ ( h k , h k ) − = φ ( h k ) − ψ ( h k ) − = ⇒ κ ( h i , h j ) = φ ( h i ) ψ ( h k ) φ ( h k ) − ψ ( h k ) − × φ ( h k ) ψ ( h j )= κ ( h i , h k ) κ ( h k , h k ) − κ ( h k , h j ) (14)Kernels of this form are valid, positive semi-deﬁnite kernels so long as the properties of symmetryand positive semi-deﬁniteness are satisﬁed. Symmetry implies κ ( h i , h j ) = κ ( h j , h i ) , implying φ ( h i ) ψ ( h j ) = φ ( h j ) ψ ( h i ) for all h i , h j ∈ H . Positive semi-deﬁniteness implies for any ex-amples h i , h j and any set of real numbers λ , . . . , λ l , (cid:80) li =1 (cid:80) lj =1 λ i , λ j κ ( h i , h j ) ≥ (see e.g.[15]).We note that following diagonal correction the implicitly sparse model is no longer invariant to pivotchoice, however is still an exact GP when used as in our model (i.e. to characterize cross functioncovariance in the mixing model). For a collection of random variables to meet the deﬁnition of a GP ,it is only required that any ﬁnite subset of the collection is jointly Gaussian distributed, not precludingﬁnite index sets [24]. This observation also extends to the explicitly sparse case and undeﬁned, freelyparameterized case. E Computational Complexity

We consider complexity of key terms required for inference and prediction for sparse versus non-sparse models in the following sections, speciﬁcally complexity per group of latent functions. Inthe discussion that follows, it is assumed that d -dimensioned matrix inverses can be evaluated withcomplexity O ( d ) in the general case and O ( d ) in sparse or diagonal cases due to nonzero elementsof sparse matrices growing linearly with d (see §3.1). Entropy

The entropy component of the

ELBO is approximated by L ent ≥ − K (cid:88) k =1 π k log K (cid:88) l =1 π l N ( m k ; m l , S k + S l ) with N ( m k ; m l , S k + S l ) = R (cid:89) r =1 N ( m kr ; m lr , S kr + S lr ) . (15)The normal terms in (15) differ in complexity over posterior forms and whether l = k . In the casewhere K > and with a Kronecker posterior, both the general and sparse GGP have poor scalabilitysince evaluation requires the log determinant and inverse of S kr + S lr with complexity O (( M Q r ) ) .Hence, we consider K = 1 or diagonal posteriors for K ≥ . In the diagonal case time and spacecomplexity is O ( M Q r ) for sparse and non sparse models ( O ( KM Q r ) for K > ).For the non-diagonal case, the entropy component for each group r reduces to − ( log | S kr | ) − C where C is constant with respect to model parameters. Since the log determinant decomposes aslog | S kr | = Q r M ln 2 + M ln | S krb | + Q r ln | S krw | evaluation depends only on the diagonal withcomplexity O ( Q r + M ) . The cost of storage is O ( Q r + M ) versus O ( Q r + M ) for sparse andgeneral cases. 10 ross Entropy L cross has several components that differ across sparse and general models,since L cross ( λ ) = − K (cid:88) k =1 π k R (cid:88) r =1 (cid:2) M r log(2 π ) + log | K ruu | + m (cid:48) kr ( K ruu ) − m kr + tr (( K ruu ) − S kr ) (cid:3) . (16)Evaluation of L cross involves three potentially costly expressions, | K ruu | , ( K ruu ) − m kr andtr (( K ruu ) − S kr ) , naively O (( M Q r ) ) , considered in turn below. Log determinant:

The winged diagonal form of ( K r hh ) − enables computation of ( K ruu ) − m kr in O ( M + Q r ) in the sparse case versus O ( M + Q r ) time in the general case. Trace term:

The trace term involves both prior and posterior covariance matrices. For the Kroneckerposterior which allows tr (( K ruu ) − S kr ) = tr (( K r hh ) − S krb ) tr (( K r zz ) − S krw ) complexity isreduced from O ( M + Q r ) to O ( Q r + M ) from the general to sparse. Where S kr is diagonal, thetrace term requires only diagonal elements of ( K r hh ) − , ( K r zz ) − and S kr . leading to O ( Q r + M + M Q r ) versus O ( Q r + M + M Q r ) in the general case. Expected Log Likelihood L ell is deﬁned as L ell ( λ ) = N (cid:88) n =1 E q ( n ) ( f ( n ) | λ ) (cid:2) log p ( y ( n ) | f ( n ) , φ ) (cid:3) (17)where we estimate the expectation in (17) by MC sampling from q ( n ) ( f ( n ) | λ ) as discussed be-low. E.1 Indirect sampling

Sampling from the posterior distribution of f ( n ) and similarly f (cid:63) is required for both L ell and prediction.Samples from q ( n ) ( f ( n ) | λ ) can be drawn from component group posteriors i.e. q k ( n ) ( f ( n ) Q r | λ kr ) = N ( b kr ( n ) , Σ kr ( n ) ) with mean and covariance expressions given by b kr ( n ) = A r ( n ) m kr and Σ kr ( n ) = (cid:101) K r ( n ) + A r ( n ) S kr A (cid:48) r ( n ) , where (cid:101) K r ( n ) = K r hh × (cid:2) κ r ( x ( n ) , x ( n ) − κ r ( x ( n ) , Z r )( K r zz ) − κ r ( Z r , x ( n ) ) (cid:3) and A r ( n ) = (cid:2) I Q r ⊗ κ r ( x ( n ) , Z r )( K r zz ) − (cid:3) Given a Kronecker posterior the quadratic term simpliﬁes to A r ( n ) S kr A (cid:48) r ( n ) = [ S krb ⊗ κ r ( x ( n ) , Z r )( K r zz ) − S krw ( K r zz ) − κ r ( Z r , x ( n ) ) (cid:3) Direct sampling as in [10] requires factorizing the posterior covariance to obtain a premultiplier Ψ( Σ kr ( n ) ) such that ΨΨ (cid:48) = Σ kr ( n ) . However, since this differs for each observation the associatedcost per group is O ( N Q r ) , where N is mini-batch size for L ell sampling or N test for prediction. Weavoid this by making use of known Cholesky factors and sampling from two component distributionsusing the property that the sum of two independent, normally distributed random variables, X ∼N ( µ X , Σ X ) and Y ∼ N ( µ Y , Σ Y ) is also a normally distributed with mean µ X + µ Y and covariance Σ X + Σ Y . We draw independent samples from two distributions speciﬁed as N ( , (cid:101) K r ( n ) ) and N ( b kr ( n ) , A r ( n ) S kr A (cid:48) r ( n ) ) , and sum these to return samples from N ( b kr ( n ) , Σ kr ( n ) ) .Complexity of factorizations for (cid:101) K r ( n ) and A r ( n ) S kr A (cid:48) r ( n ) once obtained differs across variantsbut where a sparse prior is adopted, is reduced to O ( N Q r ) . Critically, in the sparse case memorycomplexity is signiﬁcantly reduced by replacing the premultiplier Ψ( (cid:101) K r ( n ) ) with vector operationsreducing memory complexity from O ( N Q r ) to O ( N Q r ) .11he time costs to obtain the quadratic term A r ( n ) S kr A (cid:48) r ( n ) are O ( N M Q r + M ) , O ( N ( M + Q r ) + M ) and O ( N ( M + Q r ) + M ) for the diagonal, general Kronecker and sparse Kroneckerposteriors respectively. Similarly, the time costs of obtaining (cid:101) K r ( n ) are O ( N ( M + Q r ) + M ) or O ( N ( M + Q r ) + M ) for the general and sparse priors respectively. Direct sampling wouldthen require combination and factorization of these components in O ( N Q r ) time and with O ( N Q r ) memory.Indirect sampling requires factorization of (cid:101) K r ( n ) in O ( N Q r + Q r ) time for general priors and O ( N Q r ) for sparse priors. Similarly, given a Kronecker posterior, Cholesky factors for the quadraticterm can be obtained in O ( N Q r + Q r ) and O ( N Q r ) for general and sparse speciﬁcations, respec-tively, and O ( N Q r ) in the diagonal case.Overall, indirect sampling reduces time, but not memory, complexity relative to direct sampling in thegeneral model. However, by using sparse priors it is possible to achieve signiﬁcant reductions fromthe general indirect model. Further gains still are possible if a sparse Kronecker posterior is used inlieu of a diagonal posterior. The last point arises since (broadly speaking) operations involving sparsematrices can exploit decomposition properties of the Kronecker product to achieve linear complexitythat is additive, rather than multiplicative, over M and Q r . The same observation may be madeabout complexity reductions in entropy and cross entropy terms. Further details are provided in thesupplement. F Details of experimental set-up

Three classes of non-

GGP benchmark models are considered, the linear coregional model (

LCM ), GPRN and multi-task model with task-speciﬁc features (

MTG ) of [6]. These benchmarks are chosen asthey can be implemented in the same inference framework as the

GGP and sparse

GGP ,allowing directcomparison of model performance. For the

LCM model, we report two speciﬁcations for P = 25 . Theﬁrst mirrors the GGP with P = Q g -and the second is a lower ranked model with Q g = 5 ( Q g = 4 )for diagonal (Kronecker) posteriors. For P = 50 we report only the ﬁrst speciﬁcation (the best LCM model based on

RMSE ). For the

GPRN , we report models with Q g = 2 . This was the best performingspeciﬁcation for P = 25 , and the largest value for Q g able to run on the same platform as sparse GGP models.

F.1 Kernels

Kernels for

MTG models are deﬁned as κ ( f j ( x ) , f j (cid:48) ( x (cid:48) )) = κ P er. ( t, s ) κ RBF ( l jt , l j (cid:48) s ) κ RBF − Ep. (2) ( h j , h j (cid:48) ) . For GPRN latent functions in a row W i ,the kernel is deﬁned as κ P er. ( t, s ) κ RBF ( l it , l is ) . For node functions, we use a radial basis kernelwith lag features. For LCM P = Q g , node functions are deﬁned as combined periodic-radial basiskernels. For lower rank mixing models, we tested two node speciﬁcations for Q g = 2 , , , , (a) lagsfor all sites assigned to each node function, and (b) subsets of lags assigned to node functions basedon k-means clustering of lags into Q g groups. Reported benchmarks for LCM use clustered lags,while

GPRN benchmarks use complete lags per node.For node functions, κ g j ( x t , x s ) is a radial basis function kernel on a vector of lagged power features atsite i , i.e. for site i at time t , l i,t = ( y i,t , y i,t − ) . For each row-group r in W ( n ) , κ r ( f j ( x ) , f j (cid:48) ( x (cid:48) )) = κ r ( x , x (cid:48) ) κ r ( h j , h j (cid:48) ) with κ r ( x , x (cid:48) ) = κ P er. ( t, s ) κ RBF ( l rt , l rs ) , where κ P er. ( t, s ) is a periodickernel on a time index t . F.2 Solar model variantsSelection of sparsity constraints a priori

Given the spatial nature of covariance between sites andwhere node functions align with tasks, the weight applied to each node by a given task would beexpected to be a function of the (notional) spatial location of the node relative to the target task. Weassume that weights are conditionally independent given the weight assigned to the task-associatednode for that node. We enforce this constraint for each row-group i of W ( n ) , W ( n ) ij ⊥ W ( n ) ik | W ( n ) ii (18)12 , k (cid:54) = i , j (cid:54) = k . Since latent function groupings in the GGP framework can be any subset of F inany order, it is only required that, for sparse inference, latent functions within each group W ( n ) i. are ordered such that W ( n ) ii acts as the pivot. Within each row of weights, this gives rise to thestar conﬁguration for the undirected graphical model (as illustrated at Figure 2 in the supplement),with a single pivot weight function ( W ( n ) ii ), and conditionally independent child weight functionscorresponding to cross-site nodes ( W ( n ) ij , j (cid:54) = i ). Cross-function kernel variants

We test three sparse forms of K r hh , which we term (a) ‘implicitlysparse’ (sparsity is automatic i.e. implicit in the model due to separable kernel speciﬁcation asdiscussed at §3.2.1), (b) ‘explicitly sparse’ (a stationary kernel that does not give rise to automaticsparsity but is used in conjuntion with the constraint in (4) explicitly imposed), and (c) ‘free sparsity’( K r hh is freely parameterized using Q r − parameters for nonzero Cholesky elements and no kernelform is deﬁned).In the implicitly sparse case, we use a separable, dot product wavelet kernel with Ricker motherwavelet function (see [39]) plus diagonal correction as described at §3.2.1, which we assume tobe shared across groups. The full kernel is given by κ r ( wav. ) + κ r ( diag ) and the (full rank) sparseCholesky can be constructed as in (5) with Φ ii = √ κ r ( diag ) , i (cid:54) = 1 .For the explicit sparsity case, we use a combined radial basis, Epanechnikov kernel function with theconstraint at (4) enforced by setting Chol ( K r hh ) j,k = 0 , j, k (cid:54) = 1 , j (cid:54) = k . Speciﬁcally, κ r ( h j , h j (cid:48) ) = κ RBF ( h j , h j (cid:48) ) κ Ep. ( h j , h j (cid:48) ) , j, j (cid:48) = 1 . . . P . F.3 Experimental Settings

All models are estimated based on the variational framework explained in §4 with indirect samplingused for both sparse and general models. Starting values for common components were set to beequal across all models and the number of inducing points per group of latent functions is set toapproximately standardize computational complexity per iteration relating to RM , using M = 200 per GGP group as the baseline.The

ELBO is iteratively optimized until its relative change over successive epochs is less than − up to a maximum of 200 epochs or maximum of 20 hours. Optimization is performed using ADAM [19] with settings { LR = 0 . β = 0 . β = 0 . } . All data except time index featuresare normalized prior to optimization. Time and performance measures are captured every ﬁveepochs. Timed experimental results reported exclude time required to compute and store the variousperformance measures during optimization. G Additional Experimental Results

Further experimental results are provided at Table 2. Representative results for forecast accuracyover estimation time for

RMSE with sparse posterior are shown at Figures 3 and 4. Optimizationof benchmark models is faster than

GGP based models in some cases, however we also ﬁnd thatperformance of

GGP models quickly surpasses that of benchmark models. We found rankings interms of speed at which gains are achieved to be consistent across accuracy measures and posteriorspeciﬁcations. In particular,

LCM and

GPRN achieve their gains quickly as do explicitly sparse andfree sparse

GGP models, while general

GGP models and implicitly sparse

GGP models achieve gainsmore gradually.

Estimated covariance

Illustrative examples of K r hh estimated under the different GGP speciﬁca-tions with diagonal posteriors are presented at Figure 5. Heatmaps are shown for different row groupsfor sites in longitudinal order. As shown for the initial full

GGP speciﬁcation, parameters across rowgroups were found to be largely consistent with few exceptions (this was also true for latitudinallengthscales). In contrast, parameters under sparse (explicit) and sparse (implicit) were found to bevery adaptive, varying in lengthscale and magnitude. While the explicit sparse construction forcesthe model to place the most weight on nearest sites, weight centred under the wavelet kernel broadlytended to nearby sites but far less rigidly. Freely parameterized models (not shown) did not tendtoward covariance structures estimated for models with explicit kernel deﬁnitions. Rather, K r hh inboth the full and sparse free models tended to a very sparse diagonal structure.13 able 2: Forecast accuracy and variance of

GGP , sparse

GGP and benchmark models. Results shownfor diagonal and Kronecker posteriors with K = 1 for Adelaide ( P = 25 ) and Sydney ( P = 50 )datasets. Results are reported for best performing GPRN and

LCM benchmarks (based on

RMSE ) andfor

LCM where Q g = P . M - RANK is average model rank over accuracy measures (

RMSE , MAE and

NLPD ). P = 25

RMSE MAE NLPD M - RANK F - VAR

Diagonal

GGP

GGP (free) 0.345 0.215 0.378 7.0 0.116Sparse

GGP (implicit) 0.344 0.217 0.423 10.7 0.124Sparse

GGP (explicit) 0.345 0.215 0.382 7.7 0.115Sparse

GGP (free) 0.343 0.215

LCM ( Q g = P ) 0.371 0.235 0.553 15.0 0.181 LCM ( Q g = 5 ) 0.366 0.239 0.481 14.3 0.151 GPRN ( Q g = 2 ) 0.344 0.217 0.451 10.7 0.149 MTG

Kronecker

GGP

GGP (free) 0.344

GGP (implicit) 0.342 0.213 0.414 4.7 0.120Sparse

GGP (explicit)

GGP (free) 0.344 0.216 0.378 7.0

LCM ( Q g = P ) 0.375 0.236 0.583 16.0 0.180 LCM ( Q g = 4 ) 0.367 0.240 0.475 14.7 0.147 GPRN ( Q g = 2 ) 0.342 0.214 0.446 5.7 0.150 MTG

Kronecker

RMSE MAE NLPD M - RANK F - VAR

Sparse

GGP (implicit)

GGP (explicit) 0.421 0.257 0.626 2.3 0.141Sparse

GGP (free) 0.423 0.258 0.625 2.7

LCM ( Q g = P ) 0.451 0.283 0.807 5.3 0.211 GPRN ( Q g = 2 ) 0.428 0.263 0.664 4.0 0.185 MTG

References [1] Mauricio Álvarez and Neil D Lawrence. Sparse convolved Gaussian processes for multi-outputregression. In

Neural Information Processing Systems . 2009.[2] Mauricio A Álvarez and Neil D Lawrence. Computationally efﬁcient convolved multiple outputGaussian processes.

Journal of Machine Learning Research , 12(5):1459–1500, 2011.[3] Mauricio A. Álvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for vector-valuedfunctions: A review.

Found. Trends Mach. Learn. , 4(3):195–266, March 2012.[4] J. Antonanzas, N. Osorio, R. Escobar, R. Urraca, F.J. Martinez de Pison, and F. Antonanzas-Torres. Review of photovoltaic power forecasting.

Solar Energy , 136:78 – 111, 2016.[5] Edwin V Bonilla, Felix V Agakov, and Christopher KI Williams. Kernel multi-task learningusing task-speciﬁc features. In

AISTATS , pages 43–50, 2007.[6] Edwin V. Bonilla, Kian Ming A. Chai, and Christopher K. I. Williams. Multi-task Gaussianprocess prediction. In

Neural Information Processing Systems . 2008.[7] Noel Cressie and Christopher K Wikle.

Statistics for spatio-temporal data . John Wiley & Sons,2011.[8] Kurt Cutajar, Edwin V Bonilla, Pietro Michiardi, and Maurizio Filippone. Random featureexpansions for deep gaussian processes. In

Proceedings of the 34th International Conferenceon Machine Learning-Volume 70 , pages 884–893. JMLR. org, 2017.14 igure 3:

RMSE over optimization time for P = 25 with Kronecker posterior. Figure 4:

RMSE over optimization time for P = 50 with Kronecker posterior.[9] Kurt Cutajar, Michael Osborne, John Cunningham, and Maurizio Filippone. Preconditioningkernel matrices. In International Conference on Machine Learning , pages 2529–2538, 2016.[10] Astrid Dahl and Edwin V Bonilla. Grouped Gaussian processes for solar power prediction. arXiv preprint arXiv:1806.02543 , 2018.[11] Amir Dezfouli and Edwin V Bonilla. Scalable inference for Gaussian process models withblack-box likelihoods. In

Neural Information Processing Systems . 2015.[12] Trefor Evans and Prasanth Nair. Scalable gaussian processes with grid-structured eigenfunctions(gp-grief). In

International Conference on Machine Learning , pages 1416–1425, 2018.[13] Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson.Gpytorch: Blackbox matrix-matrix gaussian process inference with gpu acceleration. In

Advances in Neural Information Processing Systems , pages 7587–7597, 2018.[14] Jacob R. Gardner, Geoff Pleiss, Ruihan Wu, Kilian Q. Weinberger, and Andrew Gordon Wilson.Product kernel interpolation for scalable Gaussian processes. arXiv , abs/1802.08903, 2018.[15] Marc G Genton. Classes of kernels for machine learning: a statistics perspective.

Journal ofmachine learning research , 2(Dec):299–312, 2001.[16] James Hensman, Nicolas Durrande, Arno Solin, et al. Variational fourier features for gaussianprocesses.

Journal of Machine Learning Research , 18:151–1, 2017.[17] James Hensman, Nicolo Fusi, and Neil D Lawrence. Gaussian processes for big data. In

Uncertainty in Artiﬁcial Intelligence , 2013.[18] James Hensman, Magnus Rattray, and Neil D. Lawrence. Fast nonparametric clustering ofstructured time-series.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 002014. 15a) (b) (c)

Figure 5:

Examples of K r hh under GGP and sparse

GGP model variants for P = 25 (diagonal).Heatmaps of K r hh are shown for four row-groups in W for tasks i = 4 , , , for tasks (sites)ordered by site longitude. Estimated K r hh shown for GGP (Panel (a)), Sparse

GGP (explicit) (Panel(b)) and Sparse

GGP (implicit) (Panel (c)). Heatmaps in Panel (c) are shown without the diagonalcorrection term.[19] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.

CoRR ,abs/1412.6980, 2014.[20] Haitao Liu, Yew-Soon Ong, Xiaobo Shen, and Jianfei Cai. When Gaussian process meets bigdata: A review of scalable GPs. arXiv preprint arXiv:1807.01065 , 2018.[21] Joaquin Quinonero-Candela and Carl Edward Rasmussen. Analysis of some methods forreduced rank gaussian process regression. In

Switching and learning in feedback systems , pages98–127. Springer, 2005.[22] Joaquin Quiñonero-Candela and Carl Edward Rasmussen. A unifying view of sparse approxi-mate Gaussian process regression.

JMLR , 6:1939–1959, 2005.[23] J. Quiñonero-Candela, CE. Rasmussen, and CKI. Williams.

Approximation Methods forGaussian Process Regression , pages 203–223. Neural Information Processing. MIT Press,Cambridge, MA, USA, September 2007.[24] Carl Edward Rasmussen and Christopher K. I. Williams.

Gaussian processes for machinelearning . The MIT Press, 2006. 1625] Håvard Rue, Sara Martino, and Nicolas Chopin. Approximate Bayesian inference for latentGaussian models by using integrated nested Laplace approximations.

Journal of the royalstatistical society: Series b (statistical methodology) , 71(2):319–392, 2009.[26] Ed Snelson and Zoubin Ghahramani. Sparse Gaussian processes using pseudo-inputs. In

NeuralInformation Processing Systems , 2006.[27] Arno Solin and Simo Särkkä. Hilbert space methods for reduced-rank gaussian process regres-sion. arXiv preprint arXiv:1401.5508 , 2014.[28] Dongchu Sun and Xiaoqian Sun. Estimation of the multivariate normal precision and covariancematrices in a star-shape model.

Annals of the Institute of Statistical Mathematics , 57(3):455–484,2005.[29] Xiaoqian Sun and Dongchu Sun. Estimation of the cholesky decomposition of the covariancematrix for a conditional independent normal model.

Statistics & probability letters , 73(1):1–12,2005.[30] Yee Whye Teh, Matthias Seeger, and Michael I. Jordan. Semiparametric latent factor models.In

Artiﬁcial Intelligence and Statistics , 2005.[31] Michalis Titsias. Variational learning of inducing variables in sparse Gaussian processes. In

AISTATS , 2009.[32] Cyril Voyant, Gilles Notton, Soteris Kalogirou, Marie-Laure Nivet, Christophe Paoli, FabriceMotte, and Alexis Fouilloy. Machine learning methods for solar radiation forecasting: A review.

Renewable Energy , 105:569 – 582, 2017.[33] Ke Alexander Wang, Geoff Pleiss, Jacob R Gardner, Stephen Tyree, Kilian Q Weinberger, andAndrew Gordon Wilson. Exact gaussian processes on a million data points. arXiv preprintarXiv:1903.08114 , 2019.[34] Joe Whittaker.

Graphical models in applied multivariate statistics . Wiley Publishing, 2009.[35] Joakim Widen, Nicole Carpman, Valeria Castellucci, David Lingfors, Jon Olauson, FloreRemouit, Mikael Bergkvist, Morten Grabbe, and Rafael Waters. Variability assessment andforecasting of renewables: A review for solar, wind, wave and tidal resources.

Renewable andSustainable Energy Reviews , 44:356–375, 2015.[36] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured gaussianprocesses (kiss-gp). In

International Conference on Machine Learning , pages 1775–1784, 2015.[37] Andrew G. Wilson, David A. Knowles, and Zoubin Ghahramani. Gaussian process regressionnetworks. In

International Conference on Machine Learning , 2012.[38] Dazhi Yang, Jan Kleissl, Christian A Gueymard, Hugo TC Pedro, and Carlos FM Coimbra.History and trends in solar irradiance and pv power forecasting: A preliminary assessment andreview using text mining.

Solar Energy , 2018.[39] Li Zhang, Weida Zhou, and Licheng Jiao. Wavelet support vector machine.