[PDF] Scaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

Abstract

Automating statistical modelling is a challenging problem in artificial intelligence. The Automatic Statistician takes a first step in this direction, by employing a kernel search algorithm with Gaussian Processes (GP) to provide interpretable statistical models for regression problems. However this does not scale due to its O( N 3 ) running time for the model selection. We propose Scalable Kernel Composition (SKC), a scalable kernel search algorithm that extends the Automatic Statistician to bigger data sets. In doing so, we derive a cheap upper bound on the GP marginal likelihood that sandwiches the marginal likelihood with the variational lower bound . We show that the upper bound is significantly tighter than the lower bound and thus useful for model selection.

Full PDF

SScaling up the Automatic Statistician: Scalable Structure Discoveryusing Gaussian Processes

Hyunjik Kim and Yee Whye Teh

University of Oxford, DeepMind {hkim,y.w.teh}@stats.ox.ac.uk

Abstract

Automating statistical modelling is a chal-lenging problem in artiﬁcial intelligence. TheAutomatic Statistician takes a ﬁrst step inthis direction, by employing a kernel searchalgorithm with Gaussian Processes (GP) toprovide interpretable statistical models forregression problems. However this does notscale due to its O ( N ) running time for themodel selection. We propose Scalable KernelComposition (SKC), a scalable kernel searchalgorithm that extends the Automatic Statis-tician to bigger data sets. In doing so, we de-rive a cheap upper bound on the GP marginallikelihood that sandwiches the marginal like-lihood with the variational lower bound . Weshow that the upper bound is signiﬁcantlytighter than the lower bound and thus usefulfor model selection. Automated statistical modelling is an area of researchin its early stages, yet it is becoming an increasinglyimportant problem [14]. As a growing number of dis-ciplines use statistical analyses and models to helpachieve their goals, the demand for statisticians, ma-chine learning researchers and data scientists is at anall time high. Automated systems for statistical mod-elling serves to assist such human resources, if not as abest alternative where there is a shortage.An example of a fruitful attempt at automated statisti-cal modelling in nonparametric regression is Composi-tional Kernel Search (CKS) [10], an algorithm that ﬁtsa Gaussian Process (GP) to the data and automatically

Proceedings of the st International Conference on Artiﬁ-cial Intelligence and Statistics (AISTATS) 2018, Lanzarote,Spain. JMLR: W&CP volume 7X. Copyright 2018 by theauthor(s). chooses a suitable parametric form of the kernel. Thisleads to high predictive performance that matches ker-nels hand-selected by GP experts [27]. There also existother approaches that tackle this model selection prob-lem by using a more ﬂexible kernel [2, 24, 30, 41, 43].However the distinctive feature of Duvenaud et al [10] isthat the resulting models are interpretable; the kernelsare constructed in such a way that we can use themto describe patterns in the data, and thus can be usedfor automated exploratory data analysis. Lloyd et al[19] exploit this to generate natural language analysesfrom these kernels, a procedure that they name Au-tomatic Bayesian Covariance Discovery (ABCD). TheAutomatic Statistician implements this to output a10-15 page report when given data input.However, a limitation of ABCD is that it does notscale; due to the O ( N ) time for inference in GPs, theanalysis is constrained to small data sets, specialisingin one dimensional time series data. This is undesir-able not only because the average size of data setsis growing fast, but also because there is potentiallymore information in bigger data, implying a greaterneed for more expressive models that can discover ﬁnerstructure. This paper proposes Scalable Kernel Com-position (SKC), a scalable extension of CKS, to pushthe boundaries of automated interpretable statisticalmodelling to bigger data. In summary, our work makesthe following contributions: • We propose the ﬁrst scalable version of the Auto-matic Statistician that scales up to medium-sizeddata sets by reducing algorithmic complexity from O ( N ) to O ( N ) and enhancing parallelisability. • We derive a novel cheap upper bound on the GPmarginal likelihood, that is used in SKC with thevariational lower bound [37] to sandwich the GPmarginal likelihood. • We show that our upper bound is signiﬁcantly See for example analyses. a r X i v : . [ s t a t . M L ] F e b caling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processescaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes tighter than the lower bound, and plays an impor-tant role for model selection. The Compositional Kernel Search (CKS) algorithm [10]builds on the idea that the sum and product of two pos-itive deﬁnite kernels are also positive deﬁnite. Startingoﬀ with a set B of base kernels deﬁned on R × R , thealgorithm searches through the space of zero-mean GPswith kernels that can be expressed in terms of sums andproducts of these base kernels. B = { SE,LIN,PER } is used, which correspond to the squared exponential,linear and periodic kernel respectively (see AppendixC for the exact form of these base kernels). Thuscandidate kernels form an open-ended space of GPmodels, allowing for an expressive model. Such ap-proaches for structure discovery have also appearedin [12, 13]. A greedy search is employed to explorethis space, with each kernel scored by the BayesianInformation Criterion (BIC) [31] after optimising thekernel hyperparameters by type II maximum likelihood(ML-II). See Appendix B for the algorithm in detail.The resulting kernel can be simpliﬁed to be expressed asa sum of products of base kernels, which has the notablebeneﬁt of interpretability. In particular, note f ∼ GP (0 , k ) , f ∼ GP (0 , k ) ⇒ f + f ∼ GP (0 , k + k ) for independent f and f . So a GP whose kernel is asum of products of kernels can be interpreted as sums ofGPs each with structure given by the product of kernels.Now each base kernel in a product modiﬁes the modelin a consistent way. For example, multiplication bySE converts global structure into local structure sinceSE( x, x (cid:48) ) decreases exponentially with | x − x (cid:48) | , andmultiplication by PER is equivalent to multiplicationof the modeled function by a periodic function (seeLloyd et al [19] for detailed interpretations of diﬀerentcombinations). This observation is used in AutomaticBayesian Covariance Discovery (ABCD) [19], giving anatural language description of the resulting functionmodeled by the composite kernel. In summary ABCDconsists of two algorithms: the compositional kernelsearch CKS, and the natural language translation ofthe kernel into a piece of exploratory data analysis. ABCD provides a framework for a natural extensionto big data settings, in that we only need to be able toscale up CKS, then the natural language description ofmodels can be directly applied. The diﬃculty of this BIC = log marginal likelihood with a model complexitypenalty. We use a deﬁnition where higher BIC means bettermodel ﬁt. See Appendix A. extension lies in the O ( N ) time for evaluation of theGP marginal likelihood and its gradients with respectto the kernel hyperparameters.A naïve approach is to subsample the data to reduce N , but then we may fail to capture global structuresuch as periodicities with long periods or omit a set ofpoints displaying a certain local structure. We showfailure cases of random subsampling in Section 4.3.Regarding more strategic subsampling, the possibilityof a generic subsampling algorithm for GPs that is ableto capture the aforementioned properties of the datais a challenging research problem in itself.Alternatively it is tempting to use either an approxi-mate marginal likelihood or the exact marginal likeli-hood of an approximate model as a proxy for the exactlikelihood [25, 32, 34, 37], especially with modern GPapproximations scaling to large data sets with millionsof data points [15, 42]. However such scalable GP meth-ods are limited in interpretability as they often behavevery diﬀerently to the full GP, lacking guarantees forthe chosen kernel to faithfully reﬂect the actual struc-ture in the data. In other words, the real challengeis to scale up the GP while preserving interpretabil-ity, and this is a diﬃcult problem due to the tradeoﬀbetween scalability and accuracy of GP approxima-tions. Our work pushes the frontiers of interpretableGPs to medium-sized data ( N = 10 K ∼ K ) byreducing the computational complexity of ABCD from O ( N ) to O ( N ) . Extending this to large data sets( N = 100 K ∼ M ) is a diﬃcult open problem.Our approach is as follows: we provide a cheap lowerbound and upper bound to sandwich the exact marginallikelihood, and we use this interval for model selection.To do so we give a brief overview of the relevant workon low rank kernel approximations used for scaling upGPs, and we later outline how they can be applied toobtain cheap lower and upper bounds. The Nyström Method [9, 40] selects a set of m inducingpoints in the input space R D that attempt to explainall the covariance in the Gram matrix of the kernel; thekernel is evaluated for each pair of inducing points andalso between the inducing points and the data, givingmatrices K mm , K mN = K (cid:62) Nm . This is used to createthe Nyström approximation ˆ K = K Nm ( K mm ) † K mN of K = K NN , where † is the pseudo-inverse. Apply-ing Cholesky decomposition to K mm , we see that theapproximation admits the low-rank form Φ (cid:62) Φ and soallows eﬃcient computation of determinants and in-verses in O ( N m ) time (see Appendix D). We later usethe Nyström approximation to give an upper bound onthe exact log marginal likelihood. yunjik Kim and Yee Whye Teh The Nyström approximation arises naturally in thesparse GP literature, where certain distributions areapproximated by simpler ones involving f m , the GPevaluated at the m inducing points: the Determinis-tic Training Conditional (DTC) approximation [32]deﬁnes a model that gives the marginal likelihood q ( y ) = N ( y | , ˆ K + σ I ) ( y is the vector of obser-vations), whereas the Fully Independent Conditional(FIC) approximation [34] gives q ( y ) = N ( y | , ˆ K + diag ( K − ˆ K ) + σ I ) , correcting the Nyström approxi-mation along the diagonals. The Partially IndependentConditional (PIC) approximation [25] further improvesthis by correcting the Nyström approximation on blockdiagonals, with blocks typically of size m × m . Notethat the approximation is no longer low rank for FICand PIC, but matrix inversion can still be computed in O ( N m ) time by Woodbury’s Lemma (see AppendixD).The variational inducing points method (VAR) [37]is rather diﬀerent to DTC/FIC/PIC in that it givesthe following variational lower bound on the exact logmarginal likelihood: log[ N ( y | , ˆ K + σ I )] − σ Tr( K − ˆ K ) (1)This lower bound is optimised with respect to theinducing points and the kernel hyperparameters, whichis shown in the paper to successfully yield tight lowerbounds in O ( N m ) time for reasonable values of m .Another useful property of VAR is that the lower boundcan only increase as the set of inducing points grows [21,37]. It is also known that VAR always improves withextra computation, and that it successfully recoversthe true posterior GP in most cases, contrary to othersparse GP methods [4]. Hence this is what we use inSKC to obtain a lower bound on the marginal likelihoodand optimise the hyperparameters. We use 10 randominitialisations of hyperparameters and choose the onewith highest lower bound after optimisation. Fixing the hyperparameters to be those tuned by VAR,we seek a cheap upper bound to the exact marginallikelihood. Upper bounds and lower bounds are qual-itatively diﬀerent, and in general it is more diﬃcultto obtain an upper bound than a lower bound for thefollowing reason: ﬁrst note that the marginal likelihoodis the integral of the likelihood with respect to the priordensity of parameters. Hence to obtain a lower boundit suﬃces to exhibit regions in the parameter spacegiving high likelihood. However, to obtain an upperbound one must demonstrate the absence or lack oflikelihood mass outside a certain region, an arguablymore diﬃcult task. There has been some work on the subject [5, 17], but to the best of our knowledge therehas not been any work on cheap upper bounds to themarginal likelihood in large N settings. So ﬁndingan upper bound from the perspective of the marginallikelihood can be diﬃcult. Instead, we exploit the factthat the GP marginal likelihood has an analytic form,and treat it as a function of K . The GP log marginallikelihood is composed of two terms and a constant: log p ( y ) = log[ N ( y | , K + σ I )]= −

12 log det( K + σ I ) − y (cid:62) ( K + σ I ) − y − N π ) (2)We give separate upper bounds on the negative log de-terminant (NLD) term and the negative inner product(NIP) term. For NLD, it has been proven that −

12 log det( K + σ I ) ≤ −

12 log det( ˆ K + σ I ) (3)a consequence of K − ˆ K being a Schur complement of K and hence positive semi-deﬁnite (e.g. [3]). So theNyström approximation plugged into the NLD termserves as an upper bound that can be computed in O ( N m ) time (see Appendix D).As for NIP, we point out that λy (cid:62) ( K + σ I ) − y =min f ∈H (cid:80) Ni =1 ( y i − f ( x i )) + λ (cid:107) f (cid:107) H , the optimal valueof the objective function in kernel ridge regressionwhere H is the Reproducing Kernel Hilbert Spaceassociated with k (e.g. [23]). The dual problem,whose objective function has the same optimal value,is max α ∈ R N − λ [ α (cid:62) ( K + σ I ) α − α (cid:62) y ] . So we havethe following upper bound: − y (cid:62) ( K + σ I ) − y ≤ α (cid:62) ( K + σ I ) α − α (cid:62) y (4) ∀ α ∈ R N . Note that this is also in the form of an objec-tive for conjugate gradients (CG) [33], hence equalityis obtained at the optimal value ˆ α = ( K + σ I ) − y .We can approach the optimum for a tighter boundby using CG or preconditioned CG (PCG) for a ﬁxednumber of iterations to get a reasonable approximationto ˆ α . Each iteration of CG and the computation ofthe upper bound takes O ( N ) time, but PCG is veryfast even for large data sets and using FIC/PIC as thepreconditioner gives fastest convergence in general [8].Recall that although the lower bound takes O ( N m ) to compute, we need m = O ( N β ) for accurate approxi-mations, where β depends on the data distribution andkernel [28]. β is usually close to 0.5, hence the lowerbound is also eﬀectively O ( N ) . In practice, the upperbound evaluation seems to be a little more expensivethan the lower bound evaluation, but we only need tocompute the upper bound once, whereas we must eval-uate the lower bound and its gradients multiple times caling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processescaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

Scalable Kernel Composition (SKC)

Input: data x , . . . , x n ∈ R D , y , . . . , y n ∈ R , basekernel set B , depth d , number of inducingpoints m , kernel buﬀer size S . Output: k , the resulting kernel.For each base kernel on each dimension, obtain lowerand upper bounds to BIC (BIC interval), set k to bethe kernel with highest upper bound, and add k tokernel buﬀer K . C ← ∅ for depth=1:d do From C , add to K all kernels whose intervalsoverlap with k if there are fewer than S of them,else add the kernels with top S upper bounds. for k (cid:48) ∈ K do Add following kernels to C and obtain theirBIC intervals:(1) All kernels of form k (cid:48) + B where B isany base kernel on any dimension(2) All kernels of form k (cid:48) × B where B isany base kernel on any dimension if ∃ k ∗ ∈ C with higher upper bound than k then k ← k ∗ for the hyperparameter optimisation. We later conﬁrmin Section 4.1 that the upper bound is fast to com-pute relative to the lower bound optimisation. We alsoshow empirically that the upper bound is tighter thanthe lower bound in Section 4.1, and give the followingsuﬃcient condition for this to be true: Proposition 1.

Suppose (ˆ λ i ) Ni =1 are the eigenvaluesof ˆ K + σ I in descending order. Then if (P)CG forthe NIP term converges and ˆ λ N ≥ σ , then the upperbound is tighter than the lower bound. Notice that ˆ λ N ≥ σ ∀ ˆ K, so the assumption is feasible.The proof is in Appendix E.We later show in Section 4 that the upper bound is notonly tighter than the lower bound, but also much lesssensitive to the choice of inducing points. Hence weuse the upper bound to choose between kernels whoseBIC intervals overlap. We base our algorithm on two intuitive claims. First,the lower and upper bounds converge to the true logmarginal likelihood for ﬁxed hyperparameters as thenumber of inducing points m increases. Second, thehyperparameters optimised by VAR converge to thoseobtained by optimising the exact log marginal likeli-hood as m increases. The former is conﬁrmed in Figure 2 as well as in other works (e.g. [4] for the lower bound),and the latter is conﬁrmed in Appendix F.The algorithm proceeds as follows: for each base kerneland a ﬁxed value of m , we compute the lower and up-per bounds to obtain an interval for the GP marginallikelihood and hence the BIC of the kernel, with itshyperparameters optimised by VAR. We rank thesekernels by their intervals, using the upper bound as atie-breaker for kernels with overlapping intervals. Wethen perform a semi-greedy kernel search, expandingthe search tree on all (or some, controlled by buﬀer size S ) kernels whose intervals overlap with the top kernelat the current depth. We recurse to the next depthby computing intervals for these child kernels (parentkernel +/ × base kernel), ranking them and furtherexpanding the search tree. This is summarised in Algo-rithm 1, and Figure 12 in Appendix Q is a visualisationof the tree for diﬀerent values of m . Details on theoptimisation and initialisation are given in AppendicesH and I.The hyperparameters found by VAR may not be theglobal maximisers of the exact GP marginal likelihood,but as in ABCD we can optimise the marginal likeli-hood with multiple random seeds and choose the localoptimum closest to the global optimum. One may stillquestion whether the hyperparameter values found byVAR agree with the structure in the data, such as pe-riod values and slopes of linear trends. We show inSections 4.2 and 4.3 that a small m suﬃces for this tobe the case. Choice of m

We can guarantee that the lower boundincreases with larger m , but cannot guarantee that theupper bound decreases, since we tend to get betterhyperparameters for higher m that boost the marginallikelihood and hence the upper bound. We verify this inSection 4.1. Hence throughout SKC we ﬁx m to be thelargest possible value that one can aﬀord, so that themarginal likelihood with hyperparameters optimised byVAR is as close as possible to the marginal likelihoodwith optimal hyperparameters. It is natural to wonderwhether an adaptive choice of m is possible, usinghigher m for more promising kernels to tighten theirbounds. However a fair comparison of diﬀerent kernelsvia this lower and upper bound requires that theyhave the same value of m , since using a higher m isguaranteed to give a higher lower bound. Parallelisability

Note that SKC is extremely par-allelisable across diﬀerent random initialisations anddiﬀerent kernels at each depth, as is CKS. In factthe presence of intervals for SKC and hence buﬀersof kernels allows further parallelisation over kernels ofdiﬀerent depths in certain cases (see Appendix G). yunjik Kim and Yee Whye Teh

10 20 40 80 160 320m-1200-1000-800-600-400-2000 nega t i v e ene r g y fullGPexactVar LBUB 10 20 40 80 160 320m200300400500600 N L D exactNystrom NLD UBRFF NLD UB 10 20 40 80 160 320m-200-190-180-170-160-150 N I P exactCGPCG NysPCG FICPCG PIC l o g M L (b) Solar: learn inducing points Figure 1: (a) Left: log marginal likelihood (ML) for fullGP with optimised hyperparameters, optimised VAR LBfor each of 10 random initialisations per m , exact log ML for best hyperparameters out of 10, and correspondingUB. Middle: exact NLD and UB. Right: exact NIP and UB after m iterations of CG/PCG. (b) Same as Figure1a, except learning inducing points for the LB optimisation and using them for subsequent computations. We present results for experiments showing the boundswe obtain for two small time series and a multidimen-sional regression data set, for which CKS is feasible.The ﬁrst is the annual solar irradiance data from 1610to 2011, with 402 observations [18]. The second isthe time series Mauna Loa CO2 data [36] with 689observations. See Appendix K for plots of the timeseries. The multidimensional data set is the concretecompressive strength data set with 1030 observationsand 8 covariates [44]. The functional form of kernelsused for each of these data sets have been found byCKS (see Figure 3). All observations and covariateshave been normalised to have mean 0 and variance 1.From the left of Figures 1a, 10a and 11a, (the lattertwo can be found in Appendix Q) we see that VARgives a LB for the optimal log marginal likelihood thatimproves with increasing m . The best LB is tightrelative to the exact log marginal likelihoods at thehyperparameters optimised by VAR. We also see thatthe UB is even tighter than the LB, and increases with m as hypothesised. From the middle plots, we observethat the Nyström approximation gives a very tightUB on the NLD term. We also tried using RFF toget a stochastic UB (see Appendix P), but this is notas tight, especially for larger values of m . From theright plots, we can see that PCG with any of the threepreconditioners (Nyström, FIC, PIC) give a very tightUB to the NIP term, whereas CG may require moreiterations to get tight, for example in Figures 10a, 10bin Appendix Q.Figure 2 shows further experimental evidence for awider range of kernels reinforcing the claim that theUB is tighter than the LB, as well as being much lesssensitive to the choice of inducing points. This is why the UB is a much more reliable metric for the modelselection than the LB. Hence we use the UB for aone-oﬀ comparison of kernels when their BIC intervalsoverlap.Comparing Figures 1a, 10a, 11a against 1b, 10b, 11b,learning inducing points does not lead to a vast im-provement in the VAR LB. In fact the diﬀerences arenot very signiﬁcant, and sometimes learning inducingpoints can get the LB stuck in a bad local minimum,as indicated by the high variance of LB in the latterthree ﬁgures. Moreover the diﬀerences in computa-tional time is signiﬁcant as we can see in Table 2 ofAppendix J. Hence the computation-accuracy trade-oﬀis best when ﬁxing the inducing points to a randomsubset of training data.Table 2 also compares times for the diﬀerent computa-tions after ﬁxing the inducing points. The gains fromusing the variational LB instead of the full GP is clear,especially for the larger data sets, and we also conﬁrmthat it is indeed the optimisation of the LB that is thebottleneck in terms of computational cost. We alsosee that the NIP UB computation times are similarlyfast for all m , thus convergence of PCG with the PICpreconditioner is happening in only a few iterations. We compare the kernels chosen by CKS and by SKCfor the three data sets. The results are summarised inFigure 3. For solar, we see that SKC successfully ﬁndsSE × PER, which is the second highest kernel for CKS,with BIC very close to the top kernel. For mauna, SKCselects (SE + PER) × SE + LIN, which is third highestfor CKS and a BIC very close to the top kernel. Lookingat the values of hyperparameters in kernels PER andLIN found by SKC, 40 inducing points are suﬃcientfor it to successfully ﬁnd the correct periods and slopesin the data, reinforcing the claim that we only need asmall m to ﬁnd good hyperparameters (see Appendix caling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processescaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

10 20 40 80 160 320 m -1500-1400-1300-1200 LIN1

10 20 40 80 160 320 m -1500-1400-1300-1200 LIN2

10 20 40 80 160 320 m -1500-1400-1300-1200 LIN3

10 20 40 80 160 320 m -1500-1400-1300-1200 LIN4

10 20 40 80 160 320 m -1500-1400-1300-1200 LIN5

10 20 40 80 160 320 m -1500-1400-1300-1200 LIN6

10 20 40 80 160 320 m -1500-1400-1300-1200 LIN7

10 20 40 80 160 320 m -1500-1400-1300-1200 LIN8

10 20 40 80 160 320 m -1500-1400-1300-1200 PER1

10 20 40 80 160 320 m -1500-1400-1300-1200 PER2

10 20 40 80 160 320 m -1500-1400-1300-1200 PER3

10 20 40 80 160 320 m -1500-1400-1300-1200 PER4

10 20 40 80 160 320 m -1500-1400-1300-1200 PER5

10 20 40 80 160 320 m -1500-1400-1300-1200 PER6

10 20 40 80 160 320 m -1500-1400-1300-1200 PER7

10 20 40 80 160 320 m -1500-1400-1300-1200 PER8

10 20 40 80 160 320 m -1500-1400-1300-1200 SE1

10 20 40 80 160 320 m -1500-1400-1300-1200 SE2

10 20 40 80 160 320 m -1500-1400-1300-1200 SE3

10 20 40 80 160 320 m -1500-1400-1300-1200 SE4

10 20 40 80 160 320 m -1500-1400-1300-1200 SE5

10 20 40 80 160 320 m -1500-1400-1300-1200 SE6

10 20 40 80 160 320 m -1500-1400-1300-1200 SE7

10 20 40 80 160 320 m -1500-1400-1300-1200 SE8

Figure 2: UB and LB for kernels at depth 1 on each dimension of concrete data, while varying the inducing pointswith hyperparameters ﬁxed to the optimal values for the full GP. Error bars show mean ± -400 -350 -300 -250 -200 -150 -100 BIC -> ((SE)*PER)*PER(SE)*PERSE

CKS Solar -400 -350 -300 -250 -200 -150 -100

BIC ((SE)*PER)+LIN-> (SE)*PERSE

SKC Solar, m=80, S=2

500 1000 1500 2000

BIC -> (((((SE)+PER)*SE)+LIN)+SE)+SE((((SE)+PER)*SE)+LIN)+SE(((SE)+PER)*SE)+LIN((SE)+PER)*SE(SE)+PERSE

CKS Mauna

500 1000 1500 2000

BIC (((((SE)+PER)*SE)+PER)+LIN)+LIN((((SE)+PER)*SE)+PER)+LIN(((SE)+PER)*SE)+PER-> (((SE)+PER)*SE)+LIN((SE)+PER)*SE(SE)+PERSE

SKC Mauna, m=80, S=2 -1200 -1000 -800 -600 -400

BIC -> (((((SE8)+SE1)*PER7)*SE2)+SE4)+LIN1((((SE8)+SE1)*PER7)*SE2)+SE4(((SE8)+SE1)*PER7)*SE2((SE8)+SE1)*PER7(SE8)+SE1SE8

CKS Concrete -1200 -1000 -800 -600 -400

BIC -> (((((SE8)+SE1)+SE2)+SE4)+SE3)+SE7((((SE8)+SE1)+SE2)+SE4)+SE3(((SE8)+SE1)+SE2)+SE4((SE8)+SE1)+SE2(SE8)+SE1SE8

SKC Concrete, m=80, S=1

Figure 3: CKS & SKC results for up to depth 6. Left: BIC of kernels chosen at each depth by CKS. Right: BICintervals of kernels that have been added to the buﬀer by SKC with m = 80 . The arrow indicates the kernelchosen at the end.K for details). For concrete, a more challenging eightdimensional data set, we see that the kernels selectedby SKC do not match those selected by CKS, but itstill manages to ﬁnd similar additive structure such asSE1+SE8 and SE4. Of course, the BIC intervals forkernels found by SKC are for hyperparameters foundby VAR with m = 80 , hence do not necessarily containthe optimal BIC of kernels in CKS. However the aboveresults show that our method is still capable of selectingappropriate kernels even for low values of m , withouthaving to home in to the optimal BIC using high values of m . The savings in computation time is signiﬁcanteven for these small data sets, as shown in Table 2. The UB has marginal beneﬁts over the LB for smalldata sets, where the LB is already a fairly good estimateof the true BIC. However, this gap becomes signiﬁcantas N grows and as kernels become more complex; theUB is much tighter and more stable with respect to yunjik Kim and Yee Whye Teh the choice of inducing points (as shown in Figure 2),and plays a crucial role in the model selection. -2000 -1000 0 1000 2000 BIC -> ((((SE1)+SE2)+SE2)*SE3)*SE4(((SE1)+SE2)+SE2)*SE3((SE1)+SE2)+SE2(SE1)+SE2SE1

SKC on Power Plant, m=320 -2000 -1000 0 1000 2000

BIC -> ((((SE1)+SE2)*PER3)+LIN4)+SE2(((SE1)+SE2)*PER3)+LIN4((SE1)+SE2)*PER3(SE1)+SE2SE1

SKC-LB on Power Plant, m=320

Figure 4: Comparison of SKC (top) and SKC-LB (bot-tom), with m = 320 , S = 1 to depth 5 on Power Plantdata. The format is the same as Figure 3 but with thecrosses at the true BIC instead of the bounds. Power Plant

We ﬁrst show this for SKC on the PowerPlant data with 9568 observations and 4 covariates[38]. We see from Figure 4 that again the UB is muchtighter than the LB, especially in more complex ker-nels further down the search tree. Comparing the twoplots, we also see that the use of the UB in SKC hasa signiﬁcant positive impact on model selection: thekernel found by SKC at depth 5 has exact BIC 1469.6,whereas the corresponding kernel found by just usingthe LB (SKC-LB) has exact BIC 745.5. The patho-logical behaviour of SKC-LB occurs at depth 3, wherethe (SE1+SE2)*PER3 kernel found by SKC-LB hasa higher LB than the corresponding SE1+SE2+SE2kernel found by SKC. SKC-LB chooses the former ker-nel since it has a higher LB, which is a failure modesince the latter kernel has a signiﬁcantly higher ex-act BIC. Due to the tightness of the UB, we see thatSKC correctly chooses the latter kernel over the former,escaping the failure mode.

CKS vs SKC runtime

We run CKS and SKC for m = 160 , , on Power Plant data on the samemachine with the same hyperparameter setting. Theruntimes up to depths 1/2 are: 28.7h/94.7h (CKS),0.6h/2.6h (SKC, m = 160 ), 1.1h/4.1h (SKC, m = 320 ),4.2h/15.0h (SKC, m = 640 ). Again, the reduction incomputation time for SKC are substantial. Time Series

We also implement SKC on medium-sizedtime series data to explore whether it can successfullyﬁnd known structure at diﬀerent scales. Such datasets occur frequently for hourly time series over severalyears or daily time series over centuries.

GEFCOM

First we use the energy load data set from the 2012 Global Energy Forecasting Competition [16],and hourly time series of energy load from 2004/01/01to 2008/06/30, with 8 weeks of missing data, giving N = 38 , . Notice that this data size is far beyondthe scope of full GP optimisation in CKS. year l oad ( k W ) day in Jan 2004 l oad ( k W ) Figure 5: Top: plot of full GEFCOM data. Bottom:zoom in on the ﬁrst 7 days.From the plots, we can see that there is a noisy 6-monthperiodicity in the time series, as well as a clear dailyperiodicity with peaks in the morning and evening.Despite the periodicity, there are some noticeable ir-regularities in the daily pattern.Table 1: Hyperparameters of kernels found by SKC onGEFCOM data and on Tidal data after normalising y . Length scales, periods, and location converted tooriginal scale, σ left as was found with normalised y . GEFCOM Tidal SE σ = 0 . SE σ = 2 . l = 60 . days l = 82 . daysPER σ = 1 . PER σ = 5 . l = 1089 days l = 1026 days p = 1 . days p = 0 . daysSE σ = 0 . LIN σ = 0 . l = 331 days loc = year . PER σ = 0 . PER σ = 0 . l = 170 days l = 5974 days p = 174 days p = 0 . daysLIN σ = 0 . PER σ = 0 . loc = year . l = 338 days p = 14 . days The SE × PER + SE × (PER + LIN) kernel foundby SKC with m = 160 and its hyperparameters aresummarised in the ﬁrst two columns of Table 1. Notethat with only 160 inducing points, SKC has succesfullyfound the daily periodicity (PER ) and the 6-monthperiodicity (PER ). The SE kernel in the ﬁrst additiveterm suggests a local periodicity, exhibiting the irregu-lar observation pattern that is repeated daily. Also notethat the hyperparameter corresponding to the longerperiodicity is close but not exactly half a year, owing tothe noisiness of the periodicity that is apparent in thedata. Moreover the magnitude of the second additiveterm SE × PER that contains the longer periodicityis 0.18 × × × PER . This explicitly shows that the second term hasless eﬀect than the ﬁrst term on the behaviour of the caling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processescaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

We also run SKC on tidal data, namely thesea level measurements in Dover from 2013/01/03 to2016/08/31 [1]. This is an hourly time series, witheach observation taken to be the mean of four readingstaken every 15 minutes, giving N = 31 , . year s ea l e v e l e l e v a t i on

10 20 30 40 50 days since Jan 2003 s ea l e v e l e l e v a t i on Figure 6: Top: plot of full tidal data. Right: zoom inon the ﬁrst 4 weeks.Looking at the bottom of Figure 6, we can ﬁnd clear 12-hour periodicities and amplitudes that follow slightlynoisy bi-weekly periodicities. The shorter periods,called semi-diurnal tides, are on average 12 hours 25minutes ≈ × PER × LIN × PER × PER kernel foundby SKC with m = 640 and its hyperparameters aresummarised in the right two columns of Table 1. ThePER × PER kernel precisely corresponds to the dou-bly periodic structure in the data, whereby we havesemi-daily periodicities with bi-weekly periodic ampli-tudes. The SE kernel has a length scale of 82.5 days,giving a local periodicity. This represents the irregular-ity of the amplitudes as can be seen on the bottom plotof Figure 6 between days 20 and 40. The LIN kernelhas a large magnitude, but its eﬀect is negligible whenmultiplied since the slope is calculated to be − × − (see Appendix K for the slope calculation formula). It essentially has the role of raising the magnitude of theresulting kernel and hence representing noise in thedata. The PER kernel is also negligible due to its highlength scale and small magnitude, indicating that theamplitude of the periodicity due to this term is verysmall. Hence the term has minimal eﬀect on the kernel.The kernel search using just the LB fails to proceedpast depth 3, since it is unable to ﬁnd a kernel with ahigher LB on the BIC than the previous depth (so theextra penalty incurred by increasing model complexityoutweighs the increase in LB of log marginal likelihood).CKS on a random subset of the data similarly fails,the kernel search halting at depth 2 even for randomsubsets as large as size .In both cases, SKC is able to detect the structure ofdata and provide accurate numerical estimates of itsfeatures, all with much fewer inducing points than N ,whereas the kernel search using just the LB or a randomsubset of the data both fail. We have introduced SKC, a scalable kernel discoveryalgorithm that extends CKS and hence ABCD to big-ger data sets. We have also derived a novel cheapupper bound to the GP marginal likelihood that sand-wiches the marginal likelihood with the variationallower bound [37], and use this interval in SKC forselecting between diﬀerent kernels. The reasons forusing an upper bound instead of just the lower boundfor model selection are as follows: the upper boundallows for a semi-greedy approach, allowing us to ex-plore a wider range of kernels and compensates for thesuboptimality coming from local optima in the hyper-parameter optimisation. Should we wish to restrict therange of kernels explored for computational eﬃciency,the upper bound is tighter and more stable than thelower bound, hence we may use the upper bound as areliable tie-breaker for kernels with overlapping inter-vals. Equipped with this upper bound, our method canpinpoint global/local periodicities and linear trends intime series with tens of thousands of data points, whichare well beyond the reach of its predecessor CKS.For future work we wish to make the algorithm evenmore scalable: for large data sets where quadratic run-time is infeasible, we can apply stochastic variationalinference for GPs [15] to optimise the lower bound, thebottleneck of SKC, using mini-batches of data. Find-ing an upper bound that is cheap and tight enoughfor model selection would pose a challenge. Also onedrawback of the upper bound that could perhaps be re-solved is that hyperparameter tuning by optimising thelower bound is diﬃcult (see Appendix L). One otherminor scope for future work is using more accurate yunjik Kim and Yee Whye Teh estimates of the model evidence than BIC. A relatedpaper uses Laplace approximation instead of BIC forkernel evaluation in a similar kernel search context [20].However Laplace approximation adds on an expensiveHessian term, for which it is unclear how one can obtainlower and upper bounds.

Acknowledgements

HK and YWT’s research leading to these results hasreceived funding from the European Research Councilunder the European Union’s Seventh Framework Pro-gramme (FP7/2007-2013) ERC grant agreement no.617071. We would also like to thank Michalis Titsiasfor helpful discussions.

References [1] British Oceanographic Data Centre, UK Tide GaugeNetwork. .[2] Francis R Bach, Gert RG Lanckriet, and Michael IJordan. Multiple kernel learning, conic duality, andthe smo algorithm. In

ICML , 2004.[3] Rémy Bardenet and Michalis Titsias. Inference fordeterminantal point processes without spectral knowl-edge.

NIPS , 2015.[4] Matthias Stephan Bauer, Mark van der Wilk, andCarl Edward Rasmussen. Understanding probabilisticsparse gaussian process approximations.

NIPS , 2016.[5] Matthew James Beal.

Variational algorithms for ap-proximate Bayesian inference . PhD thesis, UniversityCollege London, 2003.[6] Stephen Boyd and Lieven Vandenberghe.

Convex op-timization . Cambridge University Press, 2004.[7] Corinna Cortes, Mehryar Mohri, and Ameet Talwalkar.On the impact of kernel approximation on learningaccuracy. In

AISTATS , 2010.[8] Kurt Cutajar, Michael A. Osborne, John P. Cunning-ham, and Maurizio Filippone. Preconditioning kernelmatrices.

ICML , 2016.[9] Petros Drineas and Michael Mahoney. On the nyströmmethod for approximating a gram matrix for improvedkernel-based learning.

JMLR , 6:2153–2175, 2005.[10] David Duvenaud, James Lloyd, Roger Grosse, JoshuaTenenbaum, and Ghahramani Zoubin. Structure dis-covery in nonparametric regression through composi-tional kernel search. In

ICML , 2013.[11] Chris Fraley and Adrian E Raftery. Bayesian regulariza-tion for normal mixture estimation and model-basedclustering.

Journal of classiﬁcation , 24(2):155–181,2007.[12] Jacob Gardner, Chuan Guo, Kilian Weinberger, Ro-man Garnett, and Roger Grosse. Discovering and exploiting additive structure for bayesian optimization.In

AISTATS , 2017.[13] Roger B. Grosse, Ruslan Salakhutdinov, William T.Freeman, and Joshua B. Tenenbaum. Exploiting com-positionality to explore a large space of model struc-tures. In

UAI , 2012.[14] Isabelle Guyon, Imad Chaabane, Hugo Jair Escalante,Sergio Escalera, Damir Jajetic, James Robert Lloyd,Núria Macià, Bisakha Ray, Lukasz Romaszko, MichèleSebag, Alexander Statnikov, Sébastien Treguer, andEvelyne Viegas. A brief review of the chalearn automlchallenge: Any-time any-dataset learning without hu-man intervention. In

Proceedings of the Workshop onAutomatic Machine Learning , volume 64, pages 21–30,2016.[15] James Hensman, Nicolo Fusi, and Neil D Lawrence.Gaussian processes for big data.

UAI , 2013.[16] Tao Hong, Pierre Pinson, and Shu Fan. Global energyforecasting competition 2012, 2014.[17] Chunlin Ji, Haige Shen, and Mike West. Boundedapproximations for marginal likelihoods. 2010.[18] Judith Lean, Juerg Beer, and Raymond S Bradley.Reconstruction of solar irradiance since 1610: Implica-tions for climate cbange.

Geophysical Research Letters ,22(23), 1995.[19] James Robert Lloyd, David Duvenaud, Roger Grosse,Joshua Tenenbaum, and Zoubin Ghahramani. Auto-matic construction and natural-language descriptionof nonparametric regression models. In

AAAI , 2014.[20] Gustavo Malkomes, Charles Schaﬀ, and Roman Gar-nett. Bayesian optimization for automated model se-lection. In

NIPS , 2016.[21] Alexander G de G Matthews, James Hensman,Richard E Turner, and Zoubin Ghahramani. On sparsevariational methods and the kullback-leibler divergencebetween stochastic processes.

AISTATS , 2016.[22] John Morrissey, James L Sumich, and Deanna R.Pinkard-Meier.

Introduction To The Biology Of MarineLife . Jones & Bartlett Learning, 1996.[23] Kevin P Murphy.

Machine learning: a probabilisticperspective . MIT press, 2012.[24] Junier B Oliva, Avinava Dubey, Andrew G Wilson,Barnabás Póczos, Jeﬀ Schneider, and Eric P Xing.Bayesian nonparametric kernel-learning. In

AISTATS ,2016.[25] Joaquin Quinonero-Candela and Carl Edward Ras-mussen. A unifying view of sparse approximate gaus-sian process regression.

JMLR , 6:1939–1959, 2005.[26] Ali Rahimi and Benjamin Recht. Random features forlarge-scale kernel machines. In

NIPS , 2007.[27] Carl Edward Rasmussen and Chris Williams.

Gaussianprocesses for machine learning . MIT Press, 2006.[28] Alessandro Rudi, Raﬀaello Camoriano, and LorenzoRosasco. Less is more: Nyström computational regu-larization. In

NIPS , 2015. caling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processescaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

NIPS , 2015. caling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processescaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes [29] Walter Rudin. Fourier analysis on groups.

AMS , 1964.[30] Yves-Laurent Kom Samo and Stephen Roberts.Generalized spectral kernels. arXiv preprintarXiv:1506.02236 , 2015.[31] Gideon Schwarz et al. Estimating the dimension of amodel.

The Annals of Statistics , 6(2):461–464, 1978.[32] Matthias Seeger, Christopher Williams, and NeilLawrence. Fast forward selection to speed up sparsegaussian process regression. In

AISTATS , 2003.[33] Jonathan Richard Shewchuk. An introduction to theconjugate gradient method without the agonizing pain,1994.[34] Edward Snelson and Zoubin Ghahramani. Sparse gaus-sian processes using pseudo-inputs. In

NIPS , 2005.[35] Arno Solin and Simo Särkkä. Explicit link betweenperiodic covariance functions and state space models.

JMLR , 2014.[36] Pieter Tans and Ralph Keeling. NOAA/ESRL, ScrippsInstitution of Oceanography. ftp://ftp.cmdl.noaa.gov/ccg/co2/trends/co2_mm_mlo.txt .[37] Michalis K Titsias. Variational learning of inducingvariables in sparse gaussian processes. In

AISTATS ,2009.[38] Pınar Tüfekci. Prediction of full load electrical poweroutput of a base load operated combined cycle powerplant using machine learning methods.

Interna-tional Journal of Electrical Power & Energy Systems ,60:126–140, 2014. http://archive.ics.uci.edu/ml/datasets/combined+cycle+power+plant .[39] Jarno Vanhatalo, Jaakko Riihimäki, Jouni Hartikainen,Pasi Jylänki, Ville Tolvanen, and Aki Vehtari. Gpstuﬀ:Bayesian modeling with gaussian processes.

JMLR ,14(Apr):1175–1179, 2013.[40] Christopher Williams and Matthias Seeger. Using thenyström method to speed up kernel machines. In

NIPS ,2001.[41] Andrew Gordon Wilson and Ryan Prescott Adams.Gaussian process kernels for pattern discovery andextrapolation. arXiv preprint arXiv:1302.4245 , 2013.[42] Andrew Gordon Wilson, Christoph Dann, and HannesNickisch. Thoughts on massively scalable Gaussianprocesses. arXiv preprint arXiv:1511.01870 , 2015. http://arxiv.org/abs/1511.01870 .[43] Andrew Gordon Wilson, Zhiting Hu, Ruslan Salakhut-dinov, and Eric P Xing. Deep kernel learning. In

AISTATS , 2016.[44] I-C Yeh. Modeling of strength of high-performanceconcrete using artiﬁcial neural networks.

Ce-ment and Concrete research , 28(12):1797–1808,1998. https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength . yunjik Kim and Yee Whye Teh AppendixA Bayesian Information Criterion(BIC)

The BIC is a model selection criterion that is themarginal likelihood with a model complexity penalty:

BIC = log p ( y | ˆ θ ) − p log( N ) for observations y , number of observations N , maxi-mum likelihood estimate (MLE) of model hyperparam-eters ˆ θ , number of hyperparameters p. It is derived asan approximation to the log model evidence log p ( y ) . B Compositional Kernel SearchAlgorithm

Algorithm 2:

Compositional Kernel Search Algorithm

Input: data x , . . . , x n ∈ R D , y , . . . , y n ∈ R ,basekernel set B Output: k , the resulting kernelFor each base kernel on each dimension, ﬁt GP todata (i.e. optimise hyperparams by ML-II) and set k to be kernel with highest BIC. for depth=1:T (either ﬁx T or repeat until BIC nolonger increases) do Fit GP to following kernels and set k to be the onewith highest BIC:(1) All kernels of form k + B where B is any basekernel on any dimension(2) All kernels of form k × B where B is any basekernel on any dimension(3) All kernels where a base kernel in k is replacedby another base kernel C Base Kernels

LIN ( x, x (cid:48) ) = σ ( x − l )( x (cid:48) − l ) SE ( x, x (cid:48) ) = σ exp (cid:16) − ( x − x (cid:48) ) l (cid:17) PER ( x, x (cid:48) ) = σ exp (cid:16) − ( π ( x − x (cid:48) ) /p ) l (cid:17) D Matrix Identities

Lemma 1 (Woodbury’s Matrix Inversion Lemma) . ( A + U BV ) − = A − − A − U ( B − + V A − U ) − V A − So setting A = σ I (Nyström) or σ I + diag ( K − ˆ K ) (FIC) or σ I + blockdiag ( K − ˆ K ) (PIC), U = Φ T = V , B = I , we get: ( A + Φ (cid:62) Φ) − = A − − A − Φ (cid:62) ( I + Φ A − Φ (cid:62) ) − Φ A − Lemma 2 (Sylvester’s Determinant Theorem) . det( I + AB ) = det( I + BA ) ∀ A ∈ R m × n ∀ B ∈ R n × m Hence: det( σ I + Φ (cid:62) Φ) = ( σ ) n det( I + σ − Φ (cid:62) Φ)= ( σ ) n det( I + σ − ΦΦ (cid:62) )= ( σ ) n − m det( σ I + ΦΦ (cid:62) ) E Proof of Proposition 1

Proof.

If PCG converges, the upper bound for NIP isexact. We showed in Section 4.1 that the convergencehappened in only a few iterations. Moreover Corteset al [7] shows that the lower bound for NIP can berather loose in general.So it suﬃces to prove that the upper bound forNLD is tighter than the lower bound for NLD. Let ( λ i ) Ni =1 , (ˆ λ i ) Ni =1 be the ordered eigenvalues of K + σ I, ˆ K + σ I respectively. Since K − ˆ K is positivesemi-deﬁnite (e.g. [3]), we have λ i ≥ ˆ λ i ≥ σ ∀ i (us-ing the assumption in the proposition). Now the slackin the upper bound is: −

12 log det( ˆ K + σ I ) − ( −

12 log det( K + σ I ))= 12 N (cid:88) i =1 (log λ i − log ˆ λ i ) Hence the slack in the lower bound is: −

12 log det( K + σ I ) − (cid:104) −

12 log det( ˆ K + σ I ) − σ Tr( K − ˆ K ) (cid:105) = − N (cid:88) i =1 (log λ i − log ˆ λ i ) + 12 σ N (cid:88) i =1 ( λ i − ˆ λ i ) Now by concavity and monotonicity of log , and since ˆ λ ≥ σ , we have: log λ i − log ˆ λ i λ i − ˆ λ i ≤ σ ⇒ N (cid:88) i =1 (log λ i − log ˆ λ i ) ≤ σ N (cid:88) i =1 ( λ i − ˆ λ i ) ⇒ N (cid:88) i =1 (log λ i − log ˆ λ i ) ≤ σ N (cid:88) i =1 ( λ i − ˆ λ i ) − N (cid:88) i =1 (log λ i − log ˆ λ i ) caling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processescaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes

Note from Figure 7 that the hyperparameters found byoptimising the lower bound converges to the hyperpa-rameters found by the exact GP when optimising theexact marginal likelihood, giving empirical evidence forthe second claim in Section 3.3.

G Parallelising SKC

Note that SKC can be parallelised across the randomhyperparameter initialisations, and also across the ker-nels at each depth for computing the BIC intervals. Infact, SKC is even more parallelisable with the kernelbuﬀer: say at a certain depth, we have two kernelsremaining to be optimised and evaluated before we canmove onto the next depth. If the buﬀer size is 5, say,then we can in fact move on to the next depth and growthe kernel search tree on the top 3 kernels of the buﬀer,without having to wait for the 2 kernel evaluations tobe complete. This saves a lot of computation timewasted by idle cores waiting for all kernel evaluationsto ﬁnish before moving on to the next depth of thekernel search tree.

H Optimisation

Since we wish to use the learned kernels for interpreta-tion, it is important to have the hyperparameters liein a sensible region after the optimisation. In otherwords, we wish to regularise the hyperparameters dur-ing optimisation. For example, we want the SE kernelto learn a globally smooth function with local variation.When naïvely optimising the lower bound, sometimesthe length scale and the signal variance becomes verysmall, so the SE kernel explains all the variation inthe signal and ends up connecting the dots. We wishto avoid this type of behaviour. This can be achievedby giving priors to hyperparameters and optimisingthe energy (log prior added to the log marginal likeli-hood) instead, as well as using sensible initialisations.Looking at Figure 8, we see that using a strong priorwith a sensible random initialisation (see AppendixI for details) gives a sensible smoothly varying func-tion, whereas for all the three other cases, we havethe length scale and signal variance shrinking to smallvalues, causing the GP to overﬁt to the data. Note thatthe weak prior is the default prior used in the GPstuﬀsoftware [39]. Careful initialisation of hyperparameters and inducingpoints is also very important, and can have strong in-ﬂuence the resulting optima. It is sensible to have theoptimised hyperparameters of the parent kernel in thesearch tree be inherited and used to initialise the hy-perparameters of the child. The new hyperparametersof the child must be initialised with random restarts,where the variance is small enough to ensure that theylie in a sensible region, but large enough to explore agood portion of this region. As for the inducing points,we want to spread them out to capture both local andglobal structure. Trying both K-means and a randomsubset of training data, we conclude that they givesimilar results and resort to a random subset. More-over we also have the option of learning the inducingpoints. However, this will be considerably more costlyand show little improvement over ﬁxing them, as weshow in Section 4. Hence we do not learn the inducingpoints, but ﬁx them to a given randomly chosen set.Hence for SKC, we use maximum a posteriori (MAP)estimates instead of MLE for the hyperparameters tocalculate the BIC, since the priors have a noticeableeﬀect on the optimisation. This is justiﬁed [23] and hasbeen used for example in [11], where they argue that us-ing the MLE to estimate the BIC for Gaussian mixturemodels can fail due to singularities and degeneracies.

I Hyperparameter initialisation andpriors Z ∼ N (0 , , T N ( σ , I ) is a Gaussian with mean 0and variance σ truncated at the interval I thenrenormalised. Signal noise σ = 0 . × exp( Z/ p (log σ ) = N (0 , . LIN σ = exp( V ) where V ∼ T N (1 , [ −∞ , , l = exp( Z ) p (log σ ) = logunifp (log l ) = logunif SE l = exp( Z/ , σ = 0 . × exp( Z/ p (log l ) = N (0 , . , p (log σ ) = logunif PER p min = log(10 × max( x ) − min( x ) N ) (shortest possibleperiod is 10 time steps) p max = log( max( x ) − min( x )5 ) (longest possible period isa ﬁfth of the range of data set) l = exp( Z/ , p = exp( p min + W ) or exp( p max + U ) , σ = 0 . × exp( Z/ w.p. where W ∼ T N ( − . , [0 , ∞ ]) , yunjik Kim and Yee Whye Teh Figure 7: Log marginal likelihood and hyperparameter values after optimising the lower bound with ARD kernelon a subset of the Power plant data for diﬀerent values of m . This is compared against the exact GP valueswhen optimising the true log marginal likelihood. Error bars show mean ± year W / m datastrong prior, rand initweak prior, rand initstrong prior, small initweak prior, small init Figure 8: GP predictions on solar data set with SE kernel for diﬀerent priors and initialisations. U ∼ T N ( − . , [ −∞ , p (log l ) = t ( µ = 0 , σ = 1 , ν = 4) , p (log p ) = LN ( p min − . , . or LN ( p max − , . w.p. p (log σ ) = logunif where LN ( µ, σ ) is log Gaussian, t ( µ, σ , ν ) is the student’s t-distribution. J Computation times

Look at Table 2. For all three data sets, the GPoptimisation time is much greater than the sum of theVar GP optimisation time and the upper bound (NLD+ NIP) evaluation time for m ≤ . Hence the savingsin computation time for SKC is signiﬁcant even forthese small data sets.Note that we show the lower bound optimisation timeagainst the upper bound evaluation time instead of theevaluation times for both, since this is what happensin SKC - the lower bound has to be optimised foreach kernel, whereas the upper bound only has to beevaluated once. K Mauna and Solar plots andhyperparameter values found bySKC

Solar

The solar data has 26 cycles over 285 years,which gives a periodicity of around 10.9615 years. Us-ing SKC with m = 40 , we ﬁnd the kernel: SE × PER × SE ≡ SE × PER. The value of the period hyper-parameter in PER is 10.9569 years, hence SKC ﬁndsthe periodicity to 3 s.f. with only 40 inducing points.The SE term converts the global periodicity to localperiodicity, with the extent of the locality governed bythe length scale parameter in SE, equal to 45. This isfairly large, but smaller than the range of the domain(1610-2011), indicating that the periodicity spans overa long time but isn’t quite global. This is most likelydue to the static solar irradiance between the years1650-1700, adding a bit of noise to the periodicities.

Mauna

The annual periodicity in the data and thelinear trend with positive slope is clear. Linear regres-sion gives us a slope of 1.5149. SKC with m = 40 givesthe kernel: SE + PER + LIN. The period hyperpa-rameter in PER takes value 1, hence SKC successfullyﬁnds the right periodicity. The oﬀset l and magnitude σ parameters of LIN allow us to calculate the slope caling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processescaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes Table 2: Mean and standard deviation of computation times (in seconds) for full GP optimisation, Var GPoptimisation(with and without learning inducing points), NLD and NIP (PCG using PIC preconditioner) upperbounds over 10 random iterations.

Solar Mauna ConcreteGP . ± . . ± . . ± . Var GP m=10 . ± . . ± . . ± . m=20 . ± . . ± . . ± . m=40 . ± . . ± . . ± . m=80 . ± . . ± . . ± . m=160 . ± . . ± . . ± . m=320 . ± . . ± . . ± . NLD m=10 . ± . . ± . . ± . m=20 . ± . . ± . . ± . m=40 . ± . . ± . . ± . m=80 . ± . . ± . . ± . m=160 . ± . . ± . . ± . m=320 . ± . . ± . . ± . NIP m=10 . ± . . ± . . ± . m=20 . ± . . ± . . ± . m=40 . ± . . ± . . ± . m=80 . ± . . ± . . ± . m=160 . ± . . ± . . ± . m=320 . ± . . ± . . ± . Var GP , m=10 . ± . . ± . . ± . learn IP m=20 . ± . . ± . . ± . m=40 . ± . . ± . . ± . m=80 . ± . . ± . . ± . m=160 . ± . . ± . . ± . m=320 . ± . . ± . . ± . by the formula σ ( x − l ) (cid:62) [ σ ( x − l )( x − l ) (cid:62) + σ n I ] − y where σ n is the noise variance in the learned likelihood.This formula is obtained from the posterior mean of theGP, which is linear in the inputs for the linear kernel.This value amounts to 1.5150, hence the slope foundby SKC is accurate to 3 s.f. L Optimising the upper bound

If the upper bound is tighter and more robust withrespect to choice of inducing points, why don’t weoptimise the upper bound to ﬁnd hyperparameters? Ifthis were to be possible then we can maximise this toget an upper bound of the exact marginal likelihoodwith optimal hyperparameters. In fact this holds forany analytic upper bound whose value and gradientscan be evaluated cheaply. Hence for any m , we can ﬁndan interval that contains the true optimised marginallikelihood. So if this interval is dominated by an intervalof another kernel, we can discard the kernel and thereis no need to evaluate the bounds for bigger values of m .Now we wish to use values of m such that we can choosethe right kernel (or kernels) at each depth of the searchtree with minimal computation. This gives rise to anexploitation-exploration trade-oﬀ, whereby we wantto balance between raising m for tight intervals that allow us to discard unsuitable kernels whose intervalsfall strictly below that of other kernels, and quicklymoving on to the next depth in the search tree tosearch for ﬁner structure in the data. The searchalgorithm is highly parallelisable, and thus we mayraise m simultaneously for all candidate kernels. Atdeeper levels of the search tree, there may be too manycandidates for simultaneous computation, in which casewe may select the ones with the highest upper boundto get tighter intervals. Such attempts are listed below.Note the two inequalities for the NLD and NIP terms: −

12 log det( ˆ K + σ I ) − σ Tr( K − ˆ K ) ≤ −

12 log det( K + σ I ) ≤ −

12 log det( ˆ K + σ I ) (5) − y (cid:62) ( ˆ K + σ I ) − y ≤ − y (cid:62) ( K + σ I ) − y ≤ − y (cid:62) ( K + ( σ + Tr( K − ˆ K )) I ) − y (6)Where the ﬁrst two inequalities come from [3], the yunjik Kim and Yee Whye Teh year s o l a r i rr ad i an c e (a) Solar year C O l e v e l (b) Mauna Figure 9: Plots of small time series data: Solar andMaunathird inequality is a direct consequence of K − ˆ K beingpostive semi-deﬁnite, and the last inequality is fromMichalis Titsias’ lecture slides .Also from 4, we have that −

12 log det( ˆ K + σ I ) + 12 α (cid:62) ( K + σ I ) α − α (cid:62) y is an upper bound ∀ α ∈ R N . Thus one idea of obtaininga cheap upper bound to the optimised marginal likeli-hood was to solve the following maximin optimisationproblem: max θ min α ∈ R N −

12 log det( ˆ K + σ I )+ 12 α (cid:62) ( K + σ I ) α − α (cid:62) y One way to solve this cheaply would be by coordinatedescent, where one maximises with respect to θ ﬁxing α , then minimises with respect to α ﬁxing θ . How-ever σ tends to blow up in practice. This is becausethe expression is O ( − log σ + σ ) for ﬁxed α , hencemaximising with respect to σ pushes it towards inﬁnity.An alternative is to sum the two upper bounds above to get the upper bound −

12 log det( ˆ K + σ I ) − y (cid:62) ( K +( σ +Tr( K − ˆ K )) I ) − y However we found that maximising this bound givesquite a loose upper bound unless m = O ( N ) . Hencethis upper bound is not very useful. M Random Fourier Features

Random Fourier Features (RFF) (a.k.a. RandomKitchen Sinks) was introduced by [26] as a low rank ap-proximation to the kernel matrix. It uses the followingtheorem

Theorem 3 (Bochner’s Theorem [29]) . A stationarykernel k(d) is positive deﬁnite if and only if k(d) is theFourier transform of a non-negative measure. to give an unbiased low-rank approximation to theGram matrix K = E [Φ (cid:62) Φ] with Φ ∈ R m × N . A bigger m lowers the variance of the estimate. Using thisapproximation, one can compute determinants andinverses in O ( N m ) time. In the context of kernelcomposition in 2, RFFs have the nice property thatsamples from the spectral density of the sum or productof kernels can easily be obtained as sums or mixtures ofsamples of the individual kernels (see Appendix N). Weuse this later to give a memory-eﬃcient upper boundon the exact log marginal likelihood in Appendix P. N Random Features for Sums andProducts of Kernels

For RFF the kernel can be approximated by the innerproduct of random features given by samples from itsspectral density, in a Monte Carlo approximation, asfollows: k ( x − y ) = (cid:90) R D e iv (cid:62) ( x − y ) d P ( v ) ∝ (cid:90) R D p ( v ) e iv (cid:62) ( x − y ) dv = E p ( v ) [ e iv (cid:62) x ( e iv (cid:62) y ) ∗ ]= E p ( v ) [ Re ( e iv (cid:62) x ( e iv (cid:62) y ) ∗ )] ≈ m m (cid:88) k =1 Re ( e iv k (cid:62) x ( e iv k (cid:62) y ) ∗ )= E b,v [ φ ( x ) (cid:62) φ ( y )] where φ ( x ) = (cid:113) m ( cos ( v (cid:62) x + b ) , . . . , cos ( v m (cid:62) x + b m )) with spectral frequencies v k iid samples from p ( v ) and b k iid samples from U [0 , π ] .Let k , k be two stationary kernels, with respective caling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processescaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes spectral densities p , p so that k ( d ) = a ˆ p ( d ) , k ( d ) = a ˆ p ( d ) , where ˆ p ( d ) := (cid:82) R D p ( v ) e iv (cid:62) d dv . We use this convention as the Fouriertransform. Note a i = k i (0) . ( k + k )( d ) = a (cid:90) p ( v ) e iv (cid:62) d dv + a (cid:90) p ( v ) e iv (cid:62) d dv = ( a + a ) ˆ p + ( d ) where p + ( v ) = a a + a p ( v ) + a a + a p ( v ) , a mixture of p and p . So to generate RFF for k + k , gener-ate v ∼ p + by generating v ∼ p w.p. a a + a and v ∼ p w.p. a a + a Now for the product, suppose ( k · k )( d ) = a a ˆ p ( d ) ˆ p ( d ) = a a ˆ p ∗ ( d ) Then p ∗ ( d ) is the inverse fourier transform of ˆ p ˆ p ,which is the convolution p ∗ p ( d ) := (cid:82) R D p ( z ) p ( d − z ) dz . So to generate RFF for k · k , generate v ∼ p ∗ by generating v ∼ p , v ∼ p and setting v = v + v .This is not applicable for non-stationary kernels, suchas the linear kernel. We deal with this problem asfollows:Suppose φ , φ are random features such that k ( x, x (cid:48) ) = φ ( x ) (cid:62) φ ( x (cid:48) ) , φ ( x ) (cid:62) φ ( x (cid:48) ) , φ i : R D → R m .It is straightforward to verify that ( k + k )( x, x (cid:48) ) = φ + ( x ) (cid:62) φ + ( x (cid:48) )( k · k )( x, x (cid:48) ) = φ ∗ ( x ) (cid:62) φ ∗ ( x (cid:48) ) where φ + ( · ) = ( φ ( · ) (cid:62) , φ ( · ) (cid:62) ) (cid:62) and φ ∗ ( · ) = φ ( · ) ⊗ φ ( · ) . However we do not want the number of featuresto grow as we add or multiply kernels, since it will growexponentially. We want to keep it to be m features. Sowe subsample m entries from φ + (or φ ∗ ) and scale byfactor √ ( √ m for φ ∗ ), which will still give us unbiasedestimates of the kernel provided that each term of theinner product φ + ( x ) (cid:62) φ + ( x (cid:48) ) (or φ ∗ ( x ) (cid:62) φ ∗ ( x (cid:48) ) ) is anunbiased estimate of ( k + k )( x, x (cid:48) ) (or ( k · k )( x, x (cid:48) ) ).This is how we generate random features for linearkernels combined with other stationary kernels, usingthe features φ ( x ) = σ √ m ( x, . . . , x ) (cid:62) . O Spectral Density for PER

From [35], we have that the spectral density of thePER kernel is: ∞ (cid:88) n = −∞ I n ( l − )exp( l − ) δ (cid:18) v − πnp (cid:19) where I is the modiﬁed Bessel function of the ﬁrst kind. P An upper bound to NLD usingRandom Fourier Features

Note that the function f ( X ) = − log det( X ) is convexon the set of positive deﬁnite matrices [6]. Henceby Jensen’s inequality we have, for Φ (cid:62) Φ an unbiasedestimate of K : −

12 log det( K + σ I ) = f ( K + σ I )= f ( E [Φ (cid:62) Φ + σ I ]) ≤ E [ f (Φ (cid:62) Φ + σ I )] Hence − log det(Φ (cid:62) Φ + σ I ) is a stochastic upperbound to NLD that can be calculated in O ( N m ) . Anexample of such an unbiased estimator Φ is given byRFF. Q Further Plots yunjik Kim and Yee Whye Teh

10 20 40 80 160 320m0500100015002000 nega t i v e ene r g y fullGPexactVar LBUB 10 20 40 80 160 320m15002000250030003500 N L D exactNystrom NLD UBRFF NLD UB 10 20 40 80 160 320m-350-300-250-200-150-100-500 N I P exactCGPCG NysPCG FICPCG PIC l o g M L (a) Mauna: ﬁx inducing points

10 20 40 80 160 320m0500100015002000 nega t i v e ene r g y fullGPexactVar LBUB 10 20 40 80 160 320m200025003000 N L D exactNystrom NLD UBRFF NLD UB 10 20 40 80 160 320m-400-300-200-1000 N I P exactCGPCG NysPCG FICPCG PIC l o g M L (b) Mauna: learn inducing points Figure 10: Same as 1a and 1b but for Mauna Loa data.

10 20 40 80 160 320m-8000-6000-4000-20000 nega t i v e ene r g y fullGPexactVar LBUB 10 20 40 80 160 320m2004006008001000 N L D exactNystrom NLD UBRFF NLD UB 10 20 40 80 160 320m-550-500-450-400-350-300 N I P exactCGPCG NysPCG FICPCG PIC l o g M L (a) Concrete: ﬁx inducing points

10 20 40 80 160 320 m -9000-8000-7000-6000-5000-4000-3000-2000-10000 nega t i v e ene r g y fullGPexactVar LBUB

10 20 40 80 160 320 m N L D exactNystrom NLD UBRFF NLD UB

10 20 40 80 160 320 m -600-550-500-450-400-350-300 N I P exactCGPCG NysPCG FICPCG PIC l o g M L (b) Concrete: learn inducing points Figure 11: Same as 1a and 1b but for Concrete data.

10 20 40 80 160320m-500-400-300-200-100

LIN

10 20 40 80 160320m-500-400-300-200-100

PER

10 20 40 80 160320m-500-400-300-200-100 SE

10 20 40 80 160320m-500-400-300-200-100 (SE)*SE

10 20 40 80 160320m-500-400-300-200-100 (SE)+SE

10 20 40 80 160320m-500-400-300-200-100 (PER)+SE

10 20 40 80 160320m-500-400-300-200-100 (SE)*PER

10 20 40 80 160320m-500-400-300-200-100 (PER)+PER

10 20 40 80 160320m-500-400-300-200-100 (PER)*PER

10 20 40 80 160320m-500-400-300-200-100 (PER)+LIN

10 20 40 80 160320m-500-400-300-200-100 (PER)*LIN

10 20 40 80 160320m-500-400-300-200-100 (LIN)+LIN

10 20 40 80 160320m-500-400-300-200-100 (LIN)*LIN

10 20 40 80 160320m-500-400-300-200-100 (LIN)+SE

10 20 40 80 160320m-500-400-300-200-100 (SE)*LIN

Figure 12: Kernel search tree for SKC on solar data up to depth 2. We show the upper and lower bounds fordiﬀerent numbers of inducing points m . caling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processescaling up the Automatic Statistician: Scalable Structure Discovery using Gaussian Processes