[PDF] The Gaussian Process Autoregressive Regression Model (GPAR)

Abstract

Multi-output regression models must exploit dependencies between outputs to maximise predictive performance. The application of Gaussian processes (GPs) to this setting typically yields models that are computationally demanding and have limited representational power. We present the Gaussian Process Autoregressive Regression (GPAR) model, a scalable multi-output GP model that is able to capture nonlinear, possibly input-varying, dependencies between outputs in a simple and tractable way: the product rule is used to decompose the joint distribution over the outputs into a set of conditionals, each of which is modelled by a standard GP. GPAR's efficacy is demonstrated on a variety of synthetic and real-world problems, outperforming existing GP models and achieving state-of-the-art performance on established benchmarks.

Full PDF

TThe Gaussian Process Autoregressive Regression Model (GPAR)

James Requeima Will Tebbutt Wessel Bruinsma Richard E. Turner University of Cambridge and Invenia Labs, Cambridge, UK {jrr41, wct23, wpb23, ret26}@cam.ac.uk

Abstract

Multi-output regression models must exploitdependencies between outputs to maximisepredictive performance. The application ofGaussian processes (GPs) to this setting typi-cally yields models that are computationallydemanding and have limited representationalpower. We present the Gaussian Process Au-toregressive Regression (GPAR) model, a scal-able multi-output GP model that is able tocapture nonlinear, possibly input-varying, de-pendencies between outputs in a simple andtractable way: the product rule is used todecompose the joint distribution over the out-puts into a set of conditionals, each of whichis modelled by a standard GP. GPAR’s eﬃ-cacy is demonstrated on a variety of syntheticand real-world problems, outperforming exist-ing GP models and achieving state-of-the-artperformance on established benchmarks.

The Gaussian process (GP) probabilistic modellingframework provides a powerful and popular approachto nonlinear single-output regression (Rasmussen andWilliams, 2006). The popularity of GP methods stemsfrom their modularity, tractability, and interpretability:it is simple to construct rich, nonlinear models by com-positional covariance function design, which can thenbe evaluated in a principled way (e.g. via the marginallikelihood), before being interpreted in terms of theircomponent parts. This leads to an attractive plug-and-play approach to modelling and understandingdata, which is so robust that it can even be automated(Duvenaud et al., 2013; Sun et al., 2018).Most regression problems, however, involve multipleoutputs rather than a single one. When modelling suchdata, it is key to capture the dependencies betweenthese outputs. For example, noise in the output space

Proceedings of the 22 nd International Conference on Ar-tiﬁcial Intelligence and Statistics (AISTATS) 2019, Naha,Okinawa, Japan. PMLR: Volume 89. Copyright 2019 bythe author(s). might be correlated, or, whilst one output might de-pend on the inputs in a complex (deterministic) way,it may depend quite simply on other output variables.In both cases multi-output GP models are required.There is a plethora of existing multi-output GP modelsthat can capture linear correlations between outputvariables if these correlations are ﬁxed across the in-put space (Goovaerts, 1997; Wackernagel, 2003; Tehand Seeger, 2005; Bonilla et al., 2008; Nguyen andBonilla, 2014; Dai et al., 2017). However, one of themain reasons for the popularity of the GP approach isthat a suite of diﬀerent types of nonlinear input depen-dencies can be modelled, and it is disappointing thatthis ﬂexibility is not extended to interactions betweenthe outputs . There are some approaches that do allowlimited modelling of nonlinear output dependencies(Wilson et al., 2012; Bruinsma, 2016) but this ﬂexibil-ity comes from sacriﬁcing tractability, with complexand computationally demanding approximate inferenceand learning schemes now required. This complexitysigniﬁcantly slows down model ﬁtting, evaluation, andimprovement work ﬂow.What is needed is a ﬂexible and analytically tractablemodelling approach to multi-output regression thatsupports plug-and-play modelling and model interpre-tation. The Gaussian Process Autoregressive Regres-sion (GPAR) model, introduced in Section 2, achievesthese aims by taking an approach analogous to that em-ployed by the Neural Autoregressive Density Estimator(Larochelle and Murray, 2011) for density modelling.The product rule is used to decompose the distribu-tion of the outputs given the inputs into a set of one-dimensional conditional distributions. Critically, thesedistributions can be interpreted as a decoupled set ofsingle-output regression problems, and learning and in-ference in GPAR therefore amount to a set of standardsingle-output GP regression tasks: training is closedform, fast, and amenable to standard scaling techniques.GPAR converts the modelling of output dependenciesthat are possibly nonlinear and input-dependent into aset of standard GP covariance function design problems,constructing expressive, jointly non-Gaussian modelsover the outputs. Importantly, we show how GPAR cancapture nonlinear relationships between outputs as wellas structured, input-dependent noise, simply through a r X i v : . [ s t a t . M L ] F e b he Gaussian Process Autoregressive Regression Model (GPAR) CO C ( t )= f ( , , t ) Sea Ice I ( t ) f ( , t )= Temperature T ( t ) Figure 1: Cartoon motivating a factorisation for thejoint distribution p ( I ( t ) , T ( t ) , C ( t )) kernel hyperparameter learning. We apply GPAR toa wide variety of multi-output regression problems,achieving state-of-the-art results on ﬁve benchmarktasks. Consider the problem of modelling the world’s averageCO level C ( t ) , temperature T ( t ) , and Arctic sea iceextent I ( t ) as a function of time t . By the greenhouseeﬀect, one can imagine that the temperature T ( t ) is acomplicated, stochastic function f of CO and time: T ( t ) = f ( C ( t ) , t ) . Similarly, one might hypothesisethat the Arctic sea ice extent I ( t ) can be modelled asanother complicated function f of temperature, CO ,and time: I ( t ) = f ( T ( t ) , C ( t ) , t ) . These functionalrelationships are depicted in Figure 1 and motivatethe following model where the conditionals model thepostulated underlying functions f and f : p ( I ( t ) , T ( t ) , C ( t ))= p ( C ( t )) p ( T ( t ) | C ( t )) (cid:124) (cid:123)(cid:122) (cid:125) models f p ( I ( t ) | T ( t ) , C ( t )) (cid:124) (cid:123)(cid:122) (cid:125) models f . More generally, consider the problem of modelling M outputs y M ( x ) = ( y ( x ) , . . . , y M ( x )) as a function ofthe input x . Applying the product rule yields p ( y M ( x )) (1) = p ( y ( x )) p ( y ( x ) | y ( x )) (cid:124) (cid:123)(cid:122) (cid:125) y ( x ) as a randomfunction of y ( x ) · · · p ( y M ( x ) | y M − ( x )) , (cid:124) (cid:123)(cid:122) (cid:125) y M ( x ) as a randomfunction of y M − ( x ) which states that y ( x ) , like CO , is ﬁrst generatedfrom x , according to some unknown, random function f ; that y ( x ) , like temperature, is then generatedfrom y ( x ) and x , according to some unknown, randomfunction f ; that y ( x ) , like the Arctic sea ice extent,is then generated from y ( x ) , y ( x ) , and x , according This requires an ordering of the outputs; we will addressthis point in Section 2. to some unknown, random function f ; et cetera : y ( x ) = f ( x ) , f ∼ p ( f ) ,y ( x ) = f ( y ( x ) , x ) , f ∼ p ( f ) , ... y M ( x ) = f M ( y M − ( x ) , x ) f M ∼ p ( f M ) . GPAR, introduced now, models these unknown func-tions f M with Gaussian processes (GPs). Recall thata GP f over an index set X deﬁnes a stochastic process,or process in short, where f ( x ) , . . . , f ( x N ) are jointlyGaussian distributed for any x , . . . , x N (Rasmussenand Williams, 2006). Marginalising out f M , we ﬁndthat GPAR models the conditionals in Equation (1)with Gaussian processes: y m | y m − (2) ∼ GP (0 , k m (( y m − ( x ) , x ) , ( y m − ( x (cid:48) ) , x (cid:48) ))) . Although the conditionals in Equation (1) are Gaus-sian, the joint distribution p ( y N ) is not; moments ofthe joint distribution over the outputs are generallyintractable, but samples can be generated by sequen-tially sampling the conditionals. Figure 2b depicts thegraphical model corresponding to GPAR. Crucially, thekernels ( k m ) may specify nonlinear, input-dependentrelationships between outputs, which enables GPAR tomodel data where outputs inter-depend in complicatedways.Returning to the climate modelling example, one mightobject that the temperature T ( t ) at time t does notjust depend the CO level C ( t ) at time t , but in-stead depends on the entire history of CO levels C : T ( t ) = f ( C, t ) . Note: by writing C instead of C ( t ) ,we refer to the entire function C , as opposed to just itsvalue at t . Similarly, one might object that the Arcticsea ice extent I ( t ) at time t does not just depend onthe temperature T ( t ) and CO level C ( t ) at time t ,but instead depends on the entire history of temper-atures T and CO levels C : T ( t ) = f ( T, C, t ) . Thiskind of dependency structure, where output y m ( x ) nowdepends on the entirety of all foregoing outputs y m instead of just their value at x , motivates the followinggeneralisation of GPAR in its form of Equation (2): y m | y m − ∼ GP (0 , k m (( y m − , x ) , ( y m − , x (cid:48) ))) , (3)which we refer to as nonlocal GPAR, or non-instantaneous in the temporal setting. Clearly, GPARis a special case of nonlocal GPAR. Figure 2a depictsthe graphical model corresponding to nonlocal GPAR.Nonlocal GPAR will not be evaluated experimentally,but some theoretical properties will be described.

Inference and learning in GPAR . Inference andlearning in GPAR is simple. Let y ( n ) m = y m ( x ( n ) ) de-note an observation for output m . Then, assuming all ames Requeima, Will Tebbutt, Wessel Bruisma, and Richard E. Turner (a) f f f y y y (b) f f f y ( x ) y ( x ) y ( x ) x Figure 2: Graphical models corresponding to (a) non-local GPAR and (b) GPARoutputs are observed at each input, we ﬁnd p ( f M | ( y ( n )1: M , x ( n ) ) Nn =1 ) (4) = M (cid:89) m =1 p ( f m | ( y ( n ) m ) Nn =1 (cid:124) (cid:123)(cid:122) (cid:125) observationsfor f m , ( y ( n )1: m − , x ( n ) ) Nn =1 (cid:124) (cid:123)(cid:122) (cid:125) input locationsof observations ) , meaning that the posterior of f m is computed sim-ply by conditioning f m on the observations for output m located at the observations for the foregoing out-puts and the input, which again is a Gaussian process;thus, like the prior, the posterior predictive densityalso decomposes as a product of Gaussian condition-als. The evidence log p (( y ( n )1: M ) Nn =1 | ( x ( n ) ) Nn =1 ) decom-poses similarly; as a consequence, the hyperparam-eters for f m can be learned simply by maximising log p (( y ( n ) m ) Nn =1 | ( y ( n )1: m − , x ( n ) ) Nn =1 ) . In conclusion, in-ference and learning in GPAR with M outputs comesdown to inference and learning in M decoupled, one-dimensional GP regression problems . This shows thatwithout approximations GPAR scales O ( M N ) , de-pending only linearly on the number of outputs M rather than cubically, as is the case for general multi-output GPs, which scale O ( M N ) . Scaling GPAR . In these one-dimensional GP regres-sion problems, oﬀ-the-shelf GP scaling techniques maybe applied to trivially scale GPAR to large data sets.Later we utilise the variational inducing point methodby Titsias (2009), which is theoretically principled(Matthews et al., 2016) and empirically accurate (Buiet al., 2016b). This method requires a set of inducinginputs to be speciﬁed for each Gaussian process. Forthe ﬁrst output y there is a single input x , which istime. We use ﬁxed (non-optimised) regularly spacedinducing inputs, as they are known to perform well fortime series (Bui and Turner, 2014). The second andfollowing outputs y m , however, require inducing pointsto be placed in x and y , . . . , y m − . Regular spacingcan be used again for x , but there are choices availablefor y , . . . , y m − . One approach would be to optimisethese inducing input locations, but instead we use theposterior predictive means of y , . . . , y m − at x . Thischoice accelerates training and was found to yield goodresults. Missing data . When there are missing outputs, theprocedures for inference and learning described aboveremain valid as long as for every observation y ( n ) m thereare also observations y ( n ) m − . We call a data set satis-fying this property closed downwards . If a data set isnot closed downwards, data may be imputed, e.g. usingthe model’s posterior predictive mean, to ensure closeddownwardness; inference and learning then, however,become approximate. The key conditional indepen-dence expressed by Figure 2b that results in simpleinference and learning is the following: Theorem 1.

Let a set of observations D be closeddownwards. Then y i ⊥ D i +1: M | D i , where D i arethe observations for outputs , . . . , i and D i +1: M for i + 1 , . . . , M .Theorem 1 is proved in Appendix A from the sup-plementary material. Note that Theorem 1 holds forevery graphical model of the form of Figure 2b, mean-ing that the decomposition into single-output modellingproblems holds for any choice of the conditionals inEquation (1), not just Gaussians. Potential deﬁciencies of GPAR . GPAR has twoapparent limitations. First, since the outputs fromearlier dimensions are used as inputs to later dimen-sions, noisy outputs yield noisy inputs. One possiblemitigating solution is to employ a denoising inputtransformation for the kernels, e.g. using the poste-rior predictive mean of the foregoing outputs as theinput of the next covariance function. We shall referto GPAR models employing this approach as D-GPAR.Second, one needs to choose an ordering of the outputs.Fortunately, often the data admits a natural ordering;for example, if the predictions concern a single outputdimension, this should be placed last. Alternatively, ifthere is not a natural ordering, one can greedily opti-mise the evidence with respect to the ordering. Thisprocedure considers M ( M + 1) conﬁgurations whilean exhaustive search would consider all M ! possibleconﬁgurations: the best ﬁrst output is selected outof all M choices, which is then ﬁxed; then, the bestsecond output is selected out of the remaining M − choices, which is then ﬁxed; et cetera. These methodsare examined in Section 6. The choice of kernels k M for f M is crucial to GPAR,as they determine the types of relationships betweeninputs and outputs that can be learned. Particularchoices for k M turn out to yield models closely relatedto existing work. These connections are made rigorousby the nonlinear and linear equivalent model discussedin Appendix B from the supplementary material. We he Gaussian Process Autoregressive Regression Model (GPAR) summarise the results here; see also Table 1.If k m depends linearly on the foregoing outputs y m − at particular x , then a joint Gaussian distribution overthe outputs is induced in the form of a multi-outputGP model (Goovaerts, 1997; Stein, 1999; Wackernagel,2003; Teh and Seeger, 2005; Bonilla et al., 2008; Nguyenand Bonilla, 2014) where latent processes are mixedtogether according to a matrix, called the mixing ma-trix , that is lower-triangular (Appendix B). One maylet the dependency of k m on y m − vary with x , inwhich case the mixing matrix varies with x , meaningthat correlations between outputs vary with x . Thisyields an instance of the Gaussian Process RegressionNetwork (GPRN) (Wilson et al., 2012) where inferenceis fast and closed form. One may even let k m de-pend nonlinearly on y m − , which yields a particularlystructured deep Gaussian process (DGP) (Damianou,2014; Bui et al., 2016a), potentially with skip connec-tions from the inputs (Appendix B). Note that GPARmay be interpreted as a conventional DGP where thehidden layers are directly observed and correspond tosuccessive outputs; this connection could potentiallybe leveraged to bring machinery developed for DGPsto GPAR, e.g. to deal with arbitrarily missing data.One can further let k m depend on the entirety of theforegoing outputs y m − , yielding instances of nonlocalGPAR. An example of a nonlocal linear kernel is k (( y, x ) , ( y (cid:48) , x (cid:48) )) = (cid:90) a ( x − z, x (cid:48) − z (cid:48) ) y ( z ) y (cid:48) ( z (cid:48) ) d z d z (cid:48) . The nonlocal linear kernel again induces a jointly Gaus-sian distribution over the outputs in the form of aconvolutional multi-output GP model (Álvarez et al.,2009; Álvarez and Lawrence, 2009; Bruinsma, 2016)where latent processes are convolved together accord-ing to a matrix-valued function, called the convolutionmatrix , that is lower-triangular (Appendix B). Again,one may let the dependency of k m on the entirety of y m − vary with x , in which case the convolution ma-trix varies with x , or even let k m depend nonlinearlyon the entirety of y m − ; an example of a nonlocalnonlinear kernel is k ( y, y (cid:48) ) = σ exp (cid:18) − (cid:90) (cid:96) ( z ) ( y ( z ) − y (cid:48) ( z )) d z (cid:19) . Henceforth, we shall refer to GPAR with linear de-pendencies between outputs as GPAR-L, GPAR withnonlinear dependencies between outputs as GPAR-NL,and a combination of the two as GPAR-L-NL.

The Gaussian Process Network (Friedman and Nach-man, 2000) is similar to GPAR, but was instead devel-oped to identify causal dependencies between variables in a probabilistic graphical models context rather thanmulti-output regression. The work by Yuan (2011) alsodiscusses a model similar to GPAR, but speciﬁes adiﬀerent generative procedure for the outputs.The multi-ﬁdelity modelling literature is closely relatedto multi-output modelling. Whereas in a multi-outputregression task we predict all outputs, multi-ﬁdelitymodelling is concerned with predicting a particularhigh-ﬁdelity function, incorporating information fromobservations from various levels of ﬁdelity. The ideaof iteratively conditioning on lower ﬁdelity models inthe construction of higher ﬁdelity ones is a well-usedstrategy (Kennedy and O’Hagan, 2000; Le Gratiet andGarnier, 2014). The model presented by Perdikariset al. (2017) is nearly identical to GPAR applied in themulti-ﬁdelity framework, but applications outside thissetting have not been considered.Moreover, GPAR follows a long line of work on thefamily of fully visible Bayesian networks (Frey et al.,1996; Bengio and Bengio, 2000) that decompose thedistribution over the observations according to theproduct rule (Equation (1)) and model the resultingone dimensional conditionals. A number of approachesuse neural networks for this purpose (Neal, 1992; Freyet al., 1996; Larochelle and Murray, 2011; Theis andBethge, 2015; van den Oord et al., 2016). In particular,if the observations are real-valued, a standard archi-tecture lets the conditionals be Gaussian with meansencoded by neural networks and ﬁxed variances. Underbroad conditions, if these neural networks are replacedby Bayesian neural networks with independent Gaus-sian priors over the weights, we recover GPAR as thewidth of the hidden layers goes to inﬁnity (Neal, 1996;Matthews et al., 2018).

GPAR is well suited for problems where there is astrong functional relationship between outputs and forproblems where observation noise is richly structuredand input dependent. In this section we demonstrateGPAR’s ability to model both types of phenomena.First, we test the ability to leverage strong functionalrelationships between the outputs. Consider three out-puts y , y , and y , inter-depending nonlinearly: y ( x ) = − sin(10 π ( x + 1)) / (2 x + 1) − x + (cid:15) ,y ( x ) = cos ( y ( x )) + sin(3 x ) + (cid:15) ,y ( x ) = y ( x ) y ( x ) + 3 x + (cid:15) , where (cid:15) , (cid:15) , (cid:15) i.i.d. ∼ N (0 , . . By substituting y and y into y , we see that y can be expressed directly interms of x , but via a complex function. The dependenceof y on y , y , however, is much simpler. Therefore, asGPAR can exploit direct dependencies between y and ames Requeima, Will Tebbutt, Wessel Bruisma, and Richard E. Turner GPAR Deps. Between Outputs Kernels for f M Related ModelsLocal Linear k ( x, x (cid:48) ) + k Linear ( y ( x ) , y ( x (cid:48) )) Multi-Output GPs [1] + dep. on x k ( x, x (cid:48) ) + k ( x, x (cid:48) ) k Linear ( y ( x ) , y ( x (cid:48) )) GPRN [2]Nonlinear k ( x, x (cid:48) ) + k ( y ( x ) , y ( x (cid:48) )) Deep GPs (DGPs) [3] + dep. on x k ( x, x (cid:48) ) + k (( x, y ( x )) , ( x (cid:48) , y ( x (cid:48) ))) DGPs with input connectionsNonlocal Linear k ( x, x (cid:48) ) + k Linear ( y, y (cid:48) )+ dep. on x k ( x, x (cid:48) ) + k ( x, x (cid:48) ) k Linear ( y, y (cid:48) ) Convolutional MOGPs [4]Nonlinear k ( x, x (cid:48) ) + k ( y, y (cid:48) )+ dep. on x k ( x, x (cid:48) ) + k (( x, y ) , ( x (cid:48) , y (cid:48) )) Table 1: Classiﬁcation of kernels k M for f M , the resulting dependencies between outputs, and related models.Here k Linear refers to a linear kernel and k and k to an exponentiated quadratic (EQ) or rational quadratic (RQ)kernel (Rasmussen and Williams, 2006). [1]: Goovaerts (1997); Stein (1999); Wackernagel (2003); Teh and Seeger(2005); Bonilla et al. (2008); Osborne et al. (2008); Nguyen and Bonilla (2014). [2]: Wilson et al. (2012). [3]:Damianou (2014); Bui et al. (2016a). [4]: Álvarez et al. (2009); Álvarez and Lawrence (2009); Bruinsma (2016). y , it should be presented with a much simpliﬁed taskas compared to predicting y from x directly. Figure 3shows plots of independent GPs (IGPs) and GPAR ﬁtto data points from y , y and y . Indeed observethat GPAR is able to learn y ’s dependence on y , and y ’s dependence on y and y , whereas the independentGPs struggle with the complicated structure.Second, we test GPAR’s ability to capture non-Gaussian and input-dependent noise. Consider the fol-lowing three schemes in which two outputs are observedunder various noise conditions: y ( x ) = f ( x ) + (cid:15) and(1): y ( x ) = f ( x ) + sin (2 πx ) (cid:15) + cos (2 πx ) (cid:15) , (2): y ( x ) = f ( x ) + sin( π(cid:15) ) + (cid:15) , (3): y ( x ) = f ( x ) + sin( πx ) (cid:15) + (cid:15) , where (cid:15) , (cid:15) i.i.d. ∼ N (0 , . , and f and f are compli-cated, nonlinear functions. All three schemes havei.i.d. homoscedastic Gaussian noise in y . The noisein y , however, depends on that in y and can beheteroscadastic. The task for GPAR is to learn thescheme’s noise structure. Figure 4 visualises the noisecorrelations induced by the schemes and the noise struc-tures learned by GPAR. Observe that GPAR is ableto learn the various noise structures. In this section we evaluate GPAR’s performance andcompare to other models on four standard data setscommonly used to evaluate multi-output models. Wealso consider a recently-introduced data set in the ﬁeldof Bayesian optimisation, which is a downstream ap-plication area that could beneﬁt from GPAR. Table 2lists the models against which we compare GPAR. We The functions given by f ( x ) = − sin(10 π ( x + 1)) / (2 x +1) − x and f ( x ) = e x ( θ cos( θ πx ) + θ cos( θ πx )) + √ x . Acronym ModelIGP Independent GPsCK CokrigingICM Intrinstic Coregionalisation Model [1]SLFM Semi-Parametric Latent Factor Model [2]CGP Collaborative Multi-Output GPs [3]CMOGP Convolved Multi-Output GP Model [4]GPRN GP Regression Network [5]Table 2: Models against which GPAR is compared. [1]:Goovaerts (1997); Stein (1999); Wackernagel (2003).[2]: Teh and Seeger (2005). [3]: Nguyen and Bonilla(2014). [4]: Álvarez and Lawrence (2011); Álvarez et al.(2010). [5]: Wilson et al. (2012).always compare against IGP and CK, ICM, SLFM,and CGP, and compare against CMOGP and GPRNif results for the considered task are available. SinceCK and ICM are much simpliﬁed versions of SLFM(Álvarez et al., 2010; Goovaerts, 1997) and CGP is anapproximation to SLFM, we sometimes omit resultsfor CK, ICM, and CGP. Implementations can be foundat https://github.com/wesselb/gpar (Python) and https://github.com/willtebbutt/GPAR.jl (Julia).Experimental details can be found in Appendix D fromthe supplementary material.

Electroencephalogram (EEG) data set . Thisdata set consists of 256 voltage measurements from7 electrodes placed on a subject’s scalp whilst thesubject is shown a certain image; Zhang et al. (1995)describe the data collection process in detail. In par-ticular, we use frontal electrodes FZ and F1–F6 fromthe ﬁrst trial on control subject . The task is topredict the last 100 samples for electrodes FZ, F1, andF2, given that the ﬁrst 156 samples of FZ, F1, and F2 The EEG data set can be downloaded at https://archive.ics.uci.edu/ml/datasets/eeg+database . he Gaussian Process Autoregressive Regression Model (GPAR) Figure 3: Synthetic data set with complex output dependencies: GPAR vs independent GPs (IGP) predictions.Figure 4: Correlation between the sample residues(deviation from the mean) for y and y . Left, middle,and right plots correspond to schemes (1), (2) and (3)respectively. Samples are coloured according to inputvalue x ; that is, all samples for a particular x have thesame colour. If the colour pattern is preserved, thenGPAR has successfully captured how the noise in y correlates to that in y .Model SMSE MLL TTIGP .

75 2 .

60 2 sec

SLFM .

06 4 .

00 11 min

GPAR-NL .

26 1 . Table 3: Results for the EEG data set for IGP, theSLFM with four latent dimensions, and GPAR.and the whole signals of F3–F6 are observed. Perfor-mance is measured with the standardised mean squarederror (SMSE), mean log loss (MLL) (Rasmussen andWilliams, 2006), and training time (TT). Figure 5visualises predictions for electrode F2, and Table 3quantiﬁes the results. We observe that GPAR-NL out-performs independent in terms of SMSE and MLL;note that independent GPs completely fail to providean informative prediction. Furthermore, independentGPs were trained in two seconds, and GPAR-NL tookonly three more seconds; in comparison, training SLFMtook 11 minutes.

Jura data set . This data set comprises metal con-centration measurements collected from the topsoil ina . region of the Swiss Jura. We follow the ex- The data can be downloaded at https://sites.google.com/site/goovaertspierre/pierregoovaertswebsite/download/ . Model IGP CK † ICM SLFM CMOGP † MAE . .

51 0 . . . MAE ∗ . . . Model GPRN † GPAR-NL D-GPAR-NLMAE . . . MAE ∗ . . . Table 4: Results for the Jura data set for IGP, cok-riging (CK) and ICM with two latent dimensions, theSLFM with two latent dimensions, CMOGP, GPRN,and GPAR. ∗ Results are obtained by ﬁrst log -transforming the data, then performing prediction, andﬁnally transforming the predictions back to the originaldomain. † Results from Wilson (2014).perimental protocol by Goovaerts (1997) also followedby Álvarez and Lawrence (2011): The training datacomprises 259 data points distributed spatially withthree output variables—nickel, zinc, and cadmium—and 100 additional data points for which only two ofthe three outputs—nickel and zinc—are observed. Thetask is to predict cadmium at the locations of those100 additional data. Performance is evaluated with themean absolute error (MAE).Table 4 shows the results. The comparatively poorperformance of independent GPs highlights the impor-tance of exploiting correlations between the mineralconcentrations. Furthermore, Table 4 shows that D-GPAR-NL signiﬁcantly outperforms the other models,achieving a new state-of-the-art.

Exchange rates data set . This data set consistsof the daily exchange rate w.r.t. USD of the top teninternational currencies (CAD, EUR, JPY, GBP, CHF,AUD, HKD, NZD, KRW, and MXN) and three preciousmetals (gold, silver, and platinum) in the year 2007.The task is to predict CAD on days 50–100, JPY ondays 100–150, and AUD on days 150–200, given thatCAD is observed on days 1–49 and 101–251, JPY ondays 1–99 and 151–251, and AUD on days 1–149 and201–251; and that all other currencies are observedthroughout the whole year. Performance is measured The exchange rates data set can be downloaded at http://fx.sauder.ubc.ca . ames Requeima, Will Tebbutt, Wessel Bruisma, and Richard E. Turner . . . . . . . − F ( V o l t ag e ) TrainTestIGPSLFMGPAR-NL

Figure 5: Predictions for electrode F2 from the EEG data setModel IGP ∗ CMOGP ∗ CGP ∗ GPAR-L-NLSMSE . . . . Table 5: Experimental results for the exchange ratesdata set for IGP, CMOGP, CGP, and GPAR. ∗ Thesenumbers are taken from Nguyen and Bonilla (2014).with the SMSE.Figure 6 visualises GPAR’s prediction for data set, andTable 5 quantiﬁes the result. We greedily optimise theevidence w.r.t. the ordering of the outputs to determinean ordering, and we impute missing data to ensureclosed downwardness of the data. Observe that GPARsigniﬁcantly outperforms all other models.

Tidal height, wind speed, and air temperaturedata set . This data set was collected at 5 minute in-tervals by four weather stations: Bramblemet, Camber-met, Chimet, and Sotonmet, all located in Southamp-ton, UK. The task is to predict the air temperaturemeasured by Cambermet and Chimet from all othersignals. Performance is measured with the SMSE. Thisexperiment serves two purposes. First, it demonstratesthat it is simple to scale GPAR to large data sets usingoﬀ-the-shelf inducing point techniques for single-outputGP regression. Second, it shows that scaling to largedata sets enables GPAR to better learn dependenciesbetween outputs, which, importantly, can signiﬁcantlyimprove predictions in regions where outputs are par-tially observed. We utilise the variational inducingpoint method by Titsias (2009) as discussed in Sec-tion 2, with 10 inducing points per day. This data setis not closed downwards, so we use mean imputationwhen training. We use D-GPAR-L and set the tempo-ral kernel to be a simple EQ, meaning that the model cannot make long-range predictions, but instead mustexploit correlations between outputs.Nguyen and Bonilla (2014) consider from this data set5 days in July 2013, and predict short periods of theair temperature measured by Cambermet and Chimetusing all other signals. We followed their setup andpredicted the same test set, but instead trained on The data can be downloaded at , http://cambermet.co.uk , , and http://sotonmet.co.uk . the whole of July. Even though the additional ob-servations do not temporally correlate with the testperiods at all, they enable GPAR to better learn therelationships between the outputs, which, unsurpris-ingly, signiﬁcantly improved the predictions: using thewhole of July, GPAR achieves SMSE . , comparedto their SMSE . .The test set used by Nguyen and Bonilla (2014) israther small, yielding high-variance test results. Wetherefore do not pursue further comparisons on theirtrain–test split, but instead consider a bigger, morechallenging setup: using as training data 10 days (days [10 , , roughly

30 k points), 15 days (days [18 , ,roughly

47 k points), and the whole of July (roughly

98 k points), make predictions of 4 day periods of theair temperature measured by Cambermet and Chimet.Figure 7 visualises the test periods and GPAR’s pre-dictions for it. Despite the additional observationsnot correlating with the test periods, we observe clear,though dimishining, improvements in the predictionsas the training data is increased.

MLP validation error data set . The ﬁnal dataset is the validation error of a multi-layer perceptron(MLP) on the MNIST data, trained using categoricalcross-entropy, and set as a function of six hyperpa-rameters: the number of hidden layers, the numberof neurons per hidden layer, the dropout rate, thelearning rate to use with the ADAM optimizer, the L weight penalty, and the L weight penalty. This exper-iment was implemented using code made available byHernández-Lobato (2016). An improved model for theobjective surface could translate directly into improvedperformance in Bayesian optimisation (Snoek et al.,2012), as this would enable a more informed search ofthe hyperparameter space.To generate a data set, we sample 291 sets of hyperpa-rameters randomly from a rectilinear grid and train theMLP for 21 epochs under each set of hyperparameters,recording the validation performance after 1, 5, 11, 16,and 21 epochs. We construct a training set of 175 ofthese hyperparameter settings and, crucially, discardroughly 30% of the validation performance results at 5epochs at random, and again discard roughly 30% ofthose results at 11 epochs, and so forth. The resultingdata set has 175 labels after 1 epoch, 124 after 5, 88 he Gaussian Process Autoregressive Regression Model (GPAR) Figure 6: Visualisation of the exchange rates data set and CGP’s (black) and GPAR’s (green) predictions for it.GPAR’s predictions are overlayed on the original ﬁgure by Nguyen and Bonilla (2014).Figure 7: Visualisation of the air temperature data set and GPAR’s prediction for it. Black circles indicate thelocations of the inducing points. . . . . S M S E o f V a li d a t i o n E rr o r P r e d i c t i o n IGPSLFMGPAR-L-NL

Figure 8: Results for the machine learning data setfor a GP, the SLFM with two latent dimensions, andGPARafter 11, 64 after 15 and 44 after 21, simulating thepartial completion of the majority of runs. Importantly,a Bayesian Optimisation system typically exploits onlycompleted training runs to inform the objective surface,whereas GPAR can also exploit partially complete runs.The results presented in Figure 8 show the SMSE inpredicting validation performance at each epoch usingGPAR, the SLFM, and independent GPs on the testset, averaged over 10 seed for the pseudo-random num-ber generator used to select which outputs from thetraining set to discard. GPs trained independently topredict performance after a particular number of epochsperform worse than the SLFM and GPAR, which both have learned to exploit the extra information available.In turn, GPAR noticeably outperforms the SLFM.

This paper introduced GPAR: a ﬂexible, fast, tractable,and interpretable approach to multi-output GP re-gression. GPAR can model (1) nonlinear relation-ships between outputs and (2) complex output noise.GPAR can scale to large data sets by trivially lever-aging scaling techniques for one-dimensional GP re-gression (Titsias, 2009). In eﬀect, GPAR transformshigh-dimensional data modelling problems into set ofsingle-output modelling problems, which are the breadand butter of the GP approach. GPAR was rigorouslytested on a variety of synthetic and real-world prob-lems, consistently outperforming existing GP modelsfor multi-output regression. An exciting future appli-cation of GPAR is to use compositional kernel search(Lloyd et al., 2014) to automatically learn and explaindependencies between outputs and inputs. Furtherinsights into structure of the data could be gained bydecomposing GPAR’s posterior over additive kernelcomponents (Duvenaud, 2014). These two approachescould be developed into a useful tool for automaticstructure discovery. Two further exciting future ap-plications of GPAR are modelling of environmentalphenomena and improving data eﬃciency of existingBayesian optimisation tools (Snoek et al., 2012). ames Requeima, Will Tebbutt, Wessel Bruisma, and Richard E. Turner

Acknowledgements

Richard E. Turner is supported by Google as well asEPSRC grants EP/M0269571 and EP/L000776/1.

References

Álvarez, M. and Lawrence, N. D. (2009). Sparse con-volved gaussian processes for multi-output regression.

Advances in Neural Information Processing Systems ,21:57–64.Álvarez, M. A. and Lawrence, N. D. (2011). Computa-tionally eﬃcient convolved multiple output Gaussianprocesses.

Journal of Machine Learning Research ,12:1459–1500.Álvarez, M. A., Luengo, D., and Lawrence, N. D. (2009).Latent force models.

Artiﬁcial Intelligence and Statis-tics , 5:9–16.Álvarez, M. A., Luengo, D., Titsias, M. K., andLawrence, N. D. (2010). Eﬃcient multioutput Gaus-sian processes through variational inducing kernels.

Journal of Machine Learning Research: Workshopand Conference Proceedings , 9:25–32.Bengio, Y. and Bengio, S. (2000). Modeling high-dimensional discrete data with multi-layer neuralnetworks. In

Advances in Neural Information Pro-cessing Systems , pages 400–406.Bonilla, E. V., Chai, K. M., and Williams, C. K. I.(2008). Multi-task Gaussian process prediction.

Ad-vances in Neural Information Processing Systems ,20:153–160.Bruinsma, W. P. (2016). The generalised gaussianconvolution process model. MPhil thesis, Departmentof Engineering, University of Cambridge.Bui, T. D., Hernández-Lobato, D., Li, Y., Hernández-Lobato, J. M., and Turner, R. E. (2016a). Deepgaussian processes for regression using approxi-mate expectation propagation. arXiv preprintarXiv:1602.04133 .Bui, T. D. and Turner, R. E. (2014). Tree-structuredGaussian process approximations.

Advances in Neu-ral Information Processing Systems , 27:2213–2221.Bui, T. D., Yan, J., and Turner, R. E. (2016b). Aunifying framework for gaussian process pseudo-pointapproximations using power expectation propagation. arXiv preprint arXiv:1605.07066 .Dai, Z., Álvarez, M. A., and Lawrence, N. D. (2017).Eﬃcient modeling of latent information in supervisedlearning using Gaussian processes. arXiv preprintarXiv:1705.09862 .Damianou, A. (2014).

Deep Gaussian Processes andVariational Propagation of Uncertainty . PhD thesis,Department of Neuroscience, University of Sheﬃeld. Duvenaud, D. (2014).

Automatic Model Constructionwith Gaussian Processes . PhD thesis, Computationaland Biological Learning Laboratory, University ofCambridge.Duvenaud, D., Lloyd, J. R., Grosse, R., Tenenbaum,J. B., and Ghahramani, Z. (2013). Structure discov-ery in nonparametric regression through composi-tional kernel search. In

International Conference onMachine Learning .Frey, B. J., Hinton, G. E., and Dayan, P. (1996). Doesthe wake-sleep algorithm produce good density esti-mators? In

Advances in neural information process-ing systems , pages 661–667.Friedman, N. and Nachman, I. (2000). Gaussian processnetworks. In

Uncertainty in Artiﬁcial Intelligence ,pages 211–219. Morgan Kaufmann Publishers Inc.Goovaerts, P. (1997).

Geostatistics for Natural Re-sources Evaluation . Oxford University Press, 1 edi-tion.Hernández-Lobato, J. M. (2016). Neural networks withoptimal accuracy and speed in their predictions.Kennedy, M. C. and O’Hagan, A. (2000). Predictingthe output from a complex computer code when fastapproximations are available.

Biometrika , 87(1):1–13.Koller, D. and Friedman, N. (2009).

ProbabilisticGraphical Models: Principles and Techniques . MITPress.Larochelle, H. and Murray, I. (2011). The neural au-toregressive distribution estimator. In

AISTATS ,volume 1, page 2.Le Gratiet, L. and Garnier, J. (2014). Recursive co-kriging model for design of computer experimentswith multiple levels of ﬁdelity.

International Journalfor Uncertainty Quantiﬁcation , 4(5).Lloyd, J. R., Duvenaud, D., Grosse, R., Tenenbaum,J. B., and Ghahramani, Z. (2014). Automatic con-struction and natural-language description of non-parametric regression models. In

Association for theAdvancement of Artiﬁcial Intelligence (AAAI) .Matthews, A. G. D. G., Hensman, J., Turner, R. E.,and Ghahramani, Z. (2016). On sparse variationalmethods and the kullback-leibler divergence be-tween stochastic processes.

Artiﬁcial Intelligenceand Statistics , 19.Matthews, A. G. d. G., Rowland, M., Hron, J., Turner,R. E., and Ghahramani, Z. (2018). Gaussian pro-cess behaviour in wide deep neural networks. arXivpreprint arXiv:1804.11271 .Neal, R. M. (1992). Connectionist learning of beliefnetworks.

Artiﬁcial intelligence , 56(1):71–113. he Gaussian Process Autoregressive Regression Model (GPAR)

Neal, R. M. (1996).

Bayesian learning for neural net-works , volume 118. Springer Science & BusinessMedia.Nguyen, T. V. and Bonilla, E. V. (2014). Collaborativemulti-output Gaussian processes.

Conference onUncertainty in Artiﬁcial Intelligence , 30.Nocedal, J. and Wright, S. J. (2006).

Numerical Opti-mization . Springer, second edition.Osborne, M. A., Roberts, S. J., Rogers, A., Ramchurn,S. D., and Jennings, N. R. (2008). Towards real-timeinformation processing of sensor network data us-ing computationally eﬃcient multi-output Gaussianprocesses. In

Proceedings of the 7th InternationalConference on Information Processing in Sensor Net-works , IPSN ’08, pages 109–120. IEEE ComputerSociety.Perdikaris, P., Raissi, M., Damianou, A., Lawrence,N. D., and Karniadakis, G. E. (2017). Nonlinearinformation fusion algorithms for data-eﬃcient multi-ﬁdelity modelling.

Proceedings of the Royal SocietyA: Mathematical, Physical and Engineering Science ,473.Rasmussen, C. E. and Williams, C. K. I. (2006).

Gaus-sian Processes for Machine Learning . MIT Press.Snoek, J., Larochelle, H., and Adams, R. P. (2012).Practical bayesian optimization of machine learn-ing algorithms. In

Advances in Neural InformationProcessing Systems , pages 2951–2959.Stein, M. (1999).

Interpolation of Spatial Data .Springer-Verlag New York, 1 edition.Sun, S., Zhang, G., Wang, C., Zeng, W., Li, J., andGrosse, R. (2018). Diﬀerentiable compositional ker-nel learning for gaussian processes.

InternationalConference on Machine Learning , 35.Teh, Y. W. and Seeger, M. (2005). Semiparametriclatent factor models.

International Workshop onArtiﬁcial Intelligence and Statistics , 10.Theis, L. and Bethge, M. (2015). Generative imagemodeling using spatial lstms. In

Advances in NeuralInformation Processing Systems , pages 1927–1935.Titsias, M. K. (2009). Variational learning of inducingvariables in sparse Gaussian processes.

ArtiﬁcialIntelligence and Statistics , 12:567–574.van den Oord, A., Kalchbrenner, N., Espeholt, L.,Vinyals, O., Graves, A., et al. (2016). Conditionalimage generation with pixelcnn decoders. In

Ad-vances in Neural Information Processing Systems ,pages 4790–4798.Wackernagel, H. (2003).

Multivariate Geostatistics .Springer-Verlag Berlin Heidelberg, 3 edition. Wilson, A. G. (2014).

Covariance Kernels for Fast Au-tomatic Pattern Discovery and Extrapolation WithGaussian Processes . PhD thesis, University of Cam-bridge.Wilson, A. G., Knowles, D. A., and Ghahramani, Z.(2012). Gaussian process regression networks.

Inter-national Conference on Machine Learning , 29.Yuan, C. (2011). Conditional multi-output regression.In

International Joint Conference on Neural Net-works , pages 189–196. IEEE.Zhang, X., Begleiter, H., Porjesz, B., Wang, W., andLitke, A. (1995). Event related potentials duringobject recognition tasks.

Brain Research Bulletin ,38(6):531–538. ames Requeima, Will Tebbutt, Wessel Bruisma, and Richard E. Turner

A Conditional Independence inFigure 2b

In this section, we prove the key conditional indepen-dence in Figure 2b that makes GPAR work:

Theorem 2.

Let a set of observations D be closeddownwards. Then y i ⊥ D i +1: M | D i , where D i arethe observations for outputs , . . . , i and D i +1: M for i + 1 , . . . , M . To begin with, we review some basic notions concerninggraphical models. Let a path be a sequence of nodes v , . . . , v n from some directed graph G where, for each v i , G either contains an edge from v i to v i +1 , or anedge from v i +1 to v i . If G contains an edge from a to b ,then write a → b to mean the two-node path in whichnode b follows node a . Similarly, if G contains an edgefrom b to a , then write a ← b to mean the two-nodepath in which b follows a . Write a (cid:10) b to mean either a → b or b ← a . Deﬁnition 1 (Active Path (Deﬁnition 3.6 from Kollerand Friedman (2009))) . Let P = v (cid:10) · · · (cid:10) v n be apath in a graphical model. Let Z be a subset of thevariables from the graphical model. Then, call P activegiven Z if (1) for every v-structure v i − → v i ← v i +1 in P , v i or a descendant of v i is in Z ; and (2) no othernode in P is in Z . Deﬁnition 2 (d-Separation (Deﬁnition 3.7 from Kollerand Friedman (2009))) . Let X , Y , and Z be three setsof nodes from a graphical model. Then, call X and Y d-separated given Z if no path between any x ∈ X and y ∈ Y is active given Z . Theorem 3 (d-Separation Implies Conditional Inde-pendence (Theorem 3.3 from Koller and Friedman(2009))) . Let X , Y , and Z be three sets of nodes froma graphical model. If X and Y are d-separated given Z , then X ⊥ Y | Z .Deﬁne the layer of a node in Figure 2b to be layer( f i ) = layer( y i ( x )) = i. We are now ready to prove Theorem 2.

Proof of Theorem 2.

For i < j , let P be a path be-tween any y i ( x (cid:48) ) ∈ y i and y j ( x ) ∈ D i +1: N . Let y k (ˆ x ) be the ﬁrst node in P such that layer( y k (ˆ x )) > i . Then P contains · · · → y m (ˆ x ) → y k (ˆ x ) (cid:10) · · · D i = { y ( n ) j ∈ D : j ≤ i, n ≤ N } and D i +1: M = { y ( n ) j ∈ D : j > i, n ≤ N } . for some m ≤ i < k .If y k (ˆ x ) ∈ D i +1: N , then, since D is closed downwards, y m (ˆ x ) ∈ D i , meaning that P is inactive.If, on the other hand, y k (ˆ x ) / ∈ D , then, since D is closeddownwards, y k (cid:48) (ˆ x ) / ∈ D for all k (cid:48) ≥ k . Therefore, y j ( x ) cannot be descendant of y k (ˆ x ) , so P must contain · · · → y k (cid:48) (ˆ x ) → y k (cid:48)(cid:48) (ˆ x ) ← f k (cid:48)(cid:48) → · · · for some m ≤ k (cid:48) < k (cid:48)(cid:48) , which forms a v-structure.We conclude that P is inactive, because y k (cid:48)(cid:48) (ˆ x ) nor adescendant of y k (cid:48)(cid:48) (ˆ x ) can be in D . B The Nonlinear and LinearEquivalent Model

In this section, we construct equivalent models forGPAR (Lemmas 1 and 2). These models make GPAR’sconnection to other models in the literature explicit.To begin with, we must introduce some notation anddeﬁnitions. For functions

A, B : X × ( Y M ) X → Y M ,deﬁne composition ◦ as follows: ( A ◦ B )( x, y ) = A ( x, B ( · , y )) . Note that ◦ is well-deﬁned and as-sociative. For a function u : X → Y M , denote A ◦ u : X → Y M , A ◦ u = A ( · , u ) . Again, note that ( A ◦ B ) ◦ u = A ◦ ( B ◦ u ) . Furthermore, denote A ◦ · · · ◦ A (cid:124) (cid:123)(cid:122) (cid:125) n times = A n . Consider a function A : X × ( Y M ) X → Y M such that A i ( x, y ) : X × ( Y M ) X → Y depends only on ( x, y i − ) ,where A = 0 . Further let u, y : X → Y M , denote T f = u + A ◦ f , and denote N consecutive applicationsof T by T N .The expression T M − u will be key in constructingthe equivalent models. We show that it is the uniquesolution of a functional equation: Proposition 1.

The unique solution of y = u + A ◦ y is y = T M − u . Proof of Proposition 1.

First, we show that y = u + A ◦ y has a solution, and that this solution is unique.Because A i ( x, y ) depends only on ( x, y i − ) , it holdsthat y i = u i + A i ◦ y = u i + A i ◦ ( y i − , , where ( y i − , represents the concatenation of y i − and M − i + 1 zeros. Thus, y i can uniquely be con-structed from u i , A i , and y i − ; therefore, y existsand is unique, so y exists and is unique: by inductionwe ﬁnd that y exists and is unique. he Gaussian Process Autoregressive Regression Model (GPAR) Second, we show that y = T M − u satisﬁes y = u + A ◦ y = T y . To show this, we show that (T n u ) i =(T n − u ) i for i = 1 , . . . , n , for all n . To begin with, weshow the base case, n = 1 : (T u ) = u + A ◦ u = u = (T u ) , since A = 0 . Finally, suppose that the claim holds fora particular n . We show that the claim then holds for n + 1 : Let i ≤ n + 1 . Then (T n +1 u ) i = u i + A i ◦ T n u = u i + A i ◦ ((T n u ) i − , (T n u ) i : M )= u i + A i ◦ ((T n − u ) i − , (T n u ) i : M ) (By assumption) ( ∗ ) = u i + A i ◦ ((T n − u ) i − , (T n − u ) i : M )= u i + A i ◦ T n − u = (T n u ) i , where ( ∗ ) holds because A i ( x, y ) depends only on ( x, y i − ) .In the linear case, T M − u turns out to greatly simplify. Proposition 2. If A ( x, y ) is linear in y , then T M − u = ( (cid:80) M − i =0 A i ) ◦ u . Proof of Proposition 2. If A ( x, y ) is linear in y , thenone veriﬁes that ◦ distributes over addition. Therefore, T M − u = u + A ◦ T M − u = u + A ◦ u + A ◦ T M − u ... = u + A ◦ u + · · · + A M − ◦ u. We now use Propositions 1 and 2 to construct a non-linear and linear equivalent model.

Lemma 1 (Nonlinear Equivalent Model) . Let A be an M -dimensional vector-valued process over X × ( Y M ) X ,each A i drawn from GP (0 , k A i ) independently, and let u be an M -dimensional vector-valued process over X ,each u i drawn from GP (0 , k u i ) independently. Further-more, let A i ( x, y ) : X × ( Y M ) X → Y depend only on ( x, y i − ) , meaning that k A i = k A i ( x, y i − , x (cid:48) , y (cid:48) i − ) ,and let A = 0 . Denote T f = u + A ◦ f , and denote N consecutive applications of T by T N . Then y | A, u = T M − u ⇐⇒ y i | y i − ∼ GP (0 , k u i + k A i ( · , y i − , · , y i − )) . Proof of Lemma 1.

Since A i ( x, y ) depends only on ( x, y i − ) , it holds by Proposition 1 that any sam-ple from y | A, u satisﬁes y i = u i + A i ◦ y , so y i = u i + A i ◦ ( y i − , , where ( y i − , represents theconcatenation of y i − and M − i + 1 zeros. Theequivalence now follows. Lemma 2 (Linear Equivalent Model) . Suppose that A was instead generated from A ( x, y ) | ˆ A = (cid:90) ˆ A ( x − z ) y ( z ) d z, where ˆ A is an ( M × M ) -matrix-valued process over X , each ˆ A i,j drawn from GP (0 , k ˆ A i,j ) independently if i > j and ˆ A i,j = 0 otherwise. Then y | A, u = (cid:32) M − (cid:88) i =0 A i (cid:33) ◦ u ⇐⇒ y i | y i − ∼ GP (0 , k u i + k A i ( · , y i − , · , y i − )) , (5)where k A i ( x, y i − , x (cid:48) , y (cid:48) i − )= i − (cid:88) j =1 (cid:90) k ˆ A i,j ( x − z, x (cid:48) − z (cid:48) ) y j ( z ) y (cid:48) j ( z (cid:48) ) d z d z (cid:48) . Proof of Lemma 2.

First, one veriﬁes that A i ( x, y ) stilldepends only on ( x, y i − ) , and that A i ( x, y ) is linearin y . The result then follows from Lemma 1 and Propo-sition 2, where the expression for k A i follows fromstraightfoward calculation.As mentioned in the paper, the kernels for f M de-termine the types of relationships between inputs andoutputs that can be learned. Lemmas 1 and 2 makethis explicit: Lemma 1 shows that nonlocal GPAR canrecover a model where M latent GPs u are repeatedlycomposed with another latent GP A , where A has aparticular dependency structure, and Lemma 2 showsthat nonlocal GPAR can recover a model where M latent GPs u are linearly transformed, where the lineartransform T = (cid:80) M − i =0 A i is lower triangular and mayvary with the input.In Lemma 2, note that it is not restrictive that T islower triangular: Suppose that T were dense. Then,letting y | T, u = T ◦ u , y | T is jointly Gaussian. Hence y i | y i − , T is a GP whose mean linearly depends upon y i − via T , meaning that y i | y i − is of the form ofEquation (5) where k u i may be more complicated. If,however, ˆ A ( z ) = δ ( z ) B for some random ( M × M ) -matrix B , each B i,j drawn from N (0 , σ B i,j ) if i > j and B i,j = 0 otherwise, then it is restrictive that T is lowertriangular: In this case, y ( x ) | B, u = (cid:80) M − i =0 B i u ( x ) . If T = (cid:80) M − i =0 B i were dense, then, letting y | T, u = T u , y can be represented with Lemma 2 if and only if y | T ’scovariance can be diagonalised by a constant, invertible, ames Requeima, Will Tebbutt, Wessel Bruisma, and Richard E. Turner lower-triangular matrix. This condition does not holdin general, as Lemma 3 proves. C Lemma 3

Call functions k , . . . , k M : X → R linearly independentif (cid:32) ∀ x : M (cid:88) i =1 c i k i ( x ) = 0 (cid:33) = ⇒ c = . . . = c M = 0 . Lemma 3.

Let k , . . . , k M : X → R be linearly in-dependent and arrange them in a diagonal matrix K = diag( k , . . . , k n ) . Let A be an invertible M × M matrix such that its columns cannot be permuted into atriangular matrix. Then there does not exist an invert-ible triangular matrix T such that T − BK ( x ) B T T − T is diagonal for all x . Proof.

Suppose, on the contrary, that such T does exist.Then two diﬀerent rows a p and a q of A = T − B sharenonzero elements in some columns C ; otherwise, A would have exactly one nonzero entry in every column— A is invertible—so A would be the product of a per-mutation matrix and a diagonal matrix, meaning that B = T A ’s columns could be permuted into a triangu-lar matrix. Now, by T − BK ( x ) B T T − T = AK ( x ) A T being diagonal for all x ∈ X , (cid:80) i a p,i a q,i k i ( x ) = 0 forall x . Therefore, by linear independence of k , . . . , k N ,it holds that a p,i a q,i = 0 for all i . But a p,i a q,i (cid:54) = 0 forany i ∈ C , which is a contradiction. D Experimental Details

For every experiment, the form of the kernels is deter-mined by the particular GPAR model used: GPAR-L,GPAR-NL, or GPAR-L-NL (see Table 2 in the mainpaper), potentially with a D- ∗ preﬁx to indicate the de-noising procedure outlined in “Potential deﬁciencies ofGPAR” in Section 2 of the paper main. For GPAR-NL,we always used exponentiated quadratic (EQ) kernels,except for the exchange rates experiment, where weused rational quadratic (RQ) kernels (Rasmussen andWilliams, 2006). Furthermore, in every problem wesimply expanded according to Equation (1) in the mainpaper or greedily optimised the ordering, in both casesputting the to-be-predicted outputs last. We used scipyscipy