[PDF] Large-scale Heteroscedastic Regression via Gaussian Process

Abstract

Heteroscedastic regression considering the varying noises among observations has many applications in the fields like machine learning and statistics. Here we focus on the heteroscedastic Gaussian process (HGP) regression which integrates the latent function and the noise function together in a unified non-parametric Bayesian framework. Though showing remarkable performance, HGP suffers from the cubic time complexity, which strictly limits its application to big data. To improve the scalability, we first develop a variational sparse inference algorithm, named VSHGP, to handle large-scale datasets. Furthermore, two variants are developed to improve the scalability and capability of VSHGP. The first is stochastic VSHGP (SVSHGP) which derives a factorized evidence lower bound, thus enhancing efficient stochastic variational inference. The second is distributed VSHGP (DVSHGP) which (i) follows the Bayesian committee machine formalism to distribute computations over multiple local VSHGP experts with many inducing points; and (ii) adopts hybrid parameters for experts to guard against over-fitting and capture local variety. The superiority of DVSHGP and SVSHGP as compared to existing scalable heteroscedastic/homoscedastic GPs is then extensively verified on various datasets.

Full PDF

11 Large-scale Heteroscedastic Regression viaGaussian Process

Haitao Liu, Yew-Soon Ong,

Fellow, IEEE, and Jianfei Cai,

Senior Member, IEEE

Abstract —Heteroscedastic regression considering the varyingnoises among observations has many applications in the ﬁeldslike machine learning and statistics. Here we focus on the het-eroscedastic Gaussian process (HGP) regression which integratesthe latent function and the noise function together in a uniﬁednon-parametric Bayesian framework. Though showing remark-able performance, HGP suffers from the cubic time complexity,which strictly limits its application to big data. To improve thescalability, we ﬁrst develop a variational sparse inference algo-rithm, named VSHGP, to handle large-scale datasets. Further-more, two variants are developed to improve the scalability andcapability of VSHGP. The ﬁrst is stochastic VSHGP (SVSHGP)which derives a factorized evidence lower bound, thus enhancingefﬁcient stochastic variational inference. The second is distributedVSHGP (DVSHGP) which (i) follows the Bayesian committeemachine formalism to distribute computations over multiple localVSHGP experts with many inducing points; and (ii) adoptshybrid parameters for experts to guard against over-ﬁtting andcapture local variety. The superiority of DVSHGP and SVSHGPas compared to existing scalable heteroscedastic/homoscedasticGPs is then extensively veriﬁed on various datasets.

Index Terms —Heteroscedastic GP, Large-scale, Sparse approx-imation, Stochastic variational inference, Distributed learning. N OTATION

F, F V = Lower bounds for log model evidence log p ( y ) f , g = Latent function values and log variances f m , g u = Inducing variables for f and g k f , k g = Kernels for f and g K f , K g = Kernel matrices for f and g m, u = Inducing sizes for f and g m , u = Inducing sizes for each expert in DVSHGP m b = Number of basis functions for GPVC [8] m sod = Subset size for EHSoD [13] M = Number of VSHGP experts M i = i th VSHGP expert (1 ≤ i ≤ M ) n, n ∗ = Training size and test size n = Training size for each expert in DVSHGP X , X ∗ = Training and test data points

Haitao Liu is with the Rolls-Royce@NTU Corporate Lab, Nanyang Tech-nological University, Singapore, 637460. E-mail: [email protected] Ong is with School of Computer Science and Engi-neering, Nanyang Technological University, Singapore, 639798. E-mail:[email protected] Cai is with Department of Data Science & AI, Monash University,Australia. E-mail: [email protected]. y , y ∗ = Training and test observations µ = Mean parameter for latent noise function g µ m , Σ m = Mean and variance for the posterior q ( f m ) µ u , Σ u = Mean and variance for the posterior q ( g u ) µ f , Σ f = Mean and variance for the posterior q ( f ) µ (cid:63)f , Σ (cid:63)f = Mean and variance for the optimal q (cid:63) ( f ) µ g , Σ g = Mean and variance for the posterior q ( g ) Λ nn = Variational parameters for q ( g u ) in VSHGP µ f ( A ) ∗ , µ g ( A ) ∗ = (Aggregated) prediction mean of f and g at x ∗ σ f ( A ) ∗ , σ g ( A ) ∗ = (Aggregated) prediction variance of f and g at x ∗ µ ( A ) ∗ , σ A ) ∗ = (Aggregated) prediction mean and variance at x ∗ µ i ∗ , σ i ∗ = Prediction mean and variance of M i at x ∗ w fi ∗ , w gi ∗ = Weights of the expert M i at x ∗ for f and g θ f , θ g = Kernel parameters for k f and k g (cid:15) i = Noise for observation y i at point x i (1 ≤ i ≤ n ) γ = Step parameter for optimization B = Mini-batch set sampled from X I. I

NTRODUCTION I N supervised learning, we learn a machine learning modelfrom n training points X = { x i ∈ R d } ni =1 and theobservations y = { y ( x i ) = f ( x i ) + (cid:15) i } ni =1 , where f is theunderlying function and (cid:15) i is the independent noise. The non-parametric Gaussian process (GP) places a GP prior on f with the iid noise (cid:15) i ∼ N (0 , σ (cid:15) ) , resulting in the standard homoscedastic GP. The homoscedasticity offers tractable GPinference to enable extensive applications in regression andclassiﬁcation [1], visualization [2], Bayesian optimization [3],multi-task learning [4], active learning [5], etc.Many realistic problems, e.g., volatility forecasting [6],biophysical variables estimation [7], cosmological redshiftsestimation [8], robotics and vehicle control [9], [10], howevershould consider the input-dependent noise rather than the sim-ple constant noise in order to ﬁt the local noise rates of compli-cated data distributions. In comparison to the conventional ho-mogeneous

GP, the heteroscedastic

GP (HGP) provides betterquantiﬁcation of different sources of uncertainty, which furtherbrings beneﬁts for the downstream tasks, e.g., active learning,optimization and uncertainty quantiﬁcation [11], [12].To account for the heteroscedastic noise in GP, there existstwo main strategies: (i) treat the GP as a black-box andinterpret the heteroscedasticity using another separate model;and (ii) integrate the heteroscedasticity within the unifying GP a r X i v : . [ s t a t . M L ] J a n framework. The ﬁrst post-model strategy ﬁrst trains a standardGP to capture the underlying function f , and then eithertrains another GP to take the remaining empirical varianceinto account [13], or use the quantile regression [14] to modelthe lower and upper quantiles of the variance respectively.In contrast, the second integration strategy provides anelegant framework for heteroscedastic regression. The simplestway to mimic variable noise is through adding independent yetdifferent noise variances to the diagonal of kernel matrix [15].Goldberg et al. [16] introduced a more principled HGP whichinfers a mean-ﬁeld GP for f ( x ) and an additional GP g ( x ) for log σ (cid:15) ( x ) jointly. Similarly, other HGPs which describethe heteroscedastic noise using the pointwise division of twoGPs or the general non-linear combination of GPs havebeen proposed for example in [17]–[19]. Note that unlikethe homoscedastic GP, the inference in HGP is challengingsince the model evidence (marginal likelihood) p ( y ) andthe posterior p ( y ∗ | X , y , x ∗ ) are intractable. Hence, variousapproximate inference methods, e.g., markov chain montecarlo (MCMC) [16], maximum a posteriori (MAP) [20]–[23],variational inference [24], [25], expectation propagation [26],[27] and Laplace approximation [28], have been used. Themost accurate MCMC is time-consuming when handling large-scale datasets; the MAP is a point estimation which risks over-ﬁtting and oscillation; the variational inference and its variants,which run fast via maximizing over a tractable and rigorousevidence lower bound (ELBO), provide a trade-off.This paper focuses on the HGP model developed in [16].When handling n training points, the standard GP suffers froma cubic complexity O ( n ) due to the inversion of an n × n kernel matrix, which makes it unaffordable for big data.Since HGP employs an additional log-GP for noise variance,its complexity is about two times that of standard GP. Hence,to handle large-scale datasets, which is of great demand in theera of big data, the scalability of HGP should be improved.Recently, there has been an increasing trend on studyingscalable GPs, which have two core categories: global and localapproximations [29]. As the representative of global approx-imation, sparse approximation considers m ( m (cid:28) n ) global inducing pairs { X m , f m } to summarize the training data byapproximating the prior [30] or the posterior [31], resulting inthe complexity of O ( nm ) . Variants of sparse approximationhave been recently proposed to handle millions of data pointsvia distributed inference [32], [33], stochastic variational in-ference [34], [35], or structured inducing points [36]. Theglobal sparse approximation, however, often faces challengesin capturing quick-varying features [37]. Differently, localapproximation, which follows the idea of divide-and-conquer,ﬁrst trains GP experts on local subsets and then aggregatestheir predictions, for example by the means of product-of-experts (PoE) [38] and Bayesian committee machine (BCM)[39]–[41]. Hence, local approximation not only distributes thecomputations but also captures quick-varying features [42].Hybrid strategies thereafter have been presented for takingadvantages of global and local approximations [43]–[45].The developments in scalable homoscedastic GPs havethus motivated us to scale up HGPs. Alternatively, we couldcombine the simple Subset-of-Data (SoD) approximation [46] with the empirical HGP [13] which trains two separate GPsfor predicting mean and variance, respectively. The empiricalvariance however is hard to ﬁt since it follows an asymmetricGaussian distribution. More reasonably, the GP using variablecovariances (GPVC) [8] follows the idea of relevance vectormachine (RVM) [47] that a stationary kernel k ( ., . ) has apositive and ﬁnite Fourier spectrum, suggesting using only m b ( m b (cid:28) n ) independent basis functions for approximation.Note that GPVC shares the basis functions for f and g whichhowever might produce distinct features. Besides, the RVM-type model usually suffers from underestimated predictionvariance when leaving X [48].Besides, there are some scalable “pseudo” HGPs [43],[49] which are not designed for such case, but can describethe heteroscedasticity to some extent due to the factorizedconditionals. For instance, the Fully Independent TrainingConditional (FITC) [49] and its block version named PartiallyIndependent Conditional (PIC) [43] adopt a factorized trainingconditional to achieve a varying noise [50]. Though equippedwith heteroscedastic variance, these models (i) severely un-derestimate the constant noise variance, (ii) sacriﬁce theprediction mean [50], and (iii) may produce discontinuouspredictions on block boundaries [44]. Recently, the stochas-tic/distributed variants of FITC and PIC have been developedto further improve the scalability [33], [35], [45], [51].The high capability of describing complicated data distri-bution with however poor scalability motivate us to developvariational sparse HGPs which employ an additional log-GPfor heteroscedasticity. Particularly, the main contributions are: . A variational inference algorithm for sparse HGP, namedVSHGP, is developed. Speciﬁcally, VSHGP derives an ana-lytical ELBO using m inducing points for f and u induc-ing points for g , resulting in a greatly reduced complexityof O ( nm + nu ) . Besides, some tricks for example re-parameterization are used to ease the inference; . The stochastic variant SVSHGP is presented to furtherimprove the scalability of VSHGP. Speciﬁcally, we derivea factorized ELBO which allows using efﬁcient stochasticvariational inference; . The distributed variant DVSHGP is presented for improv-ing the scalability and capability of VSHGP. The local expertswith many inducing points (i) distribute the computations forparallel computing, and (ii) employ hybrid parameters to guardagainst over-ﬁtting and capture local variety; . Extensive experiments conducted on datasets with up totwo million points reveal that the localized DVSHGP exhibitssuperior performance, while the global SVSHGP may sacriﬁcethe prediction mean for capturing heteroscedastic noise.The remainder of the article is organized as follows.Section II ﬁrst develops the VSHGP model via variationalinference. Then, we present the stochastic variant in Section IIIand the distributed variant in Section IV to further enhance thescalability and capability, followed by extensive experiments inSection VI. Finally, Section VII provides concluding remarks. The SVSHGP is implemented based on the GPﬂow toolbox [52], whichbeneﬁts from parallel/GPU speedup and automatic differentiation of Ten-sorﬂow [53]. The DVSHGP is built upon the GPML toolbox [54]. Theseimplementations are available at https://github.com/LiuHaiTao01.

II. V

ARIATIONAL SPARSE

HGP

A. Sparse approximation

We follow [16] to deﬁne the HGP as y ( x ) = f ( x ) + (cid:15) ( x ) ,wherein the latent function f ( x ) and the noise (cid:15) ( x ) follow f ( x ) ∼ GP (0 , k f ( x , x (cid:48) )) , (cid:15) ( x ) ∼ N (0 , σ (cid:15) ( x )) . (1)It is observed that the input-dependent noise variance σ (cid:15) ( x ) enables describing possible heteroscedasticity. Notably, thisHGP degenerates to homoscedastic GP when σ (cid:15) ( x ) is aconstant. To ensure the positivity, we particularly consider theexponential form σ (cid:15) ( x ) = e g ( x ) , wherein the latent function g ( x ) akin to f ( x ) follows an independent GP prior g ( x ) ∼ GP ( µ , k g ( x , x (cid:48) )) . (2)The only difference is that unlike the zero-mean GP priorplaced on f ( x ) , we explicitly consider a prior mean µ toaccount for the variability of the noise variance. The kernels k f and k g could be, e.g., the squared exponential (SE) kernelequipped with automatic relevance determination (ARD) k ( x , x (cid:48) ) = σ s exp (cid:32) − d (cid:88) i =1 ( x i − x (cid:48) i ) l i (cid:33) , (3)where the signal variance σ s is an output scale, and the length-scale l i is an input scale along the i th dimension.Given the training data D = { X , y } , the joint priors follow p ( f ) = N ( f | , K fnn ) , p ( g ) = N ( g | µ , K gnn ) , (4)where [ K fnn ] ij = k f ( x i , x j ) and [ K gnn ] ij = k g ( x i , x j ) for ≤ i, j ≤ n . Accordingly, the data likelihood becomes p ( y | f , g ) = N ( y | f , Σ (cid:15) ) , (5)where the diagonal noise matrix has [ Σ (cid:15) ] ii = e g ( x i ) .To scale up HGP, we follow the sparse approxima-tion framework to introduce m inducing variables f m ∼N ( f m | , K fmm ) at the inducing points X m for f ; similarly,we introduce u inducing variables g u ∼ N ( g u | µ , K guu ) atthe independent X u for g . Besides, we assume that f m is asufﬁcient statistic for f , and g u a sufﬁcient statistic for g . As a result, we obtain two training conditionals p ( f | f m ) = N ( f | Ω fnm f m , K fnn − Q fnn ) ,p ( g | g u ) = N ( g | Ω gnu ( g u − µ ) + µ , K gnn − Q gnn ) , where Ω fnm = K fnm ( K fmm ) − , Ω gnm = K gnm ( K gmm ) − , Q fnn = Ω fnm K fmn and Q gnn = Ω gnm K gmn .In the augmented probability space, the model evidence p ( y ) = (cid:90) p ( y | f , g ) p ( f | f m ) p ( g | g u ) p ( f m ) p ( g u ) d f d f m d g d g u , together with the posterior p ( z | y ) = p ( y | z ) p ( z ) /p ( y ) where z = { f , g , f m , g u } , however, is intractable. Hence, we usevariational inference to derive an analytical ELBO of log p ( y ) . For f , we can pre-process the data to fulﬁll the zero-mean assumption.For g , however, it is hard to satisfy the zero-mean assumption, since we haveno access to the “noise” data. Sufﬁcient statistic means the variables z and f are independent given f m ,i.e., it holds p ( z | f , f m ) = p ( z | f m ) . B. Evidence lower bound

We employ the mean-ﬁeld theory [55] to approximate the in-tractable posterior p ( z | y ) = p ( f | f m ) p ( g | g u ) p ( f m | y ) p ( g u | y ) as p ( z | y ) ≈ q ( z ) = p ( f | f m ) p ( g | g u ) q ( f m ) q ( g u ) , (6)where q ( f m ) and q ( g u ) are free variational distributions toapproximate the posteriors p ( f m | y ) and p ( g u | y ) , respectively.In order to push the approximation q ( z ) towards the exact p ( z | y ) , we minimize their Kullback-Leibler (KL) divergence KL( q ( z ) || p ( z | y )) , which, on the other hand, is equivalent tomaximizing the ELBO F , since KL( ., . ) ≥ , as F = (cid:90) q ( z ) log p ( z , y ) q ( z ) d z = log p ( y ) − KL( q ( z ) || p ( z | y )) . (7)As a consequence, instead of directly maximizing the in-tractable log p ( y ) for inference, we now seek the maximizationof F w.r.t. the variational distributions q ( f m ) and q ( g u ) .By reformulating F we observe that F = − KL( q ( f m ) || q (cid:63) ( f m )) + H ( q ( g u )) + const., where H ( q ( g u )) is the information entropy of q ( g u ) ; q (cid:63) ( f m ) is the optimal distribution since it maximizes the bound F ,and it satisﬁes, given the normalizer C , q (cid:63) ( f m ) = p ( f m ) C e (cid:82) p ( f | f m ) p ( g | g u ) q ( g u ) log p ( y | f , g ) d f d g d g u . (8)Thereafter, by substituting q (cid:63) ( f m ) back into F , we arriveat a tighter ELBO, given q ( g u ) = N ( g u | µ u , Σ u ) , as F V = log C − KL( q ( g u ) || p ( g u ))= log N ( y | , Q fnn + R g ) (cid:124) (cid:123)(cid:122) (cid:125) log term − . Σ g ] (cid:124) (cid:123)(cid:122) (cid:125) trace term of g − . R − g ( K fnn − Q fnn )] (cid:124) (cid:123)(cid:122) (cid:125) trace term of f − KL( q ( g u ) || p ( g u )) (cid:124) (cid:123)(cid:122) (cid:125) KL term , (9)where the diagonal matrix R g ∈ R n × n has [ R g ] ii = e [ µ g ] i − [ Σ g ] ii / , with the mean and variance µ g = Ω gnu ( µ u − µ ) + µ , (10a) Σ g = K gnn − Q gnn + Ω gnu Σ u ( Ω gnu ) T , (10b)coming from q ( g ) = (cid:82) p ( g | g u ) q ( g u ) d g u which approximates p ( g | y ) . It is observed that the analytical bound F V dependsonly on q ( g u ) since we have “marginalized” q ( f m ) out.Let us delve further into the terms of F V in (9): • The log term log N ( y | , Σ y ) , where Σ y = Q fnn + R g ,is analogous to that in a standard GP. It achieves bias-variance trade-off for both f and g by penalizing modelcomplexity and low data likelihood [1]. • The two trace terms act as a regularization to choose goodinducing sets for f and g , and guard against over-ﬁtting.It is observed that Tr[ K gnn − Q gnn ] and Tr[ K fnn − Q fnn ] represent the total variance of the training conditionals p ( g | g u ) and p ( f | f m ) , respectively. To maximize F V , thetraces should be very small, implying that f m and g u are very informative (i.e., sufﬁcient statistics, also called variational compression [56]) for f and g . Particularly,the zero traces indicate that f m = f and g u = g , thusrecovering the variational HGP (VHGP) [24]. Besides,the zero traces imply that the variances of q ( g u ) equal tothat of p ( g u | y ) . • The KL term is a constraint for rationalising q ( g u ) . Itis observed that minimizing the traces only push thevariances of q ( g u ) towards that of p ( g u | y ) . To let the co-variances of q ( g u ) rationally approximate that of p ( g u | y ) ,the KL term penalizes q ( g u ) so that it does not deviatesigniﬁcantly from the prior p ( g u ) . C. Reparameterization and inference

In order to maximize the ELBO F V in (9), we need toinfer ω = u + u ( u + 1) / free variational parameters in µ u and Σ u . Assume that u = 0 . n , then ω is larger than n whenthe training size n > × , leading to a high-dimensionaland non-trivial optimization task.We observe that the derivatives of F V w.r.t. µ u and Σ u are ∂F V ∂ µ u = 12 ( Ω gnu ) T Λ abnn − ( K guu ) − ( µ u − µ ) ,∂F V ∂ Σ u = −

14 ( Ω gnu ) T ( Λ abnn + I ) Ω gnu + 12 [ Σ − u − ( K guu ) − ] , where the diagonal matrix Λ abnn = Λ ann + Λ bnn with Λ ann =( Σ − y yy T Σ − y − Σ − y ) (cid:12) R g , Λ bnn = ( K fnn − Q fnn ) (cid:12) R − g ,and the operator (cid:12) being the element-wise product. Hence, itis observed that the optimal µ u and Σ − u satisfy µ u =0 . K gun Λ abnn + µ , Σ − u =0 . Ω gnu ) T ( Λ abnn + I ) Ω gnu + ( K guu ) − . Interestingly, we ﬁnd that both the optimal µ u and Σ − u de-pend on Λ nn = 0 . Λ abnn + I ) , which is a positive semi-deﬁnitediagonal matrix, see the non-negativity proof in Appendix A.Hence, we re-parameterize µ u and Σ − u in terms of Λ nn as µ u = K gun ( Λ nn − . I ) + µ , (11a) Σ − u =( K guu ) − + ( Ω gnu ) T Λ nn Ω gnu . (11b)This re-parameterization eases the model inference by (i)reducing the number of variational parameters from ω to n ,and (ii) limiting the new variational parameters Λ nn to benon-negative, thus narrowing the search space.So far, the bound F V depends on the variational pa-rameters Λ nn , the kernel parameters θ f and θ g , the meanparameter µ for g , and the inducing points X m and X u .We maximize F V to infer all these hyperparameters ζ = { Λ nn , θ f , θ g , X m , X u } jointly for model selection. This non-linear optimization task can be solved via conjugate gradientdescent (CGD), since the derivatives of F V w.r.t. these hyper-parameters have closed forms, see Appendix B. D. Predictive Distribution

The predictive distribution p ( y ∗ | y , x ∗ ) at the test point x ∗ is approximated as q ( y ∗ ) = (cid:90) p ( y ∗ | f ∗ , g ∗ ) q ( f ∗ ) q ( g ∗ ) df ∗ dg ∗ . (12) As for q ( f ∗ ) = (cid:82) p ( f ∗ | f m ) q (cid:63) ( f m ) d f m , we ﬁrst calculate q (cid:63) ( f m ) = N ( f m | µ (cid:63)f , Σ (cid:63)f ) from (8) as µ (cid:63)f = K fmm K − R K fmn R − g y , Σ (cid:63)f = K fmm K − R K fmm , (13)where K R = K fmn R − g K fnm + K fmm . Using q (cid:63) ( f m ) , wehave q ( f ∗ ) = N ( f ∗ | µ f ∗ , σ f ∗ ) with µ f ∗ = k f ∗ m K − R K fmn R − g y , (14a) σ f ∗ = k f ∗∗ − k f ∗ m ( K fmm ) − k fm ∗ + k f ∗ m K − R k fm ∗ . (14b)It is interesting to ﬁnd in (14b) that the correction term k f ∗ m K − R k fm ∗ contains the heteroscedasticity information fromthe noise term R g . Hence, q ( f ∗ ) produces heteroscedastic variances over the input domain, see an illustration exam-ple in Fig. 2(b). The heteroscedastic σ f ( x ∗ ) (i) eases thelearning of g , and (ii) plays as an auxiliary role, since theheteroscedasticity is mainly explained by g . Also, our VSHGPis believed to produce a better prediction mean µ f ( x ∗ ) throughthe interaction between f and g in (14a). Similarly, we have the predictive distribution q ( g ∗ ) = (cid:82) p ( g ∗ | g u ) q ( g u ) d g u = N ( g ∗ | µ g ∗ , σ g ∗ ) where, given K Λ = K gun Λ − nn K gnu + K guu , µ g ∗ = k g ∗ u ( K guu ) − ( µ u − µ ) + µ , (15a) σ g ∗ = k g ∗∗ − k g ∗ u ( K guu ) − k gu ∗ + k g ∗ u K − k gu ∗ . (15b)Finally, using the posteriors q ( f ∗ ) and q ( g ∗ ) , and thelikelihood p ( y ∗ | f ∗ , g ∗ ) = N ( y ∗ | f ∗ , e g ∗ ) , we have q ( y ∗ ) = (cid:90) N ( y ∗ | µ f ∗ , e g ∗ + σ f ∗ ) N ( g ∗ | µ g ∗ , σ g ∗ ) dg ∗ , (16)which is intractable and non-Gaussian . However, the integralcan be approximated up to several digits using the Gauss-Hermite quadrature, resulting in the mean and variance as [24] µ ∗ = µ f ∗ , σ ∗ = σ f ∗ + e µ g ∗ + σ g ∗ / . (17)In the ﬁnal prediction variance, the variance σ f ∗ represents theuncertainty about f due to data density, and it approaches zerowith increasing n ; the exponential term implies the intrinsicheteroscedastic noise uncertainty.It is notable that the unifying VSHGP includes VHGP [24]and variational sparse GP (VSGP) [31] as special cases: when f m = f and g u = g , VSHGP recovers VHGP; when q ( g u ) = p ( g u ) , i.e., we are now facing a homoscedastic regression task,VSHGP regenerates to VSGP.Overall, by introducing inducing sets for both f and g ,VSHGP is equipped with the means to handle large-scaleheteroscedastic regression. However, (i) the current time com-plexity O ( nm + nu ) , which is linear with training size,makes VSHGP still unaffordable for, e.g., millions of datapoints; and (ii) as a global approximation, the capability ofVSHGP is limited by the small and global inducing sets.To this end, we introduce below two strategies to furtherimprove the scalability and capability of VSHGP. This happens when g is learned well, see the numerical experiments below. III. S

TOCHASTIC

VSHGPTo further improve the scalability of VSHGP, the variationaldistribution q ( f m ) = N ( f m | µ m , Σ m ) is re-introduced to usethe original bound F = (cid:82) q ( z ) log p ( z , y ) q ( z ) d z in (7). Given thefactorized likelihood p ( y | f , g ) = (cid:81) Ni =1 p ( y i | f i , g i ) , the ELBO F is F = n (cid:88) i =1 E q ( f i ) q ( g i ) [log p ( y i | f i , g i )] − KL[ q ( f m ) || p ( f m )] − KL[ q ( g u ) || p ( g u )] , (18)where the ﬁrst expectation term is expressed as =: n (cid:88) i =1 (cid:20) log N ( y i | [ µ f ] i , [ R g ] ii ) −

14 [ Σ g ] ii −

12 [ Σ f R − g ] ii (cid:21) , with µ f = Ω fnm µ m and Σ f = K fnn − Q fnn + Ω fnm Σ m ( Ω fnm ) T .The new F is a relaxed version of F V in (9). It is foundthat the derivatives satisfy ∂F∂ µ m =( Ω fnm ) T R − g ( y − Ω fnm µ m ) − ( K fmm ) − µ m ,∂F∂ Σ m = −

12 ( Ω fnm ) T R − g Ω fnm + 12 ( Σ − m − ( K fmm ) − ) . (19)Let the gradients be zeros, we recover the optimal solution q (cid:63) ( f m ) in (13), indicating that F V ≥ F with the equality at q ( f m ) = q (cid:63) ( f m ) .The scalability is improved by F through the ﬁrst term in theright-hand side of (18), which factorizes over data points. Thesum form allows using efﬁcient stochastic gradient descent(SGD), e.g., Adam [57], with mini-batch mode for big data.Speciﬁcally, we choose a random subset B ⊆ { , · · · , n } tohave an unbiased estimation of F as (cid:101) F = n |B| (cid:88) i ∈B E q ( f i ) q ( g i ) [log p ( y i | f i , g i )] − KL[ q ( f m ) || p ( f m )] − KL[ q ( g u ) || p ( g u )] , (20)where |B| (cid:28) n is the mini-batch size. More efﬁciently,since the two variational distributions are deﬁned in termsof KL divergence, we could optimize them along the natural gradients instead of the Euclidean gradients, see Appendix C.Finally, the predictions of the stochastic VSHGP (SVSHGP)follow (17), with the predictions replaced as µ f ∗ = k f ∗ m ( K fmm ) − µ m ,σ f ∗ = k f ∗∗ − k f ∗ m ( K fmm ) − ( K fmm − Σ m )( K fmm ) − k fm ∗ , and µ g ∗ = k g ∗ u ( K guu ) − ( µ u − µ ) + µ ,σ g ∗ = k g ∗∗ − k g ∗ u ( K guu ) − ( K guu − Σ u )( K guu ) − k gu ∗ . Compared to the deterministic VSHGP, the stochastic vari-ant greatly reduces the time complexity from O ( nm + nu ) to O ( |B| m + |B| u + m + u ) , at the cost of requiring manymore optimization efforts in the enlarged probabilistic space. Besides, the capability of SVSHGP akin to VSHGP is stilllimited to the ﬁnite number of global inducing points. Compared to VSHGP, the SVSHGP cannot re-parameterize µ u and Σ u ,and has to infer m + m ( m +1) more variational parameters in µ m and Σ m . IV. D

ISTRIBUTED

VSHGPTo further improve the scalability and capability of VSHGPvia many inducing points, the distributed variant namedDVSHGP proposes to combine VSHGP with local approxi-mations, e.g., the Bayesian committee machine (BCM) [39],[40], to enable distributed learning and capture local variety.

A. Training experts with hybrid parameters

We ﬁrst partition the training data D into M subsets D i = { X i , y i } , ≤ i ≤ M . Then, we train a VSHGP expert M i on D i by using the relevant inducing sets X m i and X u i .Particularly, to obtain computational gains, an independenceassumption is posed for all the experts {M i } Mi =1 such that log p ( y ; X , ζ ) is decomposed into the sum of M individuals log p ( y ; X , ζ ) ≈ M (cid:88) i =1 log p ( y i ; X i , ζ i ) ≥ M (cid:88) i =1 F V i , (21)where ζ i is the hyperparameters in M i , and F V i = F ( q ( g u i )) is the ELBO of M i . The factorization (21) calculates theinversions efﬁciently as ( K fmm ) − ≈ diag[ { ( K fm i m i ) − } Mi =1 ] and ( K gmm ) − ≈ diag[ { ( K gm i m i ) − } Mi =1 ] .We train these VSHGP experts with hybrid parameters.Speciﬁcally, the BCM-type aggregation requires sharing thepriors p ( f ∗ ) and p ( g ∗ ) over experts. That means, we shouldshare the hyperparameters including θ f , θ g and µ acrossexperts. These global parameters are beneﬁcial for guardingagainst over-ﬁtting [40], at the cost of however degrading thecapability. Hence, we leave the variational parameters Λ n i n i and the inducing points X m i and X u i for each expert to inferthem individually. These local parameters improve capturinglocal variety by (i) pushing q ( g u i ) towards the posterior p ( g u i | y i ) of M i , and (ii) using many inducing points.Besides, because of the local parameters, we prefer par-titioning the data into disjoint experts rather than random experts like [40]. The disjoint partition using clustering tech-niques produces local and separate experts which are desirablefor learning the relevant local parameters. In contrast, the ran-dom partition, which assigns points randomly to the subsets,provides global and overlapped experts which are difﬁcultto well estimate the local parameters. For instance, whenDVSHGP uses random experts on the toy example below, itfails to capture the heteroscedastic noise.Finally, suppose that each expert has the same trainingsize n = n/M , the training complexity for an expert is O ( n m + n u ) , where m is the inducing size for f i and u the inducing size for g i . Due to the M local experts, DVSHGPnaturally offers parallel/distributed training, hence reducing thetime complexity of VSHGP with a factor ideally close to thenumber of machines when m = m and u = u . B. Aggregating experts’ predictions

For each VSHGP expert M i , we obtain the predictivedistribution q i ( y ∗ ) with the means { µ f i ∗ , µ g i ∗ , µ i ∗ } and vari-ances { σ f i ∗ , σ g i ∗ , σ i ∗ } . Thereafter, we combine the experts’predictions together to perform the ﬁnal prediction by, for ex-ample the robust BCM (RBCM) aggregation, which naturallysupports distributed/parallel computing [40], [58]. The key to the success of aggregation is that we do notdirectly combine the experts’ predictions { µ i ∗ , σ i ∗ } Mi =1 . Thisis because (i) the RBCM aggregation of { q i ( y ∗ ) } Mi =1 producesan inconsistent prediction variance with increasing n and M [41]; and (ii) the predictive distribution q i ( y ∗ ) in (16)is non-Gaussian. To have a meaningful prediction variance,which is crucial for heteroscedastic regression, we perform theaggregation for the latent functions f and g , respectively. Thisis because the prediction variances of f and g approach zeroswith increasing n . This agrees with the property of RBCM.We ﬁrst have the aggregated predictive distribution for f ∗ ,given the prior p ( f ∗ ) = N ( f ∗ | , σ f ∗∗ (cid:44) k f ∗∗ ) , as p A ( f ∗ | y , x ∗ ) = (cid:81) Mi =1 [ q i ( f ∗ )] w fi ∗ [ p ( f ∗ )] (cid:80) i w fi ∗ − , (22)with the mean and variance expressed respectively, as µ f A ∗ = σ f A ∗ M (cid:88) i =1 w fi ∗ σ − f i ∗ µ f i ∗ ,σ − f A ∗ = M (cid:88) i =1 w fi ∗ σ − f i ∗ + (cid:32) − M (cid:88) i =1 w fi ∗ (cid:33) σ − f ∗∗ , where the weight w fi ∗ ≥ represents the contribution of M i at x ∗ for f , and is deﬁned as the difference in thedifferential entropy between the prior p ( f ∗ ) and the posterior q i ( f ∗ ) as w fi ∗ = 0 . σ f ∗∗ − log σ f i ∗ ) . Similarly, for g whichexplicitly considers a prior mean µ , the aggregated predictivedistribution is p A ( g ∗ | y , x ∗ ) = (cid:81) Mi =1 [ q i ( g ∗ )] w gi ∗ [ p ( g ∗ )] (cid:80) i w gi ∗ − , (23)with the mean and variance, expressed respectively as µ g A ∗ = σ g A ∗ (cid:34) M (cid:88) i =1 w gi ∗ σ − g i ∗ µ g i ∗ + (cid:32) − M (cid:88) i =1 w gi ∗ (cid:33) σ − g ∗∗ µ (cid:35) ,σ − g A ∗ = M (cid:88) i =1 w gi ∗ σ − g i ∗ + (cid:32) − M (cid:88) i =1 w gi ∗ (cid:33) σ − g ∗∗ , where σ − g ∗∗ is the prior precision of g ∗ , and w gi ∗ =0 . σ g ∗∗ − log σ g i ∗ ) is the weight of M i at x ∗ for g .Thereafter, as shown in Fig. 1, the ﬁnal predictions akinto (17) are combined as µ A∗ = µ f A ∗ , σ A∗ = σ f A ∗ + e µ g A∗ + σ g A∗ / . (24)The hierarchical and localized computation structure enables(i) large-scale regression via distributed computations, and (ii)ﬂexible approximation of slow-/quick-varying features by localexperts and many inducing points (up to the training size n ).Finally, we illustrate the DVSHGP on a heteroscedastic toyexample expressed as y ( x ) = sinc( x ) + (cid:15), x ∈ [ − , , (25) Note that the generalized RBCM strategy [41], which provides consistentpredictions, however is not favored by our DVSHGP with local parameters,since it requires (i) the experts’ predictions to be Gaussian and (ii) anadditional global base expert. Fig. 1. Hierarchy of the proposed DVSHGP model. -10 -5 0 5 10-101 (a) -10 -5 0 5 10-101 (b) -10 -5 0 5 10-8-6-4-20 (c)

Fig. 2. Illustration of DVSHGP on the toy example. The crosses marked indifferent colors in (a) represent M = 5 subsets. The top circles and bottomsquares marked in different colors represent the optimized positions of induc-ing points for { f i } Mi =1 and { g i } Mi =1 , respectively. The red curves present theprediction mean, whereas the black curves represent 95% conﬁdence intervalof the prediction mean. where (cid:15) = N (0 , σ (cid:15) ( x )) and σ (cid:15) ( x ) = 0 .

05 + 0 . x )) / (1+ e − . x ) . We draw 500 training points from (25),and use the k -means technique to partition them into ﬁvedisjoint subsets. We then employ ten inducing points for both f i and g i of the VSHGP expert M i , ≤ i ≤ . Fig. 2shows that (i) DVSHGP can efﬁciently employ up to 100inducing points for modeling through ﬁve local experts, and(ii) DVSHGP successfully describes the underlying function f and the heteroscedastic log noise variance g .V. D ISCUSSIONS REGARDING IMPLEMENTATION

A. Implementation of DVSHGP

As for DVSHGP, we should infer (i) the global parametersincluding the kernel parameters θ f and θ g , and the mean µ ; and (ii) the local parameters including the variationalparameters Λ n i n i and the inducing parameters X m i and X u i ,for local experts {M i } Mi =1 .Notably, the variational parameters Λ n i n i are crucial for thesuccess of DVSHGP, since they represent the heteroscedas-ticity. To learn the variational parameters well, there are twoissues: (i) how to initialize them and (ii) how to optimize them.As for initialization, let us focus on VSHGP, which is thefoundation for the experts in DVSHGP. It is observed in (11)that Λ nn directly determines the initialization of q ( g u ) = N ( g u | µ u , Σ u ) . Given the prior g u ∼ N ( g u | µ , K guu ) , we -10 -5 0 5 100.30.40.50.60.70.8 DVSHGPVHGP Fig. 3. The variational parameters learnt respectively by DVSHGP and VHGPon the toy problem. The crosses marked in different colors represent thevariational parameters inferred for the ﬁve local experts of DVSHGP. intuitively place a prior mean µ on µ u , resulting in Λ nn =0 . I . In contrast, if we initialize [ Λ nn ] ii with a value largeror smaller than . , the cumulative term K gun ( Λ nn − . I ) in (11a) becomes far away from zero with increasing n ,leading to improper prior mean for µ u . As for optimization,compared to standard GP, our DVSHGP needs to additionallyinfer n variational parameters and M ( m + u ) d inducingparameters, which greatly enlarge the parameter space andincrease the optimization difﬁculty. Hence, we use an alternat-ing strategy where we ﬁrst optimize the variational parametersindividually to capture the heteroscedasticity roughly, followedby learning all the hyperparameters jointly.Fig. 3 depicts the inferred variational parameters varyingover training points by DVSHGP and the original VHGP [24],respectively, on the toy problem. It turns out that the varia-tional parameters estimated by DVSHGP (i) generally agreewith that of VHGP, and (ii) showcase local characteristics thatare beneﬁcial for describing local variety. B. Implementation of SVSHGP

As for SVSHGP, to effectively infer the variational pa-rameters in q ( f m ) and q ( g u ) , we adopt the natural gradientdescent (NGD), which however should carefully tune thestep parameter γ deﬁned in Appendix C. For the Gaussianlikelihood, the optimal solution is γ = 1 . , since taking anunit step is equivalent to performing a VB update [34]. Butfor the stochastic case, empirical results suggest that γ shouldbe gradually increased to some ﬁxed value. Hence, we followthe schedule in [59]: take γ initial = 0 . and log-linearlyincrease γ to γ ﬁnal = 0 . over ﬁve iterations, and then keep γ ﬁnal for the remaining iterations.Thereafter, we employ a hybrid strategy, calledNGD+Adam, for optimization. Speciﬁcally, we performa step of NGD on the variational parameters with theaforementioned γ schedule, followed by a step of Adam onthe remaining hyperparameters with a ﬁxed step γ = 0 . .Fig. 4 depicts the convergence histories of SVSHGP usingAdam and NGD+Adam respectively on the toy example. Weuse m = u = 20 inducing points and a mini-batch size of |B| = 50 . As the ground truth, the ﬁnal ELBO obtained byVSHGP is provided. It is observed that (i) the NGD+Adamconverges faster than the pure Adam, and (ii) the stochastic AdamNGD + Adam

Fig. 4. Illustration of SVSHGP using Adam and NGD+Adam respectivelyon the toy exmaple. The dash line represents the ﬁnal ELBO of VSHGP. optimizers ﬁnally approach the solution of CGD used inVSHGP.

C. DVSHGP vs. (S)VSHGP

Compared to the global (S)VSHGP, the performance ofDVSHGP is enhanced by many inducing points and thelocalized experts with individual variational and inducing pa-rameters, resulting in the capability of capturing quick-varyingfeatures. To verify this, we apply DVSHGP and (S)VSHGPto the time-series solar irradiance dataset [60] which containsquick-varying and heteroscedastic features. (a) (b)

Fig. 5. Comparison of DVSHGP and VSHGP on the solar irradiance dataset.

In the comparison, DVSHGP employs the k -means tech-nique to partition the 391 training points into M = 10 subsets,and uses m = u = 20 inducing points for each expert;(S)VSHGP employs m = u = 20 inducing points. Particularly,we initialize the length-scales in the SE kernel (3) with apretty small value of 5.0 for k f and k g for (S)VSHGP on thisquick-varying dataset. Fig. 5 shows that (i) DVSHGP capturesthe quick-varying and heteroscedastic features successfullyvia local experts and many inducing points; (ii) (S)VSHGPhowever fails due to the small set of global inducing points. VI. N

UMERICAL EXPERIMENTS

This section veriﬁes the proposed DVSHGP and SVSHGPagainst existing scalable HGPs on a synthetic dataset and fourreal-world datasets. The comparison includes (i) GPVC [8],(ii) the distributed variant of PIC (dPIC) [33], (iii) FITC [49], Since VSHGP and SVSHGP show similar predictions on this dataset, weonly illustrate the VSHGP predictions here. and (iv) the SoD based empirical HGP (EHSoD) [13]. Besides,the comparison also employs the homoscedastic VSGP [31]and RBCM [40] to showcase the beneﬁts brought by theconsideration of heteroscedasticity.We implement DVSHGP, FITC, EHSoD, VSGP and RBCMby the GPML toolbox [54], and implement SVSHGP by theGPﬂow toolbox [52]; we use the published GPVC codes avail-able at https://github.com/OxfordML/GPz and the dPIC codesavailable at https://github.com/qminh93/dSGP ICML16. Theyare executed on a personal computer with four 3.70 GHz coresand 16 GB RAM for the synthetic and three medium-sizeddatasets, and on a Linux workstation with eight 3.20 GHzcores and 32GB memory for the large-scale dataset.All the GPs employ the SE kernel in (3). Normalizationis performed for both X and y to have zero mean andunit variance before training. Finally, we use n ∗ test points { X ∗ , y ∗ } to assess the model accuracy by the standardizedmean square error (SMSE) and the mean standardized log loss(MSLL) [1]. The SMSE quantiﬁes the discrepancy between thepredictions and the exact observations. Particularly, it equals toone when the model always predicts the mean of y . Moreover,the MSLL quantiﬁes the predictive distribution, and is negativefor better models. Particularly, it equals to zero when themodel always predicts the mean and variance of y . A. Synthetic dataset

We employ a 2D version of the toy example (25) as y ( x ) = f ( x ) + (cid:15) ( x ) , x ∈ [ − , , with highly nonlinear latent function f ( x ) = sinc(0 . x x ) and noise (cid:15) ( x ) = N (0 , σ (cid:15) (0 . x x )) . We randomly generate10,000 training points and evaluate the model accuracy on4,900 grid test points. We generate ten instances of the trainingdata such that each model is repeated ten times.We have M = 50 and m = u = 100 for DVSHGP, result-ing in n = 200 data points assigned to each expert; we have m b = 300 basis functions for GPVC; we have m = 300 forSVSHGP, FITC and VSGP; we have m = 300 and M = 50 for dPIC; we have M = 50 for RBCM; ﬁnally, we train twoseparate GPs on a subset of size m sod = 2 , for EHSoD.As for optimization, DVSHGP adopts a two-stage process:it ﬁrst only optimizes the variational parameters using CGDwith up to 30 line searches, and then learns all the parametersjointly using up to 70 line searches; SVSHGP trains withNGD+Adam using |B| = 1 , over 1,000 iterations; VSGP,FITC, GPVC and RBCM use up to 100 line searches to learnthe parameters; dPIC employs the default optimization settingsin the published codes; and ﬁnally EHSoD uses up to 50 linesearches to train the two standard GPs, respectively.Fig. 6 depicts the modeling results of different GPs overten runs on the synthetic sinc2D dataset. The horizontal axisrepresents the sum of training and predicting time for a model.It turns out that DVSHGP, SVSHGP, dPIC, VSGP and RBCMare competitive in terms of SMSE; but DVSHGP and SVSHGPperform better in terms of MSLL due to the well estimated Since the dPIC codes only provide the prediction mean, we did not offerits MSLL value as well as the estimated noise variance in the following plots. sinc2D

DVSHGPSVSHGPGPVCdPICFITCEHSoDVSGPRBCM sinc2D

Fig. 6. Comparison of GPs over ten runs on the synthetic dataset. heteroscedastic noise. Compared to the homoscedastic VSGPand RBCM, the FITC has heteroscedastic variances, whichis indicated by the lower MSLL, at the cost of however (i)sacriﬁcing the accuracy of prediction mean, and (ii) sufferingfrom invalid noise variance σ (cid:15) . As a principled HGP, GPVCperforms slightly better than FITC in terms of MSLL. Finally,the simple EHSoD has the worst SMSE; but it outperformsthe homoscedastic VSGP and RBCM in terms of MSLL.In terms of efﬁciency, the RBCM requires less computingtime since it contains no variational/inducing parameters, thusresulting in (i) lower complexity, and (ii) early stop foroptimization. This also happens for the three datasets below.Finally, Fig. 7 depicts the prediction variances of all the GPsexcept dPIC in comparison to the exact σ on the syntheticdataset. It is ﬁrst observed that the homoscedastic VSGP andRBCM are unable to describe the complex noise variance:they yield a nearly constant variance over the input domain. Incontrast, DVSHGP, SVSHGP and GPVC capture the varyingnoise variance accurately by using an additional noise process g ; FITC also captures the proﬁle of the exact σ with howeverunstable peaks and valleys; EHSoD is found to capture a roughexpression of the exact σ . B. Medium real-world datasets

This section conducts comparison on three real-worlddatasets. The ﬁrst is the 9D protein dataset [61] with 45,730data points. This dataset, taken from CASP 5-9, describes thephysicochemical properties of protein tertiary structure. Thesecond is the 21D sarcos dataset [1], which relates to theinverse kinematics of a robot arm, has 48,933 data points.The third is the 3D dataset which comprises 434,874data points [62] extracted from a 2D road network in NorthJutland, Denmark, plussing elevation information.

1) The protein dataset:

For the protein dataset, we ran-domly choose 35,000 training points and 10,730 test points.In the comparison, we have M = 100 (i.e., n = 350 )and m = u = 175 for DVSHGP; we have m = 400 forSVSHGP, VSGP and FITC; we have m = 400 and M = 100 for dPIC; we have m b = 400 for GPVC; we have M = 100 for RBCM; and ﬁnally we have m sod = 4 , for EHSoD.As for optimization, SVSHGP trains with NGD+Adam using |B| = 2 , over 2,000 iterations. The optimization settingsof other GPs keep consistent to that for the synthetic dataset.The results of different GPs over ten runs are summarized inFig. 8. Among the HGPs, it is observed that dPIC outperforms FITC estimates σ (cid:15) as 0.0030, while VSGP estimates it as 0.0309. -10 0 10-10010 -10 0 10-10010 -10 0 10-10010 -10 0 10-10010 -10 0 10-10010 -10 0 10-10010 -10 0 10-10010 -10 0 10-10010 Fig. 7. The exact noise variance and the prediction variances of different heteroscedastic/homoscedastic GPs on the synthetic dataset. protein

DVSHGPSVSHGPGPVCdPICFITCEHSoDVSGPRBCM protein

Fig. 8. Comparison of GPs over ten runs on the protein dataset.

DVSHGP SVSHGP GPVC FITC EHSoD-10-505 protein

Fig. 9. The distributions of log noise variances estimated by different GPson the protein dataset. The dash and dot lines indicate the log noise variancesof VSGP and RBCM, respectively. the others in terms of SMSE, followed by DVSHGP. Onthe other hand, DVSHGP performs the best in terms ofMSLL, followed by FITC and SVSHGP. The simple EHSoD isfound to produce unstable MSLL results because of the smallsubset. Finally, the homoscedastic VSGP and RBCM providemediocre SMSE and MSLL results.Next, Fig. 9 offers insights into the distributions of log noisevariances of all the GPs except dPIC on the protein dataset fora single run. Note that (i) as homoscedastic GPs, the log noisevariances of VSGP and RBCM are marked as dash and dotlines, respectively; and (ii) we plot the variance of p ( f ∗ |D , x ∗ ) for FITC since (a) it accounts for the heteroscedasticity and (b)the scalar noise variance σ (cid:15) is severely underestimated. Theresults in Fig. 9 indicate that the protein dataset may containheteroscedastic noise. Besides, compared to the VSGP whichuses a global inducing set, the localized RBCM provides amore compact estimation of σ (cid:15) . This compact noise variance,which has also been observed on the two datasets below, brings (a) DVSHGPSVSHGPGPVCdPICFITCEHSoDVSGP -0.8-0.6-0.4-0.200.2 (b)

DVSHGPSVSHGPGPVCFITCEHSoDVSGP

100 200 300 400 5000.380.40.420.44 (c) -0.75-0.7-0.65-0.6

Fig. 10. The effect of algorithmic parameters on the performance of differentsparse GPs on the protein dataset. lower MSLL for RBCM.Furthermore, we clearly observe the interaction between f and g for DVSHGP, SVSHGP and GPVC. The small MSLL ofRBCM suggests that the protein dataset may own small noisevaraicnes at some test points. Hence, the localized DVSHGP,which is enabled to capture the local variety through individual variational and inducing parameters for each expert, producesa longer tail in Fig. 9. The well estimated heteroscedastic noisein turn improves the prediction mean of DVSHGP through theinteraction between f and g . In contrast, due to the limitedglobal inducing set, the prediction mean of SVSHGP andGPVC is traded for capturing heteroscedastic noise.Notably, the performance of sparse GPs is affected by theirmodeling parameters, e.g., the inducing sizes m , m , u and u , the number of basis functions m b , and the subset size m sod . Fig. 10(a) and (b) depict the average results of sparseGPs over ten runs using different parameters. Particularly,we investigate the impact of subset size n on DVSHGP inFig. 10(c) using m = u = 0 . n . It is found that DVSHGPfavours large n (small M ) and large m and u . Similarly,VSGP and FITC favour more inducing points. However, dPICoffers an unstable SMSE performance with increasing m ;GPVC performs slightly worse with increasing m b in termsof both SMSE and MSLL, which has also been observed inthe original paper [8], and may be caused by the sharing ofbasis functions for f and g . Finally, EHSoD showcases poorerMSLL values when m sod ≥ , because of the difﬁculty of sarcos DVSHGPSVSHGPGPVCdPICFITCEHSoDVSGPRBCM sarcos

Fig. 11. Comparison of GPs over ten runs on the sarcos dataset.

DVSHGP SVSHGP GPVC FITC EHSoD-4-20246 sarcos

Fig. 12. The distributions of log noise variances of different GPs on the sarcos dataset. The dash and dot lines indicate the log noise variances ofVSGP and RBCM, respectively. approximating the empirical variances.

2) The sarcos dataset:

For the sarcos dataset, we randomlychoose 40,000 training points and 8,933 test points. In thecomparison, we have M = 120 (i.e., n ≈ ) and m = u = 175 for DVSHGP; we have m = 600 forSVSHGP, VSGP and FITC; we have m = 600 and M = 120 for dPIC; we have m b = 600 for GPVC; we have M = 120 for RBCM; and ﬁnally we have m sod = 4 , for EHSoD.The optimization settings are the same as before.The results of different GPs over ten runs on the sarcos dataset are depicted in Fig. 11. Besides, Fig. 12 depicts thelog noise variances of the GPs on this dataset. Differentfrom the protein dataset, the sarcos dataset seems to haveweak heteroscedastic noises across the input domain, whichis veriﬁed by the facts that (i) the noise variance of DVSHGPis a constant, and (ii) DVSHGP agrees with RBCM in terms ofboth SMSE and MSLL. Hence, all the HGPs except EHSoDperform similarly in terms of MSLL.In addition, the weak heteroscedasticity in the sarcos datasetreveals that we can use only a few inducing points for g to speed up the inference. For instance, we retrain DVSHGPusing u = 5 . This extremely small inducing set for g brings(i) much less computing time of about 350 seconds, and (ii)almost the same model accuracy with SMSE = 0.0099 andMSLL = -2.3034.

3) The 3droad dataset:

Finally, for the dataset,we randomly choose 390,000 training points, and use theremaining 44,874 data points for testing. In the comparison,we have M = 800 (i.e., n ≈ ) and m = u = 250 forDVSHGP; we have m = 500 for SVSHGP, VSGP and FITC;we have m = 500 and M = 800 for dPIC; we have m b = 500 for GPVC; we have M = 800 for RBCM; and ﬁnallywe have m sod = 8 , for EHSoD. As for optimization,SVSHGP trains with NGD+Adam using |B| = 4 , over DVSHGPSVSHGPGPVCdPICFITCEHSoDVSGPRBCM -4-20246 Fig. 13. Comparison of GPs over ten runs on the dataset.

DVSHGP SVSHGP GPVC FITC EHSoD-505

Fig. 14. The distributions of log noise variances of different GPs on the dataset. The dash and dot lines indicate the log noise variances ofVSGP and RBCM, respectively. dataset are depicted in Fig. 13. It is observed that DVSHGPoutperforms the others in terms of both SMSE and MSLL,followed by RBCM. For other HGPs, especially SVSHGP andGPVC, the relatively poor noise variance (large MSLL) inturn sacriﬁces the accuracy of prediction mean. Even though,the heteroscedastic noise helps SVSHGP, GPVC and FITCperform similarly to VSGP in terms of MSLL.In addition, Fig. 14 depicts the log noise variances of theseGPs on the dataset. The highly accurate predictionmean of DVSHGP helps well estimate the heteroscedasticnoise. It is observed that (i) the noise variances estimated byDVSHGP are more compact than that of other HGPs; and (ii)the average noise variance agrees with that of RBCM.Finally, the results from the dataset together withthe other two datasets indicate that: • the well estimated noise variance of HGPs in turn im-proves the prediction mean via the interaction between f and g ; otherwise, it may sacriﬁce the prediction mean; • the heteroscedastic noise usually improves sparse HGPsover the homoscedastic VSGP in terms of MSLL. C. Large real-world dataset

The ﬁnal section evaluates the performance of different GPson the 11D electric dataset, which is partitioned into twomillion training points and 49,280 test points. The HGPs in thecomparison include DVSHGP, SVSHGP, dPIC and EHSoD. The dataset is available at https://archive.ics.uci.edu/ml/index.php. GPVC and FITC are unaffordable for this massive dataset. Besides, thestochastic variant of FITC [35] is not included, since it does not supportend-to-end training. TABLE IC

OMPARISON OF GP S ON THE electric

DATASET . DVSHGP SVSHGP dPIC EHSoD SVGP RBCMSMSE 0.0020 0.0029 0.0042 0.0103 0.0028 0.0023MSLL -3.4456 -3.1207 - -1.9453 -2.8489 -3.0647 t [ h ] 11.05 7.44 47.22 4.72 3.55 3.97 -3 -3.05-3-2.95 Fig. 15. Illustration of (a) the computing time of DVSHGP vs. number ofparallel cores, and (b) the performance of SVSHGP, from left to right, using |B| = 1 , , 2,500 and 5,000 on the electric dataset. Besides, the RBCM and the stochastic variant of VSGP, namedSVGP [34], are employed for this big dataset.In the comparison, we have M = 2 , (i.e., n = 1 , )and m = u = 300 for DVSHGP; we have m = 2 , for SVSHGP and SVGP; we have m = 2 , and M =2 , for dPIC; we have M = 2 , for RBCM; and ﬁnallywe have m sod = 15 , for EHSoD. As for optimization,SVSHGP trains with NGD+Adam using |B| = 5 , over10,000 iterations; The optimization settings of other GPs keepthe same as before.The average results over ﬁve runs in Table I indicate thatDVSHGP outperforms the others in terms of both SMSE andMSLL, followed by SVSHGP. The simple EHSoD providesthe worst performance, and cannot be improved by usinglarger m sod due to the memory limit in current infrastructure.Additionally, in terms of efﬁciency, we ﬁnd that (i) SVSHGPis better than DVSHGP due to the parallel/GPU accelerationdeployed in Tensorﬂow; (ii) SVGP is better than SVSHGPbecause of lower complexity; and (iii) the huge computingtime of dPIC might be incurred by the unoptimized codes.Finally, due to the distributed framework, Fig. 15(a) depictsthe total computing time of DVSHGP using different numbersof processing cores. It is observed that the DVSHGP usingeight cores achieves a speedup around 3.5 in comparisonto the centralized counterpart. Fig. 15(b) also exploits theperformance of SVSHGP using a varying mini-batch size |B| .It is observed that (i) a small |B| signiﬁcantly speeds up themodel training, and (ii) different mini-batch sizes yield similarSMSE and MSLL here, because the model has been optimizedover sufﬁcient iterations.VII. D ISCUSSIONS AND CONCLUSIONS

In order to scale up the original HGP [16], we have pre-sented distributed and stochastic variational sparse HGPs. Theproposed SVSHGP improves the scalability through stochasticvariational inference. The proposed DVSHGP (i) enables Further GPU speedup could be utilized for DVSHGP in Matlab. large-scale regression via distributed computations, and (ii)achieves high model capability via localized experts andmany inducing points. We compared them to existing scalablehomoscedastic/heteroscedastic GPs on a synthetic dataset andfour real-world datasets. The comparative results obtained in-dicated that DVSHGP exhibits superior performance in termsof both SMSE and MSLL; while due to the limited globalinducing set, SVSHGP may sacriﬁce the prediction mean forcapturing heteroscedastic noise.Apart from our scalable HGP framework, there are somenew GP paradigms developed recently from different perspec-tives for improving the predictive distribution. For instance, in-stead of directly modeling the heteroscedastic noise, we couldintroduce additional latent inputs to modulate the kernel [63],[64]; or we directly target the interested posterior distributionto enhance the prediction of f [65]; or we adopt highly ﬂexiblepriors, e.g., implicit processes, over functions [66]; or we mixvarious GPs during the inference [67], [68]; or we develop thespeciﬁc non-stationary kernel [69]. They bring new interpre-tations at the cost of however losing the direct description ofheteroscedasticity or raising complicated inference with highcomplexity. But all these paradigms together with our scalableHGPs greatly enable future exploitation of ﬁtting the datadistribution with high quality and efﬁciency.Finally, our future work will consider the heteroscedasticityin the underlying function f , i.e., the non-stationary, suchas [18], [22], [67]. The integration of various kinds of het-eroscedasticity is believed to improve predictions.A PPENDIX AN ON - NEGATIVITY OF Λ nn We know that the variational diagonal matrix Λ nn expresses Λ nn = 0 . Λ ann + Λ bnn + I ) . In order to prove the non-negativity of Λ nn , we should ﬁgureout the non-negativity of the diagonal elements of Λ ann + I and Λ bnn , respectively.Firstly, the diagonal elements of Λ bnn write [ Λ bnn ] ii = [( K fnn − Q fnn ) R − g ] ii , ≤ i ≤ n, where the diagonal elements of R − g satisfy [ R − g ] ii = e [ Σ g ] ii / − [ µ g ] i > , ≤ i ≤ n ; and the diagonal elements of K fnn − Q fnn are the variances of training conditional p ( f | f m ) .Therefore, Λ bnn has non-negative diagonal elements.Secondly, the diagonal elements of Λ ann + I write, given β n = Σ − y y , [ Λ ann + I ] ii = [ β n β T n R g − Σ − y R g + I ] ii , ≤ i ≤ n. For β n β T n R g , the diagonal elements are non-negative. For I − Σ − y R g , given the Cholesky decomposition K f Λ = L f Λ ( L f Λ ) T ,we have I − Σ − y R g =[ R − g K fnm ( K f Λ ) − K fmn R − g ] R g =[ R − g K fnm ( L f Λ ) − T ( L f Λ ) − K fmn R − g ] R g , indicating that the diagonal elements must be non-negative.Hence, from the foregoing discussions, we know that Λ nn is a non-negative diagonal matrix. A PPENDIX BD ERIVATIVES OF F V W . R . T . HYPERPARAMETERS

Let λ n = log( Λ nn ) collect n variational parameters inthe log form for non-negativity, we have the derivatives of F V w.r.t. λ n as ∂F V ∂ λ n = Λ nn (cid:20)

12 ( Q gnn + 12 A nn ) Λ abnn + 14 A nn − A nn Λ nn − µ g + µ (cid:21) , where A nn = ( K gnu K − K gun ) (cid:12) , and the operator (cid:12) represents the element-wise power.The derivatives of F V w.r.t. the kernel parameters θ f = { θ fi } are ∂F V ∂θ fi = 12 Tr (cid:34) ( β n β T n − Σ − y + R − g ) ∂ Q fnn ∂θ fi − R − g ∂ K fnn ∂θ fi (cid:35) . Similarly, the derivatives of F V w.r.t. the kernel parameters θ g = { θ gi } are ∂F V ∂θ gi = 12 Tr (cid:20) ∂ µ g ∂θ gi ( T Λ abnn ) −

12 ( Λ abnn + I ) ∂ Σ g ∂θ gi (cid:21) −

12 Tr (cid:20) V guu ∂ K guu ∂θ gi − ( Ω gnu ) T Λ nn Ω gnu ∂ Σ u ∂θ gi + 2 ∂ µ u ∂θ gi γ T u (cid:21) , where γ u = ( K guu ) − ( µ u − µ ) and V guu = ( K guu ) − − γ u γ T u − ( K guu ) − Σ u ( K guu ) − .The derivatives of F V w.r.t. the mean parameter µ of g is ∂F V ∂µ = 12 Tr( Λ abnn ) . Finally, we calculate the derivatives of F V w.r.t. the inducingpoints X m and X u . Since the inducing points are involvedin the kernel matrices, we get the derivatives ∂ K fnm /∂x fij , ∂ K fmn /∂x fij , ∂ K fmm /∂x fij , ∂ K gnu /∂x gij , ∂ K gun /∂x gij , and ∂ K guu /∂x gij , where x fij = [ X m ] ij and x gij = [ X u ] ij . We ﬁrstobtain the derivatives of F V w.r.t. X m as ∂F V ∂x fij = 2Tr (cid:34) ∂ K fnm ∂x fij A fmn (cid:35) + Tr (cid:34) ∂ K fmm ∂x fij A fmm (cid:35) , where A fmn = 0 . Ω fnm ) T ( β n β T n − Σ − y + R − g ) , and A fmm = − A fmn Ω fnm . Similarly, the derivatives of F V w.r.t. X u write ∂F V ∂x gij = Tr (cid:34) ∂ K gnu ∂x gij A gun (cid:35) +Tr (cid:34) A gnu ∂ K gun ∂x gij (cid:35) +Tr (cid:34) ∂ K guu ∂x gij A guu (cid:35) . For A gun in ∂F V /∂x gij , we have A gun = 0 . Ω gnu ) T ( Λ nn − . I ) ( T Λ abnn ) (cid:124) (cid:123)(cid:122) (cid:125) T + 0 . (cid:0) H uu K gun Λ nn − [ Ω Λ nu + Ω gnu ] T ( Λ abnn + I ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) T + 0 . (cid:0) J uu K gun Λ nn − γ u T ( Λ nn − . I ) (cid:1)(cid:124) (cid:123)(cid:122) (cid:125) T . where H uu = ( Ω Λ nu ) T ( Λ abnn + I ) Ω Λ nu , and J uu = K − K guu (( K guu ) − − Σ − u ) K guu K − . For A gnu , we have A gnu = 0 . Λ nn − . I ) ( T Λ abnn ) Ω gnu + T T + T T . For A guu , we have A guu = − T Ω gnu − . (cid:0) ( Ω gnu ) T ( Λ abnn + I ) Ω gnu − H uu (cid:1) + 0 . (cid:0) P uu + P T uu + V guu − J uu (cid:1) , where P uu = K − K guu (( K guu ) − − Σ − u ) .The calculation of ∂F V /∂x fij and ∂F V /∂x gij requires a loopover m × d and u × d parameters of the inducing points, whichis quite slow for even moderate m , u and d . Fortunately, weknow that the derivative ∂ K /∂x f ( g ) ij only has n or m ( u )non-zero elements. Due to the sparsity, ∂ K /∂x f ( g ) ij can beperformed in vectorized operations such that the derivativesw.r.t. all the inducing points can be calculated along a speciﬁcdimension. A PPENDIX CN ATURAL GRADIENTS OF q ( f m ) AND q ( g u ) For exponential family distributions parameterized by natural parameters θ , we update the parameters using naturalgradients as θ ( t +1) = θ ( t ) − γ ( t ) G − θ ( t ) ∂F∂ θ ( t ) = θ ( t ) − γ ( t ) ∂F∂ ψ ( t ) , where F is the objective function, and G θ = ∂ ψ ( t ) /∂ θ ( t ) is the ﬁsher information matrix with ψ being the expectation parameters of exponential distributions.For q ( g u ) = N ( g u | µ u , Σ u ) , its natural parameters are θ which are partitioned into two components θ = Σ − u µ u , Θ = − Σ − u , where θ comprises the ﬁrst m elements of θ , and Θ theremaining elements reshaped to a square matrix. Accordingly,the expectation parameters ψ are divided as ψ = µ u , Ψ = µ u µ T u + Σ u . Thereafter, we update the natural parameters with step γ ( t ) as θ ( t +1) = Σ − u ( t ) µ u ( t ) − γ ( t ) ∂F∂ ψ ( t ) , Θ ( t +1) = − Σ − u ( t ) − γ ( t ) ∂F∂ Ψ ( t ) , where ∂F/∂ ψ ( t ) = ∂F/∂ µ u ( t ) and ∂F/∂ Ψ ( t ) = ∂F/∂ Σ u ( t ) . The derivatives ∂F/∂ µ u and ∂F/∂ Σ u are re-spectively expressed as ∂F∂ µ u = 12 ( Ω gnu ) T Λ a (cid:48) bnn − ( K guu ) − ( µ u − µ ) ,∂F∂ Σ u = −

14 ( Ω gnu ) T ( Λ a (cid:48) bnn + I ) Ω gnu + 12 [ Σ − u − ( K guu ) − ] , where Λ a (cid:48) bnn = Λ a (cid:48) nn + Λ bnn , and Λ a (cid:48) nn is a diagonal matrix withthe diagonal element being [ Λ a (cid:48) nn ] ii = [ R − g ( y − Ω fnm µ m )( y − Ω fnm µ m ) T − I ] ii . The probability density function (PDF) of exponential family is p ( x ) = h ( x ) e θ T t ( x ) − A ( θ ) , where θ is natural parameters, h ( x ) is underlyingmeasure, t ( x ) is sufﬁcient statistic, and A ( θ ) is log normalizer. Besides,the expectation parameters are deﬁned as ψ = E p ( x ) [ t ( x )] . For q ( f m ) = N ( f m | µ m , Σ m ) , the updates of µ m ( t +1) and Σ m ( t +1) follow the foregoing steps, with the derivatives ∂F/∂ µ m and ∂F/∂ Σ m taking (19).A CKNOWLEDGMENT

This work is funded by the National Research Foundation,Singapore under its AI Singapore programme [Award No.:AISG-RP-2018-004], the Data Science and Artiﬁcial Intelli-gence Research Center (DSAIR) at Nanyang TechnologicalUniversity and supported under the Rolls-Royce@NTU Cor-porate Lab. R

EFERENCES[1] C. E. Rasmussen and C. K. I. Williams,

Gaussian processes for machinelearning . MIT Press, 2006.[2] N. Lawrence, “Probabilistic non-linear principal component analysiswith Gaussian process latent variable models,”

Journal of MachineLearning Research , vol. 6, no. Nov, pp. 1783–1816, 2005.[3] B. Shahriari, K. Swersky, Z. Wang, R. P. Adams, and N. De Freitas,“Taking the human out of the loop: A review of Bayesian optimization,”

Proceedings of the IEEE , vol. 104, no. 1, pp. 148–175, 2016.[4] H. Liu, J. Cai, and Y. Ong, “Remarks on multi-output Gaussian processregression,”

Knowledge-Based Systems , vol. 144, no. March, pp. 102–121, 2018.[5] B. Settles, “Active learning,”

Synthesis Lectures on Artiﬁcial Intelligenceand Machine Learning , vol. 6, no. 1, pp. 1–114, 2012.[6] P. Kou, D. Liang, L. Gao, and J. Lou, “Probabilistic electricity priceforecasting with variational heteroscedastic Gaussian process and activelearning,”

Energy Conversion and Management , vol. 89, pp. 298–308,2015.[7] M. L´azaro-Gredilla, M. K. Titsias, J. Verrelst, and G. Camps-Valls,“Retrieval of biophysical parameters with heteroscedastic Gaussianprocesses,”

IEEE Geoscience and Remote Sensing Letters , vol. 11, no. 4,pp. 838–842, 2014.[8] I. A. Almosallam, M. J. Jarvis, and S. J. Roberts, “GPz: non-stationarysparse Gaussian processes for heteroscedastic uncertainty estimationin photometric redshifts,”

Monthly Notices of the Royal AstronomicalSociety , vol. 462, no. 1, pp. 726–739, 2016.[9] M. Bauza and A. Rodriguez, “A probabilistic data-driven model for pla-nar pushing,” in

International Conference on Robotics and Automation ,2017, pp. 3008–3015.[10] A. J. Smith, M. AlAbsi, and T. Fields, “Heteroscedastic Gaussianprocess-based system identiﬁcation and predictive control of a quad-copter,” in

AIAA Atmospheric Flight Mechanics Conference , 2018, p.0298.[11] J. Kirschner and A. Krause, “Information directed sampling and banditswith heteroscedastic noise,” in

Proceedings of the 31st Conference OnLearning Theory , 2018, pp. 358–384.[12] A. Kendall and Y. Gal, “What uncertainties do we need in Bayesiandeep learning for computer vision?” in

Advances in Neural InformationProcessing Systems , 2017, pp. 5574–5584.[13] S. Urban, M. Ludersdorfer, and P. Van Der Smagt, “Sensor calibrationand hysteresis compensation with heteroscedastic Gaussian processes,”

IEEE Sensors Journal , vol. 15, no. 11, pp. 6498–6506, 2015.[14] F. C. Pereira, C. Antoniou, J. A. Fargas, and M. Ben-Akiva, “Ametamodel for estimating error bounds in real-time trafﬁc predictionsystems,”

IEEE Transactions on Intelligent Transportation Systems ,vol. 15, no. 3, pp. 1310–1322, 2014.[15] E. Snelson and Z. Ghahramani, “Variable noise and dimensionalityreduction for sparse Gaussian processes,” in

Uncertainty in ArtiﬁcialIntelligence , 2006, pp. 461–468.[16] P. W. Goldberg, C. K. Williams, and C. M. Bishop, “Regression withinput-dependent noise: A Gaussian process treatment,” in

Advances inNeural Information Processing Systems , 1998, pp. 493–499.[17] L. Mu˜noz-Gonz´alez, M. L´azaro-Gredilla, and A. R. Figueiras-Vidal,“Divisive Gaussian processes for nonstationary regression,”

IEEE Trans-actions on Neural Networks and Learning Systems , vol. 25, no. 11, pp.1991–2003, 2014.[18] L. Munoz-Gonzalez, M. L´azaro-Gredilla, and A. R. Figueiras-Vidal,“Laplace approximation for divisive Gaussian processes for nonstation-ary regression,”

IEEE Transactions on Pattern Analysis and MachineIntelligence , vol. 38, no. 3, pp. 618–624, 2016. [19] A. D. Saul, J. Hensman, A. Vehtari, and N. D. Lawrence, “ChainedGaussian processes,” in

Artiﬁcial Intelligence and Statistics , 2016, pp.1431–1440.[20] K. Kersting, C. Plagemann, P. Pfaff, and W. Burgard, “Most likely het-eroscedastic Gaussian process regression,” in

International Conferenceon Machine Learning , 2007, pp. 393–400.[21] N. Quadrianto, K. Kersting, M. D. Reid, T. S. Caetano, and W. L.Buntine, “Kernel conditional quantile estimation via reduction revisited,”in

International Conference on Data Mining , 2009, pp. 938–943.[22] M. Heinonen, H. Mannerstr¨om, J. Rousu, S. Kaski, and H. L¨ahdesm¨aki,“Non-stationary Gaussian process regression with hamiltonian montecarlo,” in

Artiﬁcial Intelligence and Statistics , 2016, pp. 732–740.[23] M. Binois, R. B. Gramacy, and M. Ludkovski, “Practical heteroscedasticGaussian process modeling for large simulation experiments,”

Journalof Computational and Graphical Statistics , vol. 27, no. 4, pp. 808–821,2018.[24] M. K. Titsias and M. L´azaro-Gredilla, “Variational heteroscedasticGaussian process regression,” in

International Conference on MachineLearning , 2011, pp. 841–848.[25] M. Menictas and M. P. Wand, “Variational inference for heteroscedasticsemiparametric regression,”

Australian & New Zealand Journal ofStatistics , vol. 57, no. 1, pp. 119–138, 2015.[26] L. Mu˜noz-Gonz´alez, M. L´azaro-Gredilla, and A. R. Figueiras-Vidal,“Heteroscedastic Gaussian process regression using expectation prop-agation,” in

Machine Learning for Signal Processing , 2011, pp. 1–6.[27] V. Tolvanen, P. Jyl¨anki, and A. Vehtari, “Expectation propagation fornonstationary heteroscedastic Gaussian process regression,” in

MachineLearning for Signal Processing , 2014, pp. 1–6.[28] M. Hartmann and J. Vanhatalo, “Laplace approximation and naturalgradient for Gaussian process regression with heteroscedastic student-tmodel,”

Statistics and Computing , vol. 29, no. 4, pp. 753–773, 2019.[29] H. Liu, Y.-S. Ong, X. Shen, and J. Cai, “When Gaussian process meetsbig data: A review of scalable GPs,”

IEEE Transactions on NeuralNetworks and Learning Systems , pp. 1–19, 2020.[30] J. Qui˜nonero-Candela and C. E. Rasmussen, “A unifying view of sparseapproximate Gaussian process regression,”

Journal of Machine LearningResearch , vol. 6, no. Dec, pp. 1939–1959, 2005.[31] M. Titsias, “Variational learning of inducing variables in sparse Gaussianprocesses,” in

Artiﬁcial Intelligence and Statistics , 2009, pp. 567–574.[32] Y. Gal, M. van der Wilk, and C. E. Rasmussen, “Distributed variationalinference in sparse Gaussian process regression and latent variablemodels,” in

Advances in Neural Information Processing Systems , 2014,pp. 3257–3265.[33] T. N. Hoang, Q. M. Hoang, and B. K. H. Low, “A distributed variationalinference framework for unifying parallel sparse Gaussian processregression models.” in

International Conference on Machine Learning ,2016, pp. 382–391.[34] J. Hensman, N. Fusi, and N. D. Lawrence, “Gaussian processes for bigdata,” in

Uncertainty in Artiﬁcial Intelligence , 2013, pp. 282–290.[35] T. N. Hoang, Q. M. Hoang, and B. K. H. Low, “A unifying frameworkof anytime sparse Gaussian process regression models with stochasticvariational inference for big data,” in

International Conference onMachine Learning , 2015, pp. 569–578.[36] A. Wilson and H. Nickisch, “Kernel interpolation for scalable struc-tured Gaussian processes (KISS-GP),” in

International Conference onMachine Learning , 2015, pp. 1775–1784.[37] T. D. Bui and R. E. Turner, “Tree-structured Gaussian process approxi-mations,” in

Advances in Neural Information Processing Systems , 2014,pp. 2213–2221.[38] G. E. Hinton, “Training products of experts by minimizing contrastivedivergence,”

Neural Computation , vol. 14, no. 8, pp. 1771–1800, 2002.[39] V. Tresp, “A Bayesian committee machine,”

Neural Computation ,vol. 12, no. 11, pp. 2719–2741, 2000.[40] M. P. Deisenroth and J. W. Ng, “Distributed Gaussian processes,” in

International Conference on Machine Learning , 2015, pp. 1481–1490.[41] H. Liu, J. Cai, Y. Wang, and Y.-S. Ong, “Generalized robust Bayesiancommittee machine for large-scale Gaussian process regression,” in

International Conference on Machine Learning , 2018, pp. 3137–3146.[42] H. Liu, J. Cai, Y.-S. Ong, and Y. Wang, “Understanding and comparingscalable Gaussian process regression for big data,”

Knowledge-BasedSystems , vol. 164, pp. 324–335, 2019.[43] E. Snelson and Z. Ghahramani, “Local and global sparse Gaussianprocess approximations,” in

Artiﬁcial Intelligence and Statistics , 2007,pp. 524–531.[44] J. Vanhatalo and A. Vehtari, “Modelling local and global phenomenawith sparse Gaussian processes,” in

Uncertainty in Artiﬁcial Intelligence ,2008, pp. 571–578. [45] R. Ouyang and K. H. Low, “Gaussian process decentralized data fusionmeets transfer learning in large-scale distributed cooperative perception,”in AAAI Conference on Artiﬁcial Intelligence , 2018.[46] K. Chalupka, C. K. Williams, and I. Murray, “A framework for evalu-ating approximation methods for Gaussian process regression,”

Journalof Machine Learning Research , vol. 14, no. Feb, pp. 333–350, 2013.[47] M. E. Tipping, “Sparse Bayesian learning and the relevance vectormachine,”

Journal of Machine Learning Research , vol. 1, no. Jun, pp.211–244, 2001.[48] C. E. Rasmussen and J. Quinonero-Candela, “Healing the relevancevector machine through augmentation,” in

International Conference onMachine Learning , 2005, pp. 689–696.[49] E. Snelson and Z. Ghahramani, “Sparse Gaussian processes usingpseudo-inputs,” in

Advances in Neural Information Processing Systems ,2006, pp. 1257–1264.[50] M. Bauer, M. van der Wilk, and C. E. Rasmussen, “Understandingprobabilistic sparse Gaussian process approximations,” in

Advances inNeural Information Processing Systems , 2016, pp. 1533–1541.[51] H. Yu, T. Nghia, B. K. H. Low, and P. Jaillet, “Stochastic variationalinference for bayesian sparse Gaussian process regression,” in

Interna-tional Joint Conference on Neural Networks , 2019, pp. 1–8.[52] D. G. Matthews, G. Alexander, M. Van Der Wilk, T. Nickson, K. Fujii,A. Boukouvalas, P. Le´on-Villagr´a, Z. Ghahramani, and J. Hensman,“GPﬂow: A Gaussian process library using TensorFlow,”

Journal ofMachine Learning Research , vol. 18, no. 1, pp. 1299–1304, 2017.[53] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard et al. , “Tensorﬂow: A system for large-scale machine learning,” in { USENIX } Symposium on OperatingSystems Design and Implementation , 2016, pp. 265–283.[54] C. E. Rasmussen and H. Nickisch, “Gaussian processes for machinelearning (GPML) toolbox,”

Journal of Machine Learning Research ,vol. 11, no. Nov, pp. 3011–3015, 2010.[55] S. Sun, “A review of deterministic approximate inference techniquesfor Bayesian machine learning,”

Neural Computing and Applications ,vol. 23, no. 7-8, pp. 2039–2050, 2013.[56] J. Hensman and N. D. Lawrence, “Nested variational compression indeep Gaussian processes,” arXiv preprint arXiv:1412.1370 , 2014.[57] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980 , 2014.[58] B. Ingram and D. Cornford, “Parallel geostatistics for sparse and densedatasets,” in

GeoENV VII–Geostatistics for Environmental Applications .Springer, 2010, pp. 371–381.[59] H. Salimbeni, S. Eleftheriadis, and J. Hensman, “Natural gradientsin practice: Non-conjugate variational inference in Gaussian processmodels,” in

International Conference on Artiﬁcial Intelligence andStatistics , 2018, pp. 689–697.[60] J. Hensman, N. Durrande, A. Solin et al. , “Variational fourier featuresfor Gaussian processes.”

Journal of Machine Learning Research , vol. 18,pp. 1–52, 2018.[61] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,”2017. [Online]. Available: http://archive.ics.uci.edu/ml[62] M. Kaul, B. Yang, and C. S. Jensen, “Building accurate 3D spatialnetworks to enable next generation intelligent transportation systems,”in

International Conference on Mobile Data Management , vol. 1, 2013,pp. 137–146.[63] C. Wang and R. M. Neal, “Gaussian process regression with het-eroscedastic or non-gaussian residuals,” arXiv preprint arXiv:1212.6246 ,2012.[64] V. Dutordoir, H. Salimbeni, J. Hensman, and M. Deisenroth, “Gaussianprocess conditional density estimation,” in

Advances in Neural Informa-tion Processing Systems , 2018, pp. 2385–2395.[65] M. Jankowiak, G. Pleiss, and J. R. Gardner, “Sparse Gaussianprocess regression beyond variational inference,” arXiv preprintarXiv:1910.07123 , 2019.[66] C. Ma, Y. Li, and J. M. Hernandez-Lobato, “Variational implicit pro-cesses,” in

International Conference on Machine Learning , 2019, pp.4222–4233.[67] T. Nguyen and E. Bonilla, “Fast allocation of Gaussian process experts,”in

International Conference on Machine Learning , 2014, pp. 145–153.[68] D. Wu and J. Ma, “A two-layer mixture model of Gaussian processfunctional regressions and its MCMC EM algorithm,”

IEEE Transac-tions on Neural Networks and Learning Systems , vol. 29, no. 10, pp.4894–4904, 2018.[69] S. Remes, M. Heinonen, and S. Kaski, “Non-stationary spectral kernels,”in

Advances in Neural Information Processing Systems , 2017, pp. 4642–4651.

Haitao Liu received the Ph.D. degree from theSchool of Energy and Power Engineering, DalianUniversity of Technology, Dalian, China, in 2016.He is currently a Research Fellow with the Rolls-Royce@NTU Corp Laboratory, Nanyang Techno-logical University, Singapore. His current researchinterests include multi-task learning, large-scaleGaussian process, active learning, and optimization.

Yew-Soon Ong (M99-SM12-F18) received thePh.D. degree in artiﬁcial intelligence in complexdesign from the University of Southampton, U.K.,in 2003. He is a Presidents Chair Professor inComputer Science at the Nanyang TechnologicalUniversity (NTU), and holds the position of ChiefArtiﬁcial Intelligence Scientist at the Agency forScience, Technology and Research Singapore. AtNTU, he serves as Director of the Data Scienceand Artiﬁcial Intelligence Research Center and Di-rector of the Singtel-NTU Cognitive & ArtiﬁcialIntelligence Joint Lab. His research interest is in artiﬁcial and computationalintelligence. He is founding Editor-in-Chief of the IEEE Transactions onEmerging Topics in Computational Intelligence, Technical Editor-in-Chief ofMemetic Computing and Associate Editor of IEEE Transactions on NeuralNetworks & Learning Systems, the IEEE Transactions on Cybernetics, IEEETransactions on Evolutionary Computation and others. He has received severalIEEE outstanding paper awards, listed as a Thomson Reuters highly citedresearcher and among the World’s Most Inﬂuential Scientiﬁc Minds.