[PDF] Heterogeneous Multi-output Gaussian Process Prediction

Abstract

We present a novel extension of multi-output Gaussian processes for handling heterogeneous outputs. We assume that each output has its own likelihood function and use a vector-valued Gaussian process prior to jointly model the parameters in all likelihoods as latent functions. Our multi-output Gaussian process uses a covariance function with a linear model of coregionalisation form. Assuming conditional independence across the underlying latent functions together with an inducing variable framework, we are able to obtain tractable variational bounds amenable to stochastic variational inference. We illustrate the performance of the model on synthetic data and two real datasets: a human behavioral study and a demographic high-dimensional dataset.

Full PDF

HHeterogeneous Multi-output Gaussian ProcessPrediction

Pablo Moreno-Muñoz Antonio Artés-Rodríguez Mauricio A. Álvarez Dept. of Signal Theory and Communications, Universidad Carlos III de Madrid, Spain Dept. of Computer Science, University of Shefﬁeld, UK {pmoreno,antonio}@tsc.uc3m.es , [email protected] Abstract

We present a novel extension of multi-output Gaussian processes for handlingheterogeneous outputs. We assume that each output has its own likelihood functionand use a vector-valued Gaussian process prior to jointly model the parametersin all likelihoods as latent functions. Our multi-output Gaussian process uses acovariance function with a linear model of coregionalisation form. Assumingconditional independence across the underlying latent functions together with aninducing variable framework, we are able to obtain tractable variational boundsamenable to stochastic variational inference. We illustrate the performance of themodel on synthetic data and two real datasets: a human behavioral study and ademographic high-dimensional dataset.

Multi-output Gaussian processes (MOGP) generalise the powerful Gaussian process (GP) predictivemodel to the vector-valued random ﬁeld setup (Alvarez et al., 2012). It has been experimentallyshown that by simultaneously exploiting correlations between multiple outputs and across the inputspace, it is possible to provide better predictions, particularly in scenarios with missing or noisy data(Bonilla et al., 2008; Dai et al., 2017).The main focus in the literature for MOGP has been on the deﬁnition of a suitable cross-covariancefunction between the multiple outputs that allows for the treatment of outputs as a single GP witha properly deﬁned covariance function (Alvarez et al., 2012). The two classical alternatives todeﬁne such cross-covariance functions are the linear model of coregionalisation (LMC) (Journeland Huijbregts, 1978) and process convolutions (Higdon, 2002). In the former case, each outputcorresponds to a weighted sum of shared latent random functions. In the latter, each output ismodelled as the convolution integral between a smoothing kernel and a latent random functioncommon to all outputs. In both cases, the unknown latent functions follow Gaussian process priorsleading to straight-forward expressions to compute the cross-covariance functions among differentoutputs. More recent alternatives to build valid covariance functions for MOGP include the workby Ulrich et al. (2015) and Parra and Tobar (2017), that build the cross-covariances in the spectraldomain.Regarding the type of outputs that can be modelled, most alternatives focus on multiple-outputregression for continuous variables. Traditionally, each output is assumed to follow a Gaussianlikelihood where the mean function is given by one of the outputs of the MOGP and the variance inthat distribution is treated as an unknown parameter. Bayesian inference is tractable for these models.In this paper, we are interested in the heterogeneous case for which the outputs are a mix of continuous,categorical, binary or discrete variables with different likelihood functions. a r X i v : . [ s t a t . M L ] J a n here have been few attempts to extend the MOGP to other types of likelihoods. For example, Skolidisand Sanguinetti (2011) use the outputs of a MOGP for jointly modelling several binary classiﬁcationproblems, each of which uses a probit likelihood. They use an intrinsic coregionalisation model(ICM), a particular case of LMC. Posterior inference is perfomed using expectation-propagation (EP)and variational mean ﬁeld. Both Chai (2012) and Dezfouli and Bonilla (2015) have also used ICMfor modeling a single categorical variable with a multinomial logistic likelihood. The outputs of theICM model are used as replacements for the linear predictors in the softmax function. Chai (2012)derives a particular variational bound for the marginal likelihood and computes Gaussian posteriordistributions; and Dezfouli and Bonilla (2015) introduce an scalable inference procedure that uses amixture of Gaussians to approximate the posterior distribution using automated variational inference(AVI) (Nguyen and Bonilla, 2014a) that requires sampling from univariate Gaussians.For the single-output GP case, the usual practice for handling non-Gaussian likelihoods has been re-placing the parameters or linear predictors of the non-Gaussian likelihood by one or more independentGP priors . Since computing posterior distributions becomes intractable, different alternatives havebeen offered for approximate inference. An example is the Gaussian heteroscedastic regression modelwith variational inference (Lázaro-Gredilla and Titsias, 2011), Laplace approximation (Vanhataloet al., 2013); and stochastic variational inference (SVI) (Saul et al., 2016). This last reference uses thesame idea for modulating the parameters of a Student -t likelihood, a log-logistic distribution, a betadistribution and a Poisson distribution. The generalised Wishart process (Wilson and Ghahramani,2011) is another example where the entries of the scale matrix of a Wishart distribution are modulatedby independent GPs.Our main contribution in this paper is to provide an extension of multiple-output Gaussian processesfor prediction in heterogeneous datasets. The key principle in our model is to use the outputs ofa MOGP as the latent functions that modulate the parameters of several likelihood functions, onelikelihood function per output. We tackle the model’s intractability using variational inference.Furthermore, we use the inducing variable formalism for MOGP introduced by Alvarez and Lawrence(2009) and compute a variational bound suitable for stochastic optimisation as in Hensman et al.(2013). We experimentally provide evidence of the beneﬁts of simultaneously modeling heteroge-neous outputs in different applied problems. Our model can be seen as a generalisation of Saulet al. (2016) for multiple correlated output functions of an heterogeneous nature. Our Python im-plementation follows the spirit of Hadﬁeld et al. (2010), where the user only needs to specify a listof likelihood functions likelihood_list = [Bernoulli(), Poisson(), HetGaussian()] ,where HetGaussian refers to the heteroscedastic Gaussian distribution, and the number of latentparameter functions per likelihood is assigned automatically.

Consider a set of output functions Y = { y d ( x ) } Dd =1 , with x ∈ R p , that we want to jointly modelusing Gaussian processes. Traditionally, the literature has considered the case for which each y d ( x ) is continuous and Gaussian distributed. In this paper, we are interested in the heterogeneous case forwhich the outputs in Y are a mix of continuous, categorical, binary or discrete variables with severaldifferent distributions. In particular, we will assume that the distribution over y d ( x ) is completelyspeciﬁed by a set of parameters θ d ( x ) ∈ X J d , where we have a generic X domain for the parametersand J d is the number of parameters thet deﬁne the distribution. Each parameter θ d,j ( x ) ∈ θ d ( x ) is a non-linear transformation of a Gaussian process prior f d,j ( x ) , this is, θ d,j ( x ) = g d,j ( f d,j ( x )) ,where g d,j ( · ) is a deterministic function that maps the GP output to the appropriate domain for theparameter θ d,j .To make the notation concrete, let us assume an heterogeneous multiple-output problem for which D = 3 . Assume that output y ( x ) is binary and that it will be modelled using a Bernoulli distribution.The Bernoulli distribution uses a single parameter (the probability of success), J = 1 , restricted tovalues in the range [0 , . This means that θ ( x ) = θ , ( x ) = g , ( f , ( x )) , where g , ( · ) could bemodelled using the logistic sigmoid function σ ( z ) = 1 / (1 + exp( − z )) that maps σ : R → [0 , .Assume now that the second output y ( x ) corresponds to a count variable that can take values y ( x ) ∈ N ∪ { } . The count variable can be modelled using a Poisson distribution with a singleparameter (the rate), J = 1 , restricted to the positive reals. This means that θ ( x ) = θ , ( x ) = g , ( f , ( x )) where g , ( · ) could be modelled as an exponential function g , ( · ) = exp( · ) to ensurestrictly positive values for the parameter. Finally, y ( x ) is a continuous variable with heteroscedastic2oise. It can be modelled using a Gaussian distribution where both the mean and the variance arefunctions of x . This means that θ ( x ) = [ θ , ( x ) θ , ( x )] (cid:62) = [ g , ( f , ( x )) g , ( f , ( x ))] (cid:62) ,where the ﬁrst function is used to model the mean of the Gaussian, and the second function is usedto model the variance. Therefore, we can assume the g , ( · ) is the identity function and g , ( · ) is afunction that ensures that the variance takes strictly positive values, e.g. the exponential function.Let us deﬁne a vector-valued function y ( x ) = [ y ( x ) , y ( x ) , · · · , y D ( x )] (cid:62) . We assumethat the outputs are conditionally independent given the vector of parameters θ ( x ) =[ θ ( x ) , θ ( x ) , · · · , θ D ( x )] (cid:62) , deﬁned by specifying the vector of latent functions f ( x ) =[ f , ( x ) , f , ( x ) , · · · f ,J ( x ) , f , ( x ) , f , ( x ) , · · · , f D,J D ( x )] (cid:62) ∈ R J × , where J = (cid:80) Dd =1 J d , p ( y ( x ) | θ ( x )) = p ( y ( x ) | f ( x )) = D (cid:89) d =1 p ( y d ( x ) | θ d ( x )) = D (cid:89) d =1 p ( y d ( x ) | (cid:101) f d ( x )) , (1)where we have deﬁned (cid:101) f d ( x ) = [ f d, ( x ) , · · · , f d,J d ( x )] (cid:62) ∈ R J d × , the set of latent functions thatspecify the parameters in θ d ( x ) . Notice that J ≥ D . This is, there is not always a one-to-onemap from f ( x ) to y ( x ) . Most previous work has assumed that D = 1 , and that the correspondingelements in θ d ( x ) , this is, the latent functions in (cid:101) f ( x ) = [ f , ( x ) , · · · , f ,J ( x )] (cid:62) are drawn fromindependent Gaussian processes. Important exceptions are Chai (2012) and Dezfouli and Bonilla(2015), that assumed a categorical variable y ( x ) , where the elements in (cid:101) f ( x ) were drawn froman intrinsic coregionalisation model. In what follows, we generalise these models for D > andpotentially heterogeneuos outputs y d ( x ) . We will use the word “output” to refer to the elements y d ( x ) and “latent parameter function” (LPF) or “parameter function” (PF) to refer to f d,j ( x ) . Our main departure from previous work is in modeling of f ( x ) using a multi-parameter Gaussianprocess that allows correlations for the parameter functions f d,j ( x ) . We will use a linear modelof corregionalisation type of covariance function for expressing correlations between functions f d,j ( x ) , and f d (cid:48) ,j (cid:48) ( x (cid:48) ) . The particular construction is as follows. Consider an additional set ofindependent latent functions U = { u q ( x ) } Qq =1 that will be linearly combined to produce J LPFs { f d,j ( x ) } J d ,Dj =1 ,d =1 . Each latent function u q ( x ) is assummed to be drawn from an independent GPprior such that u q ( · ) ∼ GP (0 , k q ( · , · )) , where k q can be any valid covariance function, and the zeromean is assumed for simplicity. Each latent parameter f d,j ( x ) is then given as f d,j ( x ) = Q (cid:88) q =1 R q (cid:88) i =1 a id,j,q u iq ( x ) , (2)where u iq ( x ) are IID samples from u q ( · ) ∼ GP (0 , k q ( · , · )) and a id,j,q ∈ R . The mean functionfor f d,j ( x ) is zero and the cross-covariance function k f d,j f d (cid:48) ,j (cid:48) ( x , x (cid:48) ) = cov[ f d,j ( x ) , f d (cid:48) ,j (cid:48) ( x (cid:48) )] is equal to (cid:80) Qq =1 b q ( d,j ) , ( d (cid:48) ,j (cid:48) ) k q ( x , x (cid:48) ) , where b q ( d,j ) , ( d (cid:48) ,j (cid:48) ) = (cid:80) R q i =1 a id,j,q a id (cid:48) ,j (cid:48) ,q . Let us deﬁne X = { x n } Nn =1 ∈ R N × p as a set of common input vectors for all outputs y d ( x ) . Although, thepresentation could be extended for the case of a different set of inputs per output. Let us alsodeﬁne f d,j = [ f d,j ( x ) , · · · , f d,j ( x N )] (cid:62) ∈ R N × ; (cid:101) f d = [ f (cid:62) d, · · · f (cid:62) d,J d ] (cid:62) ∈ R J d N × , and f =[ (cid:101) f (cid:62) · · · (cid:101) f (cid:62) D ] (cid:62) ∈ R JN × . The generative model for the heterogeneous MOGP is as follows. We sample f ∼ N ( , K ) , where K is a block-wise matrix with blocks given by { K f d,j f d (cid:48) ,j (cid:48) } D,D,J d ,J d (cid:48) d =1 ,d (cid:48) =1 ,j =1 ,j (cid:48) =1 . Inturn, the elements in K f d,j f d (cid:48) ,j (cid:48) are given by k f d,j f d (cid:48) ,j (cid:48) ( x n , x m ) , with x n , x m ∈ X . For the particularcase of equal inputs X for all LPF, K can also be expressed as the sum of Kronecker products K = (cid:80) Qq =1 A q A (cid:62) q ⊗ K q = (cid:80) Qq =1 B q ⊗ K q , where A q ∈ R J × R q has entries { a id,j,q } D,J d ,R q d =1 ,j =1 ,i =1 and B q has entries { b q ( d,j ) , ( d (cid:48) ,j (cid:48) ) } D,D,J d ,J d (cid:48) d =1 ,d (cid:48) =1 ,j =1 ,j (cid:48) =1 . The matrix K q ∈ R N × N has entries given by k q ( x n , x m ) for x n , x m ∈ X . Matrices B q ∈ R J × J are known as the coregionalisation matrices .Once we obtain the sample for f , we evaluate the vector of parameters θ = [ θ (cid:62) · · · θ (cid:62) D ] (cid:62) , where θ d = (cid:101) f d . Having speciﬁed θ , we can generate samples for the output vector y = [ y (cid:62) · · · y (cid:62) D ] (cid:62) ∈X DN × , where the elements in y d are obtained by sampling from the conditional distributions3 ( y d ( x ) | θ d ( x )) . To keep the notation uncluttered, we will assume from now that R q = 1 , meaningthat A q = a q ∈ R J × , and the corregionalisation matrices are rank-one. In the literature such modelis known as the semiparametric latent factor model (Teh et al., 2005). Given an heterogeneous dataset D = { X , y } , we would like to compute the posterior distributionfor p ( f |D ) , which is intractable in our model. In what follows, we use similar ideas to Alvarezand Lawrence (2009); Álvarez et al. (2010) that introduce the inducing variable formalism forcomputational efﬁciency in MOGP. However, instead of marginalising the latent functions U toobtain a variational lower bound, we keep their presence in a way that allows us to apply stochasticvariational inference as in Hensman et al. (2013); Saul et al. (2016). A key idea to reduce computational complexity in Gaussian process models is to introduce auxiliaryvariables or inducing variables . These variables have been used already in the context of MOGP(Alvarez and Lawrence, 2009; Álvarez et al., 2010) . A subtle difference from the single outputcase is that the inducing variables are not taken from the same latent process, say f ( x ) , but fromthe latent processes U used also to build the model for multiple outputs. We will follow the sameformalism here. We start by deﬁning the set of M inducing variables per latent function u q ( x ) as u q = [ u q ( z ) , · · · , u q ( z M )] (cid:62) , evaluated at a set of inducing inputs Z = { z m } Mm =1 ∈ R M × p . Wealso deﬁne u = [ u (cid:62) , · · · , u (cid:62) Q ] (cid:62) ∈ R QM × . For simplicity in the exposition, we have assumed thatall the inducing variables, for all q , have been evaluated at the same set of inputs Z . Instead ofmarginalising { u q ( x ) } Qq from the model in (2), we explicitly use the joint Gaussian prior p ( f , u ) = p ( f | u ) p ( u ) . Due to the assumed independence in the latent functions u q ( x ) , the distribution p ( u ) factorises as p ( u ) = (cid:81) Qq =1 p ( u q ) , with u q ∼ N ( , K q ) , where K q ∈ R M × M has entries k q ( z i , z j ) with z i , z j ∈ Z . Notice that the dimensions of K q are different to the dimensions of K q in section2.1. The LPFs f d,j are conditionally independent given u , so we can write the conditional distribution p ( f | u ) as p ( f | u ) = D (cid:89) d =1 J d (cid:89) j =1 p ( f d,j | u ) = D (cid:89) d =1 J d (cid:89) j =1 N (cid:16) f d,j | K f d,j u K − uu u , K f d,j f d,j − K f d,j u K − uu K (cid:62) f d,j u (cid:17) , where K uu ∈ R QM × QM is a block-diagonal matrix with blocks given by K q and K f d,j u ∈ R N × QM is the cross-covariance matrix computed from the cross-covariances between f d,j ( x ) and u q ( z ) . Theexpression for this cross-covariance function can be obtained from (2) leading to k f d,j u q ( x , z ) = a d,j,q k q ( x , z ) . This form for the cross-covariance between the LPF f d,j ( x ) and u q ( z ) is a keydifference between the inducing variable methods for the single-output GP case and the MOGP case. Exact posterior inference is intractable in our model due to the presence of an arbitrary number ofnon-Gaussian likelihoods. We use variational inference to compute a lower bound L for the marginallog-likelihood log p ( y ) , and for approximating the posterior distribution p ( f , u |D ) . FollowingÁlvarez et al. (2010), the posterior of the LPFs f and the latent functions u can be approximated as p ( f , u | y , X ) ≈ q ( f , u ) = p ( f | u ) q ( u ) = D (cid:89) d =1 J d (cid:89) j =1 p ( f d,j | u ) Q (cid:89) q =1 q ( u q ) , where q ( u q ) = N ( u q | µ u q , S u q ) are Gaussian variational distributions whose parameters { µ u q , S u q } Qq =1 must be optimised. Building on previous work by Saul et al. (2016); Hensmanet al. (2015), we derive a lower bound that accepts any log-likelihood function that can be modulatedby the LPFs f . The lower bound L for log p ( y ) is obtained as follows log p ( y ) = log (cid:90) p ( y | f ) p ( f | u ) p ( u ) d f d u ≥ (cid:90) q ( f , u ) log p ( y | f ) p ( f | u ) p ( u ) q ( f , u ) d f d u = L .

4e can further simplify L to obtain L = (cid:90) (cid:90) p ( f | u ) q ( u ) log p ( y | f ) d f d u − Q (cid:88) q =1 KL (cid:0) q ( u q ) || p ( u q ) (cid:1) = (cid:90) (cid:90) D (cid:89) d =1 J d (cid:89) j =1 p ( f d,j | u ) q ( u ) log p ( y | f ) d u d f − Q (cid:88) q =1 KL (cid:0) q ( u q ) || p ( u q ) (cid:1) , where KL is the Kullback-Leibler divergence. Moreover, the approximate marginal posterior for f d,j is q ( f d,j ) = (cid:82) p ( f d,j | u ) q ( u ) d u , leading to q ( f d,j ) = N (cid:16) f d,j | K f d,j u K − uu µ u , K f d,j f d,j + K f d,j u K − uu ( S u − K uu ) K − uu K (cid:62) f d,j u (cid:17) , where µ u = [ µ (cid:62) u , · · · , µ (cid:62) u Q ] (cid:62) and S u is a block-diagonal matrix with blocks given by S u q .The expression for log p ( y | f ) factorises, according to (1): log p ( y | f ) = (cid:80) Dd =1 log p ( y d | (cid:101) f d ) = (cid:80) Dd =1 log p ( y d | f d, , · · · , f d,J d ) . Using this expression for log p ( y | f ) leads to the following expres-sion for the bound D (cid:88) d =1 E q ( f d, ) ··· q ( f d,Jd ) (cid:2) log p ( y d | f d, , · · · , f d,J d ) (cid:3) − Q (cid:88) q =1 KL (cid:0) q ( u q ) || p ( u q ) (cid:1) . When D = 1 in the expression above, we recover the bound obtained in Saul et al. (2016). Tomaximize this lower bound, we need to ﬁnd the optimal variational parameters { µ u q } Qq =1 and { S u q } Qq =1 . We represent each matrix S u q as S u q = L u q L (cid:62) u q and, to ensure positive deﬁniteness for S u q , we estimate L u q instead of S u q . Computation of the posterior distributions over f d,j can be doneanalytically. There is still an intractability issue in the variational expectations on the log-likelihoodfunctions. Since we construct these bounds in order to accept any possible data type, we need ageneral way to solve these integrals. One obvious solution is to apply Monte Carlo methods, howeverit would be slow both maximising the lower bound and updating variational parameters by samplingthousands of times (for approximating expectations) at each iteration. Instead, we address thisproblem by using Gaussian-Hermite quadratures as in Hensman et al. (2015); Saul et al. (2016). Stochastic Variational Inference.

The conditional expectations in the bound above are also validacross data observations so that we can express the bound as D (cid:88) d =1 N (cid:88) n =1 E q ( f d, ( x n )) ··· q ( f d,Jd ( x n )) (cid:2) log p ( y d ( x n ) | f d, ( x n ) , · · · , f d,J d ( x n )) (cid:3) − Q (cid:88) q =1 KL (cid:0) q ( u q ) || p ( u q ) (cid:1) . This functional form allows the use of mini-batches of smaller sets of training samples, performingthe optimization process using noisy estimates of the global objective gradient in a similar fashionto Hoffman et al. (2013); Hensman et al. (2013, 2015); Saul et al. (2016) . This scalable boundmakes our multi-ouput model applicable to large heterogenous datasets. Notice that computationalcomplexity is dominated by the inversion of K uu with a cost of O ( QM ) and products like K fu with a cost of O ( JN QM ) . Hyperparameter learning.

Hyperparameters in our model include Z , { B q } Qq =1 , and { γ q } Qq =1 , thehyperparameters associated to the covariance functions { k q ( · , · ) } Qq =1 . Since the variational distribu-tion q ( u ) is sensitive to changes of the hyperparameters, we maximize the variational parameters for q ( u ) , and the hyperparameters using a Variational EM algorithm (Beal, 2003) when employing thefull dataset, or the stochastic version when using mini-batches (Hoffman et al., 2013). Consider a set of test inputs X ∗ . Assuming that p ( u | y ) ≈ q ( u ) , the predictive distribution p ( y ∗ ) canbe approximated as p ( y ∗ | y ) ≈ (cid:82) p ( y ∗ | f ∗ ) q ( f ∗ ) d f ∗ , where q ( f ∗ ) = (cid:82) p ( f ∗ | u ) q ( u ) d u . Computingthe expression q ( f ∗ ) = (cid:81) Dd =1 (cid:81) J d j =1 q ( f d,j, ∗ ) involves evaluating K f d,j, ∗ u at X ∗ . As in the case of5he lower bound, the integral above is intractable for the non-Gaussian likelihoods p ( y ∗ | f ∗ ) . We canonce again make use of Monte Carlo integration or quadratures to approximate the integral. Simplerintegration problems are obtained if we are only interested in the predictive mean, E [ y ∗ ] , and thepredictive variance, var[ y ∗ ] . The most closely related works to ours are Skolidis and Sanguinetti (2011), Chai (2012), Dezfouli andBonilla (2015) and Saul et al. (2016). We are different from Skolidis and Sanguinetti (2011) becausewe allow more general heterogeneous outputs beyond the speciﬁc case of several binary classiﬁcationproblems. Our inference method also scales to large datasets. The works by Chai (2012) and Dezfouliand Bonilla (2015) do use a MOGP, but they only handle a single categorical variable. Our inferenceapproach scales when compared to the one in Chai (2012) and it is fundamentally different to the onein Dezfouli and Bonilla (2015) since we do not use AVI. Our model is also different to Saul et al.(2016) since we allow for several dependent outputs,

D > , and our scalable approach is more akinto applying SVI to the inducing variable approach of Álvarez et al. (2010).More recenty, Vanhatalo et al. (2018) used additive multi-output GP models to account for inter-dependencies between counting and binary observations. They use the Laplace approximation forapproximating the posterior distribution. Similarly, Pourmohamad and Lee (2016) perform combinedregression and binary classiﬁcation with a multi-output GP learned via sequential Monte Carlo.Nguyen and Bonilla (2014b) also uses the same idea from Álvarez et al. (2010) to provide scalabilityfor multiple-output GP models conditioning the latent parameter functions f d,j ( x ) on the inducingvariables u , but only considers the multivariate regression case.It is also important to mention that multi-output Gaussian processes have been considered as alterna-tive models for multi-task learning (Alvarez et al., 2012). Multi-task learning also addresses multipleprediction problems together within a single inference framework. Most previous work in this areahas focused on problems where all tasks are exclusively regression or classiﬁcation problems. Whentasks are heterogeneous, the common practice is to introduce a regularizer per data type in a globalcost function (Zhang et al., 2012; Han et al., 2017). Usually, these cost functions are compoundedby additive terms, each one referring to every single task, while the correlation assumption amongheterogeneous likelihoods is addressed by mixing regularizers in a global penalty term (Li et al.,2014) or by forcing different tasks to share a common mean (Ngufor et al., 2015). Another naturalway of treating both continuous and discrete tasks is to assume that all of them share a common inputset that varies its inﬂuence on each output. Then, by sharing a jointly sparsity pattern, it is possibleto optimize a global cost function with a single regularization parameter on the level of sparsity(Yang et al., 2009). There have also been efforts for modeling heterogeneous data outside the labelof multi-task learning including mixed graphical models (Yang et al., 2014), where varied types ofdata are assumed to be combinations of exponential families, and latent feature models (Valera et al.,2017) with heterogeneous observations being mappings of a set of Gaussian distributed variables. In this section, we evaluate our model on different heterogeneous scenarios . To demonstrate itsperformance in terms of multi-output learning, prediction and scalability, we have explored severalapplications with both synthetic and real data. For all the experiments, we consider an RBF kernel foreach covariance function k q ( · , · ) and we set Q = 3 . For standard optimization we used the LBFGS-Balgorithm. When SVI was needed, we considered ADADELTA included in the climin library, and a mini-batch size of samples in every output. All performance metrics are given in terms of thenegative log-predictive density (NLPD) calculated from a test subset and applicable to any type oflikelihood. Further details about experiments are included in the appendix. Missing Gap Prediction:

In our ﬁrst experiment, we evaluate if our model is able to predictobservations in one output using training information from another one. We setup a toy problemwhich consists of D = 2 heterogeneous outputs, where the ﬁrst function y ( x ) is real and y ( x ) binary. Assumming that heterogeneous outputs do not share a common input set, we observe The code is publicly available in the repository github.com/pmorenoz/HetMOGP/ . . . . . . . . . − − − Real Input R e a l O u t pu t Output 1: Gaussian Regression (a) . . . . . . . . . . . . . Real Input B i n a r y O u t pu t Output 2: Binary Classiﬁcation (b) . . . . . . . . . . . . . Real Input B i n a r y O u t pu t Single Output: Binary Classiﬁcation (c)

Figure 1: Comparison between multi-output and single-output performance for two heterogeneoussets of observations. a) Fitted function and uncertainty for the ﬁrst output. It represents the meanfunction parameter µ ( x ) for a Gaussian distribution with σ = 1 . b) Predictive output function forbinary inputs. Blue curve is the ﬁtting function for training data and the red one corresponds topredicting from test inputs (true test binary outputs in red too). c) Same output as Figure 1(b) buttraining an independent Chained GP only in the single binary output (GP binary classiﬁcation). N = 600 and N = 500 samples respectively. All inputs are uniformly distributed in the inputrange [0 , , and we generate a gap only in the set of binary observations by removing N test = 150 samples in the interval [0 . , . . Using the remaining points from both outputs for training, we ﬁttedour MOGP model. In Figures 1(a,b) we can see how uncertainty in binary test predictions is reducedby learning from the ﬁrst output. In contrast, Figure 1(c) shows wider variance in the predictedparameter when it is trained independently. For the multi-output case we obtained a NLPD value fortest data of . ± . × − while in the single-output case the NLPD was . ± . × − . Human Behavior Data:

In this experiment, we are interested in modeling human behavior inpsychiatric patients. Previous work by Soleimani et al. (2018) already explores the application ofscalable MOGP models to healthcare for reliable predictions from multivariate time-series. Ourdata comes from a medical study that asked patients to download a monitoring app (EB2) ontheir smartphones. The system captures information about mobility, communication metadata andinteractions in social media. The work has a particular interest in mental health since shifts ormisalignments in the circadian feature of human behavior (24h cycles) can be interpreted as earlysigns of crisis. Table 1: Behavior Dataset Test-NLPD ( × − ) Bernoulli Heteroscedastic Bernoulli Global

HetMOGP . ± .

21 6 . ± . . ± . . ± . ChainedGP . ± .

30 7 . ± . . ± . . ± . M o nd a y T u e s d a y W e dn e s d a y T hu r s d a y F r i d a y S a t u r d a y Sund a y . . . . O u t pu t : B i n a r y Presence/Absence at Home M o nd a y T u e s d a y W e dn e s d a y T hu r s d a y F r i d a y S a t u r d a y Sund a y − − O u t pu t : L og - d i s t a n ce Distance from Home (Km) M o nd a y T u e s d a y W e dn e s d a y T hu r s d a y F r i d a y S a t u r d a y Sund a y . . . . O u t pu t : B i n a r y Use/non-use of Whatsapp

Figure 2: Results for multi-output modeling of human behavior. After training, all output predictionsshare a common (daily) periodic pattern.In particular, we obtained a binary indicator variable of presence/absence at home by monitoring latitude - longitude and measuring its distance from the patient’s home location within a 50m radiusrange. Then, using the already measured distances, we generated a mobility sequence with alllog-distance values. Our last output consists of binary samples representing use/non-use of the This smartphone application can be found at . hatsapp application in the smartphone. At each monitoring time instant, we used its differentialdata consumption to determine use or non-use of the application. We considered an entire week inseconds as the input domain, normalized to the range [0 , .In Figure (2), after training on N = 750 samples, we ﬁnd that the circadian feature is mainlycontained in the ﬁrst output. During the learning process, this periodicity is transferred to the otheroutputs through the latent functions improving the performance of the entire model. Experimentally,we tested that this circadian pattern was not captured in mobility and social data when training outputsindependently. In Table 1 we can see prediction metrics for multi-output and independent prediction. London House Price Data:

Based on the large scale experiments in Hensman et al. (2013),we obtained the complete register of properties sold in the Greater London County during 2017( ). We preprocessed it totranslate all property addresses to latitude - longitude points. For each spatial input, we consideredtwo observations, one binary and one real. The ﬁrst one indicates if the property is or is not a ﬂat(zero would mean detached, semi-detached, terraced, etc.. ), and the second one the sale price ofhouses. Our goal is to predict features of houses given a certain location in the London area. We useda training set of N = 20 , samples, , for test predictions and M = 100 inducing points.Table 2: London Dataset Test-NLPD ( × − ) Bernoulli Heteroscedastic Global

HetMOGP . ± .

46 10 . ± .

64 16 . ± . ChainedGP . ± .

25 10 . ± .

03 17 . ± . -0.51 -0.34 -0.17 -0.0 0.16 0.3351.2951.3751.4551.5351.6151.69 Longitude L a t i t ud e Property Type FlatOther -0.51 -0.34 -0.17 -0.0 0.16 0.3351.2951.3751.4551.5351.6151.69

Longitude L a t i t ud e Sale Price £ £ £ £ £ -0.51 -0.34 -0.17 -0.0 0.16 0.3351.2951.3751.4551.5351.6151.69 Longitude L a t i t ud e Log-price Variance

Figure 3: Results for spatial modeling of heterogeneous data. (Top row) of training samples forthe Greater London County. Binary outputs are the type of property sold in and real ones areprices included in sale contracts. (Bottom row) Test prediction curves for N test = 2 , inputs.Results in Figure (3) show a portion of the entire heterogeneous dataset and its test predictioncurves. We obtained a global NLPD score of . ± . using the MOGP and . ± . in theindependent outputs setting (both × − ). There is an improvement in performance when trainingour multi-output model even in large scale datasets. See Table (2) for scores per each output. High Dimensional Input Data:

In our last experiment, we tested our MOGP model for the ar-rhythmia dataset in the UCI repository ( http://archive.ics.uci.edu/ml/ ). We use a datasetof dimensionality p = 255 and samples that we divide in training, validation and test sets8more details are in the appendix). We use our model for predicting a binary output (gender) and acontinuous output (logarithmic age) and we compared against independent Chained GPs per output.The binary output is modelled as a Bernoulli distribution and the continuous one as a Gaussian. Weobtained an average NLPD value of . for both multi-output and independent output modelswith a slight difference in the standard deviation. In this paper we have introduced a novel extension of multi-output Gaussian Processes for handlingheterogeneous observations. Our model is able to work on large scale datasets by using sparseapproximations within stochastic variational inference. Experimental results show relevant improve-ments with respect to independent learning of heterogeneous data in different scenarios. In futurework it would be interesting to employ convolutional processes (CPs) as an alternative to build themulti-output GP prior. Also, instead of typing hand-made deﬁnitions of heterogeneous likelihoods,we may consider to automatically discover them (Valera and Ghahramani, 2017) as an input block ina pipeline setup of our tool.

Acknowledgments

The authors want to thank Wil Ward for his constructive comments and Juan José Giraldo for hisuseful advice about SVI experiments and simulations. We also thank Alan Saul and David Ramírezfor their recommendations about scalable inference and feedback on the equations. We are grateful toEero Siivola and Marcelo Hartmann for sharing their Python module for heterogeneous likelihoodsand to Francisco J. R. Ruiz for his illuminating help about the stochastic version of the VEM algorithm.Also, we would like to thank Juan José Campaña for his assistance on the London House Pricedataset. Pablo Moreno-Muñoz acknowledges the support of his doctoral FPI grant BES2016-077626and was also supported by Ministerio de Economía of Spain under the project Macro-ADOBE(TEC2015-67719-P), Antonio Artés-Rodríguez acknowledges the support of projects ADVENTURE(TEC2015-69868-C2-1-R), AID (TEC2014-62194-EXP) and CASI-CAM-CM (S2013/ICE-2845).Mauricio A. Álvarez has been partially ﬁnanced by the Engineering and Physical Research Council(EPSRC) Research Projects EP/N014162/1 and EP/R034303/1.

References

M. Alvarez and N. D. Lawrence. Sparse convolved Gaussian processes for multi-output regression. In

NIPS 21 ,pages 57–64, 2009.M. Álvarez, D. Luengo, M. Titsias, and N. Lawrence. Efﬁcient multioutput Gaussian processes throughvariational inducing kernels. In

AISTATS , pages 25–32, 2010.M. A. Alvarez, L. Rosasco, N. D. Lawrence, et al. Kernels for vector-valued functions: A review.

Foundationsand Trends in Machine Learning , 4(3):195–266, 2012.M. J. Beal. Variational algorithms for approximate Bayesian inference.

Ph. D. Thesis, University CollegeLondon , 2003.E. V. Bonilla, K. M. Chai, and C. Williams. Multi-task Gaussian process prediction. In

NIPS 20 , pages 153–160,2008.K. M. A. Chai. Variational multinomial logit Gaussian process.

Journal of Machine Learning Research , 13:1745–1808, 2012.Z. Dai, M. A. Álvarez, and N. Lawrence. Efﬁcient modeling of latent information in supervised learning usingGaussian processes. In

NIPS 30 , pages 5131–5139, 2017.A. Dezfouli and E. V. Bonilla. Scalable inference for Gaussian process models with black-box likelihoods. In

NIPS 28 , pages 1414–1422. 2015.J. D. Hadﬁeld et al. MCMC methods for multi-response generalized linear mixed models: the MCMCglmm Rpackage.

Journal of Statistical Software , 33(2):1–22, 2010.H. Han, A. K. Jain, S. Shan, and X. Chen. Heterogeneous face attribute estimation: A deep multi-task learningapproach.

IEEE transactions on pattern analysis and machine intelligence , 2017. . Hensman, N. Fusi, and N. D. Lawrence. Gaussian processes for big data. In Uncertainty in ArtiﬁcialIntelligence , pages 282–290, 2013.J. Hensman, A. G. d. G. Matthews, and Z. Ghahramani. Scalable variational Gaussian process classiﬁcation. In

AISTATS , pages 351–360, 2015.D. M. Higdon. Space and space-time modelling using process convolutions. In

Quantitative methods for currentenvironmental issues , pages 37–56, 2002.M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference.

Journal of MachineLearning Research , 14(1):1303–1347, 2013.A. G. Journel and C. J. Huijbregts.

Mining Geostatistics . Academic Press, London, 1978.M. Lázaro-Gredilla and M. Titsias. Variational heteroscedastic Gaussian process regression. In

ICML , pages841–848, 2011.S. Li, Z.-Q. Liu, and A. B. Chan. Heterogeneous multi-task learning for human pose estimation with deepconvolutional neural network. In

CVPR , pages 482–489, 2014.C. Ngufor, S. Upadhyaya, D. Murphree, N. Madde, D. Kor, and J. Pathak. A heterogeneous multi-task learningfor predicting RBC transfusion and perioperative outcomes. In

AIME , pages 287–297. Springer, 2015.T. V. Nguyen and E. V. Bonilla. Automated variational inference for Gaussian process models. In

NIPS 27 ,pages 1404–1412. 2014a.T. V. Nguyen and E. V. Bonilla. Collaborative multi-output Gaussian processes. In

UAI , 2014b.G. Parra and F. Tobar. Spectral mixture kernels for multi-output Gaussian processes. In

NIPS 30 , 2017.T. Pourmohamad and H. K. H. Lee. Multivariate stochastic process models for correlated responses of mixedtype.

Bayesian Analysis , 11(3):797–820, 2016.A. D. Saul, J. Hensman, A. Vehtari, and N. D. Lawrence. Chained Gaussian processes. In

AISTATS , pages1431–1440, 2016.G. Skolidis and G. Sanguinetti. Bayesian multitask classiﬁcation with Gaussian process priors.

IEEE Transactionson Neural Networks , 22(12), 2011.H. Soleimani, J. Hensman, and S. Saria. Scalable joint models for reliable uncertainty-aware event prediction.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 40(8):1948–1963, 2018.Y. Teh, M. Seeger, and M. Jordan. Semiparametric latent factor models. In

AISTATS , pages 333–340, 2005.K. R. Ulrich, D. E. Carlson, K. Dzirasa, and L. Carin. GP kernels for cross-spectrum analysis. In

NIPS 28 , 2015.I. Valera and Z. Ghahramani. Automatic discovery of the statistical types of variables in a dataset. In

ICML ,pages 3521–3529, 2017.I. Valera, M. F. Pradier, M. Lomeli, and Z. Ghahramani. General latent feature models for heterogeneous datasets. arXiv preprint arXiv:1706.03779 , 2017.J. Vanhatalo, J. Riihimäki, J. Hartikainen, P. Jylänki, V. Tolvanen, and A. Vehtari. GPstuff: Bayesian modelingwith Gaussian processes.

Journal of Machine Learning Research , 14(1):1175–1179, 2013.J. Vanhatalo, M. Hartmann, and L. Veneranta. Joint species distribution modeling with additive multivariateGaussian process priors and heteregenous data. arXiv preprint arXiv:1809.02432 , 2018.A. G. Wilson and Z. Ghahramani. Generalised Wishart processes. In

UAI , pages 736–744, 2011.E. Yang, P. Ravikumar, G. I. Allen, Y. Baker, Y.-W. Wan, and Z. Liu. A general framework for mixed graphicalmodels. arXiv preprint arXiv:1411.0288 , 2014.X. Yang, S. Kim, and E. P. Xing. Heterogeneous multitask learning with joint sparsity constraints. In

NIPS 22 ,pages 2151–2159, 2009.D. Zhang, D. Shen, A. D. N. Initiative, et al. Multi-modal multi-task learning for joint prediction of multipleregression and classiﬁcation variables in Alzheimer’s disease.

NeuroImage , 59(2):895–907, 2012., 59(2):895–907, 2012.