MAGMA: Inference and Prediction with Multi-Task Gaussian Processes
MMAGMA: Inference and Prediction with
Multi-Task Gaussian Processes
Arthur Leroy
Universit de Paris, CNRS, MAP5 UMR 8145,F-75006 Paris, [email protected]
Pierre Latouche
Universit de Paris, CNRS, MAP5 UMR 8145,F-75006 Paris, [email protected]
Benjamin Guedj
Inria, France andUniversity College London, United [email protected]
Servane Gey
Universit de Paris, CNRS, MAP5 UMR 8145,F-75006 Paris, [email protected] 22, 2020
We investigate the problem of multiple time series forecasting, with the ob-jective to improve multiple-step-ahead predictions. We propose a multi-taskGaussian process framework to simultaneously model batches of individualswith a common mean function and a specific covariance structure. This com-mon mean is defined as a Gaussian process for which the hyper-posterior distri-bution is tractable. Therefore an EM algorithm can be derived for simultane-ous hyper-parameters optimisation and hyper-posterior computation. Unlikeprevious approaches in the literature, we account for uncertainty and handleuncommon grids of observations while maintaining explicit formulations, by1 a r X i v : . [ s t a t . C O ] J u l odelling the mean process in a non-parametric probabilistic framework. Wealso provide predictive formulas integrating this common mean process. Thisapproach greatly improves the predictive performance far from observations,where information shared across individuals provides a relevant prior mean.Our overall algorithm is called Magma (standing for Multi tAsk Gaussianprocesses with common MeAn), and publicly available as a R package. Thequality of the mean process estimation, predictive performances, and compar-isons to alternatives are assessed in various simulated scenarios and on realdatasets.
Keywords:
Multi-task learning, Gaussian process, EM algorithm, Commonmean process, Functional data analysis.
Gaussian processes (GPs) are a powerful tool, widely used in machine learning (Bishop,2006; Rasmussen and Williams, 2006). The classic context of regression aims at inferringthe underlying mapping function associating input to output data. In a probabilisticframework, a typical strategy is to assume that this function is drawn from a prior GP.Doing so, we may enforce some properties for the function solely by characterising themean and covariance function of the process, the latter often being associated with aspecific kernel. This covariance function plays a central role and GPs are an example ofkernel methods. We refer to ´Alvarez et al. (2012) for a comprehensive review. The meanfunction is generally set to 0 for all entries assuming that the covariance structure alreadyintegrates the desired relationship between observed data and prediction targets. In thispaper, we consider a multi-task learning framework with a series of Gaussian processessharing a common mean function. We demonstrate that modelling this function can bekey to obtain relevant predictions.
Related work
A major drawback of GPs lies in the O ( N ) computational cost of the training step,where N denotes the number of observations in the training sample. Many approachesto mitigate this problem with sparse approximations have been proposed in the last twodecades. One of the most popular methods can be found in Snelson and Ghahramani(2006), introducing elegant ideas to select pseudo-inputs, and a subsequent review camein Qui˜nonero-Candela et al. (2007). Titsias (2009) proposed to use variational inferencefor sparse GPs, and Hensman et al. (2013) extended the idea for larger data sets, whereasBanerjee et al. (2013) used linear projections onto low-dimensional subspaces. Besides,some state-of-the-art approximations have been theoretically studied in Bauer et al. (2016).Another approach to deal with numerical issues has recently been proposed in Wilson et al.(2020) to sample from GP efficiently in MCMC algorithms. Bijl et al. (2015) proposed anonline version of some of the sparse approximations mentioned above, while Clingermanand Eaton (2017) and Moreno-Mu˜noz et al. (2019) developed continual learning methodsfor multi-task GP.The multi-task framework consists in using data from several tasks (or batches of indi-viduals) to improve the learning or predictive capacities compared to an isolated model.It has been introduced by Caruana (1997) and then adapted in many fields of machinelearning. GP versions of such models where introduced by Schwaighofer et al. (2004), andthey proposed an EM algorithm for learning. Similar techniques can be found in Shi et al.22005). Meanwhile, Yu et al. (2005) offered an extensive study of the relationships betweenthe linear model and GPs to develop a multi-task GP formulation. However, since the in-troduction in Bonilla et al. (2008) of the idea of two matrices modelling covariance betweeninputs and tasks respectively, the term multi-task Gaussian process has mostly referredto the choice made regarding the covariance structure. Some further developments werediscussed by Hayashi et al. (2012), Rakitsch et al. (2013) and Zhu and Sun (2014). Let usalso mention the work of Swersky et al. (2013) on Bayesian hyper-parameter optimisationin such models. Real applications were tackled by similar models in Williams et al. (2009)and Alaa and van der Schaar (2017).As we focus on multi-task time series forecasting, there is an immediate connection tothe study of multiple curves, or functional data analysis (FDA). As iniatially proposedin Rice and Silverman (1991), it is possible to model and learn mean and covariancestructures simultaneously in this context. We also refer to the monographs (Ramsay andSilverman, 2005; Ferraty and Vieu, 2006). In particular, these books introduced severalusual ways to model a set of functional objects in frequentist frameworks, for exampleby using a decomposition in a basis of functions (such as B-splines, wavelets, Fourier).Subsequently, some Bayesian alternatives were developed in Thompson and Rosen (2008),and Crainiceanu and Goldsmith (2010). Our contributions
Our aim is to define a multi-task GP framework with common mean process, allow-ing reliable probabilistic forecasts even in multiple-step-ahead problems, or for sparselyobserved individuals. For this purpose, (i) We introduce a GP model where the specificcovariance structure of each individual is defined through a kernel and its associated set ofhyper-parameters, whereas a mean function µ overcomes the weaknesses of classic GPsin making predictions far from observed data. To account for its uncertainty, we proposeto define the common mean process µ as a GP as well. (ii) We derive an algorithm called Magma (available as a R package at https://github.com/ArthurLeroy/MAGMA ) to com-pute µ ’s hyper-posterior distribution together with the estimation of hyper-parametersin an EM fashion, and discuss its computational complexity. (iii) We enrich Magma withexplicit formulas to make predictions for a new, partially observed, individual. The hyper-posterior distribution of µ provides a prior belief on what we expect to observe beforeseeing any of the new individual’s data, as an already-informed process integrating bothtrend and uncertainty coming from other individuals. (iv) We illustrate the performanceof our method on synthetic and two real-life datasets, and obtain state-of-the-art resultscompared to alternative approaches. Outline
The paper is organised as follows. We introduce our multi-task Gaussian processmodel in Sec. 2, along with notation. Sec. 3 is devoted to the inference procedure, withan Expectation-Maximisation (EM) algorithm to estimate the Gaussian process hyper-parameters. We leverage this strategy in Sec. 4 and derive a prediction algorithm. InSec. 5, we analyse and discuss the computational complexity of both the inference and pre-diction procedures. Our methodology is illustrated in Sec. 6, with a series of experimentson both synthetic and real-life datasets, and a comparison to competing state-of-the-artalgorithms. On those tasks, we provide empirical evidence that our algorithm outperformsother approaches. Sec. 7 draws perspectives for future work, and we defer all proofs tooriginal results claimed in the paper to Sec. 8.3 i µ m θ f i θ i (cid:15) i σ i N N N ∀ i ∈ I Figure 1: Graphical model of dependencies between variables in the Multi-task GaussianProcess model.
While GPs can be used for many types of data, their continuous nature makes them partic-ularly well suited to study temporal phenomena. Throughout, we use the term individual as a synonym of task or batch, and adopt notation and vocabulary of time series to re-main consistent with the real datasets application we provide in Sec. 6.5, which addressesyoung swimmers performances’ forecast. These time series are considered as pointwiseobservations of functions we try to reconstruct thanks to the following generative model.We are provided with functional data coming from M ∈ I different individuals, where I ⊂ N . For each individual i , we observe a set of inputs and outputs (cid:110)(cid:0) t i , y i ( t i ) (cid:1) , . . . , (cid:16) t N i i , y i ( t N i i ) (cid:17)(cid:111) ,where N i is the number of data points for the i -th individual. Since many objects are de-fined for all individuals, we shorten our notation as follows: for any object x existing forall i , we denote { x i } i = { x , . . . , x M } . Moreover, as we work in a temporal context, theinputs (cid:8) t ki (cid:9) i,k are referred to as timestamps . In the specific case where all individualsare observed at the same timestamps, we call common the grid of observations. On thecontrary, a grid of observations is uncommon if the timestamps are different in numberand/or location among the individuals. Some convenient notation: • t i = { t i , . . . , t N i i } , the set of timestamps for the i -th individual, • y i = y i ( t i ), the vector of outputs for the i -th individual, • t = M (cid:83) i =1 t i , the pooled set of timestamps among individuals, • N = t ), the total number of observed timestamps. Suppose that a functional data is coming from the sum of a mean process, common toall individuals, and an individual-specific centred process. To clarify relationships in thegenerative model, we illustrate our graphical model in Fig. 1.Let T be the input space, our model is 4 i ( t ) = µ ( t ) + f i ( t ) + (cid:15) i ( t ) , t ∈ T , i = 1 , . . . , M, (1)where µ ( · ) ∼ GP ( m ( · ) , K θ ( · , · )) is the mean common process and f i ( · ) ∼ GP (0 , Σ θ i ( · , · ))the individual specific process. Moreover, the error term is supposed to be (cid:15) i ( · ) ∼ GP (0 , σ i I ). The following notation is used for parameters: • K θ ( · , · ), a covariance function of hyper-parameters θ , • ∀ i, Σ θ i ( · , · ), a covariance function of hyper-parameters θ i , • σ i ∈ R + , the noise term for individual i , • ∀ i, Ψ θ i ,σ i ( · , · ) = Σ θ i ( · , · ) + σ i I d , • Θ = { θ , { θ i } i , (cid:8) σ i (cid:9) i } , the set of all hyper-parameters of the model.We also assume that • { f i } i are independent, • { (cid:15) i } i are independent, • ∀ i, µ and f i are independent.It follows that { y i | µ } i =1 ,...,M are independent from one another, and for all i ∈ I : y i ( · ) | µ ( · ) ∼ GP ( µ ( · ) , Ψ θ i ,σ i ( · , · )) . (2)Although this model is based on infinite-dimensional GPs, the inference will be con-ducted on a finite grid of observations. According to the aforementioned notation, weobserve { ( t i , y i ) } i , and the corresponding likelihoods are Gaussian: y i | µ ( t i ) ∼ N ( y i ; µ ( t i ) , Ψ t i θ i ,σ i ) , (3)where Ψ t i θ i ,σ i = Ψ θ i ,σ i ( t i , t i ) = (cid:104) Ψ θ i ,σ i ( k, l ) (cid:105) k,(cid:96) ∈ t i is a N i × N i covariance matrix. Since t i might be different among individuals, we also need to evaluate µ on the pooled grid t : µ ( t ) ∼ N (cid:0) µ ( t ); m ( t ) , K t θ (cid:1) , (4)where K t θ = K θ ( t , t ) = [ K θ ( k, (cid:96) )] k,l ∈ t is a N × N covariance matrix.An alternate hypothesis consists in considering hyper-parameters { θ i } i and { σ i } i equalfor all individuals. We call this hypothesis Common HP in the Sec. 6. This particular casemodels a context where individuals represent different trajectories of the same process,whereas different hyper-parameters indicate different covariance structures and thus amore flexible model. For the sake of generality, the remainder of the paper is written with θ i and σ i notation, when there are no differences in the procedure. Moreover, the modelabove and the subsequent algorithm may use any covariance function parametrised by afinite set (usually small) of hyper-parameters. For example, a common kernel in the GPliterature is known as the Exponentiated Quadratic kernel (also called sometimes SquaredExponential or Radial Basis Function kernel). It depends only on two hyper-parameters θ = { v, (cid:96) } and is defined as: 5 EQ (cid:0) x, x (cid:48) (cid:1) = v exp (cid:32) − ( x − x (cid:48) ) (cid:96) (cid:33) . (5)The Exponentiated Quadratic kernel is simple and enjoys useful smoothness properties.This is the kernel we use in our implementation (see Sec. 6 for details). Note that there isa rich literature on kernel choice, their construction and properties, which is beyond thescope of the paper: we refer to Rasmussen and Williams (2006) or Duvenaud (2014) forcomprehensive studies.
Several approaches to learn hyper-parameters for Gaussian processes have been proposedin the literature, we refer to Rasmussen and Williams (2006) for a comprehensive study.One classical approach, called empirical Bayes (Casella, 1985), is based on the maximi-sation of an explicit likelihood to estimate hyper-parameters. This procedure avoids tosample from intractable distributions, usually resulting in additional computational costand complicating practical use in moderate to large sample sizes. However, since the like-lihood of the model depends on µ , we cannot maximise it directly. Therefore, we proposean EM algorithm (see the pseudocode in Algorithm 1) to learn the hyper-parameters Θ.The procedure alternatively computes the hyper-posterior distribution p ( µ | ( y i ) i , (cid:98) Θ) withcurrent hyper-parameters, and then optimises Θ according to this hyper-posterior distri-bution. This EM algorithm converges to local maxima (Dempster et al., 1977), typicallyin a handful of iterations.
E step
For the sake of simplicity, we assume in that section that for all i, j, t i = t j = t , i.e.the individuals are observed on a common grid of timestamps. The E-step then consistsin computing the hyper-posterior distribution of µ ( t ). Proposition 3.1.
Assume the hyper-parameters (cid:98) Θ known from initialisation or estimatedfrom a previous M step. The hyper-posterior distribution of µ remains Gaussian: p (cid:16) µ ( t ) | { y i } i , (cid:98) Θ (cid:17) = N (cid:16) µ ( t ); (cid:98) m ( t ) , (cid:98) K t (cid:17) , (6) with • (cid:98) K t = (cid:18) K t (cid:98) θ − + M (cid:80) i =1 Ψ t (cid:98) θ i , (cid:98) σ i − (cid:19) − , • (cid:98) m ( t ) = (cid:98) K t (cid:18) K t (cid:98) θ − m ( t ) + M (cid:80) i =1 Ψ t (cid:98) θ i , (cid:98) σ i − y i (cid:19) . Proof.
6e omit specifying timestamps in what follows since each process is evaluated on t . p (cid:16) µ | { y i } i , (cid:98) Θ (cid:17) ∝ p (cid:16) { y i } i | µ , (cid:98) Θ (cid:17) p (cid:16) µ | (cid:98) Θ (cid:17) ∝ (cid:40) M (cid:89) i =1 p (cid:16) y i | µ , (cid:98) θ i , (cid:98) σ i (cid:17)(cid:41) p (cid:16) µ | (cid:98) θ (cid:17) ∝ (cid:40) M (cid:89) i =1 N (cid:16) y i ; µ , Ψ (cid:98) θ i , (cid:98) σ i ) (cid:17)(cid:41) N (cid:16) µ ; m , K (cid:98) θ (cid:17) . The term L = − (1 /
2) log p ( µ | { y i } i , (cid:98) Θ) may then be written as L = −
12 log p ( µ | { y i } i , (cid:98) Θ)= M (cid:88) i =1 ( y i − µ ) (cid:124) Ψ − (cid:98) θ i , (cid:98) σ i ( y i − µ )+ ( µ − m ) (cid:124) K − (cid:98) θ ( µ − m ) + C = M (cid:88) i =1 µ (cid:124) Ψ − (cid:98) θ i , (cid:98) σ i µ − µ (cid:124) Ψ − (cid:98) θ i , (cid:98) σ i y i + µ (cid:124) K − (cid:98) θ µ − µ (cid:124) K − (cid:98) θ m + C = µ (cid:124) (cid:32) K − (cid:98) θ + M (cid:88) i =1 Ψ − (cid:98) θ i , (cid:98) σ i (cid:33) µ − µ (cid:124) (cid:32) K − (cid:98) θ m + M (cid:88) i =1 Ψ − (cid:98) θ i , (cid:98) σ i y i (cid:33) + C . Identifying terms in the quadratic form with the Gaussian likelihood, we get the desiredresult.Let us stress here that the above result assumes common timestamps among individuals,which is a simplified setting. We provide a generalisation of this proposition in Sec. 4:Proposition 4.1 holds with uncommon grids of timestamps t i .The maximisation step depends on the assumptions on the generative model, resultingin two versions for the EM algorithm (the E step is common to both, the branching pointis here). M step: different hyper-parameters
Assuming each individual has its own set of hyper-parameters { θ i , σ i } , the M step isgiven by the following. Proposition 3.2.
Assume p ( µ | { y i } i )= N (cid:16) µ ( t ); (cid:98) m ( t ) , (cid:98) K t (cid:17) given by a previous E step. For a set of hyper-parameters Θ = { θ , { θ i } i , { σ i } i } , optimal values are given by (cid:98) Θ = argmax Θ E µ |{ y i } i [ p ( { y i } i , µ ( t ) | Θ) ] , nducing M + 1 independent maximisation problems: (cid:98) θ = argmax θ L t (cid:0) (cid:98) m ( t ); m ( t ) , K t θ (cid:1) , ( (cid:98) θ i , (cid:98) σ i ) = argmax θ i ,σ i L t i ( y i ; (cid:98) m ( t ) , Ψ t i θ i ,σ i ) , ∀ i, where L t ( x ; m , S ) = N ( x ; m , S ) − T r (cid:16) (cid:98) K t S − (cid:17) . Proof.
One simply has to distribute the conditional expectation in order to get the rightlikelihood to maximise, and then notice that the function can be written as a sum of M+1independent (with respect to the hyper-parameters) terms. Moreover, by rearranging,one can observe that each independent term is the sum of a Gaussian likelihood and acorrection trace term. See Sec. 8 for details.
M step: common hyper-parameters
Alternatively, assuming all individuals share the same set of hyper-parameters { θ, σ } ,the M step is given by the following. Proposition 3.3.
Assume p ( µ | { y i } i )= N (cid:16) µ ( t ); (cid:98) m ( t ) , (cid:98) K t (cid:17) given by a previous E step. For a set of hyper-parameters Θ = { θ , θ, σ } , optimal values are given by (cid:98) Θ = argmax Θ E µ |{ y i } i [ p ( { y i } i , µ ( t ) | Θ) ] , inducing two independent maximisation problems: (cid:98) θ = argmax θ L t (cid:0) (cid:98) m ( t ); m ( t ) , K t θ (cid:1) , ( (cid:98) θ, (cid:98) σ ) = argmax θ,σ L M ( θ, σ ) , where L M ( θ, σ ) = M (cid:88) i =1 L t i ( y i ; (cid:98) m ( t ) , Ψ t i θ,σ ) . Proof.
We use the same strategy as for Proposition 3.2, see Sec. 8 for details.In both cases, explicit gradients associated with the likelihoods to maximise are avail-able, facilitating the optimisation with gradient-based methods.
To implement the EM algorithm described above, several constants must be (appropri-ately) initialised: • m ( · ), the mean parameter from the hyper-prior distribution of the mean process µ ( · ). A somewhat classical choice in GP is to set its value to a constant, typically 0in the absence of external knowledge. Notice that, in our multi-task framework, theinfluence of m ( · ) in hyper-posterior computation decreases quickly as M grows.8 Initial values for kernel parameters θ and { θ i } i . Those strongly depend on the cho-sen kernel and its properties. We advise initiating θ and { θ i } i with close values, asa too large difference might induce a nearly singular covariance matrix and resultin numerical instability. In such pathological regime, the influence of a specific indi-vidual tends to overtake others in the calculus of µ ’s hyper-posterior distribution. • Initial values for the variance of the error terms { σ i } i . This choice mostly dependson the context and properties of the dataset. We suggest avoiding initial values withmore than an order of magnitude different from the variability of data. In particular,a too high value might result in a model mostly capturing noise.As a final note, let us stress that the EM algorithm depends on the initialisation andis only guaranteed to converge to local maxima of the likelihood function (Krishnan andMcLachlan, 1997). Several strategies have been considered in the literature to tackle thisissue such as simulated annealing and the use of multiple initialisations (Biernacki et al.,2003). In this paper, we choose the latter option. We wrap up this section with the pseudocode of the EM component of our completealgorithm, which we call
Magma (standing for Multi tAsk Gaussian processes with com-mon MeAn). The corresponding code is available at https://github.com/ArthurLeroy/MAGMA . Algorithm 1
Magma : EM componentInitialise m and Θ = { θ , { θ i } i , { σ i } i } . while not converged do E step: Compute the hyper-posterior distribution p ( µ | { y i } i , Θ) = N ( (cid:98) m , (cid:98) K ) . M step: Estimate hyper-parameter by maximising (cid:98)
Θ = argmax Θ E µ |{ y i } i [ p ( µ , { y i } i | Θ) ] . end whilereturn (cid:98) Θ, (cid:98) m , (cid:98) K . Let us stress that even though we focus on prediction purpose in this paper, the outputof the EM algorithm already provides results on related FDA problems. The generativemodel in Yang et al. (2016) describes a Bayesian framework that resembles ours to smoothmultiple curves simultaneously. However, modelling variance structure with an Inverse-Wishard process forces the use of an MCMC algorithm for inference or the introduction of amore tractable approximation in Yang et al. (2017). One can think of the learning through
Magma and applying a single task GP regression on each individual as an empirical Bayes counterpart to their approach. Meanwhile, µ ’s hyper-posterior distribution also providesthe probabilistic estimation of a mean curve from a set of functional data. The closestmethod to our approach can be found in Shi et al. (2007) and the following book Shiand Choi (2011), though by several aspects, authors dealt with more general features likemultidimensional or non-functional inputs. The authors also work in the context of a9ulti-task GP model, and one can retrieve the idea of defining a mean function µ toovercome the weaknesses of classic GPs in making predictions far from observed data.Since their model uses B-splines to estimate this mean function, thanks to informationfrom multiple individuals, this method only works if all individuals share the same grid ofobservation, and does not account for uncertainty over µ . Once the hyper-parameters of the model have been learned, we can focus on our maingoal: prediction at new timestamps. Since (cid:98)
Θ is known and for the sake of concision, weomit conditioning on (cid:98)
Θ in the sequel. Note there are two cases for prediction (referredto as
Type I and
Type II in Shi and Cheng, 2014, Section 3.2.1), depending on whetherwe observe some data or not for any new individual we wish to predict on. We denoteby the index ∗ a new individual for whom we want to make a prediction at timestamps t p . If there are no available data for this individual, we have no ∗ -specific information,and the prediction is merely given by p ( µ ( t p ) | { y i } i ). This quantity may be consideredas the ’generic’ (or Type II ) prediction according to the trained model, and only informsus through the mean process. Computing p ( µ ( t p ) | { y i } i ) is also one of the steps leadingto the prediction for a partially observed new individual ( Type I ). The latter being themost compelling case, we consider
Type II prediction as a particular case of the full
TypeI procedure, described below.If we observe { t ∗ , y ∗ ( t ∗ ) } for the new individual, the multi-task GP prediction is obtainedby computing the posterior distribution p ( y ∗ ( t p ) | y ∗ ( t ∗ ) , { y i } i ). Note that the conditioningis taken over y ∗ ( t ∗ ), as for any GP regression, but also on { y i } i , which is specific to ourmulti-task setting. Computing this distribution requires the following steps.1. Choose a grid of prediction t p and define the pooled vector of timestamps t p ∗ ,2. Compute the hyper-posterior distribution of µ at t p ∗ : p ( µ ( t p ∗ ) | { y i } i ),3. Compute the prior distribution p ( y ∗ ( t p ∗ ) | { y i } i ),4. Compute hyper-parameters θ ∗ of the new individual’s covariance matrix (optional),5. Compute the posterior distribution: p ( y ∗ ( t p ) | y ∗ ( t ∗ ) , { y i } i ). As mentioned above, we observed a new individual at timestamps t ∗ . The GP regressionconsists of arbitrarily choosing a vector t p of timestamps on which we wish to make a pre-diction. Since a GP is an infinite-dimensional object, we can pick a finite-dimensional vec-tor at any new location. Then, we define new notation for the pooled vector of timestamps t p ∗ = (cid:20) t p t ∗ (cid:21) , which will serve as a working grid to define the prior and posterior distributionsinvolved in the prediction process. One can note that, although not mandatory in theory,it is often a good idea to include the observed timestamps of training individuals, t , within t p ∗ since they match locations which contain information for the mean process to ’help’ theprediction. In particular, if t p ∗ = t , the computation of µ ’s hyper-posterior distribution isnot necessary since p ( µ ( t ) | { y i } i ) has previously been obtained with the EM algorithm.10owever, in general, it is necessary to compute the hyper-posterior p ( µ ( t p ∗ ) | { y i } i ) at thenew timestamps. The idea remains similar to the E step aforementioned, and we obtainthe following result. Proposition 4.1.
Let t p ∗ be a vector of timestamps of size ˜ N . The hyper-posterior distri-bution of µ remains Gaussian: p ( µ ( t p ∗ ) | { y i } i ) = N (cid:16) µ ( t p ∗ ); (cid:98) m ( t p ∗ ) , (cid:98) K p ∗ (cid:17) , (7) with: (cid:98) K p ∗ = (cid:32) ˜ K − + M (cid:88) i =1 ˜Ψ i − (cid:33) − , (cid:98) m ( t p ∗ ) = (cid:98) K p ∗ (cid:32) ˜ K − m ( t p ∗ ) + M (cid:88) i =1 ˜Ψ − i ˜ y i (cid:33) , where we used the shortening notation: • ˜ K = K (cid:98) θ ( t p ∗ , t p ∗ ) ( ˜ N × ˜ N matrix), • ˜ y i = (cid:0) [ t ∈ t i ] × y i ( t ) (cid:1) t ∈ t p ∗ ( ˜ N -size vector), • ˜Ψ i = (cid:104) [ t,t (cid:48) ∈ t i ] × Ψ (cid:98) θ i , (cid:98) σ i ( t, t (cid:48) ) (cid:105) t,t (cid:48) ∈ t p ∗ ( ˜ N × ˜ N matrix). Proof.
The sketch of the proof is similar to Proposition 3.1 in the E step. The only tech-nicality consists in dealing carefully with the dimensions of vectors and matrices involved,and whenever relevant, to define augmented versions of y i and Ψ (cid:98) θ i , (cid:98) σ i with 0 elements atunobserved timestamps’ position for the i -th individual. Note that if we pick a vector t p ∗ including only some of the timestamps from t i , information coming from y i at the remain-ing timestamps is ignored. We defer details to Sec. 8. According to our generative model, given the mean process, any new individual ∗ is mod-elled as: y ∗ ( · ) | µ ( · ) ∼ GP (cid:0) µ ( · ) , Ψ θ ∗ ,σ ∗ ( · , · ) (cid:1) . (8)Therefore, for any finite-dimensional vector of timestamps, and in particular for t p ∗ , p ( y ∗ ( t p ∗ ) | µ ( t p ∗ )) is a multivariate Gaussian vector. Moreover, from this distribution and µ ’s hyper-posterior, we can figure out the multi-task prior distribution over y ∗ ( t p ∗ ). Proposition 4.2.
For a set of timestamps t p ∗ , the multi-task prior distribution of y ∗ isgiven by p ( y ∗ ( t p ∗ ) | { y i } i ) = N (cid:16) y ∗ ( t p ∗ ); (cid:98) m ( t p ∗ ) , (cid:98) K p ∗ + Ψ t p ∗ θ ∗ ,σ ∗ (cid:17) . (9) Proof.
Let P denote the multi-task prior distribution for the new individual at times-tamps t p ∗ . To compute this prior, we need to integrate p ( y ∗ | µ , { y i } i ) over the mean process µ , whereas the multi-task aspect remains through the conditioning over { y i } i . We omitthe writing of timestamps, by using the simplified notation µ and y ∗ instead of µ ( t p ∗ ) and11 ∗ ( t p ∗ ), respectively. We first use the assumption that { y i | µ } i ∈{ ,...,M } ⊥⊥ y ∗ | µ , i.e. , theindividuals are independent conditionally to µ . Then, one can notice that the two dis-tributions involved within the integral are Gaussian, which leads to the explicit Gaussiantarget distribution after integration. P = p ( y ∗ | { y i } i )= (cid:90) p ( y ∗ , µ | { y i } i ) d µ = (cid:90) p ( y ∗ | µ , { y i } i ) p ( µ | { y i } i ) d µ = (cid:90) p ( y ∗ | µ )) (cid:124) (cid:123)(cid:122) (cid:125) N (cid:18) y ∗ ; µ , Ψ t p ∗ θ ∗ ,σ ∗ (cid:19) p ( µ | { y i } i ) (cid:124) (cid:123)(cid:122) (cid:125) N ( µ ; (cid:98) m , (cid:98) K p ∗ ) d µ . This convolution of two Gaussians remains Gaussian (Bishop, 2006, Chapter 2.3.3). Forany random variable X ∈ Ω, and A X depending on X, let E A X [ X ] = (cid:82) Ω x p ( A X ) d x . Themean parameter is then given by E y ∗ |{ y i } i [ y ∗ ] = (cid:90) y ∗ p ( y ∗ | { y i } i ) d y ∗ = (cid:90) y ∗ (cid:90) p ( y ∗ | µ ) p ( µ | { y i } i ) d µ d y ∗ = (cid:90) (cid:18)(cid:90) y ∗ p ( y ∗ | µ ) d y ∗ (cid:19) p ( µ | { y i } i ) d µ = (cid:90) E y ∗ | µ [ y ∗ ] p ( µ | { y i } i ) d µ = E µ |{ y i } i (cid:2) E y ∗ | µ [ y ∗ ] (cid:3) = E µ |{ y i } i [ µ ]= (cid:98) m . Following the same idea, the second-order moment is given by E y ∗ |{ y i } i (cid:2) y ∗ (cid:3) = E µ |{ y i } i (cid:2) E y ∗ | µ (cid:2) y ∗ (cid:3) (cid:3) = E µ |{ y i } i (cid:104) V y ∗ | µ [ y ∗ ] + E y ∗ | µ [ y ∗ ] (cid:105) = Ψ θ ∗ ,σ ∗ + E µ |{ y i } i (cid:2) µ (cid:3) = Ψ θ ∗ ,σ ∗ + V µ |{ y i } i [ µ ] + E µ |{ y i } i [ µ ] = Ψ θ ∗ ,σ ∗ + (cid:98) K + (cid:98) m , hence V y ∗ |{ y i } i [ y ∗ ] = E y ∗ |{ y i } i (cid:2) y ∗ (cid:3) − E y ∗ |{ y i } i [ y ∗ ] = Ψ θ ∗ ,σ ∗ + (cid:98) K + (cid:98) m − (cid:98) m = Ψ θ ∗ ,σ ∗ + (cid:98) K. y ∗ ( · ) | { y i } i is not a GP, although its finite-dimensional evaluationabove remains Gaussian. The covariance structure cannot be expressed as a kernel thatcould be directly evaluated on any vector: the process is known as a degenerated GP . Inpractice however, this does not bear much consequence as an arbitrary vector of times-tamps τ can still be chosen, then we compute the hyper-posterior p ( µ ( τ ) | { y i } i ), whichyields the Gaussian distribution p ( y ∗ ( τ ) | { y i } i ) as above. For the sake of simplicity, wenow rename the covariance matrix of the prior distribution as (cid:98) K p ∗ + Ψ t p ∗ θ ∗ ,σ ∗ = Γ p ∗ = (cid:18) Γ pp Γ p ∗ Γ ∗ p Γ ∗∗ (cid:19) , (10)where Γ x,x (cid:48) = Γ( t x , t x (cid:48) ) = (cid:98) K x (cid:48) x + Ψ θ ∗ ,σ ∗ ( t x , t x (cid:48) ), for any t x , t x (cid:48) ⊂ T . When we collect data points for a new individual, as in the single-task GP setting, weneed to learn the hyper-parameters of its covariance function before making predictions.A salient fact in our multi-task approach is that we include this step in the predictionprocess, for the two following reasons. First, the model is already trained for individuals i = 1 , . . . , M , and this training is general and independent from future individual ∗ orthe choice of prediction timestamps. Since learning these new hyper-parameters requiresknowledge of µ ( t p ∗ ) and thus of the prediction timestamps, we cannot compute thembeforehand. Secondly, learning these hyper-parameters with the empirical Bayes approachonly requires maximisation of a Gaussian likelihood which is negligible in computing timecompared to the previous EM algorithm.As for single-task GP, we have the following estimates for hyper-parameters: (cid:98) Θ ∗ = argmax Θ ∗ p ( y ∗ ( t ∗ ) | { y i } i , Θ ∗ )= argmax Θ ∗ N (cid:0) y ∗ ( t ∗ ); (cid:98) m ( t ∗ ) , Γ Θ ∗ ∗∗ (cid:1) . Note that this step is optional depending on model: in the common hyper-parametersmodel (i.e. ( θ, σ ) = ( θ i , σ i )), any new individual will share the same hyper-parametersand we already have (cid:98) Θ ∗ = ( (cid:98) θ ∗ , (cid:98) σ ∗ ) = ( (cid:98) θ, (cid:98) σ ) from the EM algorithm. We can write the prior distribution, separating observed and prediction timestamps, as: p ( y ∗ ( t p ∗ ) | { y i } i ) = p ( y ∗ ( t p ) , y ∗ ( t ∗ ) | { y i } i ) (11)= N ( y ∗ ( t p ∗ ); (cid:98) m ( t p ∗ ) , Γ p ∗ ) (12)= N (cid:18)(cid:20) y ∗ ( t p ) y ∗ ( t ∗ ) (cid:21) ; (cid:20) (cid:98) m ( t p ) (cid:98) m ( t ∗ ) (cid:21) , (cid:18) Γ pp Γ p ∗ Γ ∗ p Γ ∗∗ (cid:19)(cid:19) . (13)The conditional distribution remains Gaussian (Bishop, 2006), and the predictive distri-bution is given by 13 ( y ∗ ( t p ) | y ∗ ( t ∗ ) , { y i } i ) = N (cid:16) y ∗ ( t p ); (cid:98) µ p , (cid:98) Γ p (cid:17) , (14)where: • (cid:98) µ p = (cid:98) m ( t p ) + Γ p ∗ Γ − ∗∗ ( y ∗ ( t ∗ ) − (cid:98) m ( t ∗ )) , • (cid:98) Γ p = Γ pp − Γ p ∗ Γ − ∗∗ Γ ∗ p . Computational complexity is of paramount importance in GPs as it quickly scales withlarge datasets. The classical cost to train a GP is O ( N ), and O ( N ) for prediction(Rasmussen and Williams, 2006) where N is the number of data points (see aforementionedreferences in Sec. 1 for sparse approximations). Since Magma uses information from M individuals, each of them providing N i observations, these quantities determine the overallcomplexities of the algorithm. If we recall that N is the number of distinct timestamps ( i.e. N ≤ M (cid:80) i =1 N i ), the training complexity is O (cid:0) M × N i + N (cid:1) ( i.e. the complexity of each EMiteration). As usual with GPs, the cubic costs come from the inversion of the correspondingmatrices, and here, the constant is proportional to the number of iterations of the EMalgorithm. The dominating term in this expression depends on the values of M , relativelyto N . For a large number of individuals with many common timestamps ( M N i (cid:38) N ), thefirst term dominates. For diverse timestamps among individuals ( M N i (cid:46) N ), the secondterm becomes the primary burden, as in any GP problem. During the prediction step,the re-computation of µ ’s hyper-posterior implies the inversion of a ˜ N × ˜ N (dimensionof t p ∗ ) which has a O ( ˜ N ) complexity while the final prediction is O ( N ∗ ). In practice, themost computing-expensive steps can be performed in advance to allow for quick on-the-flyprediction when collecting new data. If we observe the training dataset once and pre-compute the hyper-posterior of µ on a fine grid on which to predict later, the immediatecomputational cost for each new individual is identical to the one of the single-task GPregression. We evaluate our
Magma algorithm on synthetic data, and two real datasets. The classicalGP regression on single tasks separately is used as the baseline alternative for predictions.While it is not expected to perform well on the dataset used, the comparison highlightsthe interest of multi-task approaches. To our knowledge, the only alternative to
Magma is the GPFDA algorithm from Shi et al. (2007), Shi and Choi (2011), described in Sec. 3.4,and the associated R package
GPFDA , which is applied on the examples. Throughoutthe section, the standard
Exponentiated Quadratic kernel (see Eq. (5)) is used both forsimulating the data and for the covariance structures in the three algorithms. Hence,each kernel is associated with θ = { v, (cid:96) } , v, (cid:96) ∈ R + , a set of, respectively, variance andlength-scale hyper-parameters. Each simulated dataset has been drawn from the samplingscheme below:1. Define a random working grid t ⊂ [ 0 ,
10 ] of N = 200 timestamps, and a number M of individuals. 14. Define the prior mean for µ : m ( t ) = at + b, ∀ t ∈ t , where a ∈ [ − , b ∈ [ 0 ,
10 ].3. Draw uniformly hyper-parameters for µ ’s kernel : θ = { v , (cid:96) } , where v ∈ (cid:2) , e (cid:3) and (cid:96) ∈ (cid:2) , e (cid:3) .4. Draw uniformly µ ( t ) ∼ N (cid:0) m ( t ) , K t θ (cid:1) .5. For all i = 1 , . . . , M , θ i = { v i , (cid:96) i } , where v i ∈ (cid:2) , e (cid:3) , (cid:96) i ∈ (cid:2) , e (cid:3) , and σ i ∈ [ 0 , i = 1 , . . . , M , draw a subset uniformly at random t i ⊂ t of N i = 30 times-tamps, and draw y i ∼ N (cid:16) µ ( t i ) , Ψ t i θ i ,σ i (cid:17) .This procedure provides a synthetic data set { t i , y i } i , and its associated mean pro-cess µ ( t ). Those quantities are used to train the model, make predictions with eachalgorithm, and then compute errors in µ estimation and forecasts. We recall that the Magma algorithm enables two different settings depending on the model’s assumptionover hyper-parameters (HP), and we refer to them as
Common HP and
Different HP inthe following. In order to test these two contexts, differentiated datasets have been gener-ated, by drawing
Common HP data or Different HP data for each individual at step 5. Wepreviously presented the idea of the model used in GPFDA, and, although the algorithmhas many features (in particular about the type and number of input variables), it is notyet usable when timestamps are different among individuals. Therefore, two frameworksare considered,
Common grid and
Uncommon grid , to take this specification into account.Thus, the comparison between the different methods can only be performed on data gen-erated under the settings
Common HP and
Common grid , and the effect of the differentsettings on
Magma is analysed separately. Moreover, without additional knowledge, theinitialisation for the prior mean function, m ( · ), is set to be equal to 0 for each algorithm.Except in some experiments, where the influence of the number of individuals is analysed,the generic value is M = 20. In the case of prediction on unobserved timestamps for anew individual, the first 20 data points are used as observations, and the remaining 10 aretaken as test values. To illustrate the multi-task approach of
Magma , Fig. 2 displays a comparison betweensingle GP regression and
Magma on a simple example, from a dataset simulated accordingto the scheme above. Given the observed data (in black), values on a thin grid of unob-served timestamps are predicted and compared, in particular, with the true test values(in red). As expected, GP regression provides a good fitting close to the data points andthen dives rapidly to the prior 0 with increasing uncertainty. Conversely, although theinitialisation for the prior mean was also 0 in
Magma , the hyper-posterior distributionof µ (dashed line) is estimated thanks to all individuals in the training dataset. Thisprocess acts as an informed prior helping GP prediction for the new individual, even farfrom its own observations. More precisely, 3 phases can be distinguished according to thelevel of information coming from the data: in the first one, close to the observed data( t ∈ [ 1 , Magma , which is logical since the prediction also takes uncertainty over µ into ac-count (see Eq. (9)); in the second one, on intervals of unobserved timestamps containingdata points from the training dataset ( t ∈ [ 0 , ∪ [ 7 ,
10 ]), the prediction is guided bythe information coming from other individuals through µ . In this context, the mean15igure 2: Prediction curves (blue) for a new individual with associated 95% credible in-tervals (grey) for GP regression (left) and Magma (right). The dashed linerepresents the mean function of the mean process’s hyper-posterior p ( µ | { y i } i ).Observed data points are in black, and testing data points are in red. Thecolourful backward points are the observations from the training dataset, eachcolour corresponding to a different individual. Table 1
Average MSE (sd) and average CI coverage (sd) on 100 runs for GP, GPFDAand Magma . ( (cid:63) : 99.6 (2.8), the measure of incertitude from the GPFDA package is nota genuine credible interval) Prediction Estimation µ MSE CI MSE CI Magma
GPFDA 31.8 (49.4) 90.4 (18.1) 2.4 (3.6) (cid:63)
GP 87.5 (151.9) 74.0 (32.7)trajectory remains coherent and the uncertainty increases only slightly. In the third case,where no observations are available neither from new individual nor from training dataset( t ∈ [ 10 ,
12 ]), the prediction behaves as expected, with a slow drifting to the prior mean0, with highly increasing variance. Overall, the multi-task framework provides reliableprobabilistic predictions on a wider range of timestamps, potentially outside of the usualscope for GPs.
We confront the performance of
Magma to alternatives in several situations and for dif-ferent datasets. In the first place, the classical GP regression (GP), GPFDA and
Magma are compared through their performance in prediction and estimation of the true meanprocess µ . In the prediction context, the performances are evaluated according to thefollowing indicators: • the mean squared error (MSE) which compares the predicted values to the true testvalues of the 10 last timestamps:110 (cid:88) k =21 (cid:16) y predi ( t ki ) − y truei ( t ki ) (cid:17) , M of training individuals (100 runs in eachcase). Left : prediction error on 10 testing points.
Right : estimation error of thetrue mean process µ .Figure 4: MSE prediction error on the 10 last testing points with respect to the increasingnumber N of observed timestamps, among the first 20 points (100 runs in eachcase). 17 the ratio of CI coverage, i.e. the percentage of unobserved data points effectivelylying within the 95% credible interval defined from the predictive posterior distribu-tion p ( y ∗ ( t p ) | y ∗ ( t ∗ ) , { y i } i ): 100 × (cid:88) k =21 { y truei ∈ CI } . The ratio of CI coverage gives an approximation of the predictive variance reliabilityand should be as close to the value 95% as possible. Other values would indicate a ten-dency to underestimate or overestimate the uncertainty. Let us recall that GPFDA usesB-splines to estimate the mean process and does not account for uncertainty, contrarilyto a probabilistic framework as Magma . However, a measure of uncertainty based on anempirical variance estimated from training curves is proposed (see Shi and Cheng, 2014,Section 3.2.1). In practice, this measure constantly overestimates the true variance, andthe CI coverage is generally equal or close to 100%.In the estimation context, the performances are evaluated thanks to another MSE, whichcompares the estimations to the true values of µ at all timestamps:1 M M (cid:88) i =1 N N (cid:88) k =1 (cid:16) µ pred ( t ki ) − µ true ( t ki ) (cid:17) . Table 1 presents the results obtained over 100 datasets, where the model is trained on M = 20 individuals, each of them observed on N = 30 common timestamps. As expected,both multi-task methods lead to better results than GP. However, Magma outperformsGPFDA, both in estimation of µ and in prediction performance. In terms of error as wellas in uncertainty quantification, Magma provides more accurate results, in particularwith a CI coverage close to the 95% expected value. Each method presents a quite highstandard deviation for MSE in prediction, which is due to some datasets with particularlydifficult values to predict, although most of the cases lead to small errors. This behaviour isreasonably expected since the forecast of 10-ahead-timestamps might sometimes be tricky.It can also be noticed on Fig. 3 that Magma consistently provides lower errors as well asless pathological behaviour, as it may sometimes occur with the B-splines modelling usedin GPFDA.To highlight the effect of the number of individuals M on the performance, Fig. 3 pro-vides the same 100 runs trial as previously, for different values of M . The boxplots exhibit,for each method, the behaviour of the prediction and estimation MSE as information isadded in the training dataset. Let us mention the absence of discernible changes as soonas M > µ , leading to shallow errors for high values of M ,in particular for Magma . Meanwhile, the left panel exhibits reasonably unchanged pre-diction performance with respect to the values of M , excepted some random fluctuations.This property is expected for GP regression, since no external information is used fromthe training dataset in this context. For both multi-tasks algorithms though, the estima-tion of µ improves the prediction by one order of magnitude below the typical errors,even with only a few training individuals. Furthermore, since a new individual behavesindependently through f ∗ , it is natural for a 10-points-ahead forecast to present intrinsicvariations, despite an adequate estimation of the shared mean process.18o illustrate the advantage of multi-task methods, even for M = 20, we display on Fig. 4the evolution of MSE according to the number of timestamps N that are assumed to beobserved for the new individual on which we make predictions. These predictions remaincomputed on the last 10 timestamps, although in this experiment, we only observe thefirst 5, 10, 15, or 20 timestamps, in order to change the volume of information and thedistance from training observations to targets. We observe on Fig. 3 that, as expected in aGP framework, the closer observations are to targets, the better the results. However, formulti-tasks approaches and in particular for Magma , the prediction remains consistentlyadequate even with few observations. Once more, sharing information across individualssignificantly helps the prediction, even for small values of M or few observed data. As we previously discussed, different settings are available for
Magma according to thenature of data and the model hypotheses. First, the
Common grid setting correspondsto cases where all individuals share the same timestamps, whereas
Uncommon grid isused otherwise. Moreover,
Magma enables to consider identical hyper-parameters for allindividuals or specific ones, as previously discussed in Sec. 2.2. To evaluate the effect ofthe different settings, performances in prediction and µ ’s estimation are evaluated in thefollowing cases in Table 2: • Common HP , when data are simulated with a common set of hyper-parameters forall individuals, and Proposition 3.3 is used for inference in
Magma , • Different HP , when data are simulated with its own set of hyper-parameters for eachindividual, and Proposition 3.2 is used for inference in
Magma , • Common HP on different HP data , when data are simulated with its own set ofhyper-parameters for each individual, and Proposition 3.3 is used for inference in
Magma .Note that the first line of the table (
Common grid / Common HP ) of Table 2 is identi-cal to the corresponding results in Table 1, providing reference values, significantly betterthan for other methods. The results obtained in Table 2 indicates that the
Magma per-formance are not significantly altered by the settings used, or the nature of the simulateddata. In order to confirm the robustness of the method, the setting
Common HP wasapplied to data generated by drawing different values of hyper-parameters for each in-dividual (
Different HP data ). In this case, performance in prediction and estimation of µ are slightly deteriorated, although Magma still provides quite reliable forecasts. Thisexperience also highlights a particularity of the
Different HP setting: looking at the es-timation of µ performance, we observe a significant decrease in the CI coverage, dueto numerical instability in some pathological cases. Numerical issues, in particular duringmatrix inversions, are classical problems in the GP literature and, because of the poten-tially large number of different hyper-parameters to train, the probability for at least oneof them to lead to a nearly singular matrix increases. In this case, one individual mightoverwhelm others in the calculus of µ ’s hyper-posterior (see Proposition 4.1), and thuslead to an underestimated posterior variance. This problem does not occur in the CommonHP settings, since sharing the same hyper-parameters prevents the associated covariancematrices from running over each other. Thus, except if one specifically wants to smoothmultiple curves presenting really different behaviours, keeping
Common HP as a defaultsetting appear as a reasonable choice. Let us notice that the estimation of µ is slightly19 able 2 Average MSE (sd) and average CI coverage (sd) on 100 runs for the differentsettings of Magma . Prediction Estimation of µ MSE CI MSE CI Common HP Common grid 18.7 (31.4) 93.8 (13.5) 1.3 (2) 94.3 (11.3)Uncommon grid 19.2 (43) 94.6 (13.1) 2.9 (2.6) 93.6 (9.2)Different HP Common grid 19.9 (54.7) 91.6 (17.8) 0.5 (0.4) 70.8 (24.3)Uncommon grid 14.5 (22.4) 89.1 (17.9) 2.5 (4.5) 81.1 (15.9)Common HP ondifferent HP data Common grid 21.7 (36) 91 (19.8) 1.5 (1.2) 91.1 (13)Uncommon grid 18.1 (33) 92.5 (15.9) 3.2 (4.5) 93.4 (9.8)better for common than for uncommon grid, since the estimation problem on the unionof different timestamps is generally more difficult. However, this feature only depends onthe nature of data.
The counterpart of the more accurate and general results provided by
Magma is a naturalincrease in running time. Table 3 exhibits the raw and relative training times for GPFDAand
Magma (prediction times are negligible and comparable in both cases), with varyingvalues of M on a Common grid of N = 30 timestamps. The algorithms were run underthe , on a laptop with a dual-core processor cadenced at 2.90GhZ andan 8Go RAM. The reported computing times are in seconds, and for small to moderatedatasets ( N (cid:39) , M (cid:39) ) the procedures ran in few minutes to few hours. Thedifference between the two algorithms is due to GPFDA modelling µ as a deterministicfunction through B-splines smoothing, whereas Magma accounts for uncertainty. Theratio of computing times between the two methods tends to decrease as M increases, andstabilises around 2 for higher numbers of training individuals. This behaviour comes fromthe E step in Magma , which is incompressible and quite insensitive to the value of M .Roughly speaking, one needs to pay twice the computing price of GPFDA for Magma toprovide (significantly) more accurate predictions and uncertainty over µ .Table 4 provides running times of Magma according to its different settings, with M = 20. Because the complexity is linear in M in each case, the ratio in running timeswould remain roughly similar no matter the value of M . Prediction time appears negligiblecompared to training time, and generally takes less than one second to run. Besides, the Different HP setting increases the running time, since in this context M maximisations(instead of one for Common HP ) are required at each EM iteration. In this case, theprediction also takes slightly longer because of the necessity to optimise hyper-parametersfor the new individual. Although the nature of the grid of timestamps does not matterin itself, a key limitation lies in the dimension N of the pooled set of timestamps, whichtends to get bigger when individuals have different timestamps from one another.20 able 3 Average (sd) training time (in seconds) for
Magma and GPFDA for differentnumbers M of individuals in the training dataset. The relative running time between Magma and GPFDA is provided on the line
Ratio .5 10 50 100
Magma
Average (sd) training and prediction time (in seconds) for different settings of
Magma . Train PredictCommon HP Common grid 12.6 (3.5) 0.1 (0)Uncommon grid 16.5 (11.4) 0.2 (0.1)Different HP Common grid 42.6 (20.5) 0.6 (0.1)Uncommon grid 40.2 (17) 0.6 (0.1)
Data and problematic
We consider the problem of performance prediction in competition for french swimmers.The French Swimming Federation (FFN) provided us with an anonymised dataset, com-piling the age and results of its members between 2000 and 2016. For each competitor,the race times are registered for competitions of 100m freestyle (50m swimming-pool).The database contains results from 1731 women and 7876 men, each of them compilingan average of 22.2 data points (min = 15, max = 61) and 12 data points (min = 5, max= 57) respectively. In the following, age of the i -th swimmer is considered as the inputvariable (timestamp t ) and the performance (in seconds) on a 100m freestyle as the output( y i ( t )). For reasons of confidentiality and property, the raw dataset cannot be published.The analysis focuses on the youth period, from 10 to 20 years, where the progression is themost noticeable. In order to get relevant time series, we retained only individuals havinga sufficient number of data points on the considered time period.For a young swimmer, observed during its first years of competition, we aim at modellingits progression curve and make predictions on its future performance in the subsequentyears. Since we consider a decision-making problem involving irregular time series, theGP probabilistic framework is a natural choice to work on. Thereby, assuming that eachswimmer in the database is a realisation y i defined as previously, we expect Magma toprovide multi-task predictions for a new young swimmer, that will benefit from informationof other swimmers already observed at older ages. To study such modelling, and validateits efficiency in practice, we split the individuals into a training and testing datasets withrespective sizes: • M Ftrain = 1039, for the female training set, • M Ftest = 692, for the female testing set, • M Mtrain = 4726, for the male training set, • M Mtest = 3150, for the male testing set.21nference on the hyper-parameters is performed thanks to the training dataset in bothcases. Considering the different timestamps and the relative monotony of the progressioncurves, the settings
Uncommon grid / Common HP has been used for
Magma . The overalltraining lasted around 2 hours with the same hardware configuration as for simulations.To compute MSE and the CI coverage, the data points of each individual in the testingset has been split into observed and testing timestamps. Since each individual has a dif-ferent number of data points, the first 80% of timestamps are taken as observed , while theremaining 20% are considered as testing timestamps. Magma ’s predictions are comparedwith the true values of y i at testing timestamps.As previously, both GP and Magma have been initialised with a constant 0 meanfunction. Initial values for hyper-parameters are also similar for all i, θ ini = θ inii = ( e , e )and σ inii = 0 .
4. Those values are the default in
Magma and remain adequate in thecontext of these datasets.
Results and interpretation
The overall performance and comparison are summarised inTable 5.
Table 5
Average MSE (sd) and average CI coverage (sd) for prediction on french swim-mer testing datasets. MSE CI CoverWomen
Magma
GP 25.3 (97.6) 72.7 (37.1)Men
Magma
GP 22.1 (94.3) 78.2 (30.4)We observe that
Magma still provides excellent results in this context, and naturallyoutperform predictions provided by a single GP regression. The progression curves pre-senting relatively monotonic variations, and thus avoiding pathological behaviours thatcould occur with synthetic data, the MSE in prediction remains very low. The CI coverage sticks close to the 95% expected value for Magma , indicating an adequate quan-tification of uncertainty. To illustrate these results, an example is displayed on Fig. 5 forboth men and women. For a randomly chosen testing individual, we plot its predicted pro-gression curve (in blue), where we used its first 15 data points as observations (in black),while the remaining true data points (in red) are displayed for comparison purpose. Aspreviously observed in the simulation study, the simple GP quickly drifts to the prior 0mean, as soon as data lack. However, for both men and women, the
Magma predictionsremain close to the true data, which also lie within the 95% credible interval. Even forlong term forecast, where the mean prediction curve tends to overlap the mean process(dashed line), the true data remain in our range of uncertainty, as the credible intervalwidens far from observations. For clarity, we displayed only a few individuals from thetraining dataset (colourful points) in the background. The mean process (dashed line)seems to represent the main trend of progression among swimmers correctly, even thoughwe cannot numerically compare µ to any real-life analogous quantity. In a more sport-related perspective, we can note that both genders present similar patterns of progression.However, while performances are roughly similar in mean trend before the age of 14, theystart to differentiate afterwards and then converge to average times with approximatively a5 seconds gap. Interestingly, the difference between world records in 100 freestyle for men22igure 5: Prediction curves (blue) for a testing individual with associated 95% credibleintervals (grey) for GP regression (left) and Magma (right), for both women(top) and men (bottom). The dashed lines represent the mean functions of thehyper-posterior mean process µ | { y i } i . Observed data points are in black, andtesting data points are in red. The colourful backward points are observationsfrom the training dataset, each colour corresponding to a different individual.23nd women is currently 4.8 seconds (46.91 versus 51.71). These results, obtained underreasonable hypotheses on several hundreds of swimmers, seem to indicate that Magma would give quite reliable predictions for a new young swimmer. Furthermore, the uncer-tainty provided through the predictive posterior distribution offers an adequate degree ofcaution in a decision-making process.
We have introduced a unified non-parametric multi-task framework integrating a meanGaussian process prior in the context of GP regression. While we believe that this processis an interesting object in itself, it also allows individuals to borrow information from eachother and provide more accurate predictions, even far from data points. Furthermore, ourmethod accounts for uncertainty in the mean process and remains applicable no matterthe observational grid of data. Both on simulated and real-life datasets, we exhibited theadequacy of such an approach, and studied some of its properties and possible settings.
Magma outperforms the alternatives in estimation of the mean process as well as in pre-diction, and gives a reliable quantification of uncertainty. We also displayed evidence ofits predictive efficiency for real-life problems and provided some insights on practical in-terpretation about the mean process.Interestingly, despite the extensive literature on these aspects of GPs, our model doesnot yet include sparse approximations or on-line extensions. While these aspects arebeyond the scope of the present paper, we aim to integrate such existing approaches inour model to widen its applicability. The combination of the covariance structures used inclassical multi-task GP (Bonilla et al., 2008; Hensman et al., 2013) with the common meanprocess we introduced would also open a promising path for future work. Another possibleavenue is an adaptation to the classification context, which is presented in Rasmussen andWilliams (2006, Chapter 3). Besides, this work leaves the door open to improvement as weonly tackled the problem of unidimensional regression: enabling either multidimensionalor mixed type of inputs as in Shi and Choi (2011) would be of interest. To conclude, thehypothesis of a unique underlying mean process might be considered as too restrictive forsome datasets, and enabling cluster-specific mean processes would be a relevant extension.
The proof below gives details for the calculus of µ ’s hyper-posterior distribution, involvedin the E step of the EM algorithm and during the prediction process. Although the mainidea is similar to the proof given for common timestamps, there are some cautions to takewhen working in the general case. Note that the proof of Proposition 3.1 is a particularcase of the proof below, where τ = t exactly (where τ is the set of timestamps the hyper-posterior is to be computed on). Moreover, in order to keep an analytical expressionfor µ ’s hyper-posterior distribution, we discard the superfluous information contained in { y i } i at timestamps on which the hyper-posterior is not to be computed. Hence, the proofbelow states that the remaining data points are observed on subsets { τ i } i of τ . Proposition 3.1 and Proposition 4.1.
Let τ be a finite vector of timestamps, and { τ i } i such as ∀ i = 1 , . . . , M, τ i ⊂ τ . Wedefine convenient notation: • µ τ = µ ( τ ), 24 m τ = m ( τ ), • µ τ i = µ ( τ i ) , ∀ i = 1 , . . . , M , • y τ i i = y i ( τ i ) , ∀ i = 1 , . . . , M , • Ψ i = Ψ θ i ,σ i ( τ i , τ i ) , ∀ i = 1 , . . . , M , • K = K θ ( τ , τ ).Moreover, for a covariance matrix C , and u, v ∈ τ , we note [ C ] − uv the element of theinverse matrix at row associated with timestamp u , and column associated with timestamp v . We also ignore the conditionings over (cid:98) Θ, τ i and τ to maintain simple expressions. Byconstruction of the models, we have: p ( µ τ | { y τ i i } i ) ∝ p ( { y τ i i } i | µ τ ) p ( µ τ ) ∝ (cid:40) M (cid:89) i =1 p ( y τ i i | µ τ i ) (cid:41) p ( µ τ ) ∝ (cid:40) M (cid:89) i =1 N ( y τ i i ; µ τ i , Ψ i )) (cid:41) N ( µ τ ; m τ , K ) . The term L = − (1 /
2) log p ( µ τ | { y τ i i } i ) associated with the hyper-posterior remainsquadratic and we may find the corresponding Gaussian parameters by identification: L = M (cid:88) i =1 ( y τ i i − µ τ i ) (cid:124) Ψ − i ( y τ i i − µ τ i ) + C i + ( µ τ − m τ ) (cid:124) K − ( µ τ − m τ ) + C = µ τ (cid:124) K − µ τ + M (cid:88) i =1 µ τ i (cid:124) Ψ − i µ τ i − (cid:32) µ τ (cid:124) K − m τ + M (cid:88) i =1 µ τ i (cid:124) Ψ − i y τ i i (cid:33) + C = (cid:88) u ∈ τ (cid:88) v ∈ τ µ ( u ) [ K ] − uv µ ( v )+ M (cid:88) i =1 (cid:88) u ∈ τ i (cid:88) v ∈ τ i µ ( u ) [ Ψ i ] − uv µ ( v ) − (cid:88) u ∈ τ (cid:88) v ∈ τ µ ( u ) [ K ] − uv m ( v ) − M (cid:88) i =1 (cid:88) u ∈ τ i (cid:88) v ∈ τ i µ ( u ) [ Ψ i ] − uv y i ( v ) + C, where we entirely decomposed the vector-matrix products. We factorise the expressionaccording to the common timestamps between τ i and τ . Since for all i, τ i ⊂ τ , let usintroduce a dummy indicator function τ i = { u,v ∈ τ i } to write:25 (cid:88) i =1 (cid:88) u ∈ τ i (cid:88) v ∈ τ i A ( u, v ) = M (cid:88) i =1 (cid:88) u ∈ τ (cid:88) v ∈ τ τ i A ( u, v )= (cid:88) u ∈ τ (cid:88) v ∈ τ M (cid:88) i =1 τ i A ( u, v ) , subsequently, we can gather the sums such as L = (cid:88) u ∈ τ (cid:88) v ∈ τ (cid:32) µ ( u ) [ K ] − uv µ ( v )+ M (cid:88) i =1 τ i µ ( u ) [ Ψ i ] − uv µ ( v ) − µ ( u ) [ K ] − uv m ( v ) − M (cid:88) i =1 τ i µ ( u ) [ Ψ i ] − uv y i ( v ) (cid:33) + C = (cid:88) u ∈ τ (cid:88) v ∈ τ (cid:32) µ ( u ) (cid:16) [ K ] − uv + M (cid:88) i =1 τ i [ Ψ i ] − uv (cid:17) µ ( v ) − µ ( u ) (cid:16) [ K ] − uv m ( v ) + M (cid:88) i =1 τ i [ Ψ i ] − uv y i ( v ) (cid:17)(cid:33) + C = µ τ (cid:124) (cid:16) K − + M (cid:88) i =1 ˜Ψ − i (cid:17) µ τ − µ τ (cid:124) (cid:16) K − m τ + M (cid:88) i =1 ˜Ψ − i ˜ y τ i (cid:17) + C, where the y i and Ψ i are completed by zeros: • ˜ y τ i = τ i y i ( τ ), • (cid:104) ˜Ψ i (cid:105) − uv = τ i [ Ψ i ] − uv , ∀ u, v ∈ τ .By identification of the quadratic form, we reach: p ( µ τ | { y τ i i } i ) = N (cid:16) µ τ ; (cid:98) m ( τ ) , (cid:98) K (cid:17) , with, • (cid:98) K = (cid:18) K − + M (cid:80) i =1 ˜Ψ − i (cid:19) − , • (cid:98) m ( τ ) = (cid:98) K (cid:18) K − m τ + M (cid:80) i =1 ˜Ψ − i ˜ y τ i (cid:19) . 26 roposition 3.2 and Proposition 3.3. Since the central part of the proofs is similar for both propositions, we detail the calculusdenoting Θ = { θ , { θ i } i , { σ i } i } for generality and dissociate the two cases only whennecessary. Before considering the maximisation, we notice that the joint density can bedeveloped as: L = p ( { y i } i , µ ( t ) | Θ)= p ( { y i } i | µ ( t ) , Θ) p ( µ ( t ) | Θ)= M (cid:89) i =1 (cid:8) p ( y i | µ ( t ) , θ i , σ i ) (cid:9) p ( µ ( t ) | θ )= M (cid:89) i =1 (cid:110) N ( y i ; µ ( t ) , Ψ θ i ,σ i ) (cid:111) N ( µ ( t ); m ( t ) , K t θ ) . The expectation is taken over p ( µ ( t ) | { y i } i ) though we write it E for simplicity. Wehave: f (Θ) = E [ log L ]= − E (cid:34) ( µ ( t ) − m ( t )) (cid:124) K t θ − ( µ ( t ) − m ( t ))+ M (cid:88) i =1 ( y i − µ ( t i )) (cid:124) Ψ t i θ i ,σ i − ( y i − µ ( t i )) − log (cid:12)(cid:12)(cid:12) K t θ − (cid:12)(cid:12)(cid:12) − M (cid:88) i =1 log (cid:12)(cid:12)(cid:12) Ψ t i θ i ,σ i − (cid:12)(cid:12)(cid:12) (cid:35) + C . Lemma 8.1.
Let X ∈ R N be a random Gaussian vector X ∼ N ( m, K ) , b ∈ R N , and S ,a N × N covariance matrix. Then: E = E X (cid:2) ( X − b ) (cid:124) S − ( X − b ) (cid:3) = ( m − b ) (cid:124) S − ( m − b ) + T r ( KS − ) . Lemma 8.1. E = E X (cid:2) T r ( S − ( X − b )( X − b ) (cid:124) ) (cid:3) = T r ( S − V X ( X − b )) + T r ( S − ( m − b )( m − b ) (cid:124) )= ( m − b ) (cid:124) S − ( m − b ) + T r (cid:0) KS − (cid:1) . As we note that X and b play symmetrical roles in the calculus of the conditionalexpectation, we can apply the lemma regardless to the position of µ in the M +1 equalitiesinvolved. Applying Lemma 8.1 to our previous expression of f (Θ), we obtain:27 (Θ) = − (cid:34) ( (cid:98) m ( t ) − m ( t )) (cid:124) K t θ − ( (cid:98) m ( t ) − m ( t ))+ M (cid:88) i =1 ( y i − (cid:98) m ( t i )) (cid:124) Ψ t i θ i ,σ i − ( y i − (cid:98) m ( t i ))+ T r (cid:16) (cid:98) K t K t θ − (cid:17) + M (cid:88) i =1 T r (cid:16) (cid:98) K t i Ψ t i θ i ,σ i − (cid:17) − log (cid:12)(cid:12)(cid:12) K t θ − (cid:12)(cid:12)(cid:12) − M (cid:88) i =1 log (cid:12)(cid:12)(cid:12) Ψ t i θ i ,σ i − (cid:12)(cid:12)(cid:12) + C (cid:35) . We recall that, at the M step, (cid:98) m ( t ) is a known constant, computed at the previous Estep. Thus, we identify here the characteristic expression of several Gaussian likelihoodsand associated correction trace terms. Moreover, each set of hyper-parameters is merelyinvolved in independent terms of the whole function to maximise. Hence, the global max-imisation problem can be separated into several maximisations of sub-functions accordingto the hyper-parameters getting optimised. Regardless to additional assumptions, thehyper-parameters θ , controlling the covariance matrix of the mean process, appears in afunction which is exactly a Gaussian likelihood N (cid:0) (cid:98) m ( t ) , m ( t ) , K t θ (cid:1) , added to a corre-sponding trace term − T r (cid:16) (cid:98) K t K t θ − (cid:17) . This function can be maximised independentlyfrom the other parameters, giving the first part of the results in Proposition 3.2 and Propo-sition 3.3.Although the idea is analogous for the remaining hyper-parameters, we have to dis-criminate here regarding the assumption on the model. If each individual is supposedto have its own set { θ i , σ i } , which thus can be optimised independently from the ob-servations and hyper-parameters of other individuals, we identify a sum of M Gaussianlikelihoods N (cid:0) y i , (cid:98) m ( t i ) , Ψ t i θ i ,σ i (cid:1) and the corresponding trace terms − T r ( (cid:98) K t Ψ t i θ i ,σ i − ).This results on M independent maximisation problems on corresponding functions, prov-ing Proposition 3.2. Conversely, if we assume that all individuals in the model shares theirhyper-parameters and (cid:8) θ, σ (cid:9) = (cid:8) θ i , σ i (cid:9) , ∀ i , we can no longer divide the problem into M sub-maximisations, and the whole sum on all individual should be optimised thanksto observations from all individuals. This case corresponds to the second part of Proposi-tion 3.3. Availability of data
The synthetic data and table of results are available at https://github.com/ArthurLeroy/MAGMA/tree/master/Simulations
Code availability
The R code associated with the present work is available at https://github.com/ArthurLeroy/MAGMA ibliography Ahmed M. Alaa and Mihaela van der Schaar. Bayesian Inference of Individualized Treat-ment Effects using Multi-task Gaussian Processes. In I. Guyon, U. V. Luxburg, S. Ben-gio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,
Advances inNeural Information Processing Systems 30 , pages 3424–3432. Curran Associates, Inc.,2017.Mauricio A. ´Alvarez, Lorenzo Rosasco, and Neil D. Lawrence. Kernels for Vector-ValuedFunctions: A Review.
Foundations and Trends R (cid:13) in Machine Learning , 4(3):195–266,2012. ISSN 1935-8237, 1935-8245. doi: 10.1561/2200000036.Anjishnu Banerjee, David B. Dunson, and Surya T. Tokdar. Efficient Gaussian processregression for large datasets. Biometrika , 100(1):75–89, 2013. ISSN 0006-3444. doi:10.1093/biomet/ass068.Matthias Bauer, Mark van der Wilk, and Carl Edward Rasmussen. Understanding Prob-abilistic Sparse Gaussian Process Approximations. In D. D. Lee, M. Sugiyama, U. V.Luxburg, I. Guyon, and R. Garnett, editors,
Advances in Neural Information ProcessingSystems 29 , pages 1533–1541. Curran Associates, Inc., 2016.C. Biernacki, G. Celeux, and G. Govaert. Choosing starting values for the EM algorithm forgetting the highest likelihood in multivariate gaussian mixture models.
ComputationalStatistics and Data Analysis , 41(3-4):561–575, 2003.Hildo Bijl, Jan-Willem van Wingerden, Thomas B. Sch¨on, and Michel Verhaegen. On-line sparse Gaussian process regression using FITC and PITC approximations.
IFAC-PapersOnLine , 48(28):703–708, 2015. ISSN 2405-8963. doi: 10.1016/j.ifacol.2015.12.212.Christopher M. Bishop.
Pattern Recognition and Machine Learning . Information Scienceand Statistics. Springer, New York, 2006. ISBN 978-0-387-31073-2.Edwin V Bonilla, Kian M. Chai, and Christopher Williams. Multi-task Gaussian ProcessPrediction. In J. C. Platt, D. Koller, Y. Singer, and S. T. Roweis, editors,
Advancesin Neural Information Processing Systems 20 , pages 153–160. Curran Associates, Inc.,2008.Rich Caruana. Multitask Learning.
Machine Learning , 28(1):41–75, July 1997. ISSN1573-0565. doi: 10.1023/A:1007379606734.George Casella. An Introduction to Empirical Bayes Data Analysis.
The American Statis-tician , 39(2):83–87, 1985. ISSN 0003-1305. doi: 10.2307/2682801.Christopher Clingerman and Eric Eaton. Lifelong Learning with Gaussian Processes. InMichelangelo Ceci, Jaakko Hollm´en, Ljupˇco Todorovski, Celine Vens, and Saˇso Dˇzeroski,editors,
Machine Learning and Knowledge Discovery in Databases , volume 10535, pages690–704. Springer International Publishing, Cham, 2017. ISBN 978-3-319-71245-1 978-3-319-71246-8. doi: 10.1007/978-3-319-71246-8 42.Ciprian M. Crainiceanu and A. Jeffrey Goldsmith. Bayesian Functional Data AnalysisUsing WinBUGS.
Journal of statistical software , 32(11), 2010. ISSN 1548-7660.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum Likelihood from IncompleteData via the EM Algorithm.
Journal of the Royal Statistical Society. Series B (Method-ological) , 39(1):1–38, 1977. ISSN 0035-9246.29avid Duvenaud.
Automatic Model Construction with Gaussian Processes . Thesis, Uni-versity of Cambridge, November 2014.Fr´ed´eric Ferraty and Philippe Vieu.
Nonparametric Functional Data Analysis: Theoryand Practice . Springer Science & Business Media, 2006. ISBN 978-0-387-36620-3.Kohei Hayashi, Takashi Takenouchi, Ryota Tomioka, and Hisashi Kashima. Self-measuringSimilarity for Multi-task Gaussian Process.
Transactions of the Japanese Society forArtificial Intelligence , 27(3):103–110, 2012. ISSN 1346-8030, 1346-0714. doi: 10.1527/tjsai.27.103.James Hensman, Nicolo Fusi, and Neil D. Lawrence. Gaussian Processes for Big Data. arXiv:1309.6835 [cs, stat] , September 2013.Thriyambakam Krishnan and GJ McLachlan.
The EM algorithm and extensions . JohnWiley, New York, 1997.Pablo Moreno-Mu˜noz, Antonio Art´es-Rodr´ıguez, and Mauricio A. ´Alvarez. ContinualMulti-task Gaussian Processes. arXiv:1911.00002 [cs, stat] , 2019.J. Qui˜nonero-Candela, C. E. Rasmussen, and C. K. I. Williams.
Approximation Methodsfor Gaussian Process Regression . MIT Press, 2007. ISBN 978-0-262-02625-3.Barbara Rakitsch, Christoph Lippert, Karsten Borgwardt, and Oliver Stegle. It is all inthe noise: Efficient multi-task Gaussian process inference with structured residuals. InC. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, edi-tors,
Advances in Neural Information Processing Systems 26 , pages 1466–1474. CurranAssociates, Inc., 2013.James O. Ramsay and Bernard W. Silverman.
Functional Data Analysis . Springer, 2005.Carl Edward Rasmussen and Christopher K. I. Williams.
Gaussian Processes for MachineLearning . Adaptive Computation and Machine Learning. MIT Press, Cambridge, Mass,2006. ISBN 978-0-262-18253-9.John A. Rice and B. W. Silverman. Estimating the Mean and Covariance StructureNonparametrically When the Data are Curves.
Journal of the Royal Statistical Society.Series B (Methodological) , 53(1):233–243, 1991. ISSN 0035-9246.Anton Schwaighofer, Volker Tresp, and Kai Yu. Learning Gaussian Process Kernels viaHierarchical Bayes.
NIPS , page 8, 2004.J. Q. Shi, B. Wang, R. Murray-Smith, and D. M. Titterington. Gaussian Process Func-tional Regression Modeling for Batch Data.
Biometrics , 63(3):714–723, 2007. ISSN1541-0420. doi: 10.1111/j.1541-0420.2007.00758.x.Jian Qing Shi and Yafeng Cheng. Gaussian Process Function Data Analysis R Package‘GPFDA’.
Manual of the GPFDA Package , page 33, 2014.Jian Qing Shi and Taeryon Choi.
Gaussian Process Regression Analysis for FunctionalData . CRC Press, 2011. ISBN 978-1-4398-3773-3.J.Q. Shi, R. Murray-Smith, and D.M. Titterington. Hierarchical Gaussian process mix-tures for regression.
Statistics and Computing , 15(1):31–41, 2005. ISSN 0960-3174,1573-1375. doi: 10.1007/s11222-005-4787-7.30dward Snelson and Zoubin Ghahramani. Sparse Gaussian Processes using Pseudo-inputs.
NIPS , page 8, 2006.Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-Task Bayesian Optimization.In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, edi-tors,
Advances in Neural Information Processing Systems 26 , pages 2004–2012. CurranAssociates, Inc., 2013.Wesley K. Thompson and Ori Rosen. A Bayesian Model for Sparse Functional Data.
Biometrics , 64(1):54–63, 2008. ISSN 1541-0420. doi: 10.1111/j.1541-0420.2007.00829.x.Michalis K Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Pro-cesses.
AISTATS , page 8, 2009.Christopher Williams, Stefan Klanke, Sethu Vijayakumar, and Kian M. Chai. Multi-taskGaussian Process Learning of Robot Inverse Dynamics. In D. Koller, D. Schuurmans,Y. Bengio, and L. Bottou, editors,
Advances in Neural Information Processing Systems21 , pages 265–272. Curran Associates, Inc., 2009.James T. Wilson, Viacheslav Borovitskiy, Alexander Terenin, Peter Mostowsky, andMarc Peter Deisenroth. Efficiently sampling functions from Gaussian process posteri-ors. In
Proceedings of the 37th International Conference on Machine Learning (ICML) ,2020.Jingjing Yang, Hongxiao Zhu, Taeryon Choi, and Dennis D. Cox. Smoothing andMean–Covariance Estimation of Functional Data with a Bayesian Hierarchical Model.
Bayesian Analysis , 11(3):649–670, 2016. ISSN 1936-0975, 1931-6690. doi: 10.1214/15-BA967.Jingjing Yang, Dennis D. Cox, Jong Soo Lee, Peng Ren, and Taeryon Choi. EfficientBayesian hierarchical functional data analysis with basis function approximations usingGaussian-Wishart processes.
Biometrics , 73(4):1082–1091, 2017. ISSN 0006-341X. doi:10.1111/biom.12705.Kai Yu, Volker Tresp, and Anton Schwaighofer. Learning Gaussian Processes from Multi-ple Tasks. In
Proceedings of the 22Nd International Conference on Machine Learning ,ICML ’05, pages 1012–1019, New York, NY, USA, 2005. ACM. ISBN 978-1-59593-180-1.doi: 10.1145/1102351.1102479.Jiang Zhu and Shiliang Sun. Multi-task Sparse Gaussian Processes with Improved Multi-task Sparsity Regularization. In