No-regret Algorithms for Multi-task Bayesian Optimization
NNo-regret Algorithms for Multi-task BayesianOptimization
Sayak Ray Chowdhury Aditya GopalanIndian Institute of ScienceAugust 21, 2020
Abstract
We consider multi-objective optimization (MOO) of an unknown vector-valuedfunction in the non-parametric Bayesian optimization (BO) setting, with the aimbeing to learn points on the Pareto front of the objectives. Most existing BOalgorithms do not model the fact that the multiple objectives, or equivalently, taskscan share similarities, and even the few that do lack rigorous, finite-time regretguarantees that capture explicitly inter-task structure. In this work, we address thisproblem by modelling inter-task dependencies using a multi-task kernel and developtwo novel BO algorithms based on random scalarizations of the objectives. Ouralgorithms employ vector-valued kernel regression as a stepping stone and belongto the upper confidence bound class of algorithms. Under a smoothness assumptionthat the unknown vector-valued function is an element of the reproducing kernelHilbert space associated with the multi-task kernel, we derive worst-case regretbounds for our algorithms that explicitly capture the similarities between tasks.We numerically benchmark our algorithms on both synthetic and real-life MOOproblems, and show the advantages offered by learning with multi-task kernels.
Bayesian optimization is a popular approach for optimizing a black-box function withexpensive, noisy evaluations, having been extensively applied in various applicationssuch as hyper-parameter tuning (Snoek et al., 2012), sensor selection (Garnett et al.,2010), synthetic gene design (Gonzalez et al., 2015), etc. In many practical scenarios,one is required to optimize multiple objectives together, and moreover, these objectivescan be conflicting in nature. For example, consider drug discovery, where each functionevaluation is a costly laboratory experiment and its output is a measurement of both thepotency and side-effects of a candidate drug (Paria et al., 2019). These two objectivesare typically conflicting in nature, since one would like to maximize the potency of drugwhile also keeping its side-effects to a minimum. Other examples include tradeoffssuch as bias and variance, accuracy and calibration (Guo et al., 2017), accuracy andfairness (Zliobaite, 2015) etc. These problems can be framed as that of optimizing avector-valued function f = ( f , . . . , f n ) , where each of its components is a real-valued1 a r X i v : . [ c s . L G ] A ug unction and corresponds to a particular objective or task. Since one often cannotoptimize all f i ’s simultaneously, most multi-objective optimization (MOO) approachesaim to recover the set of Pareto optimal points, where intuitively a point is Paretooptimal if there is no way to improve on all objectives simultaneously (Knowles, 2006;Ponweiser et al., 2008). Popular BO strategies in this regard include Predictive EntropySearch (Hernández-Lobato et al., 2016), max-value entropy search (Belakaria et al.,2019), Pareto active learning (Zuluaga et al., 2013), expected hypervolume improvement(Emmerich and Klinkenberg, 2008), sequential uncertainty reduction (Picheny, 2015)and scalarization based approaches (Roijers et al., 2013). Random scalarizations, inparticular, have been shown to be flexible enough to model user preferences in capturingthe whole or a part of the Pareto front (Paria et al., 2019).Most multi-objective BO approaches maintain n different Gaussian processes (GPs)(Rasmussen, 2003), one for each task or objective f i . However, in general, the tasksshare some underlying structure, and cannot be treated as unrelated objects. By makinguse of this structure, one might benefit significantly by learning the tasks simultaneouslyas opposed to learning them independently. For example, consider predicting consumerpreferences simultaneously based on their past history (Evgeniou et al., 2005). Eachtask is to learn the preference of a particular consumer, and the tasks are relatedsince people with similar tastes tend to buy similar items. Other examples includesimultaneous estimation of many related indicators in economic forecasting (Greene,2003), predicting tumour behaviour from multiple related diseases (Rifkin et al., 2003)etc. However, assuming similarities in a set of tasks and blindly learning them togethercan be detrimental (Caruana, 1997). Hence, it is important to have a model thatwill benefit the learning in case of related tasks and will not hurt performance whenthe tasks are unrelated. This can be achieved by maintaining a multi-task GP over f , which directly induces correlations between tasks (Bonilla et al., 2008). In thecontext of BO, Swersky et al. (2013) empirically demonstrate the utility of this modelin a number of applications, and Astudillo and Frazier (2019) provide an asymptoticconvergence analysis under a special setting of composite objective functions and noise-free evaluations. However, a formal finite time regret analysis showing the effectivenessof multi-task GPs over independent GPs in the context of noisy MOO has not beenrigorously pursued. Against this backdrop, we make the following contributions: • We develop two novel BO algorithms – multi-task kernelized bandits (MT-KB)and multi-task budgeted kernelized bandits (MT-BKB) – that are based on randomscalarizations, and can leverage similarities between tasks to optimize them moreefficiently. • Our algorithms use vector-valued kernel ridge regression as a building block andfollow the general template of the upper-confidence-bound class of algorithms.Also, MT-BKB is the first algorithm that employs the
Nyström approximation inthe context of multi-task kernels . • Under the assumption that the objective function has smoothness compatible witha joint kernel on its domain and components, we derive (scalarization induced)regret bounds for our algorithms that explicitly capture the inter-task structure .These are the first worst-case (frequentist) regret bounds for multi-objective BO,2nd are proved by deriving a novel concentration inequality for the estimate ofthe vector-valued objective function, which might be of independent interest. • Finally, our algorithms are simple to implement when the kernel decouplesbetween tasks and domain, and we report numerical results on synthetic as well asreal-world based datasets, for which the algorithms are seen to perform favourably.
Related work.
In the field of geostatistics (Wackernagel, 2013), and more recentlyin supervised learning (Liu et al., 2018), multi-task GPs and associated kernels havegained a lot of traction. Also, a lot of work has been done in the context of vector-valuedlearning with kernel methods (Micchelli and Pontil, 2005; Baldassarre et al., 2012;Grünewälder et al., 2012), and this paper complements the literature by considering anonline learning setting. A simple version of multi-objective black box optimization – inthe form of online learning in finite multi-armed bandits (MABs) – has been consideredin (Drugan and Nowe, 2013; Drugan and Nowé, 2014). This paper, in effect, generalizethese works to the more challenging setting of infinite-armed bandits, which has beenstudied extensively in the single task setting (Srinivas et al., 2010; Chowdhury andGopalan, 2017; Scarlett et al., 2017).
We consider the problem of maximizing a vector-valued function f ( x ) = [ f ( x ) , . . . , f n ( x )] (cid:62) over a compact domain X ⊂ R d . At each round t , a learner queries f at a single point x t ∈ X , and observes a noisy output y t = f ( x t ) + ε t , where ε t ∈ R n is a zero-mean sub-Gaussian random vector conditioned on F t − , the σ -algebra generated by therandom variables { x s , ε s } t − s =1 and x t . By this we mean that there exists a σ (cid:62) , suchthat ∀ α ∈ R n , ∀ t (cid:62) , E (cid:2) exp( α (cid:62) ε t ) (cid:12)(cid:12) F t − (cid:3) (cid:54) exp (cid:0) σ (cid:107) α (cid:107) / (cid:1) . The query point x t at round t is chosen causally depending upon the history { ( x s , y s ) } t − s =1 of query and output sequences available up to round t − . Since one cannot optimize all f i ’s simultaneously, the learner’s aim is to find the set of Pareto-optimal points, denotedby X f . A point x is said to be Pareto dominated by x (cid:48) if f ( x ) ≺ f ( x (cid:48) ) , where for any u, v ∈ R n , u ≺ v denotes that u i (cid:54) v i for all i ∈ [ n ] and u j < v j for some j ∈ [ n ] . Apoint is Pareto optimal if it is not Pareto dominated by any other points, i.e., x ∈ X f if f ( x ) ⊀ f ( x (cid:48) ) for all x (cid:48) (cid:54) = x . The Pareto front of f is denoted by f ( X f ) , where for anyset A , f ( A ) := { f ( x ) | x ∈ A} . Random scalarizations and Pareto optimality
A common approach to solve themulti-objective optimization problem is by converting the objective vectors f ( x ) intosingle objective scalars using a scalarization function s λ : R n → R , parameterizedby a weight vector λ ∈ Λ ⊂ R n . Similar to Paria et al. (2019), we assume randomscalarizations, i.e., access to a (known) distribution P λ with its support on Λ . Thus,instead of maximizing a single scalarized objective, we aim to maximize over a set of We denote by [ n ] the set { , , . . . , n } . P λ . (Note that the Dirac-delta distribution P λ yields a deterministic scalarization.) We also assume that, for all λ , the scalarizationfunction s λ is L λ -Lipschitz in the (cid:96) -norm, i.e., ∀ u, v ∈ R n , | s λ ( u ) − s λ ( v ) | (cid:54) L λ (cid:107) u − v (cid:107) . Commonly used scalarization functions include the linear scalarization s λ ( y ) = (cid:80) ni =1 λ i y i and the Chebyshev scalarization s λ ( y ) = min i ∈ [ n ] λ i ( y i − z i ) , where z ∈ R n is a ref-erence point and λ lies in the set Λ = { λ (cid:31) (cid:107) λ (cid:107) = 1 } (Nakayama et al., 2009).Apart from being Lipschitz, another important property these scalarizations have ismonotonicity in all co-ordinates, i.e., s λ ( u ) < s λ ( v ) whenever u ≺ v . Monotonicityensures that x (cid:63)λ := argmax x ∈X s λ ( f ( x )) , the maximizer of the scalarized objective,is a Pareto optimal point, since otherwise if f ( x (cid:63)λ ) ≺ f ( x ) for some x (cid:54) = x (cid:63)λ , we have s λ ( f ( x (cid:63)λ )) < s λ ( f ( x )) yielding a contradiction. Therefore, P λ defines a probabilitydistribution over the Pareto optimal set X f , and thus, in turn, over the Pareto front f ( X f ) . Hence, the distribution P λ provides flexibility to sample from the entire or apart of the Pareto front depending on the application (Paria et al., 2019). Performance metric
Given a budget of T rounds, our goal is to find a set X T := { x , . . . , x T } ⊂ X such that f ( X T ) well approximates the high probability regionsof the Pareto front f ( X f ) . This can be achieved, as shown in Paria et al. (2019), byminimizing the Bayes regret , defined as R B ( T ) = E P λ [ r λ ( T )] , where r λ ( T ) = s λ ( f ( x (cid:63)λ )) − max x ∈X T s λ ( f ( x )) . To see this, we note that it requires r λ ( T ) to be low for all λ ∈ Λ that has high mass,to achieve a low Bayes regret. Now, by definition, r λ ( T ) = 0 if x (cid:63)λ ∈ X T , and also,by monotonicity, x (cid:63)λ ∈ X f . Then, by the Lipschitz continuity, a low value of R B ( T ) will essentially imply f ( X T ) to "span" the high probability regions of f ( X f ) . A moreclassical performance measure is the (scalarized) cumulative regret R C ( T ) = T (cid:88) t =1 E P λ (cid:2) s λ t (cid:0) f ( x (cid:63)λ t ) (cid:1) − s λ t ( f ( x t )) (cid:3) , where each λ t is independent and P λ distributed. If Λ is a bounded set and the scalariza-tion s λ is also Lipschitz in λ , then one can show that R B ( T ) (cid:54) T R C ( T ) + o (1) (Pariaet al., 2019). A sub-linear growth of R C ( T ) with T then implies that R B ( T ) → as T → ∞ . Regularity assumptions
Attaining non-trivial regret bound is impossible in generalfor arbitrary vector-valued functions f , thus some regularity assumptions are in order.We call a mapping Γ :
X × X → R n × n , a multi-task kernel on X if Γ( x, x (cid:48) ) (cid:62) = A Dirac-delta is a probability distribution that puts mass on exactly one point in the probability space. In its more general form, this definition can be lifted from R n to any arbitrary Hilbert space H (Caponnettoet al., 2008). ( x (cid:48) , x ) for any x, x (cid:48) ∈ X , and it is positive definite, i.e., for any m ∈ N , { x i } mi =1 ⊆ X and { y i } mi =1 ⊆ R n it holds m (cid:88) i,j =1 y (cid:62) i Γ( x i , x j ) y j (cid:62) . Given a continuous (relative to the induced matrix norm) multi-task kernel Γ on X , thereexists a unique (modulo an isometry) vector-valued reproducing kernel Hilbert space(RKHS) of vector-valued continuous functions g : X → R n , with Γ as its reproducingkernel (Carmeli et al., 2010). We denote this RKHS as H Γ ( X ) , with the correspondinginner product (cid:104)· , ·(cid:105) Γ . Then, for every x ∈ X , there exists a bounded linear operator Γ x : R n → H Γ ( X ) such that the following holds: ∀ x (cid:48) ∈ X , Γ( x, x (cid:48) ) = Γ (cid:62) x Γ x (cid:48) and ∀ g ∈ H Γ ( X ) , g ( x ) = Γ (cid:62) x g . Here, Γ (cid:62) x denotes the adjoint of Γ x (with a slight abuse of notation), and it is the uniqueoperator that satisfies the following: ∀ g ∈ H Γ ( X ) , ∀ y ∈ R n , (cid:104) Γ (cid:62) x g, y (cid:105) = (cid:104) g, Γ x y (cid:105) Γ . We assume that the objective function f is an element of the RKHS H Γ ( X ) and itsnorm associated to H Γ ( X ) is bounded, i.e., there exists a b < ∞ such that (cid:107) f (cid:107) Γ (cid:54) b .This is a measure of smoothness of f , since, by the reproducing property ∀ x, x (cid:48) ∈ X , (cid:107) f ( x ) − f ( x (cid:48) ) (cid:107) (cid:54) (cid:107) f (cid:107) Γ (cid:107) Γ x − Γ x (cid:48) (cid:107) , where (cid:107) Γ x (cid:107) := sup (cid:107) y (cid:107) (cid:54) (cid:107) Γ x y (cid:107) Γ denotes the operator norm. Further, we assumethat there exists a κ < ∞ such that (cid:107) Γ( x, x ) (cid:107) (cid:54) κ for all x ∈ X . Note that in thesingle-task setting ( n = 1 ), the kernel Γ is scalar-valued and the RKHS H Γ ( X ) consistsof real-valued functions. In this case, the bounded norm assumption holds for stationarykernels, e.g., the squared exponential (SE) kernel and the Matérn kernel (Srinivas et al.,2010; Chowdhury and Gopalan, 2017).
Examples of multi-task (MT) kernels
It is possible to construct MT kernels usingscalar kernels k : X × X → R + . Evgeniou et al. (2005) consider the kernel Γ( x, x (cid:48) ) = k ( x, x (cid:48) ) ( ωI n + (1 − ω )1 n /n ) , where I n is the n × n identity matrix, n is the n × n all-one matrix and ω ∈ [0 , is aparameter that governs the similarity level between components of f . The choice ω = 1 corresponds to assuming that all tasks are unrelated and possible similarity among themis not exploited. Conversely, ω = 0 is equivalent to assuming that all tasks are identicaland can be explained by the same function. Swersky et al. (2013) consider a moregeneral class of kernels known as the intrinsic coregionalization model (ICM), whichincludes the aforementioned kernel as a special case. The kernels are of the form Γ( x, x (cid:48) ) = k ( x, x (cid:48) ) B , B is an n × n p.s.d. matrix that encodes the inter-task structure. This class ofkernels is called separable since it allows to decouple the contribution of input andoutput in the covariance structure (Alvarez et al., 2011). We consider stationary scalarkernels k with unit variances – to avoid redundancy in the parameterization – since thevariances can be captured fully by B (Bonilla et al., 2008). The main advantage ofICM is that one can use the eigen-system of B to define a new coordinate system where Γ becomes block diagonal, reducing the computational burden to a great extent. Thediagonal MT kernel Γ( x, x (cid:48) ) = diag ( k ( x, x (cid:48) ) , . . . , k n ( x, x (cid:48) )) has the same advantage,but corresponds to treating each task independently using different scalar kernels k j .However, in general, a MT kernel will not be diagonal, and moreover cannot be reducedto a diagonal one by linearly transforming the output space. For example, it is impossibleto reduce the kernel Γ( x, x (cid:48) ) = (cid:80) Mj =1 k j ( x, x (cid:48) ) B j , M (cid:54) = 1 , to a diagonal one, unlessall the n × n matrices B j are simultaneously diagonalizable (Caponnetto et al., 2008). We follow the general template of upper confidence bound (UCB) class of BO algorithms(Srinivas et al., 2010; Chowdhury and Gopalan, 2017) suitably adapted to the multi-tasksetting. At each round t , we randomly sample a weight vector λ t from the distribution P λ , and compute a multi-task acquisition function u t : X → R to act as an UCB forthe unknown function f , based on the random scalarization s λ t . Whenever u t ( x ) is avalid UCB, i.e., s λ t ( f ( x )) (cid:54) u t ( x ) , and it converges to s λ t ( f ( x )) “sufficiently" fast,then selecting candidates that are optimal with respect to u t leads to low (scalarized)regret, i.e., the scalarized objective s λ t ( f ( x t )) at x t ∈ argmax x ∈X u t ( x ) tends to s λ t (cid:0) f ( x (cid:63)λ t ) (cid:1) as t increases. The intuition behind our approach, at a high level, is thatthe set f ( X t ) tends to the high probability regions of the Pareto front as t increases.It now remains to design a principled multi-task acquisition function u t based on thescalarization s λ t , and in what follows, we shall describe two algorithms for that. Given the data { ( x i , y i ) } ti =1 ⊂ X × R n , we first aim to find an estimate of f by solvinga vector-valued regression problem: min f ∈H Γ ( X ) t (cid:88) i =1 (cid:107) y i − f ( x i ) (cid:107) + η (cid:107) f (cid:107) , where η > is a regularizing parameter. Micchelli and Pontil (2005) show that thesolution of this minimization problem can be written as µ t = t (cid:88) i =1 Γ x i α i . { α i } ti =1 ⊆ R n is the unique solution of the linear system of equations t (cid:88) i =1 (Γ( x j , x i ) + ηδ j,i ) α i = y j , (cid:54) j (cid:54) t , where δ j,i denotes the Kronecker-delta function. Now, by the reproducing property, wehave µ t ( x ) = Γ (cid:62) x µ t = G t ( x ) (cid:62) ( G t + ηI nt ) − Y t , where the kernel matrix G t = [Γ( x i , x j )] ti,j =1 is a t × t block matrix with each blockbeing an n × n matrix (so that G t is an nt × nt matrix), Y t = (cid:2) y (cid:62) , . . . , y (cid:62) t (cid:3) (cid:62) is an nt × vector with the outputs concatenated, and G t ( x ) = (cid:2) Γ( x, x ) (cid:62) , . . . , Γ( x, x t ) (cid:62) (cid:3) (cid:62) isan nt × n matrix. Notice that G t ( x ) can be interpreted as an embedding of a point x supported over the points x , . . . , x t observed so far. Now, if an arm x is sufficientlyunexplored, the estimate µ t ( x ) will, in general, have high variance. One natural way ofspecifying the uncertainty around µ t ( x ) is the following multi-task kernel: Γ t ( x, x (cid:48) ) = Γ( x, x (cid:48) ) − G t ( x ) (cid:62) ( G t + ηI nt ) − G t ( x (cid:48) ) , x, x (cid:48) ∈ X . (1)To see this, we draw a connection to multi-task Gaussian processes (MT-GPs) (Liuet al., 2018). Let f ∼ GP (0 , Γ) be a sample from a zero-mean MT-GP with covariancefunction Γ (i.e., E [ f i ( x )] = 0 and E [ f i ( x ) f j ( x (cid:48) )] = Γ( x, x (cid:48) ) ij for all i, j ∈ [ n ] and x, x (cid:48) ∈ X ), and assume that the observation noise vectors { ε t } t (cid:62) are independentand N (0 , ηI n ) distributed. Then the posterior distribution of f conditioned on thedata { ( x i , y i ) } ti =1 is also a MT-GP with mean µ t and covariance Γ t , yielding a naturaluncertainty model. Now, inspired by the optimism-in-face-of-uncertainty principle, wecompute the acquisition function for the next round as u t +1 ( x ) = s λ t +1 ( µ t ( x )) + L λ t +1 β t (cid:107) Γ t ( x, x ) (cid:107) / , (2)where L λ is the Lipschitz constant of the scalarization s λ . As a result, selecting the arm x t +1 with the highest u t +1 inherently trades off exploitation, i.e., picking points withhigh (scalarized) reward s λ t +1 ( µ t ( x )) , with exploration, i.e., picking points with highuncertainty (cid:107) Γ t ( x, x ) (cid:107) / . The parameter β t balances between these two objectives,and needs be tuned properly to guarantee low regret. The pseudo-code of MT-KB isgiven in Algorithm 1. Computational complexity
Maximizing the acquisition function u t ( x ) over X is ingeneral NP-hard even for a single task, since it is a highly non-convex function. Tosimplify the exposition, in what follows, we will assume that an efficient oracle tooptimize u t ( x ) , such as DIRECT (Brochu et al., 2010), is provided to us, and the perstep cost comes only from computing u t ( x ) . Now, the cost of computing u t ( x ) isdominated by the cost of inversion of the nt × nt kernel matrix, and thus in principlescales as O ( n t ) . We note that the cubic dependency with time t is present even inthe single-task ( n = 1 ) setting (Shahriari et al., 2015) and in this case, in fact, MT-KBreduces to the GP-UCB algorithm (Srinivas et al., 2010). This can be reduced to O ( n t ) using Schur’s complement, but at an additional storage cost of O ( n t ) . lgorithm 1 Multi-task kernelized bandits (MT-KB)
Require:
Kernel Γ , distribution P λ , scalarization s λ , time budget T , parameters η , { β t } T − t =0 Initialize µ ( x ) = 0 and Γ ( x, x (cid:48) ) = Γ( x, x (cid:48) ) for round t = 1 , , , . . . , T do Sample weight vector λ t ∼ P λ Compute acquisition function u t ( x ) = s λ t ( µ t − ( x )) + L λ t β t − (cid:107) Γ t − ( x, x ) (cid:107) / Select point x t ∈ argmax x ∈X u t ( x ) Get vector-valued output y t = f ( x t ) + ε t Compute G t ( x ) = (cid:2) Γ( x , x ) (cid:62) , . . . , Γ( x t , x ) (cid:62) (cid:3) (cid:62) , G t = [Γ( x i , x j )] ti,j =1 , Y t = (cid:2) y (cid:62) , . . . , y (cid:62) t (cid:3) (cid:62) Update µ t ( x ) = G t ( x ) (cid:62) ( G t + ηI nt ) − Y t Γ t ( x, x ) = Γ( x, x ) − G t ( x ) (cid:62) ( G t + ηI nt ) − G t ( x ) end forRemark 1 The diagonal MT kernel Γ( x, x (cid:48) ) = diag ( k ( x, x (cid:48) ) , . . . , k n ( x, x (cid:48) )) corre-sponds to treating each task independently and the problem reduces to inverting n kernel matrices yielding a per-step cost of O ( nt ) for MT-KB. This is similar to theprior works (Hernández-Lobato et al., 2016; Paria et al., 2019; Belakaria et al., 2019)which assume that each task f i is sampled independently from the scalar Gaussianprocess GP (0 , k i ) . One common approach to improve computational scalability in kernel methods is theNyström approximation (Drineas and Mahoney, 2005), which restricts the embeddings G t ( x ) and the kernel matrix G t to be supported on a subset (dictionary) D t of selectedpoints. However, this can lead to sub-optimal choices and large regret if D t is notsufficiently accurate. This brings about a trade-off between larger and more accuratedictionaries, or smaller and more efficient ones. The BKB algorithm solves this forsingle-task BO (Calandriello et al., 2019). We now generalize BKB for multiple tasksto improve over the O ( n t ) cost of MT-KB. The central idea behind this algorithm is to evaluate an approximate acquisition function ˜ u t ( x ) , which remains a valid UCB over the scalarized function s λ t ( f ( x )) and at thesame time is sufficiently close to u t ( x ) to ensure low regret. Given the data { ( x i , y i ) } ti =1 ,we start with an empty dictionary D t = ∅ and iterate over the set { x , . . . , x t } toupdate D t as follows. For each candidate x i , we compute an inclusion probability p t,i , and add x i to D t with probabability p t,i . The inclusion probabilities p t,i needto be set suitably so that the dictionary is small enough without compromising on8ts accuracy. Once the sampling is over, let D t be given by the set { x i , . . . , x i mt } ,where m t is the size of D t and i j ∈ [ t ] for each j ∈ [ m t ] . Given the dictionary D t , let ˜ G t ( x ) = (cid:2) Γ( x i , x ) (cid:62) / √ p t,i , . . . , Γ( x i mt , x ) (cid:62) / √ p t,i mt (cid:3) (cid:62) be the nm t × n embeddingof x supported over all points in D t and ˜ G t = (cid:2) Γ( x i u , x i v ) / √ p t,i u p t,i v (cid:3) m t u,v =1 bethe corresponding nm t × nm t kernel matrix, properly reweighted by the inclusionprobabilities. Then we compute the Nyström embeddings as ˜Φ t ( x ) = (cid:0) ˜ G / t (cid:1) + ˜ G t ( x ) , where ( · ) + denotes the pseudo-inverse. We now use these embeddings to approximate µ t and Γ t as ˜ µ t ( x ) = ˜Φ t ( x ) (cid:62) ( ˜ V t + ηI nm t ) − t (cid:88) s =1 ˜Φ t ( x s ) y s , ˜Γ t ( x, x (cid:48) ) = Γ( x, x (cid:48) ) − ˜Φ t ( x ) (cid:62) ˜Φ t ( x (cid:48) ) + η ˜Φ t ( x ) (cid:62) ( ˜ V t + ηI nm t ) − ˜Φ t ( x (cid:48) ) , where ˜ V t = (cid:80) ts =1 ˜Φ t ( x s ) ˜Φ t ( x s ) (cid:62) is an nm t × nm t matrix. Finally, similar to (2), wecompute the acquisition function for the next round as ˜ u t +1 ( x ) = s λ t +1 (˜ µ t ( x )) + L λ t +1 ˜ β t (cid:107) ˜Γ t ( x, x ) (cid:107) / , with ˜ β t governing the exploration-exploitation tradeoff. The inclusion probabilitiesfor the next round are computed as p t +1 ,i = min (cid:8) q (cid:107) ˜Γ t ( x i , x i ) (cid:107) , (cid:9) , where q (cid:62) isa parameter trading-off the size of the dictionary and accuracy of the approximation.We note here that constructing D t based on approximate posterior variance samplingis well-studied for scalar kernels (Alaoui and Mahoney, 2015), and in this work, weintroduce it for the first time for MT kernels. The pseudo-code of MT-BKB is given inAlgorithm 2. Computational complexity
Computing the dictionary involves a linear search overall selected points while the inclusion probabilities are computed already at the previousround, and thus requires O ( t ) time per step. The Nyström embeddings ˜Φ t ( x ) can becomputed in O ( n m t ) time, since an inversion of the matrix ˜ G t is required. By usingthese embeddings, ˜ V t can now be computed and inverted in O ( n m t t ) and O ( n m t ) time, respectively. Since, in general, m t (cid:54) t , the total per step cost of computing theacquisition function ˜ u t ( x ) is now O ( n m t t ) as opposed to the O ( n t ) cost of MT-KB.The computational advantage of MT-BKB is clearly visible when the dictionary size m t is near constant at every step, i.e., when m t = ˜ O (1) , where ˜ O ( · ) hides constant and log factors. We shall see in Section 4.1 that this holds, for example, for the intrinsiccoregionalization model (ICM) with the squared exponential kernel in its scalar part. The computational cost of our algorithms can be greatly reduced for ICM kernels Γ( x, x (cid:48) ) = k ( x, x (cid:48) ) B . Let { ξ i } ni =1 be the eigenvalues of B with corresponding orthonor-mal eigenvectors { u i } ni =1 . We then have the kernel matrix G t = (cid:80) ni =1 ξ i K t ⊗ u i u (cid:62) i and9 lgorithm 2 Multi-task budgeted kernelized bandits (MT-BKB)
Require:
Kernel Γ , distribution P λ , scalarization s λ , time budget T , parameters η , q , { ˜ β t } T − t =0 Initialize ˜ µ ( x ) = 0 and ˜Γ ( x, x (cid:48) ) = Γ( x, x (cid:48) ) for round t = 1 , , , . . . , T do Sample weight vector λ t ∼ P λ Compute acquisition function ˜ u t ( x ) = s λ t (˜ µ t − ( x )) + L λ t ˜ β t − (cid:107) ˜Γ t − ( x, x ) (cid:107) / Select point x t ∈ argmax x ∈X ˜ u t ( x ) Get vector-valued output y t = f ( x t ) + ε t Initialize dictionary D t = ∅ for i = 1 , , , . . . , t do Set inclusion probability p t,i = min (cid:110) q (cid:107) ˜Γ t − ( x i , x i ) (cid:107) , (cid:111) Draw z t,i ∼ Bernoulli ( p t,i ) if z t,i = 1 then Update D t = D t ∪ { x i } end ifend for Set m t = |D t | , enumerate D t = { x i , . . . , x i mt } and compute ˜ G t ( x ) = (cid:34) √ p t,i Γ( x i , x ) (cid:62) , . . . , √ p t,i mt Γ( x i mt , x ) (cid:62) (cid:35) (cid:62) , ˜ G t = (cid:20) √ p t,i u p t,i v Γ( x i u , x i v ) (cid:21) m t u,v =1 Find Nyström embeddings ˜Φ t ( x ) = (cid:16) ˜ G / t (cid:17) + ˜ G t ( x ) Compute ˜ V t = (cid:80) ts =1 ˜Φ t ( x s ) ˜Φ t ( x s ) (cid:62) and update ˜ µ t ( x ) = ˜Φ t ( x ) (cid:62) ( ˜ V t + ηI nm t ) − (cid:88) ts =1 ˜Φ t ( x s ) y s ˜Γ t ( x, x ) = Γ( x, x ) − ˜Φ t ( x ) (cid:62) ˜Φ t ( x ) + η ˜Φ t ( x ) (cid:62) ( ˜ V t + ηI nm t ) − ˜Φ t ( x ) end for the output vector Y t = (cid:80) ni =1 Y it ⊗ u i , where ⊗ denotes the Kronecker product, K t =[ k ( x i , x j )] ti,j =1 is the kernel matrix of the scalar kernel k and Y it = [ y (cid:62) u i , . . . , y (cid:62) t u i ] (cid:62) .Plugging these into (3.1) and (1), and using properties of Kronecker product, we nowobtain µ t ( x ) = n (cid:88) i =1 ξ i k t ( x ) (cid:62) ( ξ i K t + ηI t ) − Y it u i , (cid:107) Γ t ( x, x ) (cid:107) = max (cid:54) i (cid:54) n ξ i (cid:0) k ( x, x ) − ξ i k t ( x ) (cid:62) ( ξ i K t + ηI t ) − k t ( x ) (cid:1) , where k t ( x ) = [ k ( x , x ) , . . . , k ( x t , x )] (cid:62) . We see that the eigen-decomposition of B needs to be computed only once at the beginning and then, in the new coordinate system,we essentially have to solve n independent problems. Specifically, at round t , we10eed to project the vector-valued output y t to all coordinates and compute n matrix-vector multiplications of size t . However, since the kernel matrix K t is rescaled by theeigenvalues ξ i , we have to perform only one t × t inversion. Hence, the per-step timecomplexity of MT-KB is now O (cid:0) n + ( n + t ) t (cid:1) as opposed to O ( n t ) for generalMT kernels. Similarly, the per-step cost of MT-BKB can be substantially improved to O (cid:0) n + ( n + m t ) m t t (cid:1) from the O ( n m t t ) cost in general. Therefore, the kernels ofthis form allow for a near-linear (in time t ) per-step cost of MT-BKB at the price of theeigen-decomposition of B . (We defer the details to appendix A.) We now present the first theoretical result of this work, a concentration inequality forthe estimate of the unknown multi-task objective function f , which is then used to provethe regret bounds for our algorithms. (Complete proofs of all results presented in thissection are deferred to the appendix.) Theorem 1 (Multi-task concentration inequality)
Let f ∈ H Γ ( X ) and the noise vec-tors { ε t } t (cid:62) be σ -sub-Gaussian. Then, for any η > and δ ∈ (0 , , with probabilityat least − δ , the following holds uniformly over all x ∈ X and t (cid:62) (cid:107) f ( x ) − µ t ( x ) (cid:107) (cid:54) (cid:18) (cid:107) f (cid:107) Γ + σ √ n (cid:112) /δ ) + log det( I nt + η − G t ) (cid:19) (cid:107) Γ t ( x, x ) (cid:107) / . The significance of this bound can be better understood by studying the log-determinant term, and for this, we again draw a connection to MT-GPs. If f ∼ GP (0 , Γ) and ε t ∼ N (0 , ηI n ) i.i.d., then the mutual information between f and the outputs Y t is exactly equal to log det( I nt + η − G t )) , and it is a measure for the reduction inthe uncertainty or, equivalently, the information gain about f . Note that while we useGPs to describe the uncertainty in estimating the unknown function f , the bound is fre-quentist and does not need any Bayesian assumption about f . Similar to the single-tasksetting (Durand et al., 2018), the bound is proved by deriving a new self-normalizedconcentration inequality for martingales in the (cid:96) space. We note here that Astudilloand Frazier (2019) consider the much simpler setting of noise-free outputs and theirbound can be re-derived as a special case of Theorem 1.
Remark 2
The multi-task kernel Γ can be seen as a scalar kernel, Γ( x, x (cid:48) ) ij = k (( x, i ) , ( x (cid:48) , j )) , i, j ∈ [ n ] , and G t as an nt × nt kernel matrix of k evaluated atpoints ( x s , i ) , s ∈ [ t ] , i ∈ [ n ] . In this case, one can use Chowdhury and Gopalan(2017, Theorem 2) to derive concentration bounds for each task f i separately andcombine them together to obtain a result similar to Theorem 1 but with a notable change– (cid:107) Γ t ( x, x ) (cid:107) being replaced by trace (Γ t ( x, x )) . Thus, in general, we prove a tighterconcentration inequality which eventually leads to a O ( √ n ) factor saving in the finalregret bound. Theorem 1 can even be generalized to the regime of infinite-task learning (Kadri et al., 2016; Brault et al.,2019), where the observations lie in a Hilbert space H , and thus can be of independent interest. The onlytechnical assumption that one will need is that the multi-task kernel Γ( x, x ) has a finite trace, which triviallyholds in the finite-task setting. .1 Regret bounds Theorem 1 allows for a principled way to tune the confidence radii (i.e., β t and ˜ β t ) of ouralgorithms and achieve low regret. We now present the regret bound of MT-KB, which,to the best of our knowledge, is the first frequentist regret guarantee for multi-task BOunder any general MT kernel. Theorem 2 (Cumulative regret of MT-KB)
Let f ∈ H Γ ( X ) , (cid:107) f (cid:107) Γ (cid:54) b and (cid:107) Γ( x, x ) (cid:107) (cid:54) κ for all x ∈ X . Let the scalarization function s λ be L λ -Lipschitz, L λ (cid:54) L for all λ ∈ Λ and the noise vectors { ε t } t (cid:62) be σ -sub-Gaussian. Then, for any η > and δ ∈ (0 , , MT-KB with β t = b + σ √ η (cid:118)(cid:117)(cid:117)(cid:116) /δ ) + t (cid:88) s =1 log det ( I n + η − Γ s − ( x s , x s )) , enjoys, with probability at least − δ , the regret bound R MT-KB C ( T ) (cid:54) L (cid:18) b + σ √ η (cid:112) (2 log(1 /δ ) + γ nT (Γ , η )) (cid:19) (cid:114) (1 + κ/η ) T (cid:88) Tt =1 (cid:107) Γ t ( x t , x t ) (cid:107) , where γ nT (Γ , η ) := max X T ⊂X log det (cid:0) I nT + η − G T (cid:1) denotes the maximum infor-mation gain. Theorem 2, along with the upper bound (cid:80) Tt =1 (cid:107) Γ t ( x t , x t ) (cid:107) (cid:54) ηγ nT (Γ , η ) , yieldsthe more compact regret bound ˜ O (cid:0) b (cid:112) T γ nT (Γ , η ) + γ nT (Γ , η ) √ T (cid:1) . We note herethat the bound for single-task case (Chowdhury and Gopalan, 2017) can be recoveredby setting n = 1 . Furthermore, since the single-task bound is shown to be tight uptoa poly-logarithmic factor (Scarlett et al., 2017), our bound, we believe, is also tightin terms of dependence on T . Now, we instantiate Theorem 2 for the special case ofseparable kernels to point out the novel insights and improvements that our analysisunearths as compared to existing work. Lemma 1 (Inter-task structure in regret bound)
Let B be an n × n p.s.d. matrixand Γ( x, x (cid:48) ) = k ( x, x (cid:48) ) B . Let (cid:107) Γ( x, x ) (cid:107) (cid:54) κ and k ( x, x ) = 1 for all x ∈ X . Then thefollowing holds: γ nT (Γ , η ) (cid:54) (cid:88) i ∈ [ n ]: ξ i > γ T ( k, η/ξ i ) , T (cid:88) t =1 (cid:107) Γ t ( x t , x t ) (cid:107) (cid:54) η max { κ, } γ T ( k, η ) , where ξ , . . . , ξ n are the eigenvalues of B and γ T ( k, α ) := max X T ⊂X log det (cid:0) I T + α − K T (cid:1) , α > , is the maximum information gain associated with the scalar kernel k . Lemma 1, along with Theorem 2, leads to a regret bound that explicitly encodesthe amount of similarity between tasks in terms of the spectral properties of B . For12xample, consider the case B = ωI n + (1 − ω )1 n /n , ω ∈ [0 , , which has oneeigenvalue equal to and all others equal to ω . In this case, we obtain γ nT (Γ , η ) (cid:54) γ T ( k, η ) + ( n − γ T ( k, η/ω ) . Now γ T ( k, η/ω ) is an increasing function in ω , andin fact, γ T ( k, η/ω ) = 0 when ω = 0 . Hence, a low value of ω , i.e., a high amount ofsimilarity between tasks, yields a low cumulative regret and vice-versa. (A numericalexample is shown in Figure 1: (a) using the squared exponential kernel as k .) Moreover,for the extreme two cases of ω = 0 (all tasks identical) and ω = 1 (all tasks unrelated),the regret bounds are ˜ O ( γ T ( k, η ) √ T ) and ˜ O ( γ T ( k, η ) √ nT ) , respectively. The boundsclearly assert that similar objectives can be learnt much faster together rather thanlearning them separately. To the best of our knowledge, this intuitive but importantobservation is not captured by any of the existing regret analysis (Zuluaga et al., 2013;Paria et al., 2019; Belakaria et al., 2019). Remark 3
Existing works model each task independently by means of a diagonal multi-task kernel Γ( x, x (cid:48) ) = diag ( k ( x, x (cid:48) ) , . . . , k n ( x, x (cid:48) )) and prove regret bounds for thisspecial setting. In contrast, Theorem 2 is applicable to any general multi-task kernel,and in the special case of diagonal kernel, yields, along with Lemma 1, a regret boundof ˜ O (max i γ T ( k i , η ) √ nT ) . This bound, together with the discussion above, suggestthat whereas on the one hand MT-KB exploits similarities between tasks efficiently, itsperformance on the other hand does not suffer when the tasks are unrelated. Anotherimportant point to note here is that we analyze the frequentist (worst-case) regret, whichis a stronger notion of regret compared to the Bayesian one (defined as the expectedcumulative regret under a prior distribution of f ) as considered in previous works (Pariaet al., 2019; Belakaria et al., 2019). We now present regret and complexity guarantees for MT-BKB, which, to thebest of our knowledge, are first of their kinds for multi-task BO under kernel or GPapproximation.
Theorem 3 (Analysis of MT-BKB)
For any η > , ε ∈ (0 , and δ ∈ (0 , , let ρ = (1 + ε ) / (1 − ε ) and q = 6 ρ log(4 T /δ ) /ε . Then, under the same hypothesis asTheorem 2, if we run MT-BKB with ˜ β t = b (cid:0) / √ − ε (cid:1) + σ √ η (cid:118)(cid:117)(cid:117)(cid:116) /δ ) + ρ t (cid:88) s =1 log det (cid:0) I n + η − ˜Γ s − ( x s , x s ) (cid:1) , then, with probability at least − δ , the following holds: R MT-BKB C ( T ) (cid:54) ρ / R MT-KB C ( T ) , ∀ t ∈ [ T ] , m t (cid:54) ρq (1 + κ/η ) (cid:88) ts =1 (cid:107) Γ s ( x s , x s ) (cid:107) . Theorem 3 shows that MT-BKB can achieve an order-wise similar regret scaling asMT-KB (up to a constant factor), but only at a fraction of the computational cost. To seethis, we again consider the kernel Γ( x, x (cid:48) ) = k ( x, x (cid:48) ) B . In this case, Theorem 3 andLemma 1 together imply that the dictionary size m t is ˜ O ( γ t ( k, η )) . Now γ t is itself13ounded for specific scalar kernels k , e.g., it is O (cid:0) (ln t ) d (cid:1) for the squared exponentialkernel (Srinivas et al., 2010), yielding m t to be ˜ O (1) . This leads to a near-linear (intime t ) per-step cost for MT-BKB compared to the cubic cost for MT-KB. Further, it isworth noting that MT-BKB can adapt to any desired accuracy level ε of the Nyströmapproximation. A low value of ε corresponds to high desired accuracy and MT-BKBadapts to it by inducing more and more points in the dictionary, yielding accurateembeddings and thus, in turn, low regret. Conversely, if one is willing to compromise onthe accuracy (given by a high value of ε ), then MT-BKB can greatly reduce the size ofthe dictionary, yielding a low time complexity. The analysis follows in the footsteps ofCalandriello et al. (2019), but is carefully generalized to consider multi-task kernels. Theregret bound is crucially achieved by showing that Γ t ( x, x ) /ρ (cid:22) ˜Γ t ( x, x ) (cid:22) ρ Γ t ( x, x ) ,i.e., MT-BKB’s variance estimates are always almost close to the exact ones ( A (cid:23) B denotes that the matrix A − B is p.s.d.). This not only helps us avoid variance starvationwhich is known to happen with classical sparse GP approximations (Wang et al., 2018),but also, allows us to set ˜ β t efficiently and in a data-adaptive way. In order to investigate the practical benefits offered by learning with multi-task ker-nels, we compare MT-KB and MT-BKB with single-task algorithms that enjoy regretguarantees under RKHS smoothness assumptions. Specifically, we consider GP-UCB(Chowdhury and Gopalan, 2017) and its Nyström approximation BKB (Calandrielloet al., 2019) as baselines, where each task is learnt independently and inter-task structureis not exploited. We call these baselines independent task kernelized bandits (IT-KB) and budgeted kernelized bandits (IT-BKB) , respectively. Whenever the objective is notexplicitly generated from an RKHS, we also compare with MOBO (Paria et al., 2019),which has better regret performance than other methods (Knowles, 2006; Ponweiseret al., 2008; Emmerich and Klinkenberg, 2008; Hernández-Lobato et al., 2016) thatmodel each task with an independent GP. In all simulations, we set η = 0 . , δ = 0 . and ε = 0 . , and use the Chebyshev scalarization. Similar to (Paria et al., 2019), wesample from P λ as λ = α/ (cid:107) α (cid:107) , where α i = (cid:107) u (cid:107) /u i , i ∈ [ n ] , and u is sampleduniformly from [0 , n . We compare the algorithms on the following MOO problemsand plot mean and standard deviation (over independent trials) of the time-averagecumulative regret T R C ( T ) in Fig. 1: (b)-(f). (More details in appendix E.) RKHS function
We generate a vector-valued RKHS element as f ( · ) = (cid:80) i (cid:54) Γ( · , x i ) c i ,where the domain X is an . -net of the interval [0 , , each x i ∈ X and each c i isuniformly sampled from [ − , n . We consider the ICM kernel Γ( x, x (cid:48) ) = k ( x, x (cid:48) ) B adopting a SE kernel with lengthscale . for its scalar part and set B = A (cid:62) A , wherethe elements of the n × n matrix A is uniformly sampled from [0 , . We set κ as thelargest eigenvalue of B and bound the RKHS norm of f using b = max x (cid:107) f ( x ) (cid:107) /κ .The noise vectors are taken i.i.d. N (0 , σ I n ) , σ = 0 . . We compare the algorithms for n = 2 and n = 20 tasks. We observe that learning with MT kernels is much faster thanlearning the tasks independently – even more so when no. of tasks are higher (Fig. 1:(b), (c)). 14
20 40 60 80 100 120 140 160 180 20000.050.10.150.20.250.3 (a) Tasks with varying similarities (b) RKHS function ( tasks) -3 (c) RKHS function ( tasks) (d) Perturbed sine function (e) Shifted Branin-Hoo (f) Sensor measurements Figure 1: (a) Regret performance of MT-KB under varying inter-task similarities, (b)-(f)Comparison of average cumulative regret of MT-KB and MT-BKB with IT-KB, IT-BKBand MOBO on different MOO problems.
Perturbed sine function
We study a setting similar to (Baldassarre et al., 2012),where X is an . -net of the interval [0 , and we have n = 4 tasks. Each task is givenby a function f i ( x ) = sin(2 πx ) + 0 . f pert i ( x ) corrupted by Gaussian noise of variance . . Each perturbation function f pert i is a weighted sum of three Gaussians of width . centered at x = 0 . , x = 0 . and x = 0 . , where task-specific weights arecarefully chosen in order to yield tasks that are related by the common function, but alsohave local differences. We use the kernel Γ ω ( x, x (cid:48) ) = k ( x, x (cid:48) ) ( ωI n + (1 − ω )1 n /n ) that imposes a common similarity among all components and results are shown for ω = 0 . (Fig. 1: (d)). Shifted Branin-Hoo
The Branin-Hoo function, defined over a subset of R , is acommon benchmark for BO (Jones, 2001). We consider shifted Branin-Hoo’s asrelated tasks, where the i -th task is a translation of the function by i % along either axis,15nd run algorithms with the kernel Γ ω , ω = 0 . (Fig. 1: (e)). Sensor measurements
We take temperature, light and humidity measurements from54 sensors collected in the Intel Berkeley lab (Srinivas et al., 2010) in the context ofMOO. We have tasks, one for each variable, and each task f i ( x ) is given by theempirical mean of of the readings recorded at the sensor placed at location x . Wetake remaining readings to estimate an ICM kernel and run our algorithms with thiskernel. Specifically, for its scalar part, we fit an SE kernel on sensor locations, andfor its matrix part, we estimate inter-task similarities as B = m R (cid:62) K − R , where m denotes number of readings, R is an m × matrix of readings for all tasks and K is the m × m gram matrix of SE kernel. The idea is to de-correlate R with K − first so thatonly correlation with respect to B is left. Further, we compute the empirical variance ofsensor readings for each task and take the largest of those as σ . We see that the regretperformance of MT-KB and MT-BKB are much better than IT-KB, IT-BKB and MOBOthat do not use the inter-task structure in the form of the matrix B (Fig. 1: (f)). To the best of our knowledge, we prove the first rigorous regret bounds for multi-taskBayesian optimization that capture inter-task dependencies. We have demonstratedthe shortcoming of modelling each task independently without making use of tasksimilarities, and developed algorithms using multi-task kernels, which perform wellin practice. We believe that our regret bounds are tight in terms of dependence on thetime horizon. However, whether the dependence on the inter-task structure is optimalor not remains an important open question. It would also be interesting to see whetherour multi-task concentration can be applied to several other interesting settings, forexample optimizing under heavy-tailed corruptions (Chowdhury and Gopalan, 2019a),with a batch of inputs (Desautels et al., 2014), learning with kernel mean embeddings(Chowdhury et al., 2020), modelling the tranisition structure of a Markov decisionprocess (Chowdhury and Gopalan, 2019b) to name a few.
References
Ahmed Alaoui and Michael W Mahoney. Fast randomized kernel ridge regression withstatistical guarantees. In
Advances in Neural Information Processing Systems , pages775–783, 2015.Mauricio A Alvarez, Lorenzo Rosasco, and Neil D Lawrence. Kernels for vector-valuedfunctions: A review. arXiv preprint arXiv:1106.6251 , 2011.Raul Astudillo and Peter Frazier. Bayesian optimization of composite functions. In
International Conference on Machine Learning , pages 354–363, 2019.Luca Baldassarre, Lorenzo Rosasco, Annalisa Barla, and Alessandro Verri. Multi-outputlearning via spectral filtering.
Machine learning , 87(3):259–301, 2012.16yrine Belakaria, Aryan Deshwal, and Janardhan Rao Doppa. Max-value entropysearch for multi-objective bayesian optimization. In
Advances in Neural InformationProcessing Systems , pages 7823–7833, 2019.Edwin V Bonilla, Kian M Chai, and Christopher Williams. Multi-task gaussian processprediction. In
Advances in neural information processing systems , pages 153–160,2008.Romain Brault, Alex Lambert, Zoltan Szabo, Maxime Sangnier, and Florence d’AlcheBuc. Infinite task learning in rkhss. In
The 22nd International Conference on ArtificialIntelligence and Statistics , pages 1294–1302, 2019.Eric Brochu, Vlad M Cora, and Nando De Freitas. A tutorial on bayesian optimizationof expensive cost functions, with application to active user modeling and hierarchicalreinforcement learning. arXiv preprint arXiv:1012.2599 , 2010.Daniele Calandriello, Luigi Carratino, Alessandro Lazaric, Michal Valko, and LorenzoRosasco. Gaussian process optimization with adaptive sketching: Scalable and noregret.
In Conference on Learning Theory , 2019.Andrea Caponnetto, Charles A Micchelli, Massimiliano Pontil, and Yiming Ying.Universal multi-task kernels.
Journal of Machine Learning Research , 9(Jul):1615–1646, 2008.Claudio Carmeli, Ernesto De Vito, Alessandro Toigo, and Veronica Umanitá. Vectorvalued reproducing kernel hilbert spaces and universality.
Analysis and Applications ,8(01):19–61, 2010.Rich Caruana. Multitask learning.
Machine learning , 28(1):41–75, 1997.S. R. Chowdhury, Rafael dos Santos de Oliveira, and F. Ramos. Active learning ofconditional mean embeddings via bayesian optimisation. 2020.Sayak Ray Chowdhury and Aditya Gopalan. On kernelized multi-armed bandits. In
Proceedings of the 34th International Conference on Machine Learning-Volume 70 ,pages 844–853. JMLR. org, 2017.Sayak Ray Chowdhury and Aditya Gopalan. Bayesian optimization under heavy-tailedpayoffs. In
Advances in Neural Information Processing Systems , pages 13790–13801,2019a.Sayak Ray Chowdhury and Aditya Gopalan. Online learning in kernelized markovdecision processes. In
The 22nd International Conference on Artificial Intelligenceand Statistics , pages 3197–3205, 2019b.Thomas Desautels, Andreas Krause, and Joel W Burdick. Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization.
Journal of MachineLearning Research , 15:3873–3923, 2014.17etros Drineas and Michael W Mahoney. On the nyström method for approximatinga gram matrix for improved kernel-based learning. journal of machine learningresearch , 6(Dec):2153–2175, 2005.Madalina M Drugan and Ann Nowe. Designing multi-objective multi-armed banditsalgorithms: A study. In
The 2013 International Joint Conference on Neural Networks(IJCNN) , pages 1–8. IEEE, 2013.Madalina M Drugan and Ann Nowé. Scalarization based pareto optimal set of armsidentification algorithms. In , pages 2690–2697. IEEE, 2014.Audrey Durand, Odalric-Ambrym Maillard, and Joelle Pineau. Streaming kernelregression with provably adaptive mean, variance, and regularization.
The Journal ofMachine Learning Research , 19(1):650–683, 2018.Michael Emmerich and Jan-willem Klinkenberg. The computation of the expectedimprovement in dominated hypervolume of pareto front approximations.
Rapporttechnique, Leiden University , 34:7–3, 2008.Theodoros Evgeniou, Charles A Micchelli, and Massimiliano Pontil. Learning multipletasks with kernel methods.
Journal of machine learning research , 6(Apr):615–637,2005.R. Garnett, M. A. Osborne, and S. J. Roberts. Bayesian optimization for sensor setselection. In
Proceedings of the 9th ACM/IEEE International Conference on Informa-tion Processing in Sensor Networks , IPSN ’10, pages 209–219, New York, NY, USA,2010. ACM.Javier Gonzalez, Joseph Longworth, David C James, and Neil D Lawrence. Bayesianoptimization for synthetic gene design. arXiv preprint arXiv:1505.01627 , 2015.William H Greene.
Econometric analysis . Pearson Education India, 2003.Steffen Grünewälder, Guy Lever, Luca Baldassarre, Sam Patterson, Arthur Gretton, andMassimilano Pontil. Conditional mean embeddings as regressors. In
Proceedings ofthe 29th International Coference on International Conference on Machine Learning ,pages 1803–1810, 2012.Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q Weinberger. On calibration of modernneural networks. In
Proceedings of the 34th International Conference on MachineLearning-Volume 70 , pages 1321–1330. JMLR. org, 2017.Daniel Hernández-Lobato, Jose Hernandez-Lobato, Amar Shah, and Ryan Adams.Predictive entropy search for multi-objective bayesian optimization. In
InternationalConference on Machine Learning , pages 1492–1501, 2016.Donald R Jones. A taxonomy of global optimization methods based on response surfaces.
Journal of global optimization , 21(4):345–383, 2001.18achem Kadri, Emmanuel Duflos, Philippe Preux, Stéphane Canu, Alain Rakotoma-monjy, and Julien Audiffren. Operator-valued kernels for learning from functionalresponse data.
The Journal of Machine Learning Research , 17(1):613–666, 2016.Joshua Knowles. Parego: a hybrid algorithm with on-line landscape approximation forexpensive multiobjective optimization problems.
IEEE Transactions on EvolutionaryComputation , 10(1):50–66, 2006.Haitao Liu, Jianfei Cai, and Yew-Soon Ong. Remarks on multi-output gaussian processregression.
Knowledge-Based Systems , 144:102–121, 2018.Charles A Micchelli and Massimiliano Pontil. On learning vector-valued functions.
Neural computation , 17(1):177–204, 2005.Hirotaka Nakayama, Yeboon Yun, and Min Yoon.
Sequential approximate multiobjectiveoptimization using computational intelligence . Springer Science & Business Media,2009.Biswajit Paria, Kirthevasan Kandasamy, and B. Póczos. A flexible framework formulti-objective bayesian optimization using random scalarizations. In
UAI , 2019.Victor Picheny. Multiobjective optimization using gaussian process emulators viastepwise uncertainty reduction.
Statistics and Computing , 25(6):1265–1280, 2015.Wolfgang Ponweiser, Tobias Wagner, Dirk Biermann, and Markus Vincze. Multiobjec-tive optimization on a limited budget of evaluations using model-assisted S -metricselection. In International Conference on Parallel Problem Solving from Nature ,pages 784–794. Springer, 2008.Carl Edward Rasmussen. Gaussian processes in machine learning. In
Summer Schoolon Machine Learning , pages 63–71. Springer, 2003.Ryan Rifkin, Sayan Mukherjee, Pablo Tamayo, Sridhar Ramaswamy, Chen-HsiangYeang, Michael Angelo, Michael Reich, Tomaso Poggio, Eric S Lander, Todd RGolub, et al. An analytical method for multiclass molecular cancer classification.
Siam Review , 45(4):706–723, 2003.Diederik M Roijers, Peter Vamplew, Shimon Whiteson, and Richard Dazeley. A surveyof multi-objective sequential decision-making.
Journal of Artificial IntelligenceResearch , 48:67–113, 2013.Jonathan Scarlett, Ilija Bogunovic, and Volkan Cevher. Lower bounds on regret fornoisy gaussian process bandit optimization. In
Conference on Learning Theory , pages1723–1742, 2017.Bobak Shahriari, Kevin Swersky, Ziyu Wang, Ryan P Adams, and Nando De Freitas.Taking the human out of the loop: A review of bayesian optimization.
Proceedings ofthe IEEE , 104(1):148–175, 2015. 19asper Snoek, Hugo Larochelle, and Ryan P Adams. Practical bayesian optimization ofmachine learning algorithms. In
Advances in neural information processing systems ,pages 2951–2959, 2012.Niranjan Srinivas, Andreas Krause, Sham Kakade, and Matthias Seeger. Gaussianprocess optimization in the bandit setting: no regret and experimental design. In
Proceedings of the 27th International Conference on International Conference onMachine Learning , pages 1015–1022. Omnipress, 2010.Kevin Swersky, Jasper Snoek, and Ryan P Adams. Multi-task bayesian optimization. In
Advances in neural information processing systems , pages 2004–2012, 2013.Hans Wackernagel.
Multivariate geostatistics: an introduction with applications .Springer Science & Business Media, 2013.Zi Wang, Clement Gehring, Pushmeet Kohli, and Stefanie Jegelka. Batched large-scalebayesian optimization in high-dimensional spaces. In
International Conference onArtificial Intelligence and Statistics , pages 745–754, 2018.Indre Zliobaite. On the relation between accuracy and fairness in binary classification. arXiv preprint arXiv:1505.05723 , 2015.Marcela Zuluaga, Guillaume Sergent, Andreas Krause, and Markus Püschel. Activelearning for multi-objective optimization. In
International Conference on MachineLearning , pages 462–470, 2013. 20 ppendix
A Computational complexity under ICM kernels
In this section, we describe the time complexities of MT-KB and MT-BKB for theintrinsic coregionalization model (ICM) Γ( x, x (cid:48) ) = k ( x, x (cid:48) ) B . As discussed earlier, weassume that an efficient oracle to optimize the acquisition function is provided to us, andthe per step cost comes only from computing it. To this end, we first describe simplifiedmodel updates under ICM kernel using the eigen-system of B and then detail out thetime required for computing the updates. We note here that the eigen decomposition,which is O ( n ) , needs to be computed only once at the beginning and can be used atevery step of the algorithms. Per-step complexity of MT-KB
Let B = (cid:80) ni =1 ξ i u i u (cid:62) i denotes the eigen decompo-sition of the positive semi-definite matrix B . Then, Γ( x, x ) = (cid:80) ni =1 ξ i k ( x, x ) u i u (cid:62) i .From the definition of the Kronecker product, we now have G t = (cid:80) ni =1 ξ i K t ⊗ u i u (cid:62) i and G t ( x ) = (cid:80) ni =1 ξ i k t ( x ) ⊗ u i u (cid:62) i , where K t = [ k ( x i , x j )] ti,j =1 and k t ( x ) =[ k ( x , x ) , . . . , k ( x t , x )] (cid:62) . Since { u i } ni =1 yields an orthonormal basis of R n , the output y t ∈ R n can be written as y t = (cid:80) ni =1 y (cid:62) t u i · u i . We then have Y t = (cid:80) ni =1 Y it ⊗ u i ,where Y it = (cid:2) y (cid:62) u i , . . . , y (cid:62) t u i (cid:3) (cid:62) . We also note that I nt = (cid:80) ni =1 I t ⊗ u i u (cid:62) i , and, there-fore G t + ηI nt = (cid:80) ni =1 ( ξ i K t + ηI t ) ⊗ u i u (cid:62) i . Now, let K t = (cid:80) tj =1 α j w j w (cid:62) j denotesthe eigen decomposition of the (positive semi-definite) kernel matrix K t . We then have G t + ηI nt = n (cid:88) i =1 t (cid:88) j =1 ( ξ i α j + η ) w j w (cid:62) j ⊗ u i u (cid:62) i = n (cid:88) i =1 t (cid:88) j =1 ( ξ i α j + η )( w j ⊗ u i )( w j ⊗ u i ) (cid:62) . (3)By the properties of tensor product ( w j ⊗ u i ) (cid:62) ( w j (cid:48) ⊗ u i (cid:48) ) = ( w (cid:62) j w j (cid:48) ) · ( u (cid:62) i u i (cid:48) ) , whichis equal to if i = i (cid:48) , j = j (cid:48) , and is equal to otherwise. Therefore, (3) denotes theeigen decomposition of G t + ηI nt . Hence ( G t + ηI nt ) − = n (cid:88) i =1 t (cid:88) j =1 ξ i α j + η w j w (cid:62) j ⊗ u i u (cid:62) i = n (cid:88) i =1 ( ξ i K t + ηI t ) − ⊗ u i u (cid:62) i . (4)By the orthonormality of { u i } ni =1 and the mixed product property of Kronecker product,we now obtain ( G t + ηI nt ) − Y t = (cid:80) ni =1 ( ξ i K t + ηI t ) − Y it ⊗ u i , and thus, in turn, µ t ( x ) = G t ( x ) (cid:62) ( G t + ηI nt ) − Y t = n (cid:88) i =1 ξ i k t ( x ) (cid:62) ( ξ i K t + ηI t ) − Y it · u i . (5)Similarly, we get G t ( x ) (cid:62) ( G t + ηI nt ) − G t ( x ) = (cid:80) ni =1 ξ i k t ( x ) (cid:62) ( ξ i K t + ηI t ) − k t ( x ) · u i u (cid:62) i and therefore, (cid:107) Γ t ( x, x ) (cid:107) = max (cid:54) i (cid:54) n ξ i (cid:0) k ( x, x ) − ξ i k t ( x ) (cid:62) ( ξ i K t + ηI t ) − k t ( x ) (cid:1) . (6)21et us now discuss the time required to compute µ t ( x ) and (cid:107) Γ t ( x, x ) (cid:107) . Given the eigendecomposition, updating { Y it } ni =1 re-using those already computed at the previous steprequires projecting the current output y t onto all coordinates, and thus, takes O ( n ) time. Now, since the kernel matrix K t is rescaled by the eigenvalues ξ i , we can find theeigen decomposition of K t once and reuse those to compute { ( ξ i K t + ηI t ) − } ni =1 in O ( t ) time. Next, computing n matrix-vector multiplications and vector inner productsof the form k t ( x ) (cid:62) ( ξ i K t + ηI t ) − k t ( x ) and k t ( x ) (cid:62) ( ξ i K t + ηI t ) − Y it take O ( nt ) time. Finally, the sum in (5) and the max in (6) can be computed in O ( n ) and O ( n ) time, respectively. Therefore, the overall cost to compute µ t ( x ) and (cid:107) Γ t ( x, x ) (cid:107) are O (cid:0) n + nt + t (cid:1) = O (cid:0) n + t ( n + t ) (cid:1) . Per-step complexity of MT-BKB
Let ˜ ϕ t ( x ) = (cid:16) ˜ K / t (cid:17) + ˜ k t ( x ) ∈ R m t denotes theNyström embedding of the scalar kernel k , where ˜ k t ( x ) = (cid:20) √ p t,i k ( x i , x ) , . . . , √ p t,imt k ( x i mt , x ) (cid:21) (cid:62) and ˜ K t = (cid:104) √ p t,iu p t,iv k ( x i u , x i v ) (cid:105) m t u,v =1 . Then the eigen decomposition B = (cid:80) ni =1 ξ i u i u (cid:62) i yields ˜ G t = (cid:80) ni =1 ξ i ˜ K t ⊗ u i u (cid:62) i and ˜ G t ( x ) = (cid:80) ni =1 ξ i ˜ k t ( x ) ⊗ u i u (cid:62) i . A similar argu-ment as in (3) and (4) now implies (cid:16) ˜ G / t (cid:17) + = (cid:80) ni =1 1 √ ξ i (cid:16) ˜ K / t (cid:17) + ⊗ u i u (cid:62) i . Therefore,the Nyström embeddings for the multi-task kernel Γ can be computed using the embed-dings for the scalar kernel k as ˜Φ t ( x ) = (cid:16) ˜ G / t (cid:17) + ˜ G t ( x ) = n (cid:88) i =1 (cid:112) ξ i (cid:16) ˜ K / t (cid:17) + ˜ k t ( x ) ⊗ u i u (cid:62) i = n (cid:88) i =1 (cid:112) ξ i ˜ ϕ t ( x ) ⊗ u i u (cid:62) i . We now have ˜ V t = t (cid:88) s =1 ˜Φ t ( x s ) ˜Φ t ( x s ) (cid:62) = t (cid:88) s =1 n (cid:88) i =1 ξ i ˜ ϕ t ( x s ) ˜ ϕ t ( x s ) (cid:62) ⊗ u i u (cid:62) i = n (cid:88) i =1 ξ i ˜ v t ⊗ u i u (cid:62) i , where ˜ v t = (cid:80) ts =1 ˜ ϕ t ( x s ) ˜ ϕ t ( x s ) (cid:62) . A similar argument as in (3) and (4) then implies ( ˜ V t + ηI nm t ) − = n (cid:88) i =1 ( ξ i ˜ v t + ηI m t ) − ⊗ u i u (cid:62) i . We further have t (cid:88) s =1 ˜Φ t ( x s ) y s = t (cid:88) s =1 n (cid:88) i =1 (cid:112) ξ i · y (cid:62) s u i · ˜ ϕ t ( x s ) ⊗ u i = n (cid:88) i =1 (cid:112) ξ i (cid:32) t (cid:88) s =1 y (cid:62) s u i · ˜ ϕ t ( x s ) (cid:33) ⊗ u i . Similar to (5), we therefore obtain ˜ µ t ( x ) = n (cid:88) i =1 ξ i ˜ ϕ t ( x ) (cid:62) ( ξ i ˜ v t + ηI m t ) − (cid:32) t (cid:88) s =1 y (cid:62) s u i · ˜ ϕ t ( x s ) (cid:33) · u i . (7)22e now note that ˜Φ t ( x ) (cid:62) ˜Φ t ( x ) = (cid:80) ni =1 ξ i ˜ ϕ t ( x ) (cid:62) ˜ ϕ t ( x ) · u i u (cid:62) i . Similar to (6), wethen obtain (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) = max (cid:54) i (cid:54) n ξ i (cid:16) k ( x, x ) − ˜ ϕ t ( x ) (cid:62) ˜ ϕ t ( x ) + η ˜ ϕ t ( x ) (cid:62) ( ξ i ˜ v t + ηI m t ) − ˜ ϕ t ( x ) (cid:17) . (8)We now discuss the time required to compute the scalar kernel embedding ˜ ϕ t ( x ) .Sampling the dictionary D t , as we reuse the variances from the previous round, takes O ( t ) time. We now compute the embedding ˜ ϕ t ( x ) in O ( m t + m t ) time, whichcorresponds to an inversion of ˜ K / t and a matrix-vector product of dimension m t , thesize of the dictionary. Given the embedding function, let us now find the time requiredto compute ˜ µ t ( x ) and (cid:107) ˜Γ t ( x, x ) (cid:107) . We first construct the matrix ˜ v t from scratch usingall the points selected so far, which takes O ( m t t ) time. Then the inverses { ( ξ i ˜ v t + ηI m t ) − } ni =1 can be computed in O ( m t ) time and the matrix-vector multiplications { ( ξ i ˜ v t + ηI m t ) − ˜ ϕ t ( x ) } ni =1 in O ( nm t ) time. Similar to MT-KB, projecting the currentoutput onto every direction takes O ( n ) time. The projections can then be used tocompute n vectors of the form (cid:80) ts =1 y (cid:62) s u i · ˜ ϕ t ( x s ) in O ( nm t t ) time. Finally, n vectorinner products of dimension m t can be computed in O ( nm t ) time. Therefore, the overallcost to compute (7) and (8) is O ( n + nm t t + nm t + m t + m t t ) = O (cid:0) n + m t t ( n + t ) (cid:1) ,since the dictionary size m t (cid:54) t . B Multi-task concentration
We first introduce some notations. For any two Hilbert spaces G and H with respectiveinner products (cid:104)· , ·(cid:105) G and (cid:104)· , ·(cid:105) H , we denote by L ( G , H ) the space of all bounded linearoperators from G to H , with the operator norm (cid:107) A (cid:107) := sup (cid:107) g (cid:107) G (cid:54) (cid:107) Ag (cid:107) H . We alsodenote, for any A ∈ L ( G , H ) , by A (cid:62) its adjoint, which is the unique operator suchthat (cid:104) A (cid:62) h, g (cid:105) G = (cid:104) h, Ag (cid:105) H for all g ∈ G , h ∈ H . In the case G = H , we denote L ( H ) = L ( H , H ) . We now review the following lemma (Rasmussen, 2003) aboutoperators, which we will use several times. Lemma 2 (Operator identities)
Let A ∈ L ( G , H ) . Then, for any η > , the followinghold ( A (cid:62) A + ηI ) − A (cid:62) = A (cid:62) ( AA (cid:62) + ηI ) − ,I − A (cid:62) ( AA (cid:62) + ηI ) − A = η ( A (cid:62) A + ηI ) − . We now present the main result of this appendix, which is stated and proved usingthe feature map of the multi-task kernel.
Feature map of multi-task kernel
We assume the multi-task kernel Γ to be contin-uous relative to the operator norm on L ( R n ) , the space of bounded linear operatorsfrom R n to itself. Then the RKHS H Γ ( X ) associated with the kernel Γ is a subspaceof the space of continuous functions from X to R n , and hence, Γ is a Mercer ker-nel (Carmeli et al., 2010). Let µ be a probability measure on the (compact) set X .Since Γ is a Mercer kernel on X and sup x ∈X (cid:107) Γ( x, x ) (cid:107) < ∞ , the RKHS H Γ ( X ) is23 subspace of L ( X , µ ; R n ) , the Banach space of measurable functions g : X → R n such that (cid:82) X (cid:107) g ( x ) (cid:107) dµ ( x ) < ∞ , with norm (cid:107) g (cid:107) L = (cid:16)(cid:82) X (cid:107) g ( x ) (cid:107) dµ ( x ) (cid:17) / . Since Γ( x, x ) ∈ L ( R n ) is a compact operator , by the Mercer theorem for multi-task kernels(Carmeli et al., 2010), there exists an at most countable sequence { ( ψ i , ν i ) } i ∈ N suchthat Γ( x, x (cid:48) ) = ∞ (cid:88) i =1 ν i ψ i ( x ) ψ i ( x (cid:48) ) (cid:62) and (cid:107) g (cid:107) = ∞ (cid:88) i =1 (cid:104) g, ψ i (cid:105) L ν i , g ∈ L ( X , µ ; R n ) , where ν i (cid:62) for all i , lim i →∞ ν i = 0 and { ψ i : X → R n } i ∈ N is an orthonormalbasis of L ( X , µ ; R n ) . In particular g ∈ H Γ ( X ) if and only if (cid:107) g (cid:107) Γ < ∞ . Note that {√ ν i ψ i } i ∈ N is an orthonormal basis of H Γ ( X ) . Then, we can represent the objectivefunction f ∈ H Γ ( X ) as f = ∞ (cid:88) i =1 θ (cid:63)i √ ν i ψ i for some θ (cid:63) := ( θ (cid:63) , θ (cid:63) , . . . ) ∈ (cid:96) , the Hilbert space of square-summable sequences ofreal numbers, such that (cid:107) f (cid:107) Γ = (cid:107) θ (cid:63) (cid:107) := (cid:0)(cid:80) ∞ i =1 | θ (cid:63)i | (cid:1) / < ∞ . We now define afeature map Φ :
X → L ( R n , (cid:96) ) of the multi-task kernel Γ by Φ( x ) y := (cid:0) √ ν ψ ( x ) (cid:62) y, √ ν ψ ( x ) (cid:62) y, . . . (cid:1) , ∀ x ∈ X , y ∈ R n . We then have f ( x ) = Φ( x ) (cid:62) θ (cid:63) and Γ( x, x (cid:48) ) = Φ( x ) (cid:62) Φ( x (cid:48) ) for all x, x (cid:48) ∈ X . Martingale control in (cid:96) space Let us define S t = (cid:80) ts =1 Φ( x s ) ε s , where ε , . . . , ε t are the random noise vectors in R n . Now consider F t − , the σ -algebra generatedby the random variables { x s , ε s } t − s =1 and x t . Observe that S t is F t -measurable and E (cid:2) S t (cid:12)(cid:12) F t − (cid:3) = S t − . The process { S t } t (cid:62) is thus a martingale with values in the (cid:96) space. We now define a map Φ X t : (cid:96) → R nt by Φ X t θ := (cid:104)(cid:0) Φ( x ) (cid:62) θ (cid:1) (cid:62) , . . . , (cid:0) Φ( x t ) (cid:62) θ (cid:1) (cid:62) (cid:105) (cid:62) , ∀ θ ∈ (cid:96) . We also let V t := Φ (cid:62)X t Φ X t be a map from (cid:96) to itself and I be the identity operator in (cid:96) . In Lemma 3, we measure the deviation of S t by the norm weighted by ( V t + ηI ) − ,which is itself derived from S t . Lemma 3 represents the multi-task generalization of theresult of Durand et al. (2018), and we recover their result under the single-task setting( n = 1 ). An operator A ∈ L ( H ) is said to be compact if the image of each bounded set under A is relativelycompact. We ignore issues of measurability here. emma 3 (Self-normalized martingale control) Let the noise vectors { ε t } t (cid:62) be σ -sub-Gaussian. Then, for any η > and δ ∈ (0 , , with probability at least − δ , thefollowing holds uniformly over all t (cid:62) (cid:107) S t (cid:107) ( V t + ηI ) − (cid:54) σ (cid:112) /δ ) + log det ( I + η − V t ) . Proof
For any sequence of real numbers θ = ( θ , θ , . . . ) such that (cid:13)(cid:13)(cid:80) ∞ i =1 θ i √ ν i ψ i ( x ) (cid:13)(cid:13) < ∞ , let us define Φ( x ) (cid:62) θ := (cid:80) ∞ i =1 θ i √ ν i ψ i ( x ) and M θt = t (cid:89) s =1 D θs , D θs = exp (cid:18) ε (cid:62) s Φ( x s ) (cid:62) θσ − (cid:13)(cid:13) Φ( x s ) (cid:62) θ (cid:13)(cid:13) (cid:19) . Since the noise vectors { ε t } t (cid:62) are conditionally σ -sub-Gaussian, i.e., ∀ α ∈ R n , ∀ t (cid:62) , E (cid:2) exp( ε (cid:62) t α ) (cid:12)(cid:12) F t − (cid:3) (cid:54) exp (cid:16) σ (cid:107) α (cid:107) / (cid:17) , we have E (cid:2) D θt |F t − (cid:3) (cid:54) and hence E (cid:2) M θt |F t − (cid:3) (cid:54) M θt − . Therefore, it is immedi-ate that { M θt } ∞ t =0 is a non-negative super-martingale and actually satisfies E (cid:2) M θt (cid:3) (cid:54) .Now, let τ be a stopping time with respect to the filtration {F t } ∞ t =0 . By the conver-gence theorem for non-negative super-martingales, M θ ∞ = lim t →∞ M θt is almost surelywell-defined, and thus M θτ is well-defined as well irrespective of whether τ < ∞ or not.Let Q θt = M θ min { τ,t } be a stopped version of { M θt } t . Then, by Fatou’s lemma, E (cid:2) M θτ (cid:3) = E (cid:104) lim inf t →∞ Q θt (cid:105) (cid:54) lim inf t →∞ E (cid:2) Q θt (cid:3) = lim inf t →∞ E (cid:104) M θ min { τ,t } (cid:105) (cid:54) , (9)since the stopped super-martingale (cid:110) M θ min { τ,t } (cid:111) t (cid:62) is also a super-martingale.Let F ∞ be the σ -algebra generated by {F t } ∞ t =0 , and Θ = (Θ , Θ , . . . ) , Θ i ∼N (0 , /η ) be an infinite i.i.d. Gaussian random sequence which is independent of F ∞ .Since Γ( x, x ) ∈ L ( R n ) has finite trace, we have E (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ∞ (cid:88) i =1 Θ i √ ν i ψ i ( x ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) = 1 η ∞ (cid:88) i =1 ν i (cid:107) ψ i ( x ) (cid:107) = 1 η trace (Γ( x, x )) < ∞ . Therefore, (cid:13)(cid:13)(cid:80) ∞ i =1 Θ i √ ν i ψ i ( x ) (cid:13)(cid:13) < ∞ almost surely and thus M Θ t is well-defined.Now, thanks to the sub-Gaussian property, E (cid:2) M Θ t | Θ (cid:3) (cid:54) almost surely, and thus E (cid:2) M Θ t (cid:3) (cid:54) for all t .Let M t := E (cid:2) M Θ t |F ∞ (cid:3) be a mixture of non-negative super-martingales M Θ t . Then { M t } ∞ t =0 is also a non-negative super-martingale adapted to the filtration {F t } ∞ t =0 .Hence, by a similar argument as in (9), M τ is almost surely well-defined and E [ M τ ] = E (cid:2) M Θ τ (cid:3) (cid:54) . Let us now compute the mixture martingale M t . We first note for any θ ∈ (cid:96) that M θt = exp (cid:16) (cid:104) θ, S t /σ (cid:105) − (cid:107) θ (cid:107) V t (cid:17) . The difficulty however lies in the25andling of possibly infinite dimension. To this end, we follow Durand et al. (2018) toconsider the first d dimensions for each d ∈ N . Let Θ d denote the restriction of Θ to thefirst d components. Thus Θ d ∼ N (0 , η I d ) . Similarly, let S t,d , V t,d and M t,d denotethe corresponding restrictions of S t , V t and M t , respectively. Following the steps fromChowdhury and Gopalan (2017), we then obtain that M t,d = det( ηI d ) / (2 π ) d/ (cid:90) R d exp (cid:18) (cid:104) α, S t,d /σ (cid:105) − (cid:107) α (cid:107) V t,d (cid:19) exp (cid:16) − η (cid:107) α (cid:107) (cid:17) dα = 1det( I d + η − V t,d ) / exp (cid:18) σ (cid:107) S t,d (cid:107) V t,d + ηI d ) − (cid:19) . Note that M τ,d is also almost surely well defined and E [ M τ,d ] (cid:54) for all d ∈ N . Wenow fix a δ ∈ (0 , . An application of Markov’s inequality and Fatou’s Lemma thenyields P (cid:20) (cid:107) S τ (cid:107) V τ + ηI ) − > σ log (cid:18) det( I + η − V τ ) / δ (cid:19)(cid:21) = P exp (cid:16) σ (cid:107) S τ (cid:107) V τ + ηI ) − (cid:17) δ det( I + η − V τ ) / > = P lim d →∞ exp (cid:16) σ (cid:107) S τ,d (cid:107) V τ,d + ηI d ) − (cid:17) δ det( I d + η − V τ,d ) / > (cid:54) E lim d →∞ exp (cid:16) σ (cid:107) S τ,d (cid:107) V τ,d + ηI d ) − (cid:17) δ det( I d + η − V τ,d ) / (cid:54) δ lim d →∞ E [ M τ,d ] (cid:54) δ . We now define a random stopping time τ following Chowdhury and Gopalan (2017), by τ = min (cid:26) t (cid:62) (cid:107) S t (cid:107) V t + ηI ) − > σ log (cid:18) det( I + η − V t ) / δ (cid:19)(cid:27) . We then have P (cid:20) ∃ t (cid:62) (cid:107) S t (cid:107) V t + ηI ) − > σ log (cid:18) det( I + η − V t ) / δ (cid:19)(cid:21) = P [ τ < ∞ ] (cid:54) δ , which concludes the proof. 26 .1 Concentration bound for the estimate (Proof of Theorem 1) We first reformulate µ t ( x ) in terms of the feature map Φ( x ) as µ t ( x ) = G t ( x ) (cid:62) ( G t + ηI nt ) − Y t = Φ( x ) (cid:62) Φ (cid:62)X t (cid:0) Φ X t Φ (cid:62)X t + ηI nt (cid:1) − Y t = Φ( x ) (cid:62) (cid:0) Φ (cid:62)X t Φ X t + ηI (cid:1) − Φ (cid:62)X t Y t = Φ( x ) (cid:62) ( V t + ηI ) − t (cid:88) s =1 Φ( x s ) y s = Φ( x ) (cid:62) ( V t + ηI ) − t (cid:88) s =1 Φ( x s )( f ( x s ) + ε s )= Φ( x ) (cid:62) ( V t + ηI ) − t (cid:88) s =1 Φ( x s ) (cid:0) Φ( x s ) (cid:62) θ (cid:63) + ε s (cid:1) = Φ( x ) (cid:62) θ (cid:63) − η Φ( x ) (cid:62) ( V t + ηI ) − θ (cid:63) + Φ( x ) (cid:62) ( V t + ηI ) − S t = f ( x ) + Φ( x ) (cid:62) ( V t + ηI ) − ( S t − ηθ (cid:63) ) , where the third step follows from Lemma 2. We now obtain, from the definition ofoperator norm, the following (cid:107) f ( x ) − µ t ( x ) (cid:107) (cid:54) (cid:13)(cid:13)(cid:13) Φ( x ) (cid:62) ( V t + ηI ) − / (cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13) ( V t + ηI ) − / ( S t − ηθ (cid:63) ) (cid:13)(cid:13)(cid:13) (cid:54) (cid:13)(cid:13)(cid:13) ( V t + ηI ) − / Φ( x ) (cid:13)(cid:13)(cid:13) (cid:16) (cid:107) S t (cid:107) ( V t + ηI ) − + η (cid:107) θ (cid:63) (cid:107) ( V t + ηI ) − (cid:17) (cid:54) (cid:13)(cid:13) Φ( x ) (cid:62) ( V t + ηI ) − Φ( x ) (cid:13)(cid:13) / (cid:16) (cid:107) S t (cid:107) ( V t + ηI ) − + η / (cid:107) f (cid:107) Γ (cid:17) , where the last step is controlled as (cid:107) θ (cid:63) (cid:107) ( V t + ηI ) − (cid:54) η − / (cid:107) θ (cid:63) (cid:107) = η − / (cid:107) f (cid:107) Γ . Asimple application of Lemma 2 now yields η Φ( x ) (cid:62) ( V t + ηI ) − Φ( x ) = η Φ( x ) (cid:62) (Φ (cid:62)X t Φ X t + ηI ) − Φ( x )= Φ( x ) (cid:62) Φ( x ) − Φ( x ) (cid:62) Φ (cid:62)X t (Φ X t Φ (cid:62)X t + ηI nt ) − Φ X t Φ( x )= Γ( x, x ) − G t ( x ) (cid:62) ( G t + ηI nt ) − G t ( x ) = Γ t ( x, x ) . (10)We then have (cid:13)(cid:13) Φ( x ) (cid:62) ( V t + ηI ) − Φ( x ) (cid:13)(cid:13) / = η − / (cid:107) Γ t ( x, x ) (cid:107) / . We conclude theproof from Lemma 3 and using Sylvester’s identity to get det (cid:0) I + η − V t (cid:1) = det (cid:0) I + η − Φ (cid:62)X t Φ X t (cid:1) = det (cid:0) I nt + η − Φ X t Φ (cid:62)X t (cid:1) = det (cid:0) I nt + η − G t (cid:1) . (11)27 Regret analysis of MT-KB
C.1 Properties of predictive variance
Lemma 4 (Sum of predictive variances)
For any η > and t (cid:62) , η t (cid:88) s =1 trace (Γ s ( x s , x s )) = log det (cid:0) I nt + η − G t (cid:1) = t (cid:88) s =1 log det (cid:0) I n + η − Γ s − ( x s , x s ) (cid:1) . Proof
For the first part, we observe from (10) that η t (cid:88) s =1 trace (Γ s ( x s , x s )) = t (cid:88) s =1 trace (cid:0) Φ( x s ) (cid:62) ( V s + ηI ) − Φ( x s ) (cid:1) = t (cid:88) s =1 trace (cid:0) ( V s + ηI ) − Φ( x s )Φ( x s ) (cid:62) (cid:1) = t (cid:88) s =1 trace (cid:0) ( V s + ηI ) − (( V s + ηI ) − ( V s − + ηI )) (cid:1) (cid:54) t (cid:88) s =1 log (cid:18) det( V s + ηI )det( V s − + ηI ) (cid:19) = log det (cid:0) I + η − V t (cid:1) = log det (cid:0) I nt + η − G t (cid:1) . Here, the last equality follows from (11). The inequality follows from the fact thatfor two p.d. matrices A and B such that A − B is p.s.d., trace (cid:0) A − ( A − B ) (cid:1) (cid:54) log (cid:16) det( A )det( B ) (cid:17) (Calandriello et al., 2019).For the second part, we obtain from Schur’s determinant identity that det (cid:0) I nt + η − G t (cid:1) = det (cid:0) I n ( t − + η − G t − (cid:1) × det (cid:16) I n + η − Γ( x t , x t ) − η − G t − ( x t ) (cid:62) (cid:0) I n ( t − + η − G t − (cid:1) − η − G t − ( x t ) (cid:17) = det (cid:0) I n ( t − + η − G t − (cid:1) det (cid:0) I n + η − Γ t − ( x t , x t ) (cid:1) = . . . = t (cid:89) s =1 det (cid:0) I n + η − Γ s − ( x s , x s ) (cid:1) . We conclude the proof by applying logarithm on both sides.
Lemma 5 (Predictive variance geometry)
Let (cid:107) Γ( x, x ) (cid:107) (cid:54) κ . Then, for any η > and t (cid:62) , Γ t ( x, x ) (cid:22) Γ t − ( x, x ) (cid:22) (1 + κ/η ) Γ t ( x, x ) . roof Let us define V t = V t + ηI for all t (cid:62) . We then have from (10) that Γ t ( x, x ) = η Φ( x ) (cid:62) V − t Φ( x )= η Φ( x ) (cid:62) (cid:0) V t − + Φ( x t )Φ( x t ) (cid:62) (cid:1) − Φ( x )= η Φ( x ) (cid:62) V − t − Φ( x ) − η Φ( x ) (cid:62) V − t − Φ( x t ) (cid:16) I n + Φ( x t ) (cid:62) V − t − Φ( x t ) (cid:17) − Φ( x t ) (cid:62) V − t − Φ( x )= Γ t − ( x, x ) − η − Γ t − ( x t , x ) (cid:62) (cid:0) I n + η − Γ t − ( x t , x t ) (cid:1) − Γ t − ( x t , x ) (cid:22) Γ t − ( x, x ) . Here in the third step, we have used the Sherman-Morrison formula and in the last step,we have used the positive semi-definite property of multi-task kernels. To prove thesecond part, we first note that η Γ t ( x, x ) = Φ( x ) (cid:62) (cid:0) V t − + Φ( x t )Φ( x t ) (cid:62) (cid:1) − Φ( x )= Φ( x ) (cid:62) V − / t − (cid:16) I + V − / t − Φ( x t )Φ( x t ) (cid:62) V − / t − (cid:17) − V − / t − Φ( x ) . (12)Further, since (cid:107) Γ( x, x ) (cid:107) (cid:54) κ , we have λ max (Γ( x, x )) (cid:54) κ , and hence, Γ t ( x, x ) (cid:22) Γ t − ( x, x ) (cid:22) Γ t − ( x, x ) (cid:22) . . . Γ ( x, x ) = Γ( x, x ) (cid:22) κI n . (13)Since V − / t − Φ( x t )Φ( x t ) (cid:62) V − / t − and Φ( x t ) (cid:62) V − t − Φ( x t ) have same set of non-zeroeigenvalues, we now obtain from (13) that V − / t − Φ( x t )Φ( x t ) (cid:62) V − / t − (cid:22) κη I . Then (12)implies that Γ t ( x, x ) (cid:23) η Φ( x ) (cid:62) V − t − Φ( x ) / (1 + κ/η ) = Γ t − ( x, x ) / (1 + κ/η ) , which completes the proof. C.2 Regret bound for MT-KB (Proof of Theorem 2)
Since the scalarization functions s λ is L λ -Lipschitz in the (cid:96) norm, we have | s λ t ( f ( x )) − s λ t ( µ t − ( x )) | (cid:54) L λ t (cid:107) f ( x ) − µ t − ( x ) (cid:107) . Since µ ( x ) = 0 , Γ ( x, x ) = Γ( x, x ) and (cid:107) f (cid:107) Γ (cid:54) b , we have (cid:107) f ( x ) − µ ( x ) (cid:107) = (cid:13)(cid:13) Γ (cid:62) x f (cid:13)(cid:13) (cid:54) (cid:107) f (cid:107) Γ (cid:107) Γ x (cid:107) = (cid:107) f (cid:107) Γ (cid:13)(cid:13) Γ (cid:62) x Γ x (cid:13)(cid:13) / (cid:54) b (cid:107) Γ ( x, x ) (cid:107) / . Then, from Theorem 1 and Lemma 4, the following holds with probability at least − δ : ∀ t (cid:62) , ∀ x ∈ X , | s λ t ( f ( x )) − s λ t ( µ t − ( x )) | (cid:54) L λ t β t − (cid:107) Γ t − ( x, x ) (cid:107) / , (14)29here β t = b + σ √ η (cid:113) /δ ) + (cid:80) ts =1 log det ( I n + η − Γ s − ( x s , x s )) , t (cid:62) . Wecan now upper bound the instantaneous regret at time t (cid:62) as r λ t ( x t ) := s λ t (cid:0) f ( x (cid:63)λ t ) (cid:1) − s λ t ( f ( x t )) (cid:54) s λ t (cid:0) µ t − ( x (cid:63)λ t ) (cid:1) + L λ t β t − (cid:13)(cid:13) Γ t − ( x (cid:63)λ t , x (cid:63)λ t ) (cid:13)(cid:13) / − s λ t ( f ( x t )) (cid:54) s λ t ( µ t − ( x t )) + L λ t β t − (cid:107) Γ t − ( x t , x t ) (cid:107) / − s λ t ( f ( x t )) (cid:54) L λ t β t − (cid:107) Γ t − ( x t , x t ) (cid:107) / . Here in the first and third step, we have used (14). The second step follows from thechoice of x t . Since β t is a monotonically increasing function in t and L λ t (cid:54) L for all t ,we have T (cid:88) t =1 r λ t ( x t ) (cid:54) Lβ T T (cid:88) t =1 (cid:107) Γ t − ( x t , x t ) (cid:107) / (cid:54) Lβ T (cid:118)(cid:117)(cid:117)(cid:116) (1 + κ/η ) T T (cid:88) t =1 (cid:107) Γ t ( x t , x t ) (cid:107) , where the last step is due to the Cauchy-Schwartz inequality and Lemma 5. We nowobtain from Lemma 4 that β T (cid:54) b + σ √ η (cid:112) /δ ) + γ nT (Γ , η )) . We conclude theproof by taking an expectation over { λ i } Ti =1 ∼ P λ . C.3 Inter-task structure in regret for separable kernels (Proof ofLemma 1)
For separable multi-task kernels Γ( x, x (cid:48) ) = k ( x, x (cid:48) ) B , the kernel matrix is given by G T = K T ⊗ B , where K T is kernel matrix corresponding to the scalar kernel k and ⊗ denotes the Kronecker product. Let { α t } Tt =1 denote the eigenvalues of K T . Thenthe eigenvalues of G T are given by α t ξ i , (cid:54) t (cid:54) T , (cid:54) i (cid:54) n , where ξ i ’s are theeigenvalues of B . We now have log det( I nT + η − G T ) = T (cid:88) t =1 n (cid:88) i =1 log(1 + α t ξ i /η )= (cid:88) i ∈ [ n ]: ξ i > T (cid:88) t =1 log(1 + α t ξ i /η )= (cid:88) i ∈ [ n ]: ξ i > log det (cid:0) I T + ( η/ξ i ) − K T (cid:1) . Taking supremum over all possible subsets X T of X , we then obtain that γ nT (Γ , η ) (cid:54) (cid:80) i ∈ [ n ]: ξ i > γ T ( k, η/ξ i ) .To prove the second part, we use the feature representation of the scalar kernel k . To this end, we let ϕ : X → (cid:96) be a feature map of the scalar kernel k , so that k ( x, x (cid:48) ) = ϕ ( x ) (cid:62) ϕ ( x (cid:48) ) for all x, x (cid:48) ∈ X . We now define a map ϕ X t : (cid:96) → R t by ϕ X t θ := (cid:2) ϕ ( x ) (cid:62) θ, . . . , ϕ ( x t ) (cid:62) θ (cid:3) (cid:62) , ∀ θ ∈ (cid:96) .
30e also let v t := ϕ (cid:62)X t ϕ X t be a map from (cid:96) to itself. For any α > , we then obtainfrom Lemma 2 that α ϕ ( x ) (cid:62) ( v t + αI ) − ϕ ( x ) = α ϕ ( x ) (cid:62) ( ϕ (cid:62)X t ϕ X t + αI ) − ϕ ( x )= ϕ ( x ) (cid:62) ϕ ( x ) − ϕ ( x ) (cid:62) ϕ (cid:62)X t ( ϕ X t ϕ (cid:62)X t + αI t ) − ϕ X t ϕ ( x )= k ( x, x ) − k t ( x ) (cid:62) ( K t + αI t ) − k t ( x ) , where k t ( x ) = [ k ( x , x ) , . . . , k ( x t , x )] (cid:62) and K t = [ k ( x i , x j )] t ,j =1 . We then havefrom (6) that (cid:107) Γ t ( x, x ) (cid:107) = max (cid:54) i (cid:54) n ξ i (cid:32) k ( x, x ) − k t ( x ) (cid:62) (cid:18) K t + ηξ i I t (cid:19) − k t ( x ) (cid:33) = max (cid:54) i (cid:54) n ξ i · ηξ i ϕ ( x ) (cid:62) (cid:18) v t + ηξ i I (cid:19) − ϕ ( x ) (cid:54) η ϕ ( x ) (cid:62) (cid:16) v t + ηκ I (cid:17) − ϕ ( x ) . Here, in the last step we have used that ξ i (cid:54) κ for all i ∈ [ n ] . This holds from ourhypothesis (cid:107) Γ( x, x ) (cid:107) (cid:54) κ and k ( x, x ) = 1 . We now observe that (cid:0) v t + ηκ I (cid:1) − (cid:22) ( v t + ηI ) − for κ (cid:54) and (cid:0) v t + ηκ I (cid:1) − (cid:22) κ ( v t + ηI ) − for κ (cid:62) . Therefore (cid:107) Γ t ( x, x ) (cid:107) (cid:54) η max { κ, } ϕ ( x ) (cid:62) ( v t + ηI ) − ϕ ( x ) . A simple application of Lemma 4 for n = 1 and Γ( · , · ) = k ( · , · ) now yields T (cid:88) t =1 (cid:107) Γ t ( x, x ) (cid:107) (cid:54) η max { κ, } T (cid:88) t =1 ϕ ( x t ) (cid:62) ( v t + ηI ) − ϕ ( x t )= η max { κ, } log det (cid:0) I T + η − K T (cid:1) (cid:54) η max { κ, } γ T ( k, η ) , which completes the proof. C.4 Inter-task structure in regret for sum of separable kernels
We now present a generalization of Lemma 1 for multi-task kernels of the form Γ( x, x (cid:48) ) = (cid:80) Mj =1 k j ( x, x (cid:48) ) B j . This class of kernels is called the sum of separable (SoS)kernel and includes the diagonal kernel Γ( x, x (cid:48) ) = diag ( k ( x, x (cid:48) ) , . . . , k n ( x, x (cid:48) )) as aspecial case. Lemma 6 (Inter-task structure in regret for SoS kernel)
Let Γ( x, x (cid:48) ) = (cid:80) Mj =1 k j ( x, x (cid:48) ) B j and B j ∈ R n × n be positive semi-definite. Then the following holds: γ nT (Γ , η ) (cid:54) M (cid:88) j =1 ρ B j max { ξ B j , } γ T ( k j , η ) , T (cid:88) t =1 (cid:107) Γ t ( x t , x t ) (cid:107) (cid:54) η M (cid:88) j =1 max { ξ B j , } γ T ( k j , η ) , here ρ B j and ξ B j denote the rank and the maximum eigenvalue of B j , respectively and γ T ( k j ) is the maximum information gain corresponding to scalar kernel k j . Moreover,if Γ( x, x (cid:48) ) = diag ( k ( x, x (cid:48) ) , . . . , k n ( x, x (cid:48) )) and each k j is a stationary kernel, then γ nT (Γ , η ) (cid:54) n (cid:88) j =1 γ T ( k j , η ) , T (cid:88) t =1 (cid:107) Γ t ( x t , x t ) (cid:107) (cid:54) η max (cid:54) j (cid:54) n γ T ( k j , η ) . Proof
We let, for each scalar kernel k j , a feature map ϕ j : X → (cid:96) , so that k j ( x, x (cid:48) ) = ϕ j ( x ) (cid:62) ϕ j ( x (cid:48) ) . We now define the feature map Φ :
X → L ( R n , (cid:96) ) of the multi-taskkernel Γ( x, x (cid:48) ) = (cid:80) Mj =1 k j ( x, x (cid:48) ) B j by Φ( x ) y := (cid:16) ϕ ( x ) ⊗ B / y, . . . , ϕ M ( x ) ⊗ B / M y (cid:17) , ∀ x ∈ X , y ∈ R n , with the inner product Φ( x ) (cid:62) Φ( x (cid:48) ) := M (cid:88) j =1 (cid:16) ϕ j ( x ) ⊗ B / j (cid:17) (cid:62) (cid:16) ϕ j ( x (cid:48) ) ⊗ B / j (cid:17) = M (cid:88) j =1 ϕ j ( x ) (cid:62) ϕ j ( x (cid:48) ) · B j . We then have V t := t (cid:88) s =1 Φ( x s )Φ( x s ) (cid:62) = t (cid:88) s =1 M (cid:88) j =1 ϕ j ( x s ) ϕ j ( x s ) (cid:62) ⊗ B j = M (cid:88) j =1 v t,j ⊗ B j , where v t,j := (cid:80) ts =1 ϕ j ( x s ) ϕ j ( x s ) (cid:62) . We further obtain from (10) that Γ t ( x, x ) = M (cid:88) j =1 η (cid:16) ϕ j ( x ) ⊗ B / j (cid:17) (cid:62) M (cid:88) j =1 v t,j ⊗ B j + ηI − (cid:16) ϕ j ( x ) ⊗ B / j (cid:17) . Now each B j is a positive semi-definite matrix and so is v t,j ⊗ B j . Hence, for for all j ∈ [ M ] , (cid:16)(cid:80) Mj =1 v t,j ⊗ B j + ηI (cid:17) − (cid:22) ( v t,j ⊗ B j + ηI ) − . Therefore Γ t ( x, x ) (cid:22) M (cid:88) j =1 η (cid:16) ϕ j ( x ) ⊗ B / j (cid:17) (cid:62) ( v t,j ⊗ B j + ηI ) − (cid:16) ϕ j ( x ) ⊗ B / j (cid:17) = M (cid:88) j =1 Γ t,j ( x, x ) , (15)where Γ t,j ( x, x ) := η (cid:16) ϕ j ( x ) ⊗ B / j (cid:17) (cid:62) ( v t,j ⊗ B j + ηI ) − (cid:16) ϕ j ( x ) ⊗ B / j (cid:17) . Now,let ( ξ j,i , u j,i ) denotes the i -th eigenpair of B j . A similar argument as in (4) then yields ( v t,j ⊗ B j + ηI ) − = n (cid:88) i =1 ( ξ j,i v t,j + ηI ) − ⊗ u j,i u (cid:62) j,i .
32e then have from the mixed product property of Kronecker product and the orthonor-mality of { u j,i } ni =1 that Γ t,j ( x, x ) = n (cid:88) i =1 η ξ j,i ϕ j ( x ) (cid:62) ( ξ j,i v t,j + ηI ) − ϕ j ( x ) · u j,i u (cid:62) j,i = n (cid:88) i =1 η ϕ j ( x ) (cid:62) (cid:18) v t,j + ηξ j,i I (cid:19) − ϕ j ( x ) · u j,i u (cid:62) j,i . Since (cid:16) v t + ηξ j,i I (cid:17) − (cid:22) ( v t + ηI ) − for ξ j,i (cid:54) and (cid:16) v t + ηξ j,i I (cid:17) − (cid:22) ξ j,i ( v t + ηI ) − for ξ j,i (cid:62) , we now have trace(Γ t,j ( x, x )) (cid:54) η (cid:88) i ∈ [ n ]: ξ j,i > max { ξ j,i , } ϕ j ( x ) (cid:62) ( v t,j + ηI ) − ϕ j ( x ) (cid:54) η ρ B j max { ξ B j , } ϕ j ( x ) (cid:62) ( v t,j + ηI ) − ϕ j ( x ) . Similarly (cid:107) Γ t,j ( x, x ) (cid:107) (cid:54) η max (cid:54) i (cid:54) n max { ξ j,i , } ϕ j ( x ) (cid:62) ( v t,j + ηI ) − ϕ j ( x ) (cid:54) η max { ξ B j , } ϕ j ( x ) (cid:62) ( v t,j + ηI ) − ϕ j ( x ) . Let K T,j = [ k j ( x p , x q )] Tp,q =1 denotes the kernel matrix corresponding to the scalarkernel k j . An application of Lemma 4 for n = 1 and Γ( · , · ) = k j ( · , · ) now yields T (cid:88) t =1 trace (Γ t,j ( x t , x t )) (cid:54) η ρ B j max { ξ B j , } log det (cid:0) I T + η − K T,j (cid:1) and T (cid:88) t =1 (cid:107) Γ t,j ( x t , x t ) (cid:107) (cid:54) η max { ξ B j , } log det (cid:0) I T + η − K T,j (cid:1) . We then have from (15) and Lemma 4 that log det (cid:0) I nT + η − G T (cid:1) = 1 η T (cid:88) t =1 trace (Γ t ( x t , x t )) (cid:54) η M (cid:88) j =1 T (cid:88) t =1 trace (Γ t,j ( x t , x t )) (cid:54) M (cid:88) j =1 ρ B j max { ξ B j , } log det (cid:0) I T + η − K T,j (cid:1) . Taking supremum over all possible subsets X T of X , we now obtain that γ nT (Γ , η ) (cid:54) (cid:80) Mj =1 ρ B j max { ξ B j , } γ T ( k j , η ) . We further have from (15) that T (cid:88) t =1 (cid:107) Γ t ( x t , x t ) (cid:107) (cid:54) M (cid:88) j =1 T (cid:88) t =1 (cid:107) Γ t,j ( x t , x t ) (cid:107) (cid:54) η M (cid:88) j =1 max { ξ B j , } γ T ( k j , η ) , M = n and each B j is a diagonal matrix with in the j -thdiagonal entry and in all others. In this case, we have Γ t ( x, x ) = η n (cid:88) j =1 ϕ j ( x ) (cid:62) ( v t,j + ηI ) − ϕ j ( x ) · B j . We then have from Lemma 4 that log det (cid:0) I nT + η − G T (cid:1) = 1 η T (cid:88) t =1 trace (Γ t ( x t , x t ))= T (cid:88) t =1 n (cid:88) j =1 ϕ j ( x t ) (cid:62) ( v t,j + ηI ) − ϕ j ( x t ) · trace ( B j )= n (cid:88) j =1 T (cid:88) t =1 ϕ j ( x t ) (cid:62) ( v t,j + ηI ) − ϕ j ( x t )= n (cid:88) j =1 log det (cid:0) I T + η − K T,j (cid:1) . Taking supremum over all possible subsets X T of X , we now obtain that γ nT (Γ , η ) (cid:54) (cid:80) nj =1 γ T ( k j , η ) . We further have (cid:107) Γ t ( x, x ) (cid:107) = max (cid:54) j (cid:54) n η ϕ j ( x ) (cid:62) ( v t,j + ηI ) − ϕ j ( x ) . Let j (cid:63) ( x ) = argmax (cid:54) j (cid:54) n k j ( x, x ) . Since each k j is stationary, i.e., k j ( x, x (cid:48) ) = k j ( x − x (cid:48) ) , we have j (cid:63) ( x ) is independent of x . We now let j (cid:63) = j (cid:63) ( x ) for all x . Thenit can be easily checked that (cid:107) Γ t ( x, x ) (cid:107) = η ϕ j (cid:63) ( x ) (cid:62) ( v t,j (cid:63) + ηI ) − ϕ j (cid:63) ( x ) . We now obtain from Lemma 4 that T (cid:88) t =1 (cid:107) Γ t ( x t , x t ) (cid:107) = η T (cid:88) t =1 ϕ j (cid:63) ( x t ) (cid:62) ( v t,j (cid:63) + ηI ) − ϕ j (cid:63) ( x t )= η log det (cid:0) I T + η − K T,j (cid:63) (cid:1) (cid:54) η max (cid:54) j (cid:54) n γ T ( k j , η ) , which completes the proof for the second part. D Analysis of MT-BKB
Trading-off approximation accuracy and size
Given a dictionary D t = { x i , . . . , x i mt } ,we define a map Φ D t : (cid:96) → R nm t by Φ D t θ := (cid:34) √ p t,i (cid:0) Φ( x i ) (cid:62) θ (cid:1) (cid:62) , . . . , √ p t,i mt (cid:0) Φ( x i mt ) (cid:62) θ (cid:1) (cid:62) (cid:35) (cid:62) , ∀ θ ∈ (cid:96) , (16)34here p t,i j = min (cid:110) q (cid:13)(cid:13)(cid:13) ˜Γ t − ( x i j , x i j ) (cid:13)(cid:13)(cid:13) , (cid:111) for all j ∈ [ m t ] . Lemma 7 (Approximation properties)
For any T (cid:62) , ε ∈ (0 , and δ ∈ (0 , , set ρ = ε − ε and q = ρ ln(2 T/δ ) ε . Then, for any η > , with probability at least − δ , thefollowing hold uniformly over all t ∈ [ T ] :(1 − ε )Φ (cid:62)X t Φ X t − εηI (cid:22) Φ (cid:62)D t Φ D t (cid:22) (1 + ε )Φ (cid:62)X t Φ X t + εηI ,m t (cid:54) ρq (1 + κ/η ) t (cid:88) s =1 (cid:107) Γ s ( x s , x s ) (cid:107) . Proof
Let S t be an nt -by- nt block diagonal matrix with i -th diagonal block [ S t ] i = √ p t,i I n if x i ∈ D t , and [ S t ] i = 0 if x i / ∈ D t , (cid:54) i (cid:54) t . We then have Φ (cid:62)D t Φ D t =Φ (cid:62)X t S (cid:62) t S t Φ X t . The proof now can be completed by following Calandriello et al. (2019,Theorem 1). Remark 4
Note that although tuning the approximation trade-off parameter q requiresthe knowledge of the time horizon T in advance, Lemma 7 is quite robust to theuncertainty on T . If the horizon is not known, then after the T -th step, one can increase q according to the new desired horizon, and update the dictionary with this new value of q . Combining this with a standard doubling trick preserve the approximation properties(Calandriello et al., 2019). Approximating the confidence set
We now focus on the dictionary D t chosen byMT-BKB at each step and discuss a principled approach to compute the approximations ˜ µ t ( x ) and ˜Γ t ( x, x ) . To this end, we let P t = Φ (cid:62)D t (cid:0) Φ D t Φ (cid:62)D t (cid:1) + Φ D t (17)denote the symmetric orthogonal projection operator on the subspace of L (cid:0) R n , (cid:96) (cid:1) thatis spanned by Φ( x i ) , . . . , Φ( x i mt ) . We also let (cid:98) Φ t ( x ) = P t Φ( x ) denote the projectionof Φ( x ) . We now define a map (cid:98) Φ X t : (cid:96) → R nt by (cid:98) Φ X t θ := (cid:20)(cid:16)(cid:98) Φ t ( x ) (cid:62) θ (cid:17) (cid:62) , . . . , (cid:16)(cid:98) Φ t ( x t ) (cid:62) θ (cid:17) (cid:62) (cid:21) (cid:62) , ∀ θ ∈ (cid:96) . We then have (cid:98) Φ X t = Φ X t P t and (cid:98) Φ X t (cid:98) Φ (cid:62)X t = Φ X t P t Φ (cid:62)X t . Lemma 8 (Approximation as given by projection)
Let (cid:98) V t := (cid:98) Φ (cid:62)X t (cid:98) Φ X t . Then, forany η > and t (cid:62) , the following holds: ˜ µ t ( x ) = Φ( x ) (cid:62) (cid:16) (cid:98) V t + ηI (cid:17) − t (cid:88) s =1 (cid:98) Φ t ( x s ) y s , ˜Γ t ( x, x ) = η Φ( x ) (cid:62) (cid:16) (cid:98) V t + ηI (cid:17) − Φ( x ) . roof We first note that ˜Φ t ( x ) (cid:62) ˜Φ t ( x (cid:48) ) = ˜ G t ( x ) (cid:62) ˜ G + t ˜ G t ( x (cid:48) ) = Φ( x ) (cid:62) P t Φ( x (cid:48) ) . We now define an nt × nm t matrix ˜Φ X t = (cid:104) ˜Φ t ( x ) , . . . , ˜Φ t ( x t ) (cid:105) (cid:62) . We then have ˜Φ X t ˜Φ t ( x ) = Φ X t P t Φ( x ) = (cid:98) Φ X t Φ( x ) , ˜Φ X t ˜Φ (cid:62)X t = Φ X t P t Φ (cid:62)X t = (cid:98) Φ X t (cid:98) Φ (cid:62)X t , (18)where P t is the projection operator as defined in (17). We also have ˜ V t := (cid:80) ts =1 ˜Φ t ( x s ) ˜Φ t ( x s ) (cid:62) =˜Φ (cid:62)X t ˜Φ X t . Therefore ˜ µ t ( x ) = ˜Φ t ( x ) (cid:62) ( ˜Φ (cid:62)X t ˜Φ X t + ηI nm t ) − t (cid:88) s =1 ˜Φ t ( x s ) y s = ˜Φ t ( x ) (cid:62) ( ˜Φ (cid:62)X t ˜Φ X t + ηI nm t ) − ˜Φ (cid:62)X t Y t = ˜Φ t ( x ) (cid:62) ˜Φ (cid:62)X t ( ˜Φ X t ˜Φ (cid:62)X t + ηI nt ) − Y t = Φ( x ) (cid:62) (cid:98) Φ (cid:62)X t ( (cid:98) Φ X t (cid:98) Φ (cid:62)X t + ηI nt ) − Y t = Φ( x ) (cid:62) ( (cid:98) Φ (cid:62)X t (cid:98) Φ X t + ηI ) − (cid:98) Φ (cid:62)X t Y t = Φ( x ) (cid:62) ( (cid:98) V t + ηI ) − t (cid:88) s =1 (cid:98) Φ t ( x s ) y s , where in third and fifth step, we have used Lemma 2, and in fourth step, we have used(18). Further ˜Γ t ( x, x ) = Γ( x, x ) − ˜Φ t ( x ) (cid:62) ˜Φ t ( x ) + η ˜Φ t ( x ) (cid:62) ( ˜Φ (cid:62)X t ˜Φ X t + ηI nm t ) − ˜Φ t ( x )= Γ( x, x ) − ˜Φ t ( x ) (cid:62) (cid:16) I nm t − η ( ˜Φ (cid:62)X t ˜Φ X t + ηI nm t ) − (cid:17) ˜Φ t ( x )= Γ( x, x ) − ˜Φ t ( x ) (cid:62) ˜Φ (cid:62)X t ( ˜Φ X t ˜Φ (cid:62)X t + ηI nt ) − ˜Φ X t ˜Φ t ( x )= Φ( x ) (cid:62) Φ( x ) − Φ( x ) (cid:62) (cid:98) Φ (cid:62)X t ( (cid:98) Φ X t (cid:98) Φ (cid:62)X t + ηI nt ) − (cid:98) Φ X t Φ( x )= Φ( x ) (cid:62) (cid:16) I − (cid:98) Φ (cid:62)X t ( (cid:98) Φ X t (cid:98) Φ (cid:62)X t + ηI nt ) − (cid:98) Φ X t (cid:17) Φ( x )= η Φ( x ) (cid:62) ( (cid:98) Φ (cid:62)X t (cid:98) Φ X t + ηI ) − Φ( x ) = η Φ( x ) (cid:62) ( (cid:98) V t + ηI ) − Φ( x ) , where in third and sixth step, we have used Lemma 2, and in fourth step, we have used(18). Lemma 9 (Multi-task concentration under Nyström approximation)
Let f ∈ H Γ ( X ) and the noise vectors { ε t } t (cid:62) be σ -sub-Gaussian. Further, for any η > , ε ∈ (0 , and t (cid:62) , let (1 − ε )Φ (cid:62)X t Φ X t − εηI (cid:22) Φ (cid:62)D t Φ D t (cid:22) (1 + ε )Φ (cid:62)X t Φ X t + εηI . Then, forany δ ∈ (0 , , with probability at least − δ , the following holds uniformly over all x ∈ X and t (cid:62) : (cid:107) f ( x ) − ˜ µ t ( x ) (cid:107) (cid:54) (cid:18) c ε (cid:107) f (cid:107) Γ + σ √ η (cid:112) /δ ) + log det( I nt + η − G t ) (cid:19) (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) / , where c ε = 1 + √ − ε . roof Let us first define ˜ α t ( x ) := Φ( x ) (cid:62) (cid:16) (cid:98) V t + ηI (cid:17) − (cid:80) ts =1 (cid:98) Φ t ( x s ) f ( x s ) , where (cid:98) V t = (cid:98) Φ (cid:62)X t (cid:98) Φ X t . We now note that f ( x ) = Φ( x ) (cid:62) θ (cid:63) and ˜ α t ( x ) = Φ( x ) (cid:62) (cid:16) (cid:98) V t + ηI (cid:17) − (cid:98) Φ (cid:62)X t Φ X t θ (cid:63) for some θ (cid:63) ∈ (cid:96) , so that (cid:107) f (cid:107) Γ = (cid:107) θ (cid:63) (cid:107) . We then have (cid:107) f ( x ) − ˜ α t ( x ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13) Φ( x ) (cid:62) (cid:18) θ (cid:63) − (cid:16) (cid:98) V t + ηI (cid:17) − (cid:98) Φ (cid:62)X t Φ X t θ (cid:63) (cid:19)(cid:13)(cid:13)(cid:13)(cid:13) (cid:54) (cid:13)(cid:13)(cid:13)(cid:13) Φ( x ) (cid:62) (cid:16) (cid:98) V t + ηI (cid:17) − / (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13) θ (cid:63) − (cid:16) (cid:98) V t + ηI (cid:17) − (cid:98) Φ (cid:62)X t Φ X t θ (cid:63) (cid:13)(cid:13)(cid:13)(cid:13) ( (cid:98) V t + ηI ) = (cid:13)(cid:13)(cid:13)(cid:13) Φ( x ) (cid:62) (cid:16) (cid:98) V t + ηI (cid:17) − Φ( x ) (cid:13)(cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13)(cid:16) (cid:98) V t + ηI (cid:17) θ (cid:63) − (cid:98) Φ (cid:62)X t Φ X t θ (cid:63) (cid:13)(cid:13)(cid:13) ( (cid:98) V t + ηI ) − = η − / (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13) ηθ (cid:63) − (cid:98) Φ (cid:62)X t (cid:16) Φ X t − (cid:98) Φ X t (cid:17) θ (cid:63) (cid:13)(cid:13)(cid:13) ( (cid:98) V t + ηI ) − (cid:54) η − / (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) / (cid:18) η (cid:107) θ (cid:63) (cid:107) ( (cid:98) V t + ηI ) − + (cid:13)(cid:13)(cid:13)(cid:98) Φ (cid:62)X t Φ X t ( I − P t ) θ (cid:63) (cid:13)(cid:13)(cid:13) ( (cid:98) V t + ηI ) − (cid:19) (cid:54) (cid:18) (cid:107) θ (cid:63) (cid:107) + η − / (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:98) V t + ηI (cid:17) − / (cid:98) Φ (cid:62)X t Φ X t ( I − P t ) θ (cid:63) (cid:13)(cid:13)(cid:13)(cid:13) (cid:19) (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) / . Here in the fourth step, we have used Lemma 8 and in the second last step, we haveused (cid:98) Φ X t = Φ X t P t , where P t is the projection operator as defined in (17). The last stepis controlled as (cid:107) θ (cid:63) (cid:107) ( (cid:98) V t + ηI ) − (cid:54) η − / (cid:107) θ (cid:63) (cid:107) . We now have (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:98) V t + ηI (cid:17) − / (cid:98) Φ (cid:62)X t Φ X t ( I − P t ) θ (cid:63) (cid:13)(cid:13)(cid:13)(cid:13) (cid:54) (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) (cid:98) V t + ηI (cid:17) − / (cid:98) Φ (cid:62)X t (cid:13)(cid:13)(cid:13)(cid:13) (cid:107) Φ X t ( I − P t ) (cid:107) (cid:107) θ (cid:63) (cid:107) (cid:54) (cid:13)(cid:13) Φ X t ( I − P t )Φ (cid:62)X t (cid:13)(cid:13) / (cid:107) θ (cid:63) (cid:107) , where we have used that (cid:13)(cid:13)(cid:13) ( (cid:98) V t + ηI ) − / (cid:98) Φ (cid:62)X t (cid:13)(cid:13)(cid:13) = (cid:13)(cid:13)(cid:13)(cid:98) Φ X t ( (cid:98) Φ (cid:62)X t (cid:98) Φ X t + ηI ) − (cid:98) Φ (cid:62)X t (cid:13)(cid:13)(cid:13) / (cid:54) and ( I − P t ) = I − P t . We now observe from Lemma 2 and our hypothesis (1 − ε )Φ (cid:62)X t Φ X t − εηI (cid:22) Φ (cid:62)D t Φ D t (cid:22) (1 + ε )Φ (cid:62)X t Φ X t + εηI that I − P t (cid:22) I − Φ (cid:62)D t (Φ D t Φ (cid:62)D t + ηI nm t ) − Φ D t = η (Φ (cid:62)D t Φ D t + ηI ) − (cid:22) η − ε (Φ (cid:62)X t Φ X t + ηI ) − , and therefore, (cid:13)(cid:13) Φ X t ( I − P t )Φ (cid:62)X t (cid:13)(cid:13) / (cid:54) (cid:113) η − ε (cid:13)(cid:13) Φ X t (Φ (cid:62)X t Φ X t + ηI ) − Φ (cid:62)X t (cid:13)(cid:13) / (cid:54) (cid:113) η − ε . Putting it all together, we now have (cid:107) f ( x ) − ˜ α t ( x ) (cid:107) (cid:54) (cid:107) θ (cid:63) (cid:107) (cid:18) √ − ε (cid:19) (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) / = c ε (cid:107) f (cid:107) Γ (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) / , (19)where we have used that (cid:107) θ (cid:63) (cid:107) = (cid:107) f (cid:107) Γ and c ε = 1 + √ − ε . We further obtain from37emma 8 that (cid:107) ˜ µ t ( x ) − ˜ α t ( x ) (cid:107) = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) Φ( x ) (cid:62) (cid:16) (cid:98) V t + ηI (cid:17) − t (cid:88) s =1 (cid:98) Φ t ( x s )( y s − f ( x s )) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:54) (cid:13)(cid:13)(cid:13)(cid:13) Φ( x ) (cid:62) (cid:16) (cid:98) V t + ηI (cid:17) − / (cid:13)(cid:13)(cid:13)(cid:13) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) s =1 (cid:98) Φ t ( x s ) ε s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( (cid:98) V t + ηI ) − = (cid:13)(cid:13)(cid:13)(cid:13) Φ( x ) (cid:62) (cid:16) (cid:98) V t + ηI (cid:17) − Φ( x ) (cid:13)(cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13)(cid:98) Φ (cid:62)X t E t (cid:13)(cid:13)(cid:13) ( (cid:98) V t + ηI ) − = η − / (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) / (cid:13)(cid:13)(cid:13)(cid:98) Φ (cid:62)X t E t (cid:13)(cid:13)(cid:13) ( (cid:98) V t + ηI ) − , where E t = (cid:2) ε (cid:62) , . . . , ε (cid:62) t (cid:3) (cid:62) denotes an nt × vector formed by concatenating the noisevectors ε i , (cid:54) i (cid:54) t . We now have (cid:13)(cid:13)(cid:13)(cid:98) Φ (cid:62)X t E t (cid:13)(cid:13)(cid:13) ( (cid:98) V t + ηI ) − = E (cid:62) t (cid:98) Φ X t (cid:16)(cid:98) Φ (cid:62)X t (cid:98) Φ X t + ηI (cid:17) − (cid:98) Φ (cid:62)X t E t = E (cid:62) t (cid:18) I nt − η (cid:16)(cid:98) Φ X t (cid:98) Φ (cid:62)X t + ηI nt (cid:17) − (cid:19) E t (cid:54) E (cid:62) t (cid:16) I nt − η (cid:0) Φ X t Φ (cid:62)X t + ηI nt (cid:1) − (cid:17) E t = E (cid:62) t Φ X t (cid:0) Φ (cid:62)X t Φ X t + ηI (cid:1) − Φ (cid:62)X t E t = (cid:13)(cid:13) Φ (cid:62)X t E t (cid:13)(cid:13) V t + ηI ) − , where in second and fourth step, we have used Lemma 2, and in third step, we haveused (cid:98) Φ X t (cid:98) Φ (cid:62)X t = Φ X t P t Φ (cid:62)X t (cid:22) Φ X t Φ (cid:62)X t . We then have (cid:107) ˜ µ t ( x ) − ˜ α t ( x ) (cid:107) (cid:54) η − / (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) t (cid:88) s =1 Φ( x s ) ε s (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ( V t + ηI ) − (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) / = η − / (cid:107) S t (cid:107) ( V t + ηI ) − (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) / , (20)where S t := (cid:80) ts =1 Φ( x s ) ε s . Combining (19) and (20) together, we now obtain (cid:107) f ( x ) − ˜ µ t ( x ) (cid:107) (cid:54) (cid:107) f ( x ) − ˜ α t ( x ) (cid:107) + (cid:107) ˜ α t ( x ) − ˜ µ t ( x ) (cid:107) (cid:54) (cid:16) c ε (cid:107) f (cid:107) Γ + η − / (cid:107) S t (cid:107) ( V t + ηI ) − (cid:17) (cid:13)(cid:13)(cid:13) ˜Γ t ( x, x ) (cid:13)(cid:13)(cid:13) / . We now conclude the proof using Lemma 3.
Preventing variance starvation
We now show that an accurate dictionary helps usavoid variance starvation in Nyström approximation.38 emma 10 (Predictive variance control)
For any η > and ε ∈ (0 , , let ρ =(1 + ε ) / (1 − ε ) and (1 − ε )Φ (cid:62)X t Φ X t − εηI (cid:22) Φ (cid:62)D t Φ D t (cid:22) (1 + ε )Φ (cid:62)X t Φ X t + εηI . Then ρ Γ t ( x, x ) (cid:22) ˜Γ t ( x, x ) (cid:22) ρ Γ t ( x, x ) . Proof
We first note that (cid:98) Φ (cid:62)X t (cid:98) Φ X t = P t Φ (cid:62)X t Φ X t P t , where P t is the projection operatoras defined in (17). Then our hypothesis (1 − ε )Φ (cid:62)X t Φ X t − εηI (cid:22) Φ (cid:62)D t Φ D t (cid:22) (1 + ε )Φ (cid:62)X t Φ X t + εηI can be re-formulated as
11 + ε P t Φ (cid:62)D t Φ D t P t − εη ε P t (cid:22) (cid:98) Φ (cid:62)X t (cid:98) Φ X t (cid:22) − ε P t Φ (cid:62)D t Φ D t P t + εη − ε P t . Since, by definition, P t Φ (cid:62)D t = Φ (cid:62)D t and P t (cid:22) I , we have
11 + ε Φ (cid:62)D t Φ D t − εη ε (cid:22) (cid:98) Φ (cid:62)X t (cid:98) Φ X t (cid:22) − ε Φ (cid:62)D t Φ D t + εη − ε , and, thus, in turn
11 + ε (cid:0) Φ (cid:62)D t Φ D t + ηI (cid:1) (cid:22) (cid:98) Φ (cid:62)X t (cid:98) Φ X t + ηI (cid:22) − ε (cid:0) Φ (cid:62)D t Φ D t + ηI (cid:1) . We now obtain from our hypothesis that − ε ε (cid:0) Φ (cid:62)X t Φ X t + ηI (cid:1) (cid:22) (cid:98) Φ (cid:62)X t (cid:98) Φ X t + ηI (cid:22) ε − ε (cid:0) Φ (cid:62)X t Φ X t + ηI (cid:1) . This further implies that − ε ε Φ( x ) (cid:62) ( V t + ηI ) − Φ( x ) (cid:22) Φ( x ) (cid:62) (cid:16) (cid:98) V t + ηI (cid:17) − Φ( x ) (cid:22) ε − ε Φ( x ) (cid:62) ( V t + ηI ) − Φ( x ) , which completes the proof. D.1 Regret bound and dictionary size for MT-BKB (Proof of The-orem 3)
Since the scalarization functions s λ is L λ -Lipschitz in the (cid:96) norm, we have | s λ t ( f ( x )) − s λ t (˜ µ t − ( x )) | (cid:54) L λ t (cid:107) f ( x ) − ˜ µ t − ( x ) (cid:107) . Since ˜ µ ( x ) = 0 , ˜Γ ( x, x ) = Γ( x, x ) and (cid:107) f (cid:107) Γ (cid:54) b , we have (cid:107) f ( x ) − ˜ µ ( x ) (cid:107) = (cid:13)(cid:13) Γ (cid:62) x f (cid:13)(cid:13) (cid:54) (cid:107) f (cid:107) Γ (cid:107) Γ x (cid:107) = (cid:107) f (cid:107) Γ (cid:13)(cid:13) Γ (cid:62) x Γ x (cid:13)(cid:13) / (cid:54) b (cid:13)(cid:13)(cid:13) ˜Γ ( x, x ) (cid:13)(cid:13)(cid:13) / . log(1 + ax ) (cid:54) a log(1 + x ) holds for any a (cid:62) and x (cid:62) , we obtainfrom Lemma 4 and Lemma 10 that log det (cid:0) I nt + η − G t (cid:1) = t (cid:88) s =1 log det (cid:0) I n + η − Γ s − ( x s , x s ) (cid:1) (cid:54) ρ t (cid:88) s =1 log det (cid:16) I n + η − ˜Γ s − ( x s , x s ) (cid:17) , (21)where ρ = ε − ε . Let us now assume, for any t (cid:62) , that (1 − ε )Φ (cid:62)X t Φ X t − εηI (cid:22) Φ (cid:62)D t Φ D t (cid:22) (1 + ε )Φ (cid:62)X t Φ X t + εηI. (22)Then, from (21) and Lemma 9, the following holds with probability at least − δ/ : ∀ t (cid:62) , ∀ x ∈ X , | s λ t ( f ( x )) − s λ t (˜ µ t − ( x )) | (cid:54) L λ t ˜ β t − (cid:13)(cid:13)(cid:13) ˜Γ t − ( x, x ) (cid:13)(cid:13)(cid:13) / , (23)where ˜ β t = c ε b + σ √ η (cid:114) /δ ) + ρ (cid:80) ts =1 log det (cid:16) I n + η − ˜Γ s − ( x s , x s ) (cid:17) , t (cid:62) and c ε = 1 + √ − ε . We can now upper bound the instantaneous regret at time t (cid:62) as r λ t ( x t ) := s λ t (cid:0) f ( x (cid:63)λ t ) (cid:1) − s λ t ( f ( x t )) (cid:54) s λ t (cid:0) ˜ µ t − ( x (cid:63)λ t ) (cid:1) + L λ t ˜ β t − (cid:13)(cid:13)(cid:13) ˜Γ t − ( x (cid:63)λ t , x (cid:63)λ t ) (cid:13)(cid:13)(cid:13) / − s λ t ( f ( x t )) (cid:54) s λ t (˜ µ t − ( x t )) + L λ t ˜ β t − (cid:13)(cid:13)(cid:13) ˜Γ t − ( x t , x t ) (cid:13)(cid:13)(cid:13) / − s λ t ( f ( x t )) (cid:54) L λ t ˜ β t − (cid:13)(cid:13)(cid:13) ˜Γ t − ( x t , x t ) (cid:13)(cid:13)(cid:13) / . Here in the first and third step, we have used (23). The second step follows from thechoice of x t . Since ˜ β t is a monotonically increasing function in t and L λ t (cid:54) L for all t ,we now have T (cid:88) t =1 r λ t ( x t ) (cid:54) L ˜ β T T (cid:88) t =1 (cid:13)(cid:13)(cid:13) ˜Γ t − ( x t , x t ) (cid:13)(cid:13)(cid:13) / (cid:54) L ˜ β T (cid:118)(cid:117)(cid:117)(cid:116) ρT T (cid:88) t =1 (cid:107) Γ t − ( x t , x t ) (cid:107) (cid:54) L ˜ β T (cid:118)(cid:117)(cid:117)(cid:116) ρ (1 + κ/η ) T T (cid:88) t =1 (cid:107) Γ t ( x t , x t ) (cid:107) , where the second last step is due to the Cauchy-Schwartz inequality and Lemma 10, andthe last step is due to Lemma 5. A similar argument as in (21) now yields T (cid:88) t =1 log det (cid:16) I n + η − ˜Γ t − ( x t , x t ) (cid:17) (cid:54) ρ T (cid:88) t =1 log det (cid:0) I n + η − Γ t − ( x t , x t ) (cid:1) = ρ log det (cid:0) I nT + η − G T (cid:1) (cid:54) ργ nT (Γ , η ) .
40e then have ˜ β T (cid:54) c ε b + σ √ η (cid:112) /δ ) + ρ γ nT (Γ , η )) . Setting q = ρ ln(4 T/δ ) ε ,we now have from Lemma 7, that with probability at least − δ/ , uniformly across all t ∈ [ T ] , the dictionary size m t (cid:54) ρq (1 + κ/η ) (cid:80) ts =1 (cid:107) Γ s ( x s , x s ) (cid:107) and (22) is true.Taking an expectation over { λ i } Ti =1 ∼ P λ and using a union bound argument, we thenobtain, with probability at least − δ , the cumulative regret R MT-BKB C ( T ) (cid:54) L (cid:18) c ε b + σ √ η (cid:112) /δ ) + ρ γ nT (Γ , η )) (cid:19) (cid:118)(cid:117)(cid:117)(cid:116) ρ (1 + κ/η ) T T (cid:88) t =1 (cid:107) Γ t ( x t , x t ) (cid:107) . We conclude the proof by noting that ρ = ε − ε > and c ε = 1 + √ − ε (cid:54) ρ . E Additional details on experiments
Cumulative regret using linear scalarization
We sample from P λ as λ = u/ (cid:107) u (cid:107) ,where u is uniformly sampled from [0 , n . We plot the time-average cumulative regret T R C ( T ) in Figure 2. (a) RKHS function (b) Perturbed sine function (c) Sensor measurements Figure 2:
Comparison of time-average cumulative regret of MT-KB and MT-BKB with IT-KB, IT-BKB andMOBO using linear scalarization.
Comparison of Bayes regret
We compare the Bayes regret R B ( T ) of MT-KB andMT-BKB with independent task benchmarks IT-KB, IT-BKB and MOBO using Cheby-shev scalarization in Figure 3. 41 -3 (a) RKHS function -3 (b) Perturbed sine function (c) Sensor measurements Figure 3:
Comparison of Bayes regret of MT-KB and MT-BKB with IT-KB, IT-BKB and MOBO usingChebyshev scalarization.
Comments on parameters used
We set the confidence radii (i.e., β t and ˜ β t ) of MT-KB and MT-BKB exactly as given in Theorem 2 and Theorem 3, respectively. Similarly,for IT-KB and IT-BKB, we use respective choices of radii given in (Chowdhury andGopalan, 2017) and (Calandriello et al., 2019) in the context of single task BO andsuitably blow those up by a √ n factor to account for n tasks. For MOBO, we use theUCB acquistion function and set the radius as specified in ( ? ). To make the comparisonuniform across all experiments, we do not tune any hyper-parameter for any algorithmand for a particular hyperparameter, we always use the same value in all algorithms.The hyper-paramter choices are specified in Section 5. We though believe that carefultuning of hyper-parameters might lead to better performance in practice. A note on the sensor data
The data was collected at 30 second intervals for 5consecutive days starting Feb. 28th 2004 from 54 sensors deployed in the IntelBerkeley Research lab. We have downloaded the data previously from the webpage http://db.csail.mit.edu/labdata/labdatahttp://db.csail.mit.edu/labdata/labdata