Estimating Quality in Multi-Objective Bandits Optimization
EEstimating Quality in Multi-Objective Bandits Optimization
Audrey Durand and Christian Gagn´eComputer Vision and Systems LaboratoryUniversit´e Laval, Qu´ebec (QC), Canada { [email protected], [email protected] } April 24, 2017
Abstract
Many real-world applications are characterized by a number of conflicting performance mea-sures. As optimizing in a multi-objective setting leads to a set of non-dominated solutions, apreference function is required for selecting the solution with the appropriate trade-off betweenthe objectives. The question is: how good do estimations of these objectives have to be inorder for the solution maximizing the preference function to remain unchanged? In this paper,we introduce the concept of preference radius to characterize the robustness of the preferencefunction and provide guidelines for controlling the quality of estimations in the multi-objectivesetting. More specifically, we provide a general formulation of multi-objective optimization un-der the bandits setting. We show how the preference radius relates to the optimal gap andwe use this concept to provide a theoretical analysis of the Thompson sampling algorithm frommultivariate normal priors. We finally present experiments to support the theoretical results andhighlight the fact that one cannot simply scalarize multi-objective problems into single-objectiveproblems.
Multi-objective optimization (MOO) [4] is a topic of great importance for real-world applications.Indeed, optimization problems are characterized by a number of conflicting, even contradictory,performance measures relevant to the task at hand. For example, when deciding on the healthcaretreatment to follow for a given sick patient, a trade-off must be made between the efficiency of thetreatment to heal the sickness, the side effects of the treatment, and the treatment cost. MOOis often tackled by combining the objective into a single measure (a.k.a. scalarization). Suchapproaches are said to be a priori , as the preferences over the objectives is defined before carryingout the optimization itself. The challenge lies in the determination of the appropriate scalarizationfunction to use and its parameterization. Another way to conduct MOO consists in learningthe optimal trade-offs (the so-called Pareto-optimal set). Once the optimization is completed,techniques from the field of multi-criteria decision-making are applied to help the user to select thefinal solution from the Pareto-optimal set. These a posteriori techniques may require a huge numberof evaluations to have a reliable estimation of the objective values over all potential solutions.Indeed, the Pareto-optimal set can be quite large, encompassing a majority, if not all, of thepotential solutions. In this work, we tackle the MOO problem where the scalarization function1 a r X i v : . [ c s . L G ] A p r xists a priori, but might be unknown, in which case a user can act as a black box for articulatingpreferences. Integrating the user to the learning loop, she can provide feedback by selecting herpreferred choice given a set of options – the scalarization function lying in her head.More specifically, we consider problems where outcomes are stochastic and costly to evaluate (e.g.,involving a human in the loop). The challenge is therefore to identify the best solutions givenrandom observations sampled from different (unknown) density distributions. We formulate thisproblem as multi-objective bandits, where we aim at finding the solution that maximizes the prefer-ence function while maximizing the performance of the solutions evaluated during the optimization.The Thompson sampling (TS) [8] technique is a typical approach for bandits problems, where po-tential solutions are tried based on a Bayesian posterior over their expected outcome. Here weconsider TS from multivariate normal (MVN) priors for multi-objective bandits. We introducethe concept of preference radius providing the tolerance range over objective value estimations,such that the best option given the preference function remains unchanged. We use this conceptfor providing a theoretical analysis of TS from MVN priors. Finally, we perform some empiri-cal experiments to support the theoretical results and also highlight the importance of tacklingmulti-objective bandits problems as such instead of scalarizing those under the traditional banditsetting. A multi-objective bandits problem is described by a (finite) set of actions A , also referred toas the design space , each of which is associated with a d -dimensional expected outcome µ a =( µ a, , . . . , µ a,d ) ∈ X ∈ R d . For simplicity, we assume that the objective space X = [0 , d . In thisepisodic game, an agent interacts with an environment characterized by a preference function f .The agent iteratively chooses to perform an action a ( t ) and obtains a noisy observation of z ( t ). An algorithm for a multi-objective bandits problem is a (possibly randomized) method for choos-ing which action to play next, given a history of previous choices and obtained outcomes, H t = { a ( s ) , z ( s ) } t − s =1 . Let O = argmax a ∈A f ( µ a ) and let (cid:63) ∈ O denote the optimal action. The optimalgap ∆ a = f ( µ (cid:63) ) − f ( µ a ) measures the expected loss of playing action a instead of the optimalaction. The agent’s goal is to design an algorithm with low expected (cumulative) regret : R ( T ) = T (cid:88) t =1 (cid:0) f ( µ (cid:63) ) − f ( µ a ( t ) ) (cid:1) = T (cid:88) t =1 (cid:88) a ∈A P [ a ( t ) = a ]∆ a . (1)This quantity measures the expected performance of the algorithm compared to the expectedperformance of an optimal algorithm given knowledge of the outcome distributions, i.e., alwayssampling from the distribution with the expectation maximizing f . Typically, we assume that thealgorithm maintains one estimate θ a ( t ) per action a on time t . Let O ( t ) = argmax a ∈A f ( θ a ( t ))denote the set of actions with an estimate maximizing f . The algorithm faces a trade-off betweenplaying an action a ( t ) ∈ O ( t ) and choosing to gather an additionnal sample from a relatively Scalars are written unbolded; vectors are boldfaced. The operators +, − , × , and ÷ applied on a vector v =( v , . . . , v d ) and a scalar s correspond to the operation between each item of v and s , e.g., v + s = ( v + s, . . . , v d + s ).These operators applied on two vectors v = ( v , . . . , v d ) and u = ( u , . . . , u d ) correspond to itemwise operationsbetween v and u , e.g., v + u = ( v + u , . . . , v d + u d ). Also known as the scalarized regret [5]. lgorithm 1 Multi-objective bandits settingOn each episode t ≥ a ( t ) to play given O ( t ).2. The agent observes z ( t ) = µ a ( t ) + ξ ( t ), where ξ ( t ) are i.i.d. random vectors.3. The agent updates its estimates. O b j ec t i v e Figure 1: Example of dominated (black) and non-dominated (white) options.unexplored action in order to improve its estimate. Alg. 1 describes this multi-objective banditsproblem.In many situations, the environment providing the preference function is a person, let us call her the expert user . Unfortunately, people are generally unable to scalarize their choices and preferences.Therefore they cannot explicitely provide their preference function. However, given several options,users can tell which one(s) they prefer (that is O ( t )) and thus can be used as a black box to providefeedback in the learning loop. Pareto-optimality
Given two d -dimensional options x = ( x , . . . , x d ) and y = ( y , . . . , y d ), x issaid to dominate, or Pareto-dominate, y (denoted x (cid:23) y ) if and only if x i > y i for at least one i and x i ≥ y i otherwise. The dominance is strict (denoted x (cid:31) y ) if and only if x i > y i for all i = 1 , . . . , d . Finally, the two vectors are incomparable (denoted x (cid:107) y ) if x (cid:7) y and y (cid:7) x . Pareto-optimal options represent the best compromises amongst the objectives and are the only optionsthat need to be considered in an application. We say that these options constitute the Pareto front P = { a : (cid:64) µ b (cid:23) µ a } a,b ∈A . Fig. 1 shows an example of dominated and non-dominated expectedoutcomes in a d = 2 objectives space. A user facing a multi-criteria decision making problem mustselect her preferred non-dominated option. Dominated options are obviously discarded by default. Related Works
The multi-objective bandits problem has already been addressed in the a posteri-ori setting, where the goal is to discover the whole Pareto front for a posteriori decision making [5, 9].This is different from the a priori optimization problem tackled here. The aim of algorithms inthe a posteriori setting is to simultaneously minimize the Pareto-regret and the unfairness metrics.Also known as the (cid:15) -distance [6], the Pareto-regret associated with playing action a is the mini-mum value (cid:15) a such that µ a + (cid:15) a is not dominated by any other actions. In other words, any actionstanding on the front is considered equally good by the expert user. This is like considering that O = P , which corresponds to the preference function f ( µ (cid:63) ) = 1, f ( µ a ) = 1 − (cid:15) a , such that ∆ a = (cid:15) a .3ote that any algorithm optimizing a single objective could minimize the Pareto-regret regardlessof the other objectives. This is addressed by the unfairness metric, measuring the disparity in theamount of plays of non-dominated actions – the idea being to force algorithms to explore the wholePareto front evenly.In MOO settings [10], the goal is to identify the Pareto-optimal set P without evaluating all actions.The quality of a solution S is typically given by the hypervolume error V ( P ) − V ( S ), where the V ( P ) is the volume enclosed between the origin and { µ a } a ∈P (and similarly for S ). However,the hypervolume error does not give information about the quality of the estimation of actions.Identifying the Pareto front alone does not guarantee that the actions are well estimated and,therefore, that an expert user choice based on these estimations would lead to the right choice. Let θ a ( t ) denote the estimation associated with action a on episode t and let P ( t ) = { a : (cid:64) θ b ( t ) (cid:31) θ a ( t ) } a,b ∈A denote the estimated Pareto front given these options. By definition, the optimaloptions are O ( t ) ⊆ P ( t ). Let B ( c , r ) ⊆ { x ∈ X : | x i − c i | < r, i = 1 , . . . , d } denote a ball of center c and radius r . In order to characterize the difficulty of a multi-objectivebandits setting, we introduce the following quantity. Definition 1.
For each action a ∈ A , we define the preference radius ρ a as any radius such thatif θ a ( t ) ∈ B ( µ a , ρ a ) for all actions, then ∃ (cid:63) ∈ O : (cid:63) ∈ O ( t ) and a (cid:54)∈ O ( t ) ∀ a ∈ A , a (cid:54)∈ O . The radii correspond to the robustness of the preference function, that is to which extent can actionsbe poorly estimated simultaneously before the set of optimal options changes. The radius ρ a isdirectly linked to the gap ∆ a = f ( µ (cid:63) ) − f ( µ a ). For a suboptimal action, a large radius indicatesthat this action is far from being optimal. Also, the preference radii of suboptimal actions dependon the preference radius of the optimal action(s). Larger optimal action radii imply smaller radiifor suboptimal actions. Note that if all actions estimates stand in their preference balls, beinggreedy is optimal.Let α , . . . , α d ∈ [0 ,
1] denote weights such that (cid:80) di =1 α i = 1. The weighted L p metric f ( x ) = (cid:0) (cid:80) di =1 α i x pi (cid:1) /p with p ≥ p = 1 and as the Chebyshev scalarization when p = ∞ . The followingexamples show the link between the preference radii and the gap for these two common functions. Example 1 (Linear) . The linear scalarization function is given by f ( x ) = d (cid:88) i =1 α i x i . . . . . . . . . . . . . O b j ec t i v e . . . . . . . . . . . . O b j ec t i v e Figure 2: Examples of preference radii around the optimal (white) and suboptimal (black) actionsgiven the linear preference function f ( x ) = 0 . x + 0 . x . Consider the optimal action (cid:63) and the suboptimal action a . By definition of the preference radii,we have that min θ (cid:63) ∈ B ( µ (cid:63) ,ρ (cid:63) ) f ( θ (cid:63) ) > max θ a ∈ B ( µ a ,ρ a ) f ( θ a ) d (cid:88) i =1 ( α i µ (cid:63),i − α i ρ (cid:63) ) > d (cid:88) i =1 ( α i µ a,i + α i ρ a ) f ( µ (cid:63) ) − ρ (cid:63) > f ( µ a ) + ρ a ∆ a > ρ (cid:63) + ρ a . Fig. 2 shows examples of preference radii with a linear preference function.
Example 2 (Chebyshev) . The Chebyshev scalarization [3] function is given by f ( x ) = max ≤ i ≤ d α i x i . Consider the optimal and suboptimal actions (cid:63) and a , and let i (cid:63) = argmax ≤ i ≤ d α i ( µ (cid:63),i − ρ (cid:63) ) , i a = argmax ≤ i ≤ d α i ( µ a,i − ρ a ) . By definition of the preference radii, we have that min θ (cid:63) ∈ B ( µ (cid:63) ,ρ (cid:63) ) f ( θ (cid:63) ) > max θ a ∈ B ( µ a ,ρ a ) f ( θ a )max ≤ i ≤ d α i ( µ (cid:63),i + ρ (cid:63) ) > max ≤ i ≤ d α i ( µ a,i − ρ a ) α i (cid:63) µ (cid:63),i − α i (cid:63) ρ (cid:63) > α i a µ a,i + α i a ρ a f ( µ (cid:63) ) − α i (cid:63) ρ (cid:63) > f ( µ a ) + α i a ρ a ∆ a > α i (cid:63) ρ (cid:63) + α i a ρ a . The difficulty here is that i (cid:63) and i a respectively depend on ρ (cid:63) and ρ a . Consider a 2-objective setting,we can define τ (cid:63) = α µ (cid:63), − α µ (cid:63), α − α , τ a = α µ a, − α µ a, α − α . . . . . . . . . . . . O b j ec t i v e . . . . . . . . . . . . O b j ec t i v e Figure 3: Examples of preference radii around the optimal action (white) and suboptimal actions(black) given a Chebyshev function with α = 0 . α = 0 . as thresholds such that i (cid:63) = (cid:26) if ρ (cid:63) > τ (cid:63) otherwise , i a = (cid:26) if ρ a < τ a otherwise . Fig. 3 shows examples of preference radii with a Chebyshev preference function.
Outside L p metrics, other scalarization functions are often based on constraints. For example,using the (cid:15) -constraint scalarization technique, a user assigns a constraint to every objective excepta target objective (cid:96) . All options that fail to respect one of the contraints receive a value of 0, whilethe options that respect all constraints get a value of x (cid:96) . The following example shows the relationbetween the preference radius and the gap given a preference function that is articulated as an (cid:15) -constraint scalarization technique. Example 3 (Epsilon-constraint) . The (cid:15) -constraint function is given by f ( x ) = (cid:26) x (cid:96) if x i ≥ (cid:15) i ∀ i ∈ { , . . . , d } , i (cid:54) = (cid:96) otherwise . Consider the optimal and suboptimal actions (cid:63) and a . By definition of the preference radii, we havethat ρ (cid:63) ≤ min ≤ i ≤ d,i (cid:54) = (cid:96) µ (cid:63),i − (cid:15) i . We decompose ρ a = ρ ¯ a + ¯ ρ a such that ρ ¯ a = min { , max ≤ i ≤ d,i (cid:54) = (cid:96) (cid:15) i − µ a,i } denotes the radius required in order for action a to respect the constraints, that is to obtain f ( µ a ) > , and ¯ ρ a denotes the leftover leading to a gap reduction. Finally, we have that µ (cid:63),(cid:96) − ρ (cid:63) > µ a,(cid:96) + ρ ¯ a + ¯ ρ a and ∆ a > ρ (cid:63) + ρ a . Fig. 4 shows examples of preference radii with (cid:15) -constraint preference functions. . . . . . . . . . . . . O b j ec t i v e (a) (cid:96) = 1, (cid:15) = 0 . . . . . . . . . . . . . O b j ec t i v e (b) (cid:96) = 2, (cid:15) = 0 . Figure 4: Examples of preference radii around the optimal (white) and suboptimal (black) actionsgiven two different configurations of (cid:15) -contraint.
Algorithm 2
Thompson sampling from MVN priors for all episode t ≥ do for all action a ∈ A do sample θ a ( t ) = N d (cid:0) ( I d + N a ( t ) I d ) − N a ( T ) ˆ µ a ( t ) , ( I d + N a ( t ) I d ) − (cid:1) end for O ( t ) = argmax a ∈A f ( θ a ( t )) play a ( t ) ∈ O ( t ) and observe z ( t ) end for The Thompson sampling (TS) [8] algorithm maintains a posterior distribution π a ( t ) on the mean µ a given a prior and the history of observations H t . On each episode t , one option θ a ( t ) issampled from each posterior distribution π a ( t ). The algorithm selects a ( t ) ∈ O ( t ). Recall that O ( t ) = argmax a ∈A f ( θ a ( t )). Therefore P [ a ( t ) = a ] is proportionnal to the posterior probabilitythat a maximizes the preference function given the history H t . Let N a ( t ) = (cid:80) t − s =1 I [ a ( s ) = a ]denote the number of times action a has been played up to episode t . Also letˆ µ a,t = (cid:80) t − s =1: a ( s )= a z ( s ) N a ( t ) and ˆ Σ a,t = (cid:80) t − s =1: a ( s )= a (cid:0) z ( s ) − ˆ µ a ( t ) (cid:1)(cid:0) z ( s ) − ˆ µ a ( t ) (cid:1) (cid:62) N a ( t ) − Σ and µ denote priors. For MVNpriors, the posterior over µ a is given by a MVN distribution N d ( ˜ µ a ( t ) , ˜ Σ a ( t )), where˜ Σ a ( t ) = (cid:0) Σ − + N a ( t ) Σ − a (cid:1) − and ˜ µ a ( t ) = ˜ Σ a ( t ) (cid:0) Σ − µ + N a ( t ) Σ − a ˆ µ a ( t ) (cid:1) for the known covariance matrix Σ a . Since assuming that Σ a is known might be unrealistic inpractice, one can consider the non-informative covariance Σ a = I d . With non-informative priors µ = d × and Σ = I d , this corresponds to a direct extension of the one-dimensional TS fromGaussian priors [2]. Alg. 2 shows the resulting TS procedure from MVN priors. d × indicates a d -elements column vector and I d indicates a d × d identity matrix. Proposition 1.
Assuming σ -sub-Gaussian noise with σ ≤ / (4 d ) , the expected regret of TS fromMVN priors (Alg. 2) is bounded by R ( T ) ≤ (cid:88) a ∈A ,a (cid:54) = (cid:63) (cid:20) ( C ( d ) + 4 d )(1 + σ )∆ a ln( dT ∆ a ) ρ (cid:63) + 4∆ a + 2∆ a ln( dT ∆ a )( ρ a − r a ) + 2 σ ∆ a ln( dT ∆ a ) r a (cid:21) , where ρ (cid:63) , ρ a are preference radii, r a < ρ a , and C ( d ) is such that e − √ i √ πd ln id ≤ di for i ≥ C ( d ) (seeRemark 1). Theorem 1.
Assume either a linear (Ex. 1), Chebyshev (Ex. 2), or (cid:15) -constraint (Ex. 3) preferencefunction. Assuming σ -sub-Gaussian noise with σ ≤ / (4 d ) , the expected regret of TS from MVNpriors (Alg. 2) is bounded by R ( T ) ≤ (cid:88) a ∈A ,a (cid:54) = (cid:63) (cid:20) (8 C ( d ) + 24 d + 18 + 72 σ )(1 + σ ) ln( dT ∆ a )∆ a + 4∆ a (cid:21) , where C ( d ) is such that e − √ i √ πd ln id ≤ di for i ≥ C ( d ) (see Remark 1). This regret bound is oforder O ( √ dN T ln d + √ dN T ln N ) , where N = |A| . More specifically, for d ≤ ln N , it is of order O ( √ dN T ln N ) . Remark 1.
For d = 1 we can take C ( d ) = e . For d = 2 we can take C ( d ) = e , for d = 3 wecan take C ( d ) = e , and so on for any d ∈ N . For d = 1, the order of the regret bounds given by Theorem 1 match the order of the regret boundsfor TS from Gaussian priors in the single-objective bandits setting [2], assuming [0 , d ofthe objective space. This means that the more dimensions we have, the less noise we can bear inorder for these bounds to hold, given the provided analysis . In this section we start by proving Prop. 1 that provides a regret bound for TS with MVN priorsthat is independent from the preference function. Then we use the relations between the gap andthe preference radius in three preference function families to obtain Theorem 1.8 .1 Proof of Prop. 1
The following analysis extends the work for the 1-dimensional setting [2] to the d -dimensionalsetting. We rewrite Eq. 1 as R ( T ) = (cid:88) a ∈A ,a (cid:54) = (cid:63) ∆ a T (cid:88) t =1 P [ a ( t ) = a ] , where we control (cid:80) Tt =1 P [ a ( t ) = a ]. The proof relies on several facts (see Appendix A) that extendChernoff’s inequalities and (anti-)concentration bounds from the 1-dimensional setting to the d -dimensional setting using the concepts of Pareto-domination and preference radius. We introducethe following quantities and events to control the quality of mean estimations and the quality ofsamples. Definition 2 (Quantities r a ) . For each suboptimal action a , we choose a quantity r a < ρ a , where ρ a is a preference radius. By definition of the preference radii, we have µ a ≺ µ a + r a ≺ µ a + ρ a .Recall that f ( x ) < f ( y ) if x ≺ y . Hence we have f ( µ a ) < f ( µ a + r a ) < f ( µ a + ρ a ) < f ( µ (cid:63) − ρ (cid:63) ) . Definition 3 (Events E µa ( t ), E θa ( t )) . For each suboptimal action a , define E µa ( t ) as the event that ˆ µ a ( t ) ≺ µ a + r a , and define E θa ( t ) as the event that θ a ( t ) ≺ µ a + ρ a . More specifically, they are theevent that suboptimal action a is well estimated and well sampled, respectively. Definition 4 (Filtration F t ) . Define filtration F t = { a ( s ) , z ( s ) } s =1 ,...,t − . For suboptimal action a , we decompose T (cid:88) t =1 P [ a ( t ) = a ] = T (cid:88) t =1 P [ a ( t ) = a, E µa ( t ) , E θa ( t )] (cid:124) (cid:123)(cid:122) (cid:125) ( A ) + T (cid:88) t =1 P [ a ( t ) = a, E µa ( t ) , E θa ( t )] (cid:124) (cid:123)(cid:122) (cid:125) ( B ) + T (cid:88) t =1 P [ a ( t ) = a, E µa ( t )] (cid:124) (cid:123)(cid:122) (cid:125) ( C ) and control each part separately. In (A), a is played while being well estimated and well sampled.We control this by bounding poor estimation and poor samples for the optimal action. In (B), a isplayed while being well estimated but poorly sampled. We control this using Gaussian concentrationinequalities. In (C), a is played while being poorly estimated. We control this using Chernoffinequalities. Gathering the following results together and summing over all suboptimal actions, weobtain Prop. 1. By definition of TS, for suboptimal a to be played on episode t , we must (at least) have f ( θ a ( t )) ≥ f ( θ (cid:63) ( t )). By definition of event E θa ( t ) and the preference radii, we have f ( θ a ( t )) < f ( θ (cid:63) ( t )) if θ (cid:63) ( t ) (cid:31) µ (cid:63) − ρ (cid:63) . Let τ k denote the time step at which action (cid:63) is selected for the k th time for k ≥ τ = 0. Note that for any action a , τ k > T for k > N a ( T ) and τ T ≥ T . Then9 A ) = T (cid:88) t =1 P [ a ( t ) = a, E µa ( t ) , E θa ( t ) |F t ] ≤ T (cid:88) t =1 P [ f ( θ a ( t )) > f ( θ (cid:63) ( t )) , E µa ( t ) , E θa ( t ) |F t ] ≤ T (cid:88) t =1 P [ θ (cid:63) ( t ) (cid:54)(cid:31) µ (cid:63) − ρ (cid:63) |F t ] ≤ L (cid:88) k =0 E (cid:20) τ k +1 (cid:88) t = τ k +1 I [ θ (cid:63) (cid:54)(cid:31) µ (cid:63) − ρ (cid:63) |F t ] (cid:21) + T (cid:88) t = τ L +1 P [ θ (cid:63) ( t ) (cid:54)(cid:31) µ (cid:63) − ρ (cid:63) , N (cid:63) ( t ) > L |F t ] . (2)The second inequality uses the fact that the sampling of θ (cid:63) ( t ) is independent from the events E µa ( t )and E θa ( t ). The last inequality uses the observation that P [ θ (cid:63) ( t ) (cid:54)(cid:31) µ (cid:63) − ρ (cid:63) |F t ] is fixed given F t and that it changes only when π (cid:63) ( t ) changes, that is only when action (cid:63) is played. The first sumcounts the number of episodes required before action (cid:63) has been played L times. The second countsthe number of episodes where (cid:63) is badly sampled after having been played L times. We use thefollowing Lemma to control the first summation, see Appendix B. Lemma 1 (Based on Lemma 6 from [2]) . Let τ k denote the time of the k th selection of action (cid:63) .Then, for any d ∈ N and σ ≤ / (4 d ) , E (cid:20) τ k +1 (cid:88) t = τ k +1 P [ θ (cid:63) ( t ) (cid:54)(cid:31) µ (cid:63) − ρ (cid:63) |F t ] (cid:21) ≤ C ( d ) + 4 d, where C ( d ) is such that e − √ i √ πd ln id ≤ di for i ≥ C ( d ) . Now we bound the second summation in Eq. 2 by controlling the probability of poorly sampling θ (cid:63) ( t ) when N (cid:63) ( t ) > L . Let E (cid:63) ( t ) denote the event that ˆ µ (cid:63) ( t ) (cid:31) µ (cid:63) − σρ (cid:63) / (1 + σ ). Then we have P [ θ (cid:63) ( t ) (cid:54)(cid:31) µ (cid:63) − ρ (cid:63) , N (cid:63) ( t ) > L |F t ] ≤ P (cid:104) θ (cid:63) ( t ) (cid:54)(cid:31) ˆ µ (cid:63) ( t ) − ρ (cid:63) σ , E (cid:63) ( t ) , N (cid:63) ( t ) > L |F t (cid:105) + P [ E (cid:63) ( t ) , N (cid:63) ( t ) > L |F t ] ≤ P (cid:104) θ (cid:63) ( t ) (cid:54)∈ B (cid:16) ˆ µ (cid:63) ( t ) , ρ (cid:63) σ (cid:17) , E (cid:63) ( t ) , N (cid:63) ( t ) > L |F t (cid:105) + P (cid:104) ˆ µ (cid:63) ( t ) (cid:54)∈ B (cid:16) µ (cid:63) − σρ (cid:63) σ (cid:17) , N (cid:63) ( t ) > L |F t (cid:105) ≤ d e − Lρ (cid:63) σ )2 + 2 de − Lρ (cid:63) σ )2 . The last inequality uses Facts 1 and 2. With L = 2(1 + σ ) dT ∆ a ) ρ (cid:63) we obtain P [ θ (cid:63) ( t ) (cid:54)(cid:31) µ (cid:63) − ρ (cid:63) , N (cid:63) ( t ) > L |F t ] ≤ T ∆ a . (3)10e use Lem. 1 and Eq. 3 in Eq. 2 to obtain( A ) ≤ (2 C ( d ) + 8 d )(1 + σ ) ln( dT ∆ a ) ρ (cid:63) + 52∆ a for σ ≤ / (4 d ), where C ( d ) is such that e − √ i √ πd ln id ≤ di for i ≥ C ( d ). We control the probability of badly sampling suboptimal action a given that it has been played atleast L times. Recall that filtration F t is such that E µa ( t ) holds. To that extent we decompose( B ) = T (cid:88) t =1 P [ a ( t ) = a, E θa ( t ) , E µa ( t ) , N a ( t ) ≤ L |F t ] + T (cid:88) t =1 P [ a ( t ) = a, E θa ( t ) , E µa ( t ) , N a ( t ) > L |F t ] ≤ E (cid:20) T (cid:88) t =1 I [ a ( t ) = a, N a ( t ) ≤ L |F t ] (cid:21) + T (cid:88) t =1 P [ θ a ( t ) (cid:54)≺ µ a + ρ a , N a ( t ) > L |F t ] ≤ L + T (cid:88) t =1 P [ θ a ( t ) (cid:54)≺ ˆ µ a ( t ) + ( ρ a − r a ) , N a ( t ) > L |F t ] ≤ L + T d e − L ( ρa − ra )22 . The first inequality uses the observation that P [ a ( t ) = a |F t ] is fixed given F t and the definition ofevent E θa ( t ). The second inequality uses the fact that event E µa ( t ) holds. The last inequality usesFact 2. With L = 2 ln( dT ∆ a )( ρ a − r a ) we obtain( B ) ≤ dT ∆ a )( ρ a − r a ) + 12∆ a . Similarly to what has been done previously with (B), we can control the probability of badlyestimating suboptimal action a given that it has been played at least L times. Then we have( C ) ≤ T (cid:88) t =1 P [ a ( t ) = a, E µa ( T ) , N a ( t ) ≤ L |F t ] + T (cid:88) t =1 P [ a ( t ) = a, E µa ( T ) , N a ( T ) > L |F t ] ≤ E (cid:20) T (cid:88) t =1 I [ a ( t ) = a, N a ( t ) ≤ L ] (cid:21) + T (cid:88) t =1 P [ E µa ( T ) , N a ( T ) ≥ L ] ≤ L + T de − Lr a σ . The second inequality uses the observation that P [ a ( t ) = a |F t ] is fixed given F t . The last inequalityuses Fact 1. With L = 2 σ dT ∆ a ) r a we obtain( C ) ≤ σ ln( dT ∆ a ) r a + 1∆ a . lgorithm 3 Thompson sampling from Gaussian priors [2] for all episode t ≥ do for all action a ∈ A do sample θ a ( t ) = N (cid:0) N a ( T )ˆ µ a ( t ) N a ( t )+1 , N a ( t )+1 (cid:1) end for play a ( t ) = argmax a ∈A θ a ( t ) and observe f ( z ( t )) end for By definition of the preference radii, given a linear (Ex. 1), Chebyshev (Ex. 2), or (cid:15) -constraintpreference function (Ex. 3), one can take ρ (cid:63) = ρ a = ∆ a , r a = ∆ a . Using these values in Prop. 1,we obtain Theorem 1: R ( T ) ≤ (cid:88) a ∈A ,a (cid:54) = (cid:63) (cid:20) (8 C ( d ) + 24 d + 18 + 72 σ )(1 + σ ) ln( dT ∆ a )∆ a + 4∆ a (cid:21) . Let ∆ a = δ a (cid:113) dN ln NT , for δ a ∈ (0 , (cid:113) TdN ln N ]. The regret is bounded by R ( T ) ≤ (8 C ( d ) + 24 d + 18 + 72 σ )(1 + σ ) √ N T ln( d N ln N ) δ a √ d ln N + 4 √ N Tδ a √ d ln N with σ ≤ / (4 d ), that is of order O ( √ dN T ln d + √ dN T ln N ). More specifically, for d ≤ ln N , theregret bound is of order O ( √ dN T ln N ). Given that the preference function is known a priori, one might be tempted to formalize the problemunder the traditional , single-objective, bandits setting. This would correspond to optimizing overthe expected value of the preference function, E [ f ( z ( t )) | a ( t ) = a ], instead of f ( µ a ). In the followingexperiments, we compare the performance of the TS algorithm from MVN priors (Alg. 2) in themulti-objective bandits scheme (Alg. 1) with the one-dimensional TS from Gaussian priors [2]applied to the multi-objective bandits problem formalized under the traditional bandits setting(Alg. 3).We randomly generate a 10-action setting with d = 2 objectives, such that the objective space is X =[0 , . We consider settings where outcomes are sampled from multivariate normal distributionswith covariance Σ a = (cid:20) .
10 0 . .
05 0 . (cid:21) for all a ∈ A and from multi-Bernoulli distributions. A sample z ∼ B d ( µ ) from a d -dimensional multi-Bernoulli distribution with mean µ is such that z i ∼ B ( µ i ).Experiments are conducted using the linear preference function f ( x ) = 0 . x + 0 . x , x ∈ X , and the (cid:15) -constraint preference function f ( x ) = (cid:26) x if x ≥ .
50 otherwise , x ∈ X . µ a f ( µ a ) ∆ a Linear (cid:15) -constraint Linear (cid:15) -constraint(0 . , .
46) 0.50 0.46 0.17 0.26(0 . , .
26) 0.46 0.26 0.21 0.46(0 . , .
79) 0.61 0.00 0.06 0.72(0 . , .
50) 0.56 0.50 0.11 0.22(0 . , .
42) 0.54 0.42 0.13 0.29(0 . , .
72) 0.65 . , .
62) 0.57 0.00 0.10 0.72(0 . , .
84) 0.56 0.00 0.11 0.72(0 . , . . , .
44) 0.51 0.44 0.16 0.28 . . . . . . . . . . . . O b j ec t i v e (a) Linear with α = 0 . α = 0 . . . . . . . . . . . . . O b j ec t i v e (b) (cid:15) -constraint with (cid:96) = 2, (cid:15) = 0 . Figure 5: Expected outcomes for optimal (white) and suboptimal (black) actions. The dotted lineshows the preference function (left) and the (cid:15) constraint (right).Tab. 1 gives the expected outcomes for all actions along with the associated preference value andgap given the preference function. Fig. 5 shows the expected outcomes and illustrates the preferencefunction. We observe that the optimal action is different for the two preference functions. Eachexperiment is conducted over 10 ,
000 episodes and repeated 100 times. Repetitions have been madesuch that the noise ξ ( t ) is the same for all tested approaches on the same repetition . Thereforewe can compare the performance of different approaches on the same repetition. The goal is tominimize the cumulative regret (Eq. 1).Fig. 6 shows the cumulative regret of TS from MVN priors and TS from Gaussian priors (inthe traditional bandits formulation) for both outcome distributions and preference functions. Weobserve that the cumulative regret growth rate for TS from MVN priors appears to match theorder of the provided theoretical bounds (Theorem 1). Results also show that, though it might beappealing to address a multi-objective problem as a single-objective bandits problem, it is not agood idea. Consider the (cid:15) -constraint preference function used in this experiment. It is evaluated as0 if z ( t ) < .
5, otherwise to z ( t ). With multi-Bernoulli outcomes, for example, this means that P [ f ( z ( t )) = 1] = µ a ( t ) , µ a ( t ) , . Given that, argmax a ∈A f ( µ a ) (cid:54) = argmax a ∈A E [ f ( z ( t )) | a ( t ) = a ].Since the action considered as optimal in the single-objective formulation is not the same as the13 . . . . . . × C u m u l a t i v e r e g r e t TS GaussianTS MVN (a) Multi-Bernoulli, linear . . . . . . × C u m u l a t i v e r e g r e t TS GaussianTS MVN (b) Multi-Bernoulli, (cid:15) -constraint . . . . . . × C u m u l a t i v e r e g r e t TS GaussianTS MVN (c) Multivariate normal, linear . . . . . . × C u m u l a t i v e r e g r e t TS GaussianTS MVN (d) Multivariate normal, (cid:15) -constraint
Figure 6: Cumulative regret over episodes for tested outcome distributions and preference functions.Fat lines indicate the average over repetions and dotted lines indicate each individual repetition.optimal action in the multi-objective problem, TS with Gaussian priors converges to the wrong action, hence the linear regret.
In this work, we have addressed the online multi-objective optimization problem under the multi-objective bandits setting. Unlike previous formulations, we work in the a priori setting, wherethere exists a preference function to be maximized. However, acting in the the proposed settingwould not require the preference function to be known . Indeed, it would be sufficient for an expertuser to pick her preferred estimate among a set of options with no requirement of providing anactual, real valued, evaluation of each option. We have introduced the concept of preference radiusto characterize the difficulty of a multi-objective setting through the robustness of the preferencefunction to the quality of estimations available. We have shown how this measure relates to thegap between the optimal action and the recommended action by a learning algorithm. We haveused this new concept to provide a theoretical analysis of the Thompson sampling algorithm frommultivariate normal priors in the multi-objective setting. More specifically, we were able to provideregret bounds for three families of preference functions. Empirical experiments confirmed theexpected behavior of the multi-objective Thompson sampling in terms of cumulative regret growth.Results also highlight the important fact that one cannot simply reduce a multi-objective setting toa traditional, single-objective, setting since this might cause a change in the optimal action. Futurework includes the application of the proposed approach to a real world application.14 cknowledgements
This work was supported through funding from NSERC (Canada). We also thank Julien-CharlesL´evesque for insightful comments and Annette Schwerdtfeger for proofreading.
References [1] M. Abramowitz and I. A. Stegun.
Handbook of mathematical functions: with formulas, graphs,and mathematical tables , volume 55. Courier Corporation, 1964.[2] S. Agrawal and N. Goyal. Further optimal regret bounds for Thompson Sampling. In
Proceed-ings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS) ,pages 99–107, 2013.[3] V J. Bowman Jr. On the relationship of the Tchebycheff norm and the efficient frontier ofmultiple-criteria objectives. In
Multiple criteria decision making , pages 76–86. 1976.[4] C. A. C. Coello, G. B. Lamont, D. A. Van Veldhuizen, D. E. Goldberg, and J. R. Koza.
Evolutionary Algorithms for Solving Multi-Objective Problems . 2nd edition, 2007. ISBN9780387310299.[5] M. M. Drugan and A. Nowe. Designing multi-objective multi-armed bandits algorithms: Astudy. In
Proceedings of the International Joint Conference on Neural Networks (IJCNN) ,2013.[6] M. Laumanns, L. Thiele, K. Deb, and E. Zitzler. Combining convergence and diversity inevolutionary multiobjective optimization.
Evolutionary computation , 10(3):263–82, 2002.[7] P. Rigollet. 18.S997 High-Dimensional Statistics, Chapter 1, Spring 2015. (MIT Open-CourseWare: Massachusetts Institute of Technology), https://ocw.mit.edu/courses/mathematics/18-s997-high-dimensional-statistics-spring-2015/lecture-notes/MIT18_S997S15_Chapter1.pdf . (Accessed March 15, 2017). License: Creative commonsBY-NC-SA.[8] W. R. Thompson. On the likelihood that one unknown probability exceeds another in view ofthe evidence of two samples.
Biometrika Trust , 25(3):285–294, 1933.[9] S. Q. Yahyaa, M. M. Drugan, and B. Manderick. Thompson sampling in the adaptive lin-ear scalarized multi objective multi armed bandit. In
Proceedings of the 7th InternationalConference on Agents and Artificial Intelligence (ICAART) , pages 55–65, 2015.[10] M. Zuluaga, G. Sergent, A. Krause, and M. P¨uschel. Active learning for multi-objective opti-mization. In
Proceedings of the 30th International Conference on Machine Learning (ICML) ,pages 462–470, 2013. 15 ppendix
A Technical Tools
Fact 1 ( d -dimensional Chernoff) . Let X , . . . , X N be i.i.d. σ -sub-Gaussian variables with valuesin such that E [ X ] = µ . Let ˆ µ N = N (cid:80) Ni =1 X i . Then, as shown by [7], for any a ≥ , P [ | ˆ µ N − µ | ≥ a ] ≤ e − Na σ . Now consider the the multivariate setting where X , . . . , X N are i.i.d. d -dimensional σ -sub-Gaussianvariables such that E [ X ] = µ and ˆ µ N = N (cid:80) Ni =1 X i . Then for any a ≥ , P [ ˆ µ N (cid:23) µ + a ] = P [(ˆ µ N, ≥ µ + a ) ∧ · · · ∧ (ˆ µ N,d ≥ µ d + a )] ≤ e − dNa σ , P [ ˆ µ N (cid:54)(cid:22) µ + a ] ≤ P [(ˆ µ N, ≥ µ + a ) ∨ · · · ∨ (ˆ µ N,d ≥ µ d + a )] ≤ de − Na σ , P [ ˆ µ N (cid:54)∈ B ( µ , a )] ≤ P [( | ˆ µ N, − µ | ≥ a ) ∨ · · · ∨ ( | ˆ µ N,d − µ d | ≥ a )] ≤ de − Na σ . Fact 2 ( d -dimensional Gaussian concentration) . Let X be a Gaussian random variable with mean µ and standard deviation σ . The following concentration is derived [2] from [1] for z ≥ : P [ | X − µ | > zσ ] ≤ e − z / . Now consider the multivariate setting where X denotes a d -dimensional Gaussian random variablewith mean µ and diagonal covariance Σ . Then for z ≥ , P [ X (cid:31) µ + z (cid:112) diag( Σ )] = P [( X > µ + zσ ) ∧ · · · ∧ ( X d > µ d + zσ d )] ≤ (cid:18) e − z / (cid:19) d , P [ X (cid:54)≺ µ + z (cid:112) diag( Σ )] ≤ P [( X ≥ µ + zσ ) ∨ · · · ∨ ( X d ≥ µ d + zσ d )] ≤ d e − z / , P [ X (cid:54)∈ B ( µ , z (cid:112) diag( Σ ))] ≤ P [( | X − µ | ≥ zσ ) ∨ · · · ∨ ( | X d − µ d | ≥ zσ d )] ≤ d e − z / . Fact 3 ( d -dimensional Gaussian anti-concentration) . Let X be a Gaussian random variable withmean µ and standard deviation σ . The following concentration is derived [2] from [1] for z ≥ : P [ X > µ + zσ ] ≥ z √ π ( z + 1) e − z / . Now consider the multivariate setting where X denotes a d -dimensional Gaussian random variablewith mean µ and diagonal covariance Σ . Then for z ≥ , P [ X (cid:31) µ + z (cid:112) diag( Σ )] = P [( X > µ + zσ ) ∧ · · · ∧ ( X d > µ d + zσ d )] ≥ (cid:18) z √ π ( z + 1) e − z / (cid:19) d . Proof of Lemma 1
Proof.
Let Θ j denote a N d ( ˆ µ (cid:63) ( τ j +1) , ( I d + N (cid:63) ( τ j +1) I d ) − ) distributed multivariate normal randomvariable. Let G j be a geometric variable denoting the number of consecutive independent trialsuntil Θ j (cid:31) µ (cid:63) − ρ (cid:63) . Then observe that E (cid:20) τ k +1 (cid:88) t = τ k +1 P [ θ (cid:63) ( t ) (cid:54)(cid:31) µ (cid:63) − ρ (cid:63) |F t ] (cid:21) ≤ E [ G j ] = ∞ (cid:88) i =1 P [ G j ≥ i ] . We want to bound the expected value of G j by a constant for all j . Consider any integer i ≥
1, let z = √ ln i /d , and let MAX i denote the maximum preference of i independent samples of Θ j , thatis max ≤ i ≤ j f (Θ j ). We abbreviate ˆ µ (cid:63) ( τ j + 1) as ˆ µ (cid:63) and N (cid:63) ( τ j + 1) as N (cid:63) in the following. Then P [ G j < i ] ≥ P [MAX i (cid:31) µ (cid:63) − ρ (cid:63) ] ≥ P (cid:104) MAX i (cid:31) ˆ µ (cid:63) + z √ N (cid:63) (cid:12)(cid:12)(cid:12) ˆ µ (cid:63) + z √ N (cid:63) (cid:23) µ (cid:63) − ρ (cid:63) (cid:105) · P (cid:104) ˆ µ (cid:63) + z √ N (cid:63) (cid:23) µ (cid:63) − ρ (cid:63) (cid:105) . Using Fact 3, this gives P (cid:104) MAX i (cid:31) ˆ µ (cid:63) + z √ N (cid:63) (cid:12)(cid:12)(cid:12) ˆ µ (cid:63) + z √ N (cid:63) (cid:23) µ (cid:63) − ρ (cid:63) (cid:105) ≥ − (cid:32) − (cid:18) √ π zz + 1 e − z / (cid:19) d (cid:33) i = 1 − (cid:32) − (cid:18) √ π √ ln i /d (ln i /d + 1) 1 √ i /d (cid:19) d (cid:33) i ≥ − (cid:32) − (cid:18) √ πdi /d ln i (cid:19) d (cid:33) i ≥ − e − √ i √ πd ln id , where the second inequality uses that ln i /d +1 < i and the last inequality uses that 1 − x < e − x .Also, using Fact 1, we have P [ ˆ µ (cid:63) (cid:23) µ (cid:63) − z √ N (cid:63) ] ≥ − de − z σ = 1 − di / (2 dσ ) . Substituting, we obtain P [ G j < i ] ≥ (cid:16) − e − √ i √ π ln id (cid:17) · (cid:16) − di / (2 dσ ) (cid:17) ≥ − di / (2 dσ ) − e − √ i √ πd ln id and E [ G j ] = (cid:88) i ≥ (1 − P [ G j < i ]) ≤ (cid:88) i ≥ (cid:16) di / (2 dσ ) + e − √ i √ πd ln id (cid:17) ≤ C ( d ) + 2 d (cid:88) i ≥ i / (2 dσ ) , C ( d ) is such that e − √ i √ πd ln id ≤ di / (2 dσ for i ≥ C ( d ). We observe that σ ≤ / (4 dd