[PDF] Warm-starting Contextual Bandits: Robustly Combining Supervised and Bandit Feedback

Abstract

We investigate the feasibility of learning from a mix of both fully-labeled supervised data and contextual bandit data. We specifically consider settings in which the underlying learning signal may be different between these two data sources. Theoretically, we state and prove no-regret algorithms for learning that is robust to misaligned cost distributions between the two sources. Empirically, we evaluate some of these algorithms on a large selection of datasets, showing that our approach is both feasible and helpful in practice.

Full PDF

WWarm-starting Contextual Bandits:Robustly Combining Supervised and Bandit Feedback

Chicheng Zhang Alekh Agarwal Hal Daumé III

John Langford Sahand N Negahban Abstract

We investigate the feasibility of learning froma mix of both fully-labeled supervised data andcontextual bandit data. We speciﬁcally considersettings in which the underlying learning signalmay be different between these two data sources.Theoretically, we state and prove no-regret algo-rithms for learning that is robust to misalignedcost distributions between the two sources. Em-pirically, we evaluate some of these algorithmson a large selection of datasets, showing that ourapproach is both feasible, and helpful in practice.

1. Introduction

In many real-world settings, a system must learn from mul-tiple types of feedback; we consider the speciﬁc setting oflearning jointly from fully labeled “supervised” examplesand from online feedback “contextual bandit” (abbrev. CB)examples. For instance, in a system that chooses person-alized content to display on a webpage, an expert may beable to provide an initial set of fully labeled examples to geta system started. After deployment, however, the systemcan only measure its performance (e.g., dwell time) on thecontent it displays and not other (counterfactual) options.In an automated translation system, professional translatorscan provide initial translations to seed a system, but the sys-tem may be able to further improve its performance basedon, e.g., user satisfaction measures (Sokolov et al., 2015;Nguyen et al., 2017).In both these settings (content display and translation), wedesire an approach that is able to use the fully supervised ex-pert data to “warm-start” a system, which later learns fromCB feedback (Auer et al., 2002b; Langford & Zhang, 2007;Chu et al., 2011; Dudik et al., 2011; Agrawal & Goyal, 2013;Agarwal et al., 2014). Doing so has the added advantage of Microsoft Research University of Maryland YaleUniversity. Correspondence to: Chicheng Zhang.

Proceedings of the th International Conference on MachineLearning , Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s). ensuring that such a system does not need to suffer too mucherror in an initial exploration phase, which may be neces-sary in user-facing systems or in error- or safety-criticalsettings (Tewari & Murphy, 2017). However, it is generallyunreasonable to assume that the expert supervision and theCB feedback in such settings are perfectly aligned: the “best”decision according to an expert may not necessarily match auser’s choice. We need algorithms that operate well even inthe case of unknown degrees of misalignment; we introducea hypothesis class-speciﬁc notion of cost similarity used inour analysis, but not our algorithms (§2). We also highlighthow simple strategies for combining the two sources withoutrobustness to misalignment can perform signiﬁcantly worsethan learning from the ground truth source alone (§2.1).Furthermore, different applications can differ in terms ofwhich source—supervised or CB—is considered “groundtruth”. For example, while the CB feedback from users isthe better signal about their preferences in content person-alization (§ 3), the expert translations provide the groundtruth in the translation setting for which user satisfaction isan imperfect proxy (§4). We develop algorithms for both settings, which effectively “search” for a good balance be-tween ﬁtting the CB feedback and supervised labels. Inboth cases, we provide regret bounds showing the value ofthe complementary data sources, dependent on their costdiscrepancy and respective sample sizes. Importantly, ourtheory shows that our methods perform close to an oraclethat knows the similarity of the two sources beforehand anduses it to optimally weight their examples, with a smalladditional penalty from searching for this weighting.Empirically, we perform experiments based on fully-labeledexamples from which CB feedback is simulated. We fo-cus on the setting when CB data is ground truth and thesupervised warm-start might have differing levels of bias.In an experimental study over hundreds of datasets (§5) wedemonstrate the efﬁcacy of our algorithm. As a snapshot,Figure 1 shows the empirical cumulative distribution func-tions (CDF) of algorithms across a number of experimentalconditions, where each ( x, y ) value on the curve indicatesthat there is a y fraction of experimental conditions wherethe normalized error of a method is below x . The plot ag- See §5 for a formal deﬁnition of normalized error. a r X i v : . [ c s . L G ] J un arm-starting Contextual Bandits Normalized error C u m u l a t i v e f r e q u e n c y = 0.0125 ARRoW-CB with | |=8ARRoW-CB with | |=2 Sup-OnlyBandit-Only Sim-BanditMajority

Figure 1: Empirical CDFs of the performance of differentmethods across a number of datasets and experimental con-ditions (See §5 for descriptions of all algorithms, settings,and aggregation method). Our method ARR O W-CB hasa parameter | Λ | , and we evaluate it with | Λ | set to 8 and 2;S IM -B ANDIT is a baseline also leveraging the warm-start;S UP -O NLY , M

AJORITY and B ANDIT -O NLY learn usingonly the supervised and CB sources respectively. All CBmethods use the (cid:15) -greedy strategy with (cid:15) = 0 . .gregates across settings where the CB and supervised signalsare perfectly aligned as well as where they are not. Overall,our main algorithm, namely ARR O W-CB with | Λ | = 8 ,outperforms all baselines in this aggregated summary, inparticular beating the two algorithms (ARR O W-CB with | Λ | = 2 and S IM -B ANDIT ) that leverage both CB and su-pervised sources. More detailed results are presented in §5.

Relation to prior work.

A theoretical study of domainadaptation (Ben-David et al., 2010; Mansour et al., 2009)and learning from multiple sources (Crammer et al., 2008)are the closest prior works. In these works, all data sourcesprovide the same supervised feedback rather than the super-vised/CB modality we investigate here, with the two sourceshaving very different information per sample. Another re-lated line of work is on “safe” CB learning (Kazerouni et al.,2017; Sun et al., 2017) to maintain performance better thana baseline policy at all times, somewhat related to our super-vised ground truth setting. However, they do not study thedistributional mismatch concerns central to our work.Finally, there is a substantial literature on active learningfrom different data sources (Donmez & Carbonell, 2008;Urner et al., 2012; Yan et al., 2011; Malago et al., 2014;Zhang & Chaudhuri, 2015; Yan et al., 2018), combiningmultiple labeling oracles of varying quality. The CB setting S UP -O NLY and M

AJORITY do not explore or update on CBexamples and we plot the average costs of their policies over allCB examples. studied has important differences from active learning andthe techniques do not carry over directly.

2. Notation and Problem Speciﬁcation

We begin with some notation. For an event A , I ( A ) =1 if A is true, and otherwise. Denote by [ K ] the set { , , . . . , K } . We use K to denote the all ’s vector in R K and ∆ K − for the K dimensional probability simplex.In this paper, we study the problem of cost-sensitive inter-active learning from multiple data sources. Speciﬁcally, weconsider distributions over cost-sensitive examples ( x, c ) ,where x ∈ X is a context and c is a cost vector in [0 , K ; K being the number of actions (or “classes”). There aretwo distributions D s (supervised) and D b (CB), which haveidentical marginals over the context x , but different condi-tional distributions over cost vectors given x . We use thenotation c b (resp. c s ) to denote the cost vector c drawn from D b (resp. D s ) to avoid writing D b and D s as subscripts inexpectations. The interaction between the learner and theenvironment is described as follows: Warm-start:

The learner receives S , a dataset of n s fullysupervised examples drawn i.i.d. from D s . Interaction:

For t = 1 , , . . . , n b , the environment draws ( x t , c bt ) ∼ D b and reveals x t to the learner, based on whichthe learner chooses a (possibly random) action a t ∈ [ K ] and observes c bt ( a t ) , but not the cost of any other action.In this paper, we focus on two learning settings: CB groundtruth setting and supervised ground truth setting . In the CBground truth setting (resp. supervised ground truth setting),the goal of the learner is to optimize the costs drawn fromdistribution D b (resp. D s ).To help make decisions, the learner is given a ﬁnite policyclass Π that contains policies π : X → [ K ] . The per-formance of the algorithm is measured by its regret to theretrospective-best policy in Π . We consider two notionsof regret over the sequence (cid:104) x t (cid:105) n b t =1 , based on whether weconsider the CB costs ( c b ) or the supervised costs ( c s ) asthe ground truth:CB: R b ( (cid:104) x t , a t (cid:105) n b t =1 ) = (cid:80) n b t =1 E (cid:2) c b ( a t ) (cid:12)(cid:12) x t (cid:3) − min π ∈ Π (cid:80) n b t =1 E (cid:2) c b ( π ( x t )) (cid:12)(cid:12) x t (cid:3) , (1)supervised: R s ( (cid:104) x t , a t (cid:105) n b t =1 ) = (cid:80) n b t =1 E (cid:2) c s ( a t ) (cid:12)(cid:12) x t (cid:3) − min π ∈ Π (cid:80) n b t =1 E (cid:2) c s ( π ( x t )) (cid:12)(cid:12) x t (cid:3) . (2)In the content recommendation example (CB ground truth), x t encodes a user proﬁle and the system predicts whicharticles ( a t ) to display. Here, c b can be the negative dwell- More generally, π : X → ∆ K − and π ( x ) is the distributionover actions given x . arm-starting Contextual Bandits time of users and c s is the annotation of editors, which canhave disagreements with c b . The learner aims to optimizethe dwell time over all displayed articles.In the translation example (supervised ground truth), x t encodes the text to be translated and a t encodes its trans-lation. Here, the learner aims to minimize errors againstthe expert translation ( c s ) on x t ’s, despite the fact that thesystem never sees these costs in its interaction phase. Notethat the learner only observes the user feedback costs ( c b )in this interaction phase, which are imperfect proxies for c s ,and the only direct observations of c s are on the warm-startexamples. Nevertheless, we seek to optimize the accuracyof our translations given to the users, and hence regret isstill measured over the interaction phase.The utility of non ground truth examples are different in thetwo learning settings. In the CB ground truth setting, relyingon the CB examples alone is sufﬁcient to ensure vanishingregret asymptotically. The supervised warm-start primarilyhelps with a smaller regret in the initial phases of learning.On the other hand, in the supervised ground truth setting,the CB examples can have an asymptotically meaningfuleffect on the regret: for instance, if D s = D b , then utilizingCB examples can lead to a vanishing regret, whereas usingsupervised examples alone cannot.We can leverage examples from a different source only whenthe cost structures are at least somewhat related. Therefore,we introduce a measure of similarity of two distributionsover cost-sensitive examples. Deﬁnition 1. D is said to be ( α, ∆) -similar to D with respect to Π , if for any policy π , E D c ( π ( x )) − E D c ( π ∗ ( x )) ≥ α (cid:0) E D c ( π ( x )) − E D c ( π ∗ ( x )) (cid:1) − ∆ ,where π ∗ = argmin π ∈ Π E D c ( π ( x )) . If we have a larger α and smaller ∆ , examples from D aremore useful for learning under D . Prior similarity notions,such as in Ben-David et al. (2010), roughly assume a boundon max π ∈ Π | E D [ c ( π ( x ))] − E D [ c ( π ( x ))] | . The one-sidedbound in our deﬁnition (instead of absolute value bound)on regret and an additional scaling factor α yield additionalﬂexibility. Note that Deﬁnition 1 is only used in our anal-ysis; our algorithms do not require knowledge of α and ∆ .We give a more general condition which implies ( α, ∆) -similarity, along with several examples in Appendix B.Finally, we deﬁne some additional notation. In the t -th inter-action round, our algorithms compute ˆ c t , an estimate of theunobserved vector c bt . We use E S to denote sample averageson S and abbreviate E S t by E t where S t = { ( x τ , ˆ c τ ) } tτ =1 is the log of the CB examples up to time t . The settings we have described so far might appear decep-tively simple. It should be easy to include some additionalsupervised examples, which contain more feedback, into aCB algorithm. We now illustrate the difﬁculty of this taskwhen the two distributions D s and D b are misaligned.Consider the special case of -armed bandits (CB with adummy context), where the CB source is the ground truth. D s and D b are deterministic with costs (0 . , . ∆2 ) and (0 . , . − ∆2 ) for the two arms respectively, so that theyare (1 , ∆) -similar. Suppose we see n s = Ω(1 / ∆ ) ex-amples in warmstart, and use them to initialize the meansand conﬁdence intervals on each arm to run the UCB al-gorithm (Auer et al., 2002a). Proposition 1 in Appendix Cshow that the optimal arm according to D b , which is arm 2,is never played for the ﬁrst O (exp(1 / ∆)) rounds, incurringregret Ω(∆ exp(1 / ∆)) . So for any ∆ < . , the regret isstrictly larger than that of a UCB algorithm which ignoresthe warm-start and incurs at most ˜ O (1 / ∆) regret. On theother hand, if D b = D s , then the UCB strategy describedabove incurs no regret.What we observe here is a failure in competing simultane-ously with two baselines: naively warmstarting by weight-ing examples from the two sources equally, or just ignoringthe supervised source entirely. We will next describe analgorithm to compete not just with these two, but many pos-sible weightings of the two sources. This extreme failurecase shows that an arbitrary low-regret CB algorithm cannothandle biased warm-start data without extra care. Usingadditional randomization can help, but is not adequate byitself as we will see in our theory and experiments.

3. Contextual Bandit Ground Truth Setting

In this section, we study the setting where D b , the distribu-tion over CB examples, is considered the ground truth, asin the content recommendation example. Recall that in thissetting, one could ignore the supervised warm-start exam-ples and still achieve vanishing regret; the main goal hereis to show that using the warm start data can help furtherreduce the regret, especially in early stages of learning. The key challenge in design-ing an algorithm for the CB ground truth setting is under-standing how to effectively combine two data sources whichmight have unknown differences in their distributions. Forthe simpler supervised learning setting, Proposition 4 in Ap-pendix I shows that it sufﬁces to always use one of the twosources depending on the bias and relative number of exam-ples. This has two caveats though: the bias is not knownin practice, and completely ignoring one data source is ob- arm-starting Contextual Bandits

Algorithm 1

Adaptive Reweighting for Robustly Warm-starting Contextual Bandits (ARR O W-CB)

Require:

Supervised dataset S from D s of size n s , num-ber of interaction rounds n b , exploration probability (cid:15) ,weighted combination parameters Λ , policy class Π . for t = 1 , , . . . , n b do Observe instance x t from D b . Deﬁne p t := − (cid:15)t − (cid:80) t − τ =1 π λ t τ ( x t ) + (cid:15)K K for t ≥ and p t := K K for t = 1 . Predict a t ∼ p t , and receive feedback c bt ( a t ) . Deﬁne the inverse propensity score (IPS) cost vector ˆ c t ( a ) := c bt ( a t ) p t,at I ( a = a t ) , for a ∈ [ K ] . For every λ ∈ Λ , train π λt by minimizing over π ∈ Π : λ t − (cid:88) τ =1 ˆ c τ ( π ( x τ )) + (1 − λ ) (cid:88) ( x,c s ) ∈ S c s ( π ( x )) . (3) Set λ t +1 ← argmin λ ∈ Λ (cid:80) tτ =1 ˆ c τ ( π λτ ( x τ )) . end for viously wasteful when the two sources are identical. Wechoose to instead consider cost minimization on a datasetwhere the two sources are combined with different weights,and seek to learn these weights adaptively.With these insights, we return to the actual problem set-ting of warm-starting a CB learner with supervised exam-ples. Our algorithm for this setting is presented in Algo-rithm 1. The main idea is to minimize the empirical riskon a weighted dataset containing examples from the twosources. Our algorithm picks the mixture weighting by on-line model selection over a set of weighting parameters Λ ,where we use the ground truth CB data at each time stepto evaluate which λ ∈ Λ has the best performance so far.For each λ ∈ Λ , we estimate a π λ ∈ Π as the empiricalrisk minimizer (ERM) for the λ -mixture between CB andsupervised examples. We focus on the simplest (cid:15) -greedyalgorithm for CBs, leaving similar modiﬁcations in moreadvanced CB algorithms for future work.So long as { , } ⊆ Λ , Algorithm 1 allows for relying onone source alone, while using a larger set of Λ signiﬁcantlyimproves its empirical performance (see §5). We need some additional notation to present our regretbound. We deﬁne V t ( λ ) that governs the deviationof λ -weighted empirical costs for all policies in Π as If we approximate the computation of the best policy in Step6 using an online oracle as in prior works (Agarwal et al., 2014;Langford & Zhang, 2007), then the entire algorithm can be im-plemented in a streaming fashion since 7 for selecting the best λ also uses an online estimate a la Blum et al. (1999) for each λ asopposed to a holdout estimate for the current policy π λt . V t ( λ ) := 2 (cid:113) ( λ Kt(cid:15) + (1 − λ ) n s ) ln n b | Π | δ + ( λK(cid:15) + (1 − λ )) ln n b | Π | δ , and G t that bounds the excess cost of theERM solution using weighted combination parameter λ as G t ( λ, α, ∆) := (1 − λ ) n s ∆ + 2 V t ( λ ) λt + (1 − λ ) n s α . We prove the following theorem in Appendix E.

Theorem 1.

Suppose D s is ( α, ∆) -similar to D b . Then forany δ < /e , with probability − δ , the average CB regretof Algorithm 1 can be bounded as: n b R b ( (cid:104) x t , a t (cid:105) n b t =1 ) ≤ (cid:15) + 3 (cid:114) ln nb | Π | δ n b + 32 (cid:114) K ln nb | Λ | δ n b (cid:15) +min λ ∈ Λ ln( en b ) n b (cid:80) n b t =1 G t ( λ, α, ∆) (4)The bound (4) consists of many intuitive terms. The ﬁrst (cid:15) term comes from uniform exploration; the second term isfrom the deviation of costs under D b . The next term is theaverage regret incurred in performing model selection for λ ;in our experiments | Λ | = 8 so that it can be thought of as (cid:101) O ( (cid:112) K/ ( n b (cid:15) )) . The ﬁnal term involving a minimum over λ ’s is effectively ﬁnding the weighted combination whichminimizes a bias-variance tradeoff in combining the twosources. Here the bias is controlled by ∆ and in place ofvariance we use V t ( λ ) for high-probability results. Con-trasting with learning with CB examples alone, we replacea (cid:113) K ln( | Π | /δ ) n b (cid:15) term with the middle term independent of ln | Π | and the average of G t ’s which can be much smallerin favorable cases as we discuss below. Identical distributions:

A very friendly setting has D s = D b , corresponding to (1 , -similarity. Since the theoremholds with a minimum over all λ ’s in the set Λ , we canpick speciﬁc values λ of our choice. One choice of λ motivated from prior work (Ben-David et al., 2010) is topick it such that λ / (1 − λ ) = (cid:15)/K to equalize the varianceof the two sources, meaning each supervised example isworth K/(cid:15)

CB examples. This setting of λ = (cid:15)K + (cid:15) yields G t ( λ , ,

0) = O (cid:18)(cid:114) K ln nb | Π | δ (cid:15)t + Kn s + K ln nb | Π | δ (cid:15)t + Kn s (cid:19) . That is,after t CB samples, the effective sample size is n s + Kt/(cid:15) . Comparison with no warmstart:

Whenever ∈ Λ , theminimum over λ ∈ Λ in Theorem 1 can be bounded by itsvalue at λ = 1 , which corresponds to ignoring the warmstartexamples and using bandit examples alone. For this specialcase, we have the following corollary. Corollary 1.

Under conditions of Theorem 1, suppose that ∈ Λ . Then for any δ < /e , with probability − δ , n b R b ( (cid:104) x t , a t (cid:105) n b t =1 ) ≤ (cid:15) + O (cid:18)(cid:114) ln nb | Π | | Λ | δ n b (cid:15) (cid:19) . arm-starting Contextual Bandits The corollary follows from using the value of G t (1 , α, ∆) along with some algebra, and shows that the regret ofARR O W-CB is never worse than using the bandit sourcealone, up to a term scaling as ln | Λ | . In particular, the usualchoice of (cid:15) = O (( n b ) − / ) implies a O (( n b ) / ) regretbound. Since a small value of | Λ | sufﬁces in our experi-ments, this is a negligible cost for robustness to arbitrarybias in the warmstart examples. Similarly comparing to λ = 0 lets us obtain a comparison against using the warm-start alone up to a model selection penalty, when ∈ Λ .The minimization over a richer set of λ leaves room for fur-ther improvements as shown in the case of D s = D b above(which used a different setting of λ ). Further improvementsare also possible in the algorithm by using different λ valuesafter reach round, which is not captured in the theory here.

4. Supervised Ground Truth Setting

In §3, we developed an algorithm and proved regret boundsfor combining supervised and CB feedback, in the casewhere the CB cost is considered the ground truth. In thissection, we consider the reverse setting where the supervisedsource constitutes the ground truth, recalling the motivatingexample in an automated translation setting from the intro-duction. Here, we wish to leverage the CB examples forlearning the best policy relative to the distribution D s .Note that this setting is qualitatively different, since we onlyhave a ﬁxed number n s of ground-truth examples whilethe number of CB examples grows over time. If we assignrelative weights to individual supervised and CB examplesas in Algorithm 1, the CB examples will eventually out-weigh the supervised ones for any λ > , which is notdesirable when the supervised source is the ground truth.In Algorithm 2, we address this problem by ﬁrst comput-ing the average costs of every policy on the supervised andCB examples separately, and then choosing a policy thatminimizes a weighted combination of these averages. As aconsequence, the relative weight of each CB example dimin-ishes as their number grows, with the overall bias incurredfrom the CB source staying bounded.Another difference between Algorithm 2 and Algorithm 1 isthat, as opposed to using the CB examples collected online,we use subsets of warm start examples to guide the selectionof weighted combination parameter λ . To this end, we in-troduce an epoch structure in the algorithm. In particular, ateach epoch e , λ e and π λe ’s are updated exactly once, wherea separate validation set is used to pick λ . In addition, weplay with uniform randomization around the most recentpolicy as opposed to a running average of all policies trainedso far, an outcome of using a separate validation set (line 12of Algorithm 2) instead of progressive validation (7 of Al-gorithm 1). Since the exploration policy at the next epochdepends on the previous validation set, we must use a “fresh” Algorithm 2

Combining contextual bandit and superviseddata when supervised source is the ground truth

Require:

Supervised dataset S from D s of size n s , num-ber of interaction rounds n b , exploration probability (cid:15) ,weighted combination parameters Λ , policy set Π . Let E = (cid:100) log n b (cid:101) be the number of epochs. Deﬁne t e = min(2 e , n b ) for e ≥ , and t = 0 . Partition S to E +1 equally sized sets S tr , S val , . . . , S val E . for e = 1 , , . . . , E do for t = t e − + 1 , t e − + 2 , . . . , t e do Observe instance x t from D b . Deﬁne p t := (1 − (cid:15) ) π λ e − e − ( x t ) + (cid:15)K K for e ≥ ,and p t := K K for e = 1 . Predict a t ∼ p t and receive feedback c bt ( a t ) . Deﬁne the IPS cost vector ˆ c t ( a ) := c bt ( a t ) p t,at I ( a = a t ) , for a ∈ [ K ] . end for For each λ ∈ Λ , train π λe as: arg min π ∈ Π λ E t e ˆ c ( π ( x )) + (1 − λ ) E S tr c s ( π ( x )) . Set λ e ← arg min λ ∈ Λ E S val e c s ( π λe ( x )) . end for validation set at each epoch. Avoiding this splitting is aninteresting question for future work.For the main result, we need the following notation forthe deviation of λ -weighted empirical costs, where E = (cid:100) log n b (cid:101) is the total number of epochs: W t ( λ ) := 2 (cid:114)(cid:16) λ Kt(cid:15) + (1 − λ ) ( E +1) n s (cid:17) ln E | Π | δ + (cid:16) λKt(cid:15) + (1 − λ )( E +1) n s (cid:17) ln E | Π | δ . Theorem 2.

Suppose that D b is ( α, ∆) -similar to D s . Thenfor any δ < /e , with probability − δ , the average super-vised regret of Algorithm 2 can be bounded as: n b R s ( (cid:104) x t , a t (cid:105) n b t =1 ) ≤ (cid:15) + 3 (cid:115) ln | Π | δ n b + (cid:115) E + 1) ln E | Λ | δ n s + min λ ∈ Λ n b n b (cid:88) t =1 λ ∆ + 2 W t ( λ )(1 − λ ) + λα . (5)The ﬁrst term is the cost of exploration, while the secondis the gap between the conditional and unconditional ex-pectations over costs in deﬁning the regret. The third termcaptures the complexity of model selection while the ﬁ-nal is the performance upper bound for the best λ in ourweighted combination set Λ . As before, this signiﬁcantly im-proves upon the O ( (cid:113) ln | Π | /δn s ) bound from using supervisedexamples alone whenever the two sources have sufﬁcientsimilarity. The proof can be found in Appendix F. Identical distributions:

When D s = D b , which impliesthat D s is (1 , -similar to D b , a single choice of λ = arm-starting Contextual Bandits n b (cid:15)n s K + n b (cid:15) will ensure that the last term in Equation (5) isat most ˜ O (cid:18)(cid:114) K ln E | Π | δ Kn s + n b (cid:15) + K ln E | Π | δ Kn s + n b (cid:15) (cid:19) (See Proposition 2in Appendix G). That is, after n b CB samples, the effectivesample size is at most n s + n b (cid:15)/K .

5. Experiments

Experimentally, we focus on the question of learning withthe CB costs as the ground truth (§3). Our experiments seekto address the following questions: a) How much beneﬁtdoes a small amount of supervised warm-start provide? b) How much beneﬁt does the bandit feedback provide? c) How robust is our algorithm under a realistic mismatchin cost structures? d) How robust is our algorithm underadversarial cost structures (the “safety” question)?We consider the following set of approaches: B ANDIT -O NLY : a baseline that only uses CB examples. M AJORITY : always predicts a ∈ argmin a ∈ [ K ] E ( x,c ) ∼ D b [ c ( a )] independent of the con-text, without exploration. S UP -O NLY : a baseline that uses the best policy onsupervised examples, without exploration. S IM -B ANDIT : a baseline that runs the CB algorithm onwarm-start examples as well, providing cost for the chosenaction only (from the supervised set) and then continues onthe remaining CB examples.

ARR O W-CB with

Λ = { , ζ, ζ, ζ, ζ, + ζ, + ζ, } (abbrev. ARR O W-CB with | Λ | = 8 ), where ζ = (cid:15)/ ( K + (cid:15) ) ; this is chosen because ζ is an approximateminimizer of G t ( λ, , , and the | Λ | used aims to ensurethat min λ ∈ Λ G t ( λ, α, ∆) is close to min λ ∈ [0 , G t ( λ, α, ∆) (see Prop. 3). For computational considerations, weuse the last policy π λ t t rather than the averaged policy t − (cid:80) t − τ =1 π λ t τ in line 3 of Algorithm 1. ARR O W-CB with

Λ = { , } (abbrev. ARR O W-CBwith | Λ | = 2 ): as argued in Proposition 3, choosing λ in { , } also approximately minimizes G t ( λ, α, ∆) .In subsequent discussions, if not explicitly mentioned,ARR O W-CB refers to ARR O W-CB with | Λ | = 8 .All the algorithms (other than S UP -O NLY and M

AJORITY ,which do not explore) use (cid:15) -greedy exploration, with mostof the results presented using (cid:15) = 0 . . We additionallypresent the results for (cid:15) = 0 . and (cid:15) = 0 . in Appendix J.In general, the increased uniform exploration for larger (cid:15) leads to some performance penalty in the CB algorithmsrelative to S UP -O NLY , when the bias is small. However,the added exploration gives robustness to large bias as it isreadily detected in more adversarial noise settings.

Datasets.

We compare these approaches on 524 binaryand multiclass classiﬁcation datasets from Bietti et al. (2018), which in turn are from openml.org . For eachdataset, we use the multiclass label in the dataset to generatecost vectors c b and c s respectively. That is, given an example ( x, y ) ∈ X × [ K ] , c b ( a ) = I ( a (cid:54) = y ) . We vary the numberof warm-start examples and CB examples as follows: fora dataset of size n , we vary the number of warm-start ex-amples n s in { . n, . n, . n, . n } , and the num-ber of CB examples n b in { . n, . n, . n, . n } .Deﬁne the warm-start ratio as the ratio n b /n s . Wegroup different settings of (dataset, n s , n b ) by n b /n s , sothat a separate plot is generated for each ratio in R = { . , . , . , , , , } . We ﬁlter out the set-tings where n s is below 100. Evaluation Criteria.

For each (dataset, n s , n b ) combi-nation c , we can compute e c,a to be the average cost ofalgorithm a on the CB examples. Because the range of e c,a can vary signiﬁcantly over different settings c , we normal-ize these to yield the normalized error of an algorithm ona dataset: err c,a := e c,a − e ∗ c max b e c,b − e ∗ c , where e ∗ c is the errorachieved by a fully supervised one-versus-all learning al-gorithm trained on all the examples with original labels inthis dataset. Lower normalized error indicates better perfor-mance. We plot the cumulative distribution function (CDF)of the normalized errors for each algorithm. That is, for analgorithm a , at each point x , the y value is the fraction of c ’s such that err c,a ≤ x . In general, a high CDF value at asmall x indicates that the algorithm is performing well overa large number of (dataset, n s , n b ) combinations.In some of the plots when investigating the effect of a par-ticular type or level of noise, we ﬁnd it useful to aggregatethe plots further over all warm-start ratios in creating theCDF and this aggregation is done by a pointwise averagingof the individual CDFs. Comparison with baselines using both sources.

Wepresent the CDFs of all algorithms under various noise mod-els in Figure 2, with detailed results for individual noise lev-els, warm-start ratios and different (cid:15) values in Appendix J. InFigure 2, we aggregate over warm-start ratios as describedearlier. We can see from the ﬁgures that ARR O W-CB’sCDFs (approximately) dominate those of S IM -B ANDIT andARR O W-CB with | Λ | = 2 , which use weightings of . ,and the best of { , } respectively. These gains highlightthe importance of being more careful about selecting a goodweighting, despite the earlier intuition from Proposition 4.We see that there is a potentially added beneﬁt of using dif-ferent λ ’s in different phases of learning which might evenoutperform the best setting in hindsight. Results for aligned cost structures.

In Fig. 2a, we con-sider the setting c s = c b . Here, ARR O W-CB’s CDF domi-nates all other algorithms other than S UP -O NLY . For S UP -O NLY , the warm-start policy is used greedily with no explo- arm-starting Contextual Bandits

Normalized error C u m u l a t i v e f r e q u e n c y noiseless, = 0.0125 Normalized error C u m u l a t i v e f r e q u e n c y noiseless, = 0.0125 Normalized error C u m u l a t i v e f r e q u e n c y CYC, p = 1.0, = 0.0125

Normalized error C u m u l a t i v e f r e q u e n c y CYC, p = 0.25, = 0.0125

Normalized error C u m u l a t i v e f r e q u e n c y UAR, p = 0.25, = 0.0125

Normalized error C u m u l a t i v e f r e q u e n c y CYC, p = 1.0, = 0.1

Normalized error C u m u l a t i v e f r e q u e n c y MAJ, p = 0.25, = 0.0125

ARRoW-CB with | |=8ARRoW-CB with | |=2 Sup-OnlyBandit-Only Sim-BanditMajority

Figure 2: Comparison of all algorithms in the CB ground truth setting using the empirical CDF of the normalized performancescores. Left: unbiased warm-start examples with noiseless (top) and UAR with probability 0.5 (down) costs on warm-startexamples. Middle: extreme noise rate using CYC noise type with probability 1.0. All CB algorithms use (cid:15) = 0 . forexploration (top) and (cid:15) = 0 . (bottom). Right: moderate and potentially helpful noise rates. The corruption added to thewarm-start examples are of types CYC (top) and MAJ (down) respectively, with probability 0.25. Normalized error C u m u l a t i v e f r e q u e n c y MAJ, p = 0.25, ratio = 2.875, = 0.0125

Normalized error C u m u l a t i v e f r e q u e n c y MAJ, p = 0.25, ratio = 2.875, = 0.0125 (a)

Normalized error C u m u l a t i v e f r e q u e n c y MAJ, p = 0.25, ratio = 23.0, = 0.0125 (b)

Normalized error C u m u l a t i v e f r e q u e n c y MAJ, p = 0.25, ratio = 184.0, = 0.0125 (c)

Figure 3: Effect of varying warm-start ratios for MAJ noise with p = 0 . . The warm-start ratios vary from 2.875 (left), 23(middle) to 184 (right). Each CDF aggregates over all conditions of this noise type with the same warm-start ratio.ration, making it a very strong baseline when there is no bias.We observe that ARR O W-CB uses the warm-start muchmore effectively than both the S IM -B ANDIT and ARR O W-CB with | Λ | = 2 baselines. Our next experiments considera uniform at random (UAR) noise setting, where the super-vised data is unbiased (with respect to c b ) but has higher variance. In particular, for every example ( x, c b ) , with prob-ability − p we set c s = c b and with probability p we set c s as the classiﬁcation error against a uniform random label.From Claim 2 in the appendix, D s is (1 − p, -similar to D b . We plot the CDFs of the algorithms in the case where p = 0 . in Fig. 2d. The ordering of the CDFs stays the arm-starting Contextual Bandits same, with S UP -O NLY less dominant (unsurprisingly), andwith the gaps between methods reduced with the reducedutility of the warm-start data.

Results with adversarial noise.

We next conduct an ex-periment where c s and c b are highly misaligned in order tounderstand how robust ARR O W-CB is to adversarial con-ditions. We consider the cycling noise model (CYC), wherewe set the supervised costs to be “off-by-one” from the CBcosts. Speciﬁcally, if c b declares action a to be the zero-costaction, then, with probability p , c s corrupts the costs so thataction ( a + 1) mod K becomes the zero-cost action. FromClaim 3 in the appendix, D s is (1 , p ) -similar to D b . TheCDF results for this experiment are in Figs. 2b, 2e ( p = 1 )for (cid:15) = 0 . and (cid:15) = 0 . , and Fig. 2c ( p = 0 . ) for (cid:15) = 0 . respectively. Again, ARR O W-CB is dominantamongst methods which use both the sources. In the case of p = 1 . , B ANDIT -O NLY performs the best as the warm startexamples are misleading. ARR O W-CB performs slightlyworse (Fig. 2b) due to the model selection overhead, asdiscussed following Theorem 1. This gap is reduced whenwe increase the (cid:15) value in (cid:15) -greedy to . (Fig. 2e). In thecase of p = 0 . (Fig. 2c), ARR O W-CB outperforms allthe methods, showing that it can utilize warm start exampleseven if they are moderately biased.

Results with majority noise.

Finally, we consider thecase of a noise model that replaces the ground truth labelwith the majority label, roughly modeling a “lazy annotator”who occasionally defaults to the most frequent class. Forthe majority noise model (MAJ), with probability − p , weset c s = c b and with probability p we set c s to a cost vectorthat has a zero for the most frequent label in this datasetand one elsewhere. From Claim 3 in the appendix, D s is (1 , p ) -similar to D b . The CDFs for this setting are shownin Figure 2f, where we again see ARR O W-CB dominatingall the baselines (similar to Figure 2c).In sum, we observe that ARR O W-CB is the only methodwhich is the best or close across all the noise regimes; noother approach is consistently strong. In practical scenarios,where the extent of bias in the warm-start is difﬁcult orcostly to ascertain, this robust performance of ARR O W-CBis extremely desirable. If we have some prior informationabout the noise level, it is prudent to prefer smaller (cid:15) whenwe expect a low noise (to compete well with S UP -O NLY ),while a larger (cid:15) is preferred in high noise situations (toquickly detect the extent of bias).While we present aggregates over warm-start ratios here,plots for each combination of noise type, level and warm-start ratio for three values of (cid:15) are shown in Appendix J.

Effect of warm-start ratio.

In Fig. 3, we pick a moderatenoise setting and study the ordering of the different methodsas the number of warm-start examples n s increases relative to n b . We see ARR O W-CB outperforming all methods.S UP -O NLY is strong on the left for a small ratio (2.875),while B

ANDIT -O NLY does well on the other extreme (184),and ARR O W-CB consistently outperform both the base-lines combining the two sources.

Overall.

Overall, we see that effectively using warm-startexamples can certainly improve the performance of CBapproaches. ARR O W-CB provides a way to do this in arobust manner, consistently outperforming most baselines.This is best evidenced in Figure 1, which further aggregatesperformance across the following 10 noise conditions onthe warm start examples: noiseless and {UAR, CYC, MAJ}corruptions with probability p in { . , . , . } .

6. Discussion and Future Work

In this paper, we study the question of incorporating multipledata sources in CB settings. We see that even in simple cases,obvious techniques do not work robustly, and some care isrequired to handle biases from the non-ground-truth source.Building on our results, there are several natural avenues forfuture work. Doing a similar modiﬁcation to more advancedexploration algorithms (e.g. (Agrawal & Goyal, 2013; Agar-wal et al., 2014)) is signiﬁcantly more challenging. Thisfalls into the general category of selecting the best froman ensemble of CB algorithms (where the ensemble cor-responds to different weightings of the supervised and theCB examples). In (cid:15) -greedy, the policy training correspondsto training the CB algorithm on reweighted data, whilethe model selection over λ induces the action distribution.While the ﬁrst step is typically straightforward even forother CB algorithms, ﬁnding an action distribution whichlooks good at this round, while allows the CB algorithmsfor different λ values to make subsequent updates is signiﬁ-cantly harder (for instance, when using a UCB style strategy,each λ value might suggest a completely different actionand expect reward feedback about it). A possible approachis to employ ideas from the C ORRAL algorithm (Agarwalet al., 2017), but the cost of model selection is linear insteadof logarithmic in | Λ | , and the approach is somewhat datainefﬁcient due to restarts. More ambitiously, it is desirablefor the schedule of supervised and CB examples to not beﬁxed in a warm-start fashion but based on active querying,such as by sending uncertain examples to a labeler for fullsupervision. Studying this and considering broader sourcesof feedback are both interesting future research. Acknowledgments

We thank Alberto Bietti for kindly sharing the scripts forexperiments performed by Bietti et al. (2018), and help-ing getting the experiments running. We also thank theanonymous reviewers for their helpful feedback. arm-starting Contextual Bandits

References

Agarwal, A., Hsu, D., Kale, S., Langford, J., Li, L., andSchapire, R. Taming the monster: A fast and simple algo-rithm for contextual bandits. In

International Conferenceon Machine Learning , pp. 1638–1646, 2014.Agarwal, A., Luo, H., Neyshabur, B., and Schapire, R. E.Corralling a band of bandit algorithms.

COLT , 2017.Agrawal, S. and Goyal, N. Thompson sampling forcontextual bandits with linear payoffs. In

Proceed-ings of the 30th International Conference on MachineLearning, ICML 2013, Atlanta, GA, USA, 16-21 June2013 , pp. 127–135, 2013. URL http://jmlr.org/proceedings/papers/v28/agrawal13.html .Auer, P., Cesa-Bianchi, N., and Fischer, P. Finite-timeanalysis of the multiarmed bandit problem.

Machinelearning , 47(2-3):235–256, 2002a.Auer, P., Cesa-Bianchi, N., Freund, Y., and Schapire,R. E. The nonstochastic multiarmed bandit problem.

SIAM J. Comput. , 32(1):48–77, 2002b. doi: 10.1137/S0097539701398375. URL https://doi.org/10.1137/S0097539701398375 .Ben-David, S., Blitzer, J., Crammer, K., Kulesza, A.,Pereira, F., and Vaughan, J. W. A theory of learning fromdifferent domains.

Machine Learning , 79(1-2):151–175,2010. doi: 10.1007/s10994-009-5152-4. URL https://doi.org/10.1007/s10994-009-5152-4 .Beygelzimer, A., Langford, J., and Zadrozny, B. Weightedone-against-all. In

AAAI , 2005.Beygelzimer, A., Langford, J., Li, L., Reyzin, L., andSchapire, R. Contextual bandit algorithms with super-vised learning guarantees. In

Proceedings of the Four-teenth International Conference on Artiﬁcial Intelligenceand Statistics , pp. 19–26, 2011.Bietti, A., Agarwal, A., and Langford, J. A contextual banditbake-off. arXiv preprint arXiv:1802.04064 , 2018.Blum, A., Kalai, A., and Langford, J. Beating the hold-out:Bounds for k-fold and progressive cross-validation. In

COLT , 1999.Chu, W., Li, L., Reyzin, L., and Schapire, R. E. Contextualbandits with linear payoff functions. In

Proceedings ofthe Fourteenth International Conference on ArtiﬁcialIntelligence and Statistics, AISTATS 2011, Fort Laud-erdale, USA, April 11-13, 2011 , pp. 208–214, 2011.URL . Crammer, K., Kearns, M., and Wortman, J. Learning frommultiple sources.

Journal of Machine Learning Research ,9:1757–1774, 2008.Donmez, P. and Carbonell, J. G. Proactive learning: cost-sensitive active learning with multiple imperfect oracles.In

Proceedings of the 17th ACM conference on Informa-tion and knowledge management , pp. 619–628. ACM,2008.Duchi, J., Hazan, E., and Singer, Y. Adaptive subgradientmethods for online learning and stochastic optimization.

Journal of Machine Learning Research , 12(Jul):2121–2159, 2011.Dudik, M., Hsu, D., Kale, S., Karampatziakis, N., Langford,J., Reyzin, L., and Zhang, T. Efﬁcient optimal learning forcontextual bandits. In

Proceedings of the 27th Conferenceon Uncertainty in Artiﬁcial Intelligence . Citeseer, 2011.Karampatziakis, N. and Langford, J. Online importanceweight aware updates. In

Proceedings of the Twenty-Seventh Conference on Uncertainty in Artiﬁcial Intelli-gence , UAI’11, pp. 392–399, Arlington, Virginia, UnitedStates, 2011. AUAI Press. ISBN 978-0-9749039-7-2. URL http://dl.acm.org/citation.cfm?id=3020548.3020594 .Kazerouni, A., Ghavamzadeh, M., Abbasi, Y., and Roy,B. V. Conservative contextual linear bandits. In

Advancesin Neural Information Processing Systems 30: AnnualConference on Neural Information Processing Systems2017, 4-9 December 2017, Long Beach, CA, USA , pp.3913–3922, 2017.Langford, J. and Zhang, T. The epoch-greedy algorithmfor contextual multi-armed bandits. In

Proceedings ofthe 20th International Conference on Neural InformationProcessing Systems , pp. 817–824. Curran Associates Inc.,2007.Malago, L., Cesa-Bianchi, N., and Renders, J. Online ac-tive learning with strong and weak annotators. In

NIPSWorkshop on Learning from the Wisdom of Crowds , 2014.Mansour, Y., Mohri, M., and Rostamizadeh, A. Domainadaptation: Learning bounds and algorithms.

COLT ,2009.Nguyen, K., Daumé III, H., and Boyd-Graber, J. Rein-forcement learning for bandit neural machine transla-tion with simulated human feedback. In

Proceedingsof the Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) , 2017. URL http://hal3.name/docs/ .Ross, S., Mineiro, P., and Langford, J. Normalized onlinelearning.

UAI , 2013. arm-starting Contextual Bandits

Sokolov, A., Riezler, S., and Urvoy, T. Bandit structuredprediction for learning from partial feedback in statisticalmachine translation. In

MT Summit , 2015.Sun, W., Dey, D., and Kapoor, A. Safety-aware algorithmsfor adversarial contextual bandit. In

Proceedings ofthe 34th International Conference on Machine Learning,ICML 2017, Sydney, NSW, Australia, 6-11 August 2017 ,pp. 3280–3288, 2017. URL http://proceedings.mlr.press/v70/sun17a.html .Tewari, A. and Murphy, S. A. From ads to interventions:Contextual bandits in mobile health. In

Mobile Health ,pp. 495–517. Springer, 2017.Urner, R., David, S. B., and Shamir, O. Learning fromweak teachers. In

Artiﬁcial Intelligence and Statistics , pp.1252–1260, 2012.Yan, S., Chaudhuri, K., and Javidi, T. Active learning withlogged data.

ICML , 2018.Yan, Y., Rosales, R., Fung, G., and Dy, J. G. Active learningfrom crowds. In

ICML , 2011.Yu, B. Assouad, fano, and le cam. In

Festschrift for LucienLe Cam , pp. 423–435. Springer, 1997.Zhang, C. and Chaudhuri, K. Active learning from weakand strong labelers. In

Advances in Neural InformationProcessing Systems , pp. 703–711, 2015. arm-starting Contextual Bandits

A. Additional Experimental Details

We run our experiments using Vowpal Wabbit (VW) . In all our algorithms, we consider a scorer function class F thatcontains functions f that map ( x, a ) to estimated cost values. The policy class Π induced by F is deﬁned as: Π := (cid:8) π f : f ∈ F (cid:9) , where π f ( x ) = argmin a ∈ [ K ] f ( x, a ) . In general, our algorithms do not learn with respect to Π directly; instead, at each time t , they ﬁnd some scorer function f t ,and use its induced policy π f t in Π to perform exploration and exploitation.In all CB learning algorithms (B ANDIT -O NLY and S IM -B ANDIT ), we use the (cid:15) -greedy exploration strategy with (cid:15) = 0 . in most of the results for the main text, while the results for two other values ( . and . ) in Appendix J. We use theimportance weighted regression algorithm (IWR) (Bietti et al., 2018) to compute cost regressors f t in F . The functionclass we use consists of linear functions: f w ( x, a ) = (cid:104) w a , x (cid:105) . In the supervised learning algorithm (S UP -O NLY ), we usethe cost-sensitive one against all algorithm (Beygelzimer et al., 2005) to train a cost regressor f using all the warm startexamples, and make no updates in the interaction stage. In the M AJORITY algorithm, we simply predict using the majorityclass in the dataset and make no updates in the interaction stage.

Note that there is no exploration in S UP -O NLY and M AJORITY . For ARR O W-CB, we also use the same set of (cid:15) values as B

ANDIT -O NLY and S IM -B ANDIT . At time t and for λ in Λ , we approximate π λt as follows. Instead of optimizing the objective function in Equation (3), we start with ﬁnding anapproximate optimizer of the following objective function: f λt = argmin f ∈F  (1 − λ ) (cid:88) ( x,c ) ∈ S K (cid:88) a =1 ( f ( x, a ) − c ( a )) + λ t (cid:88) τ =1 p τ,a τ ( f ( x τ , a τ ) − c τ ( a τ ))  , and then take π λt = π f λt . For computational efﬁciency, we choose not to ﬁnd the exact empirical cost minimizer f λt at eachtime. Instead, we use a variant of online gradient descent in VW with adaptive (Duchi et al., 2011), normalized (Ross et al.,2013) and importance-weight aware updates (Karampatziakis & Langford, 2011) on the objective function.For computational efﬁciency, in ARR O W-CB, at time t ≥ , instead of mixing the average policy of (cid:8) π λτ (cid:9) t − τ =1 with (cid:15) -uniform exploration, we predict by mixing the most recent policy π λt with (cid:15) -uniform exploration.We vary the learning rates of all algorithms from { . , . , . , . , . , . , . , . , . } . For each algorithm anddataset/setting combination, we ﬁrst compute the average cost in the interaction stage. For different algorithm/dataset/settingcombinations, we select different learning rates that minimize the corresponding average cost.The source codes for running the experiments are available at https://github.com/zcc1307/warmcb_scripts . B. Similiarity Conditions between Cost-sensitive Distributions

In this section, we introduce an alternative, more intuitive deﬁnition of similarity between two distributions of cost-sensitiveexamples. Then we show that this notion is stronger than Deﬁnition 1.

Deﬁnition 2.

Distribution D is strongly ( α, ∆) -similar to D wrt policy class Π , if there exists a joint distribution D over ( x, c + , c − ) ∈ X × [0 , K × [ − , K such that E D [ c + ( a ) + c − ( a ) | x ] = E D [ c ( a ) | x ] , ∀ a ∈ [ K ] and x ∈ X ,and there exists a policy π ∗ ∈ argmin π ∈ Π E D [ c ( π ( x ))] , such that both of the following hold:1. For any policy π in Π , E D [ c + ( π ( x ))] − E D [ c + ( π ∗ ( x ))] ≥ α ( E D [ c ( π ( x ))] − E D [ c ( π ∗ ( x ))]) .2. The expected magnitude of c − is bounded: for any policy π , | E D [ c − ( π ( x ))] | ≤ ∆ / . In the deﬁnition above, we use + / − superscripts to indicate the utility of c + , c − , which give a decomposition of the coststructure generated from D . Intuitively, c + is the component useful for learning under D , as low excess cost under c + implies low excess cost under D . In contrast, c − may or may not have such property.We have the following lemma upper bounding a policy’s excess cost on D in terms of its excess cost on D : Lemma 1.

Suppose D and D are two distributions over cost-sensitive examples. If D is strongly ( α, ∆) -similar to D with respect to policy class Π , then D is ( α, ∆) -similar to D with respect to policy class Π . http://hunch.net/~vw/ arm-starting Contextual Bandits Proof.

Suppose ( x, c + , c − ) is the decomposition of ( x, c ) that satisﬁes Deﬁnition 2. We have that E D [ c ( π ( x ))] − E D [ c ( π ∗ ( x ))] = ( E D [ c + ( π ( x ))] − E D [ c + ( π ∗ ( x ))]) + ( E D [ c − ( π ( x ))] − E D [ c − ( π ∗ ( x ))]) ≥ α ( E D [ c ( π ( x ))] − E D [ c ( π ∗ ( x ))]) − ∆ , where the inequality follows from using the respective items in Deﬁnition 2 to lower bound each term (and observing that E D [ c − ( π ( x ))] − E D [ c − ( π ∗ ( x ))] ≤ ∆ / / ). The lemma follows.We now present a few examples that satisfy strong similarity. Example 1: labeling the lowest-cost action.

Consider a setting where eliciting the costs from an expert for warm-startexamples is prohibitively expensive, but they can label the least costly action. Let D be the CB source and let π ∗ be thebest policy under D . Deﬁne D by ﬁrst sampling ( x, c ) ∼ D and returning ( x, ˜ c ) where ˜ c ( a ) = I ( a (cid:54) = a ∗ ) , for some a ∗ ∈ argmin a c ( a ) . Deﬁne c + ( a ) = I ( a (cid:54) = π ∗ ( x )) and c − = ˜ c − c + . Claim 1. D is (1 , ∆) -similar to D , with ∆ = 2 P ( c b ( π ∗ ( x )) ≥ min a (cid:54) = π ∗ ( x ) c b ( a )) .Proof. On one hand, for any policy π , E D c + ( π ( x )) − E D c + ( π ∗ ( x )) = P ( π ( x ) (cid:54) = π ∗ ( x )) ≥ E D c ( π ( x )) − E D c ( π ∗ ( x )) . On the other hand, observe that given a cost sensitive example ( x, c ) and policy π ∗ , if c ( π ∗ ( x )) < min a (cid:54) = π ∗ ( x ) c ( a ) ,then π ∗ ( x ) = a ∗ , in which case ˜ c = c + and c − = 0 . This implies that P ( c − (cid:54) = 0) ≤ P ( c ( π ∗ ( x )) ≥ min a (cid:54) = π ∗ ( x ) c ( a )) = ∆ / . Therefore, for all policy π , E | c − ( π ( x )) | ≤ P ( c − (cid:54) = 0) = ∆ / . Given a multiclass label y , deﬁned its induced zero-one cost vector c y ∈ [0 , K as follows: c y ( a ) = 0 if a = y and c y ( a ) = 1 otherwise. From the deﬁnition of c y , it can be seen that for any policy π , E c y ( π ( x )) = P ( π ( x ) (cid:54) = y ) . Example 2: Uniform-at-random corruption.

Suppose D is a joint distribution over multiclass examples ( x, y ) , and thecorrupted label ˜ y has the following conditional distribution given ( x, y ) : for all a in [ K ] , P (˜ y = a | ( x, y )) = (1 − p ) I ( a = y ) + pK . Deﬁne D and D as the joint distribution of ( x, c y ) and ( x, c ˜ y ) , respectively. Claim 2.

Suppose D and D are deﬁned as above. Then D is (1 − p, -strongly similar to D .Proof. Suppose π ∗ ∈ argmin π ∈ Π E D c ( π ( x )) is an optimal policy with respect to D .For any policy π , by the deﬁnition of ˜ y and c ˜ y , we have that E c ˜ y ( π ( x )) = P ( π ( x ) (cid:54) = ˜ y ) = (1 − p ) P ( π ( x ) (cid:54) = y ) + p K − K .

Therefore, for any policy π , the below identity holds: E c ˜ y ( π ( x )) − E c ˜ y ( π ∗ ( x )) = (1 − p )( E c y ( π ( x )) − E c y ( π ∗ ( x ))) . Therefore, taking c + = c ˜ y , c − = 0 , α = 1 − p and ∆ = 0 , it can be easily checked that the conditions of Deﬁnition 1 aresatisﬁed. Example 3: general corruption.

Suppose D is a joint distribution over multiclass examples ( x, y ) , and the corruptedlabel ˜ y ’s conditional distribution given ( x, y ) has the following property: P (˜ y = y | ( x, y )) ≥ − p . Deﬁne D and D asthe joint distribution of ( x, c y ) and ( x, c ˜ y ) , respectively. Claim 3.

Suppose D and D are deﬁned as above. Then D is (1 , p ) -strongly similar to D .Proof. Suppose π ∗ ∈ argmin π ∈ Π E D c ( π ( x )) is an optimal policy with respect to D .For every x , deﬁne deterministically that c + = E [ c y | x ] , and c − = E [ c ˜ y | x ] − E [ c y | x ] . We have that E [ c + + c − | x ] = E [ c ˜ y | x ] by the deﬁnitions of c + and c − . In addition, by the construction of c + , we immediately have that E c + ( π ( x )) − E c + ( π ∗ ( x )) = E c y ( π ( x )) − E c y ( π ∗ ( x )) . arm-starting Contextual Bandits What remains is to bound E c − ( π ( x )) . By the deﬁnitions of c y and c ˜ y , we have that | E c − ( π ( x )) | = | P ( π ( x ) (cid:54) = ˜ y ) − P ( π ( x ) (cid:54) = y ) | ≤ P ( y (cid:54) = ˜ y ) . By the assumption on the conditional distribution of ˜ y given ( x, y ) , we have that the right hand side is at most p . The claimfollows. C. Proof Showing the Failure of Equal Data Weighting

In this section we formalize the example presented in §2.1. To recall, this is a -armed bandit setting (i.e. contextual banditswith a dummy context), where policy class Π := { π , π } , where π i maps any context to action i , i = 1 , . In addition, D s (resp. D b ) is the Dirac measure on ( x , c s ) (resp. ( x , c b ) ), and the respective cost vectors are: c s = (0 . , . , and c b = (0 . , . − ∆2 ) . We consider the following algorithm, which directly extends the UCB1 algorithm (Auer et al., 2002a) by additionally usingthe warm start examples to estimate the mean costs of the two actions. Note that as it is minimizing its cumulative cost, thealgorithm computes lower conﬁdence bounds of the costs and selects the minimum, which is equivalent to computing upperconﬁdence bounds of the rewards and selecting the maximum.

Algorithm 3

A variant of the UCB1 algorithm that accounts for warm start examples

Require:

Supervised examples S = (cid:8) ( x , c s ) (cid:9) of size n s , number of interaction rounds n b . for t = 1 , , . . . , n b do For i = 1 , , deﬁne n i,t − = (cid:80) t − s =1 I ( a t = i ) . For i = 1 , , compute empirical mean cost of action i : ˆ µ i,t = (cid:80) ( x ,cs ) ∈ S c s ( i )+ (cid:80) t − s =1 I ( a t = i ) c bs ( i ) n s + n i,t − . For i = 1 , , compute lcb i,t = ˆ µ i,t − (cid:113) ln tn s + n i,t − . Take action a t = argmin i ∈{ , } lcb i,t . Observe c bt ( a t ) . end for We have the following proposition on a lower bound of the regret of Algorithm 3, under the above settings of D s and D b . Proposition 1.

Suppose D s and D b are deﬁned as above. Additionally, Algorithm 3 is run with input n s warm startexamples drawn from D s and number of interaction rounds n b ≥ exp(∆ n s / . Then, Algorithm 3 incurs a regret of Ω(∆ exp(∆ n s / .Proof. Suppose after t − rounds of the interaction phase, Algorithm 3 has taken action i t i times for i = 1 , . As the costvectors are deterministic, we can calculate the lower conﬁdence bound estimates for the two actions in closed form: lcb t, = 0 . − (cid:115) ln( t + t + 1) n s + t and lcb t, = 0 . n s − t n s + t − (cid:115) ln( t + t + 1) n s + t Now, let us consider the ﬁrst time we play action in the interaction phase. At this point, t is still and t should satisfy lcb t, ≤ lcb t, , that is ∆2 − (cid:114) ln( t + 1) n s ≤ − (cid:115) ln( t + 1) n s + t . The above condition implies that (cid:114) ln( t + 1) n s ≥ ∆2 , equivalently t ≥ exp(∆ n s / − . Denote by T := exp(∆ n s / − . Therefore, after n b ≥ exp(∆ n s / rounds of interaction, the regret of Algorithm 3can be lower bounded by: n b (cid:88) t =1 c b ( a t ) − c b (2) ≥ T − (cid:88) t =1 c b (1) − c b (2) = ∆2 · ( T −

1) = Ω(∆ exp(∆ n s / . arm-starting Contextual Bandits In Algorithm 3, we used only ln( t + 1) in the numerator, when an alternative might be to use ln( t + n s + 1) . However, it iseasily checked that after this simple modiﬁcation, a similar exponential regret lower bound of Algorithm 3 can be proved(with the deﬁnition of T changed to T := exp(∆ n s / − n s − ). D. Concentration Inequalities

We use a version of Freedman’s inequality from (Beygelzimer et al., 2011).

Lemma 2 (Freedman’s inequality) . Let X , . . . , X n be a martingale difference sequence adapted to ﬁltration {B i } ni =0 ,and | X i | ≤ M almost surely for all i . Let V = (cid:80) ni =1 E [ X i |B i − ] be the cumulative conditional variance. Then, withprobability − δ , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n (cid:88) i =1 X i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:114) V ln 2 δ + M ln 2 δ . E. Proof of Theorem 1

We begin with some additional notation used in the analysis. Throughout this section, we let π ∗ = argmin π ∈ Π E c b ( π ( x )) ,the optimal policy in Π with respect to D b .Recall that for policy π and λ in [0 , , we deﬁne E t [ˆ c ( π ( x ))] := t (cid:80) ts =1 ˆ c s ( π ( x )) and E S [ c s ( π ( x ))] := n s (cid:80) ( x,c ) ∈ S c ( π ( x )) . We additionally deﬁne the λ -weighted empirical cost of π as ˆ L λ,t ( π ) = λt E t [ˆ c ( π ( x ))] + (1 − λ ) n s E S [ c s ( π ( x ))] λt + (1 − λ ) n s , and its expectation L λ,t ( π ) = λt E [ c b ( π ( x ))] + (1 − λ ) n s E [ c s ( π ( x ))] λt + (1 − λ ) n s . Observe that ˆ L ,t ( π ) = E t [ˆ c ( π ( x ))] is π ’s empirical cost on the ﬁrst t CB examples, L ,t ( π ) = E c b ( π ( x )) , ˆ L ,t ( π ) = E S [ c s ( π ( x ))] is π ’s empirical cost on the n s supervised examples, and L ,t ( π ) = E c s ( π ( x )) . Denote by ˆΓ λ,t ( π ) =( λt + (1 − λ ) n s ) ˆ L λ,t ( π ) the unnormalized λ -weighted empirical cost of π , and Γ λ,t ( π ) = ( λt + (1 − λ ) n s ) L λ,t ( π ) itsexpectation.Denote by ( x n b +1 , c sn b +1 ) , . . . , ( x n b + n s , c sn b + n s ) an enumeration of the elements in S . Deﬁne ﬁltration {B t } n b + n s t =0 asfollows: B is the trivial σ -algebra, and B t = (cid:40) σ (( x n b +1 , c sn b +1 ) , . . . , ( x n b +1 , c sn b + t )) , t ∈ { , . . . , n s } ,σ ( S, ( x , ˆ c ) , . . . , ( x t − n s , ˆ c t − n s )) , t ∈ (cid:8) n s + 1 , . . . , n s + n b (cid:9) . For reader’s convenience, we also recall our earlier notation: V t ( λ ) = 2 (cid:114)(cid:16) λ Kt(cid:15) + (1 − λ ) n s (cid:17) ln n b | Π | δ + (cid:16) λK(cid:15) + (1 − λ ) (cid:17) ln n b | Π | δ ,G t ( λ, α, ∆) = (1 − λ ) n s ∆+2 V t ( λ ) λt +(1 − λ ) n s α . In addition, denote by ¯ G t ( λ, α, ∆) := min(1 , G t ( λ, α, ∆)) . Proof of Theorem 1.

Deﬁne event I as: for all π in Π , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n b n b (cid:88) t =1 ( E [ c bt ( π ( x t )) | x t ] − E c b ( π ( x ))) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:115) ln | Π | δ n b . By Hoeffding’s inequality and union bound, I happens with probability − δ . Speciﬁcally, on event I , for every policy π in arm-starting Contextual Bandits Π , as π ∗ is the policy in Π that minimizes E c b ( π ( x )) , we have E c b ( π ∗ ( x )) − n b n b (cid:88) t =1 E [ c b ( π ( x t )) | x t ] ≤ E c b ( π ( x )) − n b n b (cid:88) t =1 E [ c b ( π ( x t )) | x t ] ≤ (cid:115) ln | Π | δ n b . Therefore, E c b ( π ∗ ( x )) − min π ∈ Π n b n b (cid:88) t =1 E [ c b ( π ( x t )) | x t ] ≤ (cid:115) ln | Π | δ n b . (6)Recall that randomized policy π t : X → ∆ K − is deﬁned as π t ( x ) = − (cid:15)t − (cid:80) t − τ =1 π λ t τ ( x ) + (cid:15)K K for t ≥ , and π ( x ) = K K for all x . With a slight abuse of notation, denote by E c b ( π t ( x )) := E ( x,c b ) ,a ∼ π t ( x ) c b ( a ) . Observe that E [ E [ c b ( a t ) | x t ] |B n s + t − ] = E c b ( π t ( x )) . Deﬁne event J as: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n b n b (cid:88) t =1 E [ c b ( a t ) | x t ] − n b n b (cid:88) t =1 E c b ( π t ( x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:115) ln δ n b . (7)By Azuma’s inequality, J happens with probability − δ .Denote by E the event that the events E t , F t deﬁned in Lemmas 3 and 5 (both given below) and I , J hold simultaneouslyfor all t . By a union bound over all E t , F t ’s and I, J , event E happens with probability − δ . We henceforth condition on E happening.Consider t in (cid:8) , . . . , n b (cid:9) . We now give an upper bound on the expected excess cost of using randomized prediction π t .Observe that by the deﬁnition of π t , E c b ( π t ( x )) = 1 − (cid:15)t − t − (cid:88) τ =1 E c b ( π λ t τ ( x )) + (cid:15) K K (cid:88) a =1 E c b ( a ) ≤ t − t − (cid:88) τ =1 E c b ( π λ t τ ( x )) + (cid:15), (8)it sufﬁces to upper bound t − (cid:80) t − τ =1 E c b ( π λ t τ ( x )) .By Lemma 5 below, we have that t − t − (cid:88) τ =1 E c b ( π λ t τ ( x )) − min λ ∈ Λ t − t − (cid:88) τ =1 E c b ( π λτ ( x )) ≤ (cid:115) K ln n b | Λ | δ ( t − (cid:15) . (9)In the above inequality, we can bound the individual summands of the second term by Lemma 3: E c b ( π λτ ( x )) − E c b ( π ∗ ( x )) ≤ ¯ G τ − ( λ, α, ∆) . Therefore, rewriting Equation (9), we get that t − t − (cid:88) τ =1 E c b ( π λ t τ ( x )) − min λ ∈ Λ  t − t − (cid:88) τ =1 ( E c b ( π ∗ ( x )) + ¯ G τ − ( λ, α, ∆))  ≤ (cid:115) K ln n b | Λ | δ ( t − (cid:15) . Combining the above with Equation (8), we get that E c b ( π t ( x )) − E c b ( π ∗ ( x )) ≤ (cid:15) + min λ ∈ Λ  t − t − (cid:88) τ =1 ¯ G τ − ( λ, α, ∆)  + 16 (cid:115) K ln n b | Λ | δ ( t − (cid:15) . Summing the above inequality over t in (cid:8) , . . . , n b (cid:9) and using that E c b ( π ( x )) − E c b ( π ∗ ( x )) ≤ , we get, n b (cid:88) t =1 [ E c b ( π t ( x )) − E c b ( π ∗ ( x ))] ≤ n b (cid:15) + n b (cid:88) t =2 min λ ∈ Λ  t − t − (cid:88) τ =1 ¯ G τ − ( λ, α, ∆)  + 16 n b (cid:88) t =2 (cid:115) K ln n b | Λ | δ ( t − (cid:15) + 1 . Observe that for any function (cid:8) N t ( · ) (cid:9) n b t =1 , (cid:80) n b t =1 min λ ∈ Λ N t ( λ ) ≤ min λ ∈ Λ [ (cid:80) n b t =1 N t ( λ )] . We further collect the coefﬁ- arm-starting Contextual Bandits cients on the ¯ G τ ( λ, α, ∆) terms and use the upper bound (cid:80) n b t = τ +1 1 t − ≤ ln( en b ) for all τ ≥ , getting n b (cid:88) t =1 [ E c b ( π t ( x )) − E c b ( π ∗ ( x ))] ≤ n b (cid:15) + ln( en b ) · min λ ∈ Λ n b − (cid:88) τ =0 ¯ G τ ( λ, α, ∆) + 16 n b (cid:88) t =2 (cid:115) K ln n b | Λ | δ ( t − (cid:15) + 1 . (10)Therefore, we get: R b ( (cid:104) x t , a t (cid:105) n b t =1 ) ≤ n b n b (cid:88) t =1 [ E c b ( π t ( x )) − E c b ( π ∗ ( x ))] + (cid:115) | Π | δ n b ≤ (cid:15) + ln( en b ) n b min λ ∈ Λ n b − (cid:88) t =0 ¯ G t ( λ, α, ∆) + 16 n b n b (cid:88) t =2 (cid:115) K ln n b | Λ | δ ( t − (cid:15) + (cid:115) | Π | δ n b + 1 n b ≤ (cid:15) + ln( en b ) n b min λ ∈ Λ n b (cid:88) t =1 ¯ G t ( λ, α, ∆) + 16 n b n b (cid:88) t =2 (cid:115) K ln n b | Λ | δ ( t − (cid:15) + (cid:115) | Π | δ n b + ln( e n b ) n b ≤ (cid:15) + 3 (cid:115) ln n b | Π | δ n b + 32 (cid:115) K ln n b | Λ | δ n b (cid:15) + ln( en b ) n b min λ ∈ Λ n b (cid:88) t =1 ¯ G t ( λ, α, ∆) . where the ﬁrst inequality is from Equations (6) and (7) and dividing both sides by n b ; the second inequality is fromEquation (10); the third inequality is from that ¯ G ( λ, α, ∆) ≤ ; the fourth inequality is from algebra, and our assumptionthat δ < /e . The theorem follows.The following lemma upper bounds the excess cost of π λt = argmin π ∈ Π ˆ L λ,t − ( π ) . Lemma 3.

For every t ∈ (cid:8) , . . . , n b (cid:9) , there exists an event E t with probability − δ n b , such that the following holds forall λ in Λ : E c b ( π λt ( x )) − E c b ( π ∗ ( x )) ≤ ¯ G t − ( λ, α, ∆) . Proof.

Deﬁne event E t as: for all π in Π , (cid:12)(cid:12)(cid:12) [ λ ( t − E c b ( π ( x )) + (1 − λ ) n s E c s ( π ( x ))] − [ λ ( t − E t − ˆ c ( π ( x )) + (1 − λ ) n s E S c s ( π ( x ))] (cid:12)(cid:12)(cid:12) ≤ (cid:114) ( λ K ( t − (cid:15) + (1 − λ ) n s ) ln 8 n b | Π | δ + ( λK(cid:15) + (1 − λ )) ln 8 n b | Π | δ (11)In other words, (cid:12)(cid:12)(cid:12) Γ λ,t − ( π ) − ˆΓ λ,t − ( π ) (cid:12)(cid:12)(cid:12) ≤ V t − ( λ ) . (12)For a ﬁxed π , applying Lemma 2 with X i = (1 − λ )( c sn b + i ( π ( x n b + i ) − E c s ( π ( x ))) for i in { , . . . , n s } , X i = λ (ˆ c i − n s ( π ( x i − n s )) − E c b ( π ( x ))) for i in { n s + 1 , . . . , n s + t } , and M = (1 − λ ) + λK(cid:15) , and noting that | X i | ≤ M almost surely for all i , E [ X i |B i − ] ≤ (1 − λ ) for i in { , . . . , n s } , and E [ X i |B i − ] ≤ λ E [ p i − ns,π ( xi − ns ) |B i − ] ≤ λ K(cid:15) for i in { n s + 1 , . . . , n s + t − } , we get that Equation (11) holds for π with probability − δ n b | Π | . Therefore, by an unionbound over all π in Π , E t happens with probability − δ n b . We henceforth condition on E t happening.By the optimality of π λt , ˆ L λ,t − ( π λt ) ≤ ˆ L λ,t − ( π ∗ ) . Equivalently, ˆΓ λ,t − ( π λt ) ≤ ˆΓ λ,t − ( π ∗ ) . Combining with Equation (12) applied to π λt and π ∗ , we get that Γ λ,t − ( π λt ) − Γ λ,t − ( π ∗ ) ≤ V t − ( λ ) . arm-starting Contextual Bandits Using the ( α, ∆) -similarity of D s to D b , and Lemma 4 below, we get that ( E c b ( π λt ( x )) − E c b ( π ∗ ( x )))( λ ( t −

1) + (1 − λ ) n s α ) ≤ V t − ( λ, α ) + (1 − λ ) n s ∆ . Therefore, by the deﬁnition of G t ( λ, α, ∆) , we have that E c b ( π λt ( x )) − E c b ( π ∗ ( x ))) ≤ G t − ( λ, α, ∆) . Combining theabove with the fact that E c b ( π λt ( x )) − E c b ( π ∗ ( x )) ≤ , the lemma follows. Lemma 4. If D s is ( α, ∆) -similar to D b , then for any policy π , ( E c b ( π ( x )) − E c b ( π ∗ ( x )))( λt + (1 − λ ) n s α ) ≤ (Γ λ,t ( π ) − Γ λ,t ( π ∗ )) + (1 − λ ) n s ∆ . Proof.

Using the deﬁnition of Γ λ,t , we have that Γ λ,t ( π ) − Γ λ,t ( π ∗ ) = λt ( E c b ( π ( x )) − E c b ( π ∗ ( x ))) + (1 − λ ) n s ( E c s ( π ( x )) − E c s ( π ∗ ( x ))) . Applying Lemma 1, the right hand side is at least λt ( E c b ( π ( x )) − E c b ( π ∗ ( x ))) + (1 − λ ) n s ( α ( E c b ( π ( x )) − E c b ( π ∗ ( x ))) − ∆) . The lemma follows immediately by algebra.We also bound the cost overhead for selecting λ t from Λ compared to using the best λ in hindsight. Deﬁne the progressivevalidation error using (cid:8) π λτ (cid:9) τ ≤ t as: ˆ C λ,t = (cid:80) tτ =1 ˆ c τ ( π λτ ( x τ )) , and its expectation as: C λ,t = (cid:80) tτ =1 E c b ( π λτ ( x )) . Lemma 5.

For every t ∈ (cid:8) , . . . , n b (cid:9) , there exists an event F t with probability − δ n b , such that λ t has the followingproperty: t − (cid:88) τ =1 E c b ( π λ t τ ( x )) − min λ ∈ Λ t − (cid:88) τ =1 E c b ( π λτ ( x )) ≤ (cid:114) K ( t − (cid:15) ln 8 n b | Λ | δ . Proof.

Deﬁne event F t as: for all λ in Λ , (cid:12)(cid:12)(cid:12) C λ,t − − ˆ C λ,t − (cid:12)(cid:12)(cid:12) ≤ (cid:114) K ( t − (cid:15) ln 8 n b | Λ | δ + K(cid:15) ln 8 n b | Λ | δ . In other words, (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t − (cid:88) τ =1 E c b ( π λτ ( x )) − t − (cid:88) τ =1 ˆ c τ ( π λτ ( x τ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:114) K ( t − (cid:15) ln 8 n b | Λ | δ + K(cid:15) ln 8 n b | Λ | δ . (13)For a ﬁxed λ , by Lemma 2, taking X i = 0 for i in [ n s ] , X i = ˆ c i − n s ( π λi − n s ( x i − n s )) − E c b ( π λi − n s ( x )) for i in { n s + 1 , . . . , n s + t − } , M = K(cid:15) , and noting that | X i | ≤ M almost surely for all i , E [ X i |B i − ] = 0 for i in [ n s ] ,and E [ X i |B i − ] ≤ E [ p i − ns,π ( xi − ns ) |B i − ] ≤ K(cid:15) for i in { n s + 1 , . . . , n s + t − } , we get that Equation (13) holds withprobability − δ n b | Λ | . By an union bound over all λ in Λ , the probability of F t is at least − δ n b . We henceforth conditionon F t happening.By the optimality of λ t , we know that for all λ in Λ , ˆ C λ t ,t − ≤ ˆ C λ,t − . Combining the above inequality with Equation (13) applied on λ t and λ , we get that C λ t ,t − − C λ,t − ≤ (cid:114) K ( t − (cid:15) ln 8 n b | Λ | δ + 2 K(cid:15) ln 8 n b | Λ | δ . In addition, observe that C λ t ,t − − C λ,t − ≤ t − as c b ∈ [0 , K with probability 1. Combining the above facts withLemma 6 below, we have that C λ t ,t − − C λ,t − ≤ min (cid:32) (cid:114) K ( t − (cid:15) ln 8 n b | Λ | δ + 2 K(cid:15) ln 8 n b | Λ | δ , t − (cid:33) ≤ (cid:114) K ( t − (cid:15) ln 8 n b | Λ | δ . In other words, t − (cid:88) τ =1 E c b ( π λ t τ ( x )) − t − (cid:88) τ =1 E c b ( π λτ ( x )) ≤ (cid:114) K ( t − (cid:15) ln 8 n b | Λ | δ . As the above holds for any λ in Λ , the lemma follows. arm-starting Contextual Bandits Lemma 6.

For any positive real numbers a, b > , we have min( √ ab + b, a ) ≤ √ ab. Proof.

The lemma follows from the straightforward calculations below: min( √ ab + b, a ) ≤ min( √ ab, a ) + min( b, a ) ≤ √ ab + √ ab = 2 √ ab. F. Proof of Theorem 2

We begin with some notation used in our analysis. Throughout this section, we let π ∗ = argmin π ∈ Π E c s ( π ( x )) , the optimalpolicy in Π with respect to D s . Deﬁne H t ( λ, α, ∆) = λ ∆ + 2 W t ( λ )(1 − λ ) + λα . For policy π , λ in [0 , and t e for e ∈ { , , . . . , E } , deﬁne the λ -weighted empirical cost of π as ˆ M λ,t e ( π ) = λ E t e ˆ c ( π ( x )) + (1 − λ ) E S tr c s ( π ( x )) and its expectation M λ,t e ( π ) = λ E c b ( π ( x )) + (1 − λ ) E c s ( π ( x )) For convenience, we deﬁne ˜ n s = n s / ( E + 1) .Denote by ( x n b +1 , c sn b +1 ) , . . . , ( x n b +˜ n s , c sn b +˜ n s ) an enumeration of the elements in S tr . Deﬁne ﬁltration {B t } n b +˜ n s t =0 asfollows: B is the trivial σ -algebra, and B t = (cid:40) σ (( x n b +1 , c sn b +1 ) , . . . , ( x n b +1 , c sn b + t )) , t ∈ { , . . . , ˜ n s } ,σ ( S tr , ( x , ˆ c ) , ( x t − ˜ n s , ˆ c t − ˜ n s )) , t ∈ (cid:8) ˜ n s + 1 , . . . , ˜ n s + n b (cid:9) . For reader’s convenience, we also recall our earlier notation: W t ( λ ) = 2 (cid:114)(cid:16) λ Kt(cid:15) + (1 − λ ) ( E +1) n s (cid:17) ln E | Π | δ + (cid:16) λKt(cid:15) + (1 − λ )( E +1) n s (cid:17) ln E | Π | δ . Proof of Theorem 2.

Deﬁne event I as: for all π in Π , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n b n b (cid:88) t =1 E [ c s ( π ( x t )) | x t ] − E c s ( π ( x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:115) ln | Π | δ n b . (14)Intuitively, under this event our regret measure for a policy is close to its expected value. By Hoeffding’s inequality andunion bound, I happens with probability − δ . Speciﬁcally, on event I , as π ∗ is the policy in Π that minimizes E c s ( π ( x )) ,we have that for all π in Π , E c s ( π ∗ ( x )) − n b n b (cid:88) t =1 E [ c s ( π ( x t )) | x t ] ≤ E c s ( π ( x )) − n b n b (cid:88) t =1 E [ c s ( π ( x t )) | x t ] ≤ (cid:115) ln | Π | δ n b . Therefore, E c s ( π ∗ ( x )) − min π ∈ Π n b n b (cid:88) t =1 E [ c s ( π ( x t )) | x t ] ≤ (cid:115) ln | Π | δ n b . (15)Recall that randomized policy π t : X → ∆ K − is deﬁned as π t ( x ) = (1 − (cid:15) ) π λ e e ( x ) + (cid:15)K K , (16)for e in { , , . . . , E − } , t ∈ ( t e , t e +1 ] , and π t ( x ) = K K for all t in ( t , t ] . With a slight abuse of notation, denote by E c s ( π t ( x )) := E ( x,c s ) ,a ∼ π t ( x ) c s ( a ) . Observe that for t in ( t e , t e +1 ] , E [ E [ c s ( a t ) | x t ] |B ˜ n s + t − ] = E c s ( π t ( x )) . Deﬁne event Note that previously we used a cost of λ per example, whereas now it is λ per source, implying that the per example costs are λ/t e and (1 − λ ) /n s after t e CB examples. arm-starting Contextual Bandits J as: (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) n b n b (cid:88) t =1 E [ c s ( a t ) | x t ] − n b n b (cid:88) t =1 E c s ( π t ( x )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:115) ln δ n b . (17)By Azuma’s inequality, J happens with probability − δ .Denote by E the event that the events E e , F e deﬁned in Lemmas 7 and 8 (both given below) and I , J hold simultaneouslyfor all t . By union bound over all E e , F e ’s and I, J , event E happens with probability − δ . We henceforth condition on E happening.Consider e in { , , . . . , E − } and t in ( t e , t e +1 ] . We now upper bound the expected excess cost of using randomizedprediction π t as deﬁned in Equation (16). By the deﬁnition of π t , we get E c s ( π t ( x )) = (1 − (cid:15) ) E c s ( π e ( x )) + (cid:15) K K (cid:88) a =1 E c s ( a ) ≤ E c s ( π e ( x )) + (cid:15). (18)Now, combining the above inequality with Lemmas 7 and 8, we have that for all t in ( t e , t e +1 ] , E c s ( π t ( x )) ≤ min λ ∈ Λ E c s ( π λe ( x )) + (cid:15) + (cid:115) E | Λ | δ ˜ n s ≤ E c s ( π ∗ ( x )) + (cid:15) + (cid:115) E | Λ | δ ˜ n s + min λ ∈ Λ H t e ( λ, α, ∆) . Summing the above inequality over all t = t + 1 , . . . , n b , grouping by epoch e , and using the fact that ≤ E c s ( π t ( x )) ≤ for t ≤ t = 2 , we have n b (cid:88) t = t +1 E c s ( π t ( x )) ≤ n b E c s ( π ∗ ( x )) + 2 + n b (cid:15) + ( n b − (cid:115) E | Λ | δ ˜ n s + E − (cid:88) e =1 t e +1 (cid:88) t = t e +1 min λ ∈ Λ H t e ( λ, α, ∆) . Dividing both sides by n b and some algebra yields n b n b (cid:88) t =1 E c s ( π t ( x )) ≤ n b + E c s ( π ∗ ( x )) + (cid:15) + (cid:115) E | Λ | δ ˜ n s + 1 n b E − (cid:88) e =1 t e +1 (cid:88) t = t e +1 min λ ∈ Λ H t e ( λ, α, ∆) . Note that for every t in [ t e + 1 , t e +1 ] , as t ≤ t e +1 = 2 t e , W t ( λ ) ≥ W t e ( λ ) , hence H t ( λ, α, ∆) ≥ H t e ( λ, α, ∆) .Therefore, the right hand side can be further upper bounded by n b + E c s ( π ∗ ( x )) + (cid:15) + (cid:115) E | Λ | δ ˜ n s + 2 n b E − (cid:88) e =1 t e +1 (cid:88) t = t e +1 min λ ∈ Λ H t ( λ, α, ∆) ≤ n b + E c s ( π ∗ ( x )) + (cid:15) + (cid:115) E | Λ | δ ˜ n s + min λ ∈ Λ n b n b (cid:88) t =1 H t ( λ, α, ∆) . where the inequality is from that for any set of functions (cid:8) N t ( · ) (cid:9) n b t =1 , (cid:80) n b t =1 min λ ∈ Λ N t ( λ ) ≤ min λ ∈ Λ [ (cid:80) n b t =1 N t ( λ )] .To summarize, we have that n b n b (cid:88) t =1 E c s ( π t ( x )) − E c s ( π ∗ ( x )) ≤ n b + (cid:15) + (cid:115) E | Λ | δ ˜ n s + min λ ∈ Λ n b n b (cid:88) t =1 H t ( λ, α, ∆) . Combining with Equations (15) and (17) and using some algebra, we have n b n b (cid:88) t =1 E [ c st ( π t ( x t )) | x t ] − min π ∈ Π n b n b (cid:88) t =1 E [ c s ( π ( x t )) | x t ] ≤ (cid:15) + 3 (cid:115) ln | Π | δ n b + (cid:115) E | Λ | δ ˜ n s + min λ ∈ Λ n b n b (cid:88) t =1 H t ( λ, α, ∆) . The theorem follows from the deﬁnition of ˜ n s . Lemma 7.

For every e , there exists an event E e with probability − δ E , on which for all λ in Λ , the excess cost of π λe can arm-starting Contextual Bandits be bounded as: E c s ( π λe ( x )) − E c s ( π ∗ ( x )) ≤ H t e ( λ, α, ∆) . Proof.

We ﬁrst show the following concentration inequality: with probability − δ E , for all π in Π , (cid:12)(cid:12)(cid:12) ˆ M λ,t e ( π ) − M λ,t e ( π ) (cid:12)(cid:12)(cid:12) ≤ W t e ( λ ) . (19)To show the above statement, in light of the deﬁnitions of ˆ M , M and W , it sufﬁces to show that for every π in Π , withprobability − δ E | Π | , we have (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) t e (cid:88) t =1 λt e [ˆ c t ( π ( x t )) − E c ( π ( x ))] + (cid:88) ( x,c s ) ∈ S − λ ˜ n s [ c s ( π ( x )) − E c s ( π ( x ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:115) ( λ Kt e (cid:15) + (1 − λ ) ˜ n s ) ln 8 E | Π | δ + ( λKt e (cid:15) + 1 − λ ˜ n s ) ln 8 E | Π | δ . (20)For a ﬁxed π , applying Lemma 2 with X i = − λ ˜ n s ( c sn b + i ( π ( x n b + i )) − E c s ( π ( x ))) for i in { , . . . , ˜ n s } , X i = λt e (ˆ c i − ˜ n s ( π ( x i − ˜ n s )) − E c b ( π ( x ))) for i in { ˜ n s + 1 , . . . , ˜ n s + t e } , and M = − λ ˜ n s + λKt e (cid:15) , and note that | X i | ≤ M al-most surely for all i , E [ X i |B i − ] ≤ ( − λ ˜ n s ) for i in { , . . . , ˜ n s } , and E [ X i |B i − ] ≤ ( λt e ) E [ p i − ˜ ns,π ( xi − ˜ ns ) |B i − ] ≤ λ Kt e (cid:15) for i in { ˜ n s + 1 , . . . , ˜ n s + t e } , we get that Equation (20) holds with probability − δ E | Π | .Therefore, by union bound over all π in Π Equation (19) holds for all λ simultaneously with probability − δ E . As π λe minimizes ˆ M λ,t ( π ) over Π , we have that ˆ M λ,t ( π λe ) ≤ ˆ M λ,t ( π ∗ ) . Combining this fact with Equation (19) on π λe and π ∗ , weget M λ,t e ( π λe ) − M λ,t e ( π ∗ ) ≤ W t e ( λ ) . Observe that Lemma 1 and the deﬁnition of M implies that the left hand side of the above equation is at least ( λα + (1 − λ ))( E c s ( π λe ( x )) − E c s ( π ∗ ( x ))) − λ ∆ . The lemma statement follows by straightforward algebra and the deﬁnition of H t ( · , · , · ) . Lemma 8.

For every e , there exists an event F e with probability − δ E , on which the policy π λ e e satisﬁes that E c s ( π λ e e ) − min λ ∈ Λ E c s ( π λe ( x )) ≤ (cid:115) E | Λ | δ ˜ n s . Proof.

Given any λ in Λ , as S val e is a sample independent of the π λe , we have by Hoeffding’s inequality that with probability − δ E | Λ | , (cid:12)(cid:12)(cid:12) E S val e c s ( π λe ( x )) − E c s ( π λe ( x )) (cid:12)(cid:12)(cid:12) ≤ (cid:115) ln E | Λ | δ n s . (21)By union bound, with probability − δ , Equation (21) holds for all λ in Λ simultaneously.Observe that by the optimality of λ e , for all λ in Λ , we have that E S val e c s ( π λ e e ( x )) ≤ E S val e c s ( π λe ( x )) . Combining with Equation (21) on λ and λ e , we get E c s ( π λ e e ( x )) ≤ E c s ( π λe ( x )) + (cid:115) E | Λ | δ ˜ n s . The lemma follows as the above inequality holds for every λ in Λ . arm-starting Contextual Bandits G. Fixed choice of λ in Algorithm 2 Recall that H t ( λ, α, ∆) = λ ∆ + 2 W t ( λ )(1 − λ ) + λα , and W t ( λ ) = 2 (cid:114)(cid:16) λ Kt(cid:15) + (1 − λ ) ( E +1) n s (cid:17) ln E | Π | δ + (cid:16) λKt(cid:15) + (1 − λ )( E +1) n s (cid:17) ln E | Π | δ . In addition, recall that n b (cid:80) n b t =1 H t ( λ, α, ∆) is the last term of the regret bound Equation (5). We have the followingproposition, showing that a single choice of λ ensures a small upper bound on this term when α = 1 and ∆ = 0 . Proposition 2.

Suppose λ = n b (cid:15)n s K + n b (cid:15) . Then, n b n b (cid:88) t =1 H t ( λ , ,

0) = ˜ O (cid:18)(cid:114) K ln E | Π | δ Kn s + n b (cid:15) + K ln E | Π | δ Kn s + n b (cid:15) (cid:19) . Proof.

We have the following: n b n b (cid:88) t =1 H t ( λ , ,

0) = 4 n b n b (cid:88) t =1 W t ( λ )= 4 · n b n b (cid:88) t =1 (cid:34) (cid:114)(cid:16) λ Kt(cid:15) + (1 − λ ) ( E +1) n s (cid:17) ln E | Π | δ + (cid:16) λ Kt(cid:15) + (1 − λ )( E +1) n s (cid:17) ln E | Π | δ . (cid:35) ≤ · (cid:34) (cid:114)(cid:16) λ K (ln n b +1) n b (cid:15) + (1 − λ ) ( E +1) n s (cid:17) ln E | Π | δ + (cid:16) λ K (ln n b +1) n b (cid:15) + (1 − λ )( E +1) n s (cid:17) ln E | Π | δ . (cid:35) = ˜ O (cid:18)(cid:114)(cid:16) λ Kn b (cid:15) + (1 − λ ) n s (cid:17) ln E | Π | δ + (cid:16) λ Kn b (cid:15) + (1 − λ ) n s (cid:17) ln E | Π | δ (cid:19) = ˜ O (cid:18)(cid:114) K ln E | Π | δ Kn s + n b (cid:15) + K ln E | Π | δ Kn s + n b (cid:15) (cid:19) . Where the ﬁrst inequality is from that H t ( λ , ,

0) = 2 W t ( λ ) ; the second equality is from the deﬁnition of W t ( λ ) ; theinequality is by Jensen’s inequality; the third equality is by dropping the logarithmic terms; the last inequality is by pluggingin the choice of λ . H. Approximate optimality of

Λ = { , } In this section, we show Proposition 3, which justiﬁes that using

Λ = { , } achieves a near-optimal regret bound inTheorem 1. Recall that V t ( λ ) = 2 (cid:114)(cid:16) λ Kt(cid:15) + (1 − λ ) n s (cid:17) ln n b | Π | δ + (cid:16) λK(cid:15) + (1 − λ ) (cid:17) ln n b | Π | δ ,G t ( λ, α, ∆) = (1 − λ ) n s ∆+2 V t ( λ ) λt +(1 − λ ) n s α . Proposition 3. min λ ∈{ , } G t ( λ, α, ∆) ≤ √ λ ∈ [0 , G t ( λ, α, ∆) . Proof.

For any λ in [0 , , we give a lower bound on V t ( λ ) . Note that by the fact that √ a + b ≥ (cid:113) ( √ a + √ b ) , we havethat V t ( λ ) ≥ √ − λ ) V t (0) + λV t (1)] . (22) arm-starting Contextual Bandits Therefore, G t ( λ, α, ∆) ≥ (1 − λ ) n s ∆ + 2 · (cid:113) [(1 − λ ) V t (0) + λV t (1)] λt + (1 − λ ) n s α ≥ √ − λ ) n s ∆ + 2[(1 − λ ) V t (0) + λV t (1)] λt + (1 − λ ) n s α ≥ √ n s ∆ + 2 V t (0) n s α , V t (1) t )= 1 √ λ ∈{ , } G t ( λ, α, ∆) , where the ﬁrst inequality is from Equation (22), the second inequality is by algebra, the third inequality is by the quasi-concavity of the LHS with respect to λ , and the fact that the minimum of a quasi-concave function over a convex set isattained at its boundary. As the above holds for any λ in [0 , , the proposition follows. I. Combining Two Sources in Supervised Learning

Proposition 4.

For every policy class Π of VC dimension d and ∆ ∈ [0 , ] , m, n ≥ d , for any algorithm that outputs apolicy ˆ π based on n examples from D and m examples from D , there exists a pair of distributions ( D , D ) such that D is (1 , ∆) -similar to D , and E [ E D c (ˆ π ( x )) − min π ∈ Π E D c ( π ( x ))] ≥ min (cid:16)(cid:112) d/m + 8∆ , (cid:112) d/n (cid:17) where the outer expectation is over draws from D and D , and the algorithm’s randomness. The lower bound proof follows a similar strategy as in the classical classiﬁcation setting with slight modiﬁcations. In orderto prove the bound we make use of Assouad’s Lemma. The statement below follows exactly from Yu (Yu, 1997).

Theorem 3 (Assouad’s Lemma) . Let d ≥ be an integer and let F d = { P τ | τ ∈ {− , +1 } d } be a class of d probabilitymeasures indexed by binary strings of length d . Write τ ∼ τ (cid:48) if τ and τ (cid:48) differ in only one coordinate, and write τ ∼ j τ (cid:48) when that coordinate is the j th . Suppose that there are d psuedo-distances on D such that for any x, y ∈ D ρ ( x, y ) = d (cid:88) j =1 ρ j ( x, y ) , and further that, if τ ∼ j τ (cid:48) , ρ j ( θ ( P τ ) , θ ( P τ (cid:48) )) ≥ α. Then for any estimator (cid:98) θ , max τ E τ ρ ( (cid:98) θ, θ ( P τ )) ≥ d · α (cid:107) P τ ∧ P τ (cid:48) (cid:107) | τ ∼ τ (cid:48) } . Proof of Proposition 4.

Since the VC dimension of Π is d , there exists a set A = { x , . . . , x d } such that for any binarysequence τ ∈ {− , +1 } d , there exists some function π in Π such that for all i ∈ { , . . . , d } , π ( x i ) = τ i . We now deﬁne twodistributions for a binary classiﬁcation problem, but express them as joint distributions over ( x, c ) in order to be consistentwith the rest of the paper with the understanding that c will have a zero-one cost structure (where c = ( c ( − , c (+1)) represent the costs of predicting labels − and +1 ). The two distributions D τ and D τ are each uniform over x i , and theconditional distributions over costs are given by D τ ((1 , | x = x i ) = 12 + τ i (cid:15)D τ ((0 , | x = x i ) = 12 − τ i (cid:15), and D τ ((1 , | x = x i ) = 12 + τ i (cid:15) − τ i βD τ ((0 , | x = x i ) = 12 − τ i (cid:15) + τ i β. arm-starting Contextual Bandits Where (cid:15) ∈ [0 , ] , β ∈ [0 , ∆2 ] are free parameters to be determined later. Clearly D τ is (1 , ∆) -strongly similar to D τ ,therefore D τ is (1 , ∆) -similar to D τ . This can be seen by taking c + = E D [ c | x ] and c − = E D [ c | x ] − E D [ c | x ] , andobserving that | E c − ( π ( x )) | ≤ β ≤ ∆ / . Deﬁne P τ as the product distribution of n copies of D τ and m copies of D τ , whichis the joint distribution of the input examples to the algorithm. In addition, deﬁne θ ( P τ ) = τ , and ρ j ( τ, τ (cid:48) ) = I ( τ j (cid:54) = τ (cid:48) j ) .Therefore, ρ ( τ, τ (cid:48) ) = (cid:80) dj =1 ρ j ( τ, τ (cid:48) ) = (cid:80) dj =1 I ( τ j (cid:54) = τ (cid:48) j ) is the Hamming distance between τ and τ (cid:48) . Suppose thealgorithm returns a policy ˆ π , there exists a binary sequence ˆ τ such that ˆ π ( x i ) = ˆ τ i . Observe that for any τ , by the deﬁnitionof D τ , E D τ c (ˆ π ( x )) − min π ∈ Π E D τ c ( π ( x )) = 2 (cid:15)d ρ (ˆ τ , τ ) . By Assouad’s Lemma (Theorem 3) with the ρ deﬁned above and α = 1 , we have that there exists some τ in {− , +1 } d ,such that E ρ (ˆ τ , τ ) ≥ (cid:15) min( (cid:107) P τ ∧ P τ (cid:48) (cid:107) | τ ∼ τ (cid:48) ) This immediately implies that, for any algorithm that returns ˆ π , there exists some τ such that E P τ [ c (ˆ π ( x )) − min π ∈ Π E D τ c ( π ( x ))] ≥ (cid:15) min( (cid:107) P τ ∧ P τ (cid:48) (cid:107) | τ ∼ τ (cid:48) ) . What remains is to bound (cid:107) P τ ∧ P τ (cid:48) (cid:107) for any binary sequences differing in one coordinate. Recall that (cid:107) P τ ∧ P τ (cid:48) (cid:107) = 1 − (cid:107) P τ − P τ (cid:48) (cid:107) . We will bound (cid:107) P τ − P τ (cid:48) (cid:107) ≤ H ( P τ , P τ (cid:48) ) using the Hellinger distance H ( P τ , P τ (cid:48) ) = (cid:80) z ( (cid:112) P τ ( z ) − (cid:112) P τ (cid:48) ( z )) . We recall that P τ is in fact the product distribution over n copies of D τ and m copies of D τ . For productmeasures H ( P τ , P τ (cid:48) ) ≤ n (cid:88) i =1 ( D τ , D τ (cid:48) ) + m (cid:88) i =1 ( D τ , D τ (cid:48) ) we need to bound the Hellinger distance for the biased and unbiased distributions. Bounding the Hellinger Distance

We have H ( D τ , D τ (cid:48) ) = 12 d (cid:88) i =1 (cid:88) c ∈{ (0 , , (1 , } ( (cid:112) D τ ( x i , c ) − (cid:113) D τ (cid:48) ( x i , c )) = 12 d (cid:88) c ∈{ (0 , , (1 , } ( (cid:112) D τ ( c | x i ) − (cid:113) D τ (cid:48) ( c | x i )) = 1 d H ( B ( 12 + (cid:15) ) , B ( 12 − (cid:15) )) ≤ d (cid:15) , where the ﬁrst equality is by the deﬁniton of H ; the second inequality is from that there is exactly one i such that (cid:80) c ∈{ (0 , , (1 , } ( (cid:112) D τ ( x i , c ) − (cid:112) D τ (cid:48) ( x i , c )) is nonzero, as τ ∼ τ (cid:48) ; in the right hand side of third equality, B ( p ) is theBernoulli distribution with mean parameter p . Similarly, H ( D τ , D τ (cid:48) ) ≤ d ( (cid:15) − β ) Hence, H ( P τ , P τ (cid:48) ) ≤ d (cid:16) n(cid:15) + m ( (cid:15) − β ) (cid:17) Therefore, we have max τ E P τ [ c (ˆ π ( x )) − min π ∈ Π E D τ c ( π ( x ))] ≥ (cid:15) (cid:34) − (cid:114) d (cid:0) n(cid:15) + m ( (cid:15) − β ) (cid:1)(cid:35) (23)We now consider two separate cases regarding the settings of m , n , d and ∆ . Case 1: (cid:113) dn ≤ (cid:113) dm + 4∆ . In this case, we let (cid:15) = (cid:113) dn and β = max(0 , (cid:113) dn − (cid:113) dm ) ∈ [0 , ∆2 ] . This gives that theright hand side of Equation (23) is at least (cid:113) dn · = (cid:113) dn . arm-starting Contextual Bandits Case 2: (cid:113) dn > (cid:113) dm + 4∆ . In this case, we let (cid:15) = (cid:113) dm + ∆2 and β = ∆2 . This gives that the right hand side ofEquation (23) is at least ( (cid:113) dm + ∆2 ) · = ( (cid:113) dm + 4∆) .In summary, for any choice of m, n, d, ∆ with m, n ≥ d and ∆ ∈ [0 , ] , we can ﬁnd a pair of distributions ( D τ , D τ ) such that E P τ [ c (ˆ π ( x )) − min π ∈ Π E D τ c ( π ( x ))] ≥

116 min( (cid:114) dn , (cid:114) dm + 4∆) . The proposition follows.

J. Additional Experimental Results

We give a collection of cumulative distribution functions (CDFs) of our algorithms evaluated against different explorationparameters (cid:15) , noise settings and warm start ratios below:1. In Figures 4 to 9, we present CDFs where all CB algorithms use (cid:15) -greedy with parameter (cid:15) = 0 . .2. In Figures 10 to 15, we present CDFs where all CB algorithms use (cid:15) -greedy with parameter (cid:15) = 0 . .3. In Figures 16 to 21, we present CDFs where all CB algorithms use (cid:15) -greedy with parameter (cid:15) = 0 . .The general trends in this more detailed comparison are similar to those observed in Section 5. For less noisy and smallwarm-start ratios, S UP -O NLY is particularly difﬁcult as a baseline since it performs no exploration. With extreme noise,B

ANDIT -O NLY is the best as the supervised examples are misleading. ARR O W-CB competes well on both the extremes,while outperforming all methods in several regimes. Importantly, ARR O W-CB always beats the other methods that attemptto leverage both sources of data, and prevents the signiﬁcant performance hit from relying on the wrong data source in eitherof the two extreme cases. arm-starting Contextual Bandits

RatioNoise 2.875 5.75 11.5Noiseless