Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization
Rad Niazadeh, Negin Golrezaei, Joshua Wang, Fransisca Susan, Ashwinkumar Badanidiyuru
aa r X i v : . [ c s . L G ] F e b Online Learning via Offline Greedy Algorithms:Applications in Market Design and Optimization
Rad Niazadeh
Chicago Booth School of Business, Operations Management, [email protected]
Negin Golrezaei
MIT Sloan School of Management, Operations Management, [email protected]
Joshua Wang
Google Research Mountain View, [email protected]
Fransisca Susan
MIT Sloan School of Management, Operations Management, [email protected]
Ashwinkumar Badanidiyuru
Google Research Mountain View, [email protected]
Motivated by online decision-making in time-varying combinatorial environments, we study the problemof transforming offline algorithms to their online counterparts. We focus on offline combinatorial problemsthat are amenable to a constant factor approximation using a greedy algorithm that is robust to localerrors. For such problems, we provide a general framework that efficiently transforms offline robust greedyalgorithms to online ones using Blackwell approachability. We show that the resulting online algorithms have O ( √ T ) (approximate) regret under the full information setting. We further introduce a bandit extension ofBlackwell approachability that we call Bandit Blackwell approachability. We leverage this notion to transformgreedy robust offline algorithms into a O ( T / ) (approximate) regret in the bandit setting. Demonstratingthe flexibility of our framework, we apply our offline-to-online transformation to several problems at theintersection of revenue management, market design, and online optimization, including product rankingoptimization in online platforms, reserve price optimization in auctions, and submodular maximization. Weshow that our transformation, when applied to these applications, leads to new regret bounds or improvesthe current known bounds. Key words : Blackwell approachability, Offline-to-online, No-regret, Submodular maximization, Productranking, Reserve price optimization.
1. Introduction
We study the problem of designing efficient no-regret—also known as vanishing regret —online learn-ing algorithms in complex real-world environments, where the underlying decision-making process iscombinatorial in nature. In such environments, a decision-maker (learner) needs to experiment withexponentially many options whose rewards exhibit non-trivial and non-linear structures. Exploit- iazadeh et al.: Online Learning via Offline Greedy ing such structures to design efficient online learning algorithms is challenging as the underlyingoffline problems can indeed be NP-hard. Such offline problems can only admit approximation algo-rithms. Therefore, any efficient online learning algorithm can only hope to obtain vanishing regretwith respect to an in-hindsight approximately optimal benchmark. This motivates our key researchquestions: How can one transform existing approximation algorithms for NP-hard offline problems to van-ishing regret learning algorithms for a wide range of combinatorial environments? Can we effi-ciently exploit the combinatorial reward structure to eliminate the necessity of experimentingwith exponentially many arms?
To answer these questions, we consider an adversarial online learning setting. In every round t ,the learner takes an action by choosing a (feasible) point z t among possibly exponentially manychoices, and receives a reward of f t ( z t ) . The adversarially chosen reward function f t ∈ F , whichis unknown to the learner at the time of action, can be non-linear in action z t . We are interestedin settings where the offline problem is NP-hard, and amenable to a γ -approximation algorithm,where γ ∈ (0 , . In the offline problem, the reward function f ∈ F is fully known, and the goal isto choose a feasible point z that maximizes the obtained reward f ( z ) .We focus on the prevalent class of offline approximation algorithms with a greedy nature. Roughlyspeaking, such approximation algorithms build up a solution stage by stage, choosing the nextstage that offers the most local improvement with respect to a metric. We require the greedyapproximation algorithms to be robust to local errors in every stage; for details, see Definition 8.Several combinatorial problems studied in operations research and computer science, ranging fromclassic submodular maximization problems to more recently studied optimization problems relatedto market design and revenue management, admit such robust greedy approximation algorithms.For details, see Section 6.The problem of transforming offline problems to online learning algorithm is studied byKalai and Vempala (2005) and Dudík et al. (2017) when the learner can solve the offline problemefficiently. However, the approaches in these works fail when the learner only has access to anapproximate solutions to the offline problem. This drawback is alleviated by Kakade et al. (2009)who study the offline to online transformation when (i) the offline problem is NP-hard but amenableto approximation, and (ii) the reward function is linear in the learner’s action. While the proposedapproach in Kakade et al. (2009) provides a general purpose offline-to-online blackbox reduction, itcrucially uses the linearity of the reward function. As a result, it cannot be applied to our settingswith nonlinear reward functions. We highlight that as shown by Hazan and Koren (2016), for a Our framework can also be applied to polynomially solvable problems. For these problems, the approximation factor γ = 1 . iazadeh et al.: Online Learning via Offline Greedy general offline problem, there may not exist an efficient offline-to-online transformation, justifyingour assumption on the type of approximation algorithms we study.We now summarize our main contributions. A framework for offline-to-online transformation.
We design a unified framework to trans-form robust greedy approximation algorithms to efficient online learning algorithms when the rewardfunctions are not necessarily linear. We consider two online learning settings: full information and bandit . In the full information setting, the learner observes function f t after taking action z t , andin the bandit setting, the learner only observes the obtained reward f t ( z t ) .For both settings, our proposed transformation relies on the celebrated Blackwell approacha-bility theorem due to Blackwell (1956). The Blackwell approachability theorem is concerned witha two-player repeated game with a vector payoff, and presents a strategy under which the time-averaged vector payoff approaches some target set S that satisfies certain properties. As it is shownin Abernethy et al. (2011), there is a strong connection between Blackwell approachability anddesigning vanishing regret learning algorithms. In fact, for online linear optimization, they showthat any strategy/algorithm for Blackwell approachability can be transformed to a vanishing regretlearning algorithm and vice versa. Online learning algorithms using Blackwell strategies.
In this work, as one of our maincontributions, we show that the transformation of Blackwell strategies to online vanishing regretalgorithms is also possible for combinatorial non-linear learning settings whose underlying offlineproblem is NP-hard and admits a robust greedy γ -approximation algorithm. Specifically, we showthat if the offline problem is Blackwell reducible (see Definitions 10 and 12), then we can design avanishing regret learning algorithm by running a Blackwell algorithm for each stage (subproblem)of the offline greedy algorithm. In every round, these Blackwell algorithms are run sequentially tobuild up the learner’s action stage by stage. This allows the Blackwell algorithms to communicatewith each other in a specific pattern dictated by the offline greedy algorithm. Thanks to suchcommunication between Blackwell algorithms and the robustness of the offline greedy algorithm tolocal errors, the resulting online algorithm has a vanishing γ -regret. In fact, for the full informationsetting, we show that this transformation leads to an algorithm with O ( N √ T ) γ -regret, where N is the number of subproblems in the offline algorithm.The bandit setting turns out to be much trickier as the Blackwell algorithms cannot all obtaintheir desired feedback to update their course of actions over time. To resolve this, we introduce anovel bandit version of the sequential Blackwell game, that we call bandit Blackwell . In this version,the player/algorithm does not obtain any feedback on his payoff unless he agrees to pay a certain Our regret bounds also depend on the diameter of vector payoff of the Blackwell games and their dimension; seeTheorems 2 and 3. iazadeh et al.:
Online Learning via Offline Greedy cost. When the player agrees to pay such a cost, an extra “exploration” will be done, and he obtainsan unbiased estimator of his payoff. Surprisingly, we show that in the bandit Blackwell sequentialgames, approachability is feasible. We further give a tight lower bound on the rate of convergencefor bandit Blackwell sequential games.Leveraging our notions of bandit Blackwell sequential games and approachability, we present anoffline-to-online transformation in which N bandit Blackwell algorithms communicate with eachother to build up a solution. To mimic the extra exploration step of bandit Blackwell games, we showhow this communication can be interrupted in a controlled way when one of the bandit algorithmsrequests acquiring feedback. We also show how the required unbiased estimator of the vector payoffcan be constructed. These pieces together give us the final bandit offline-to-online transformation.We show that this transformation leads to an online algorithm with O ( N T / ) γ -regret. Applications.
Finally, to demonstrate the generality and effectiveness of our framework, we applyour offline-to-online transformation to several problems at the intersection of revenue management,market design, and online optimization that have been proposed and studied in the literature. Inparticular, we consider problems of (i) optimizing product ranking, (ii) optimizing personalizedreserve prices in second price auctions, and (iii) Submodular Maximization (SM) in discrete andcontinuous domains (see Table 1). We show that in most cases, our transformations lead to new orimproved regret bounds. We emphasize that applications presented in this work are only selectivesamples of applications that can fit to our framework. The fact that we can obtain improved boundsfor these well-studied problems highlight the generality of the framework and its potentials to beapplied to other problems in the operations research domain, or even other domains of interest.In the following, we discuss our bounds in detail.
Product ranking optimization.
Online marketplaces have the opportunity of optimizingthe ranking of displayed products in order to improve revenue, shape the demand, and reduceusers’ search cost (see, for example, Athey and Ellison (2011), Kempe et al. (2003), Ursu (2016),Aouad and Segev (2015), and Derakhshan et al. (2018)). Inspired by this, we study the productranking problem in the online adversarial setting with the objective of maximizing user engagement.To express user engagement as a function of the ranking over the products, we use the model pro-posed by Asadpour et al. (2020), which is a generalization of the model presented by Ferreira et al.(2019). Under this model, the offline ranking problem can be written as maximizing sequentialsubmodular functions; see Section 6.1 for the definition of the sequential submodular functions. Byapplying our framework to this problem, we get O ( n √ T log n ) -regret and O (cid:0) n / (log n ) / T / (cid:1) -regret in full information and bandit settings, respectively. We note that our work is the first onethat studies the product ranking problem under the aforementioned model in an online adversarial iazadeh et al.: Online Learning via Offline Greedy setting. The offline PAC learning problem, which resembles aspects of the online learning in thestochastic setting, is studied by Ferreira et al. (2019) for a special case of our model. Optimizing personalized reserve prices.
Second price auctions with reserve prices are preva-lent in many marketplaces including online advertising markets, making them objects of bothwide practical relevance and scientific interest (see, for example, Hartline and Roughgarden (2009),Cesa-Bianchi et al. (2014), Beyhaghi et al. (2018), Roughgarden and Wang (2019), Golrezaei et al.(2019)). We study the online problem of optimizing personalized reserve prices, where buyers’ val-uations are chosen adversarially in every round. By applying our framework to this problem, weachieve O ( n √ T log T ) − regret in the full-information setting and O ( n / T / (log nT ) / ) − regretin the bandit setting. Our results match the previous bound for the full-information setting byRoughgarden and Wang (2019) who apply a slight variant of the Follow-the-Perturbed-Leader algo-rithm of Kalai and Vempala (2005) every round for each bidder; the bandit setting had not beenstudied prior to our work. We should note that in the special case with symmetric buyers and uni-form reserve prices (also known as anonymous reserve auction, cf. Alaei et al. (2019)), minimizingregret under stochastic bandit setting is studied in Cesa-Bianchi et al. (2014), in which they obtain ˜ O ( n √ T ) regret bound. Here, the offline problem of finding the uniform optimal reserve can be solvedexactly in polynomial time. Submodular maximization problems.
Many optimization problems that arise in the realworld, including revenue management problems, can be expressed as maximizing a submodularfunction. The notion of submodularity is commonly used to describe the diminishing return prop-erty in discrete and continuous domains. Examples include the welfare maximization problem (e.g.,Dobzinski and Schapira (2006) and Vondrák (2008)), capital budgeting with risk-averse investors(e.g., Weingartner (1967) and Ahmed and Atamtürk (2011)), and the problem of maximizing influ-ence through the network (e.g., Kempe et al. (2003) and Mossel and Roch (2010)).We apply our framework to the adversarial online submodular maximization problem. For theonline problem of maximizing non-monotone set submodular functions without any constraints, wetransform a variation of the bi-greedy offline algorithm by Buchbinder and Feldman (2018) usingour framework and obtain O ( nT / ) − regret in the full-information setting, matching the previousbound by Roughgarden and Wang (2018) who also take advantages of the bi-greedy offline algorithmof Buchbinder and Feldman (2018). Here, n is the number of coordinates. For the bandit setting,our transformation yields O ( nT / ) − regret. To the best of our knowledge, this is the first regretbound for the bandit setting of this challenging problem. PAC stands for probably approximately correct. iazadeh et al.:
Online Learning via Offline Greedy For the online problem of maximizing continuous submodular functions without any constraints,we transform a variation of the continuous bi-greedy algorithm by Niazadeh et al. (2018) and obtain O ( n √ T log T ) − regret in the online full-information setting. For the bandit setting, we obtain O ( nT / (log T ) / ) − regret when the continuous submodular functions is weak-DR. Our resultsfor weak-DR submodular functions trivially yield results for strong-DR submodular functions. Wehighlight that the notion of weak-DR submodularity is equivalent to continuous submodularityand is easier to satisfy than strong-DR submodularity, which additionally requires coordinate-wiseconcavity; see the definition of weak-DR and strong-DR submodular functions in Section 2. Our workis the first one that designs online algorithms for weak-DR submodular functions. Furthermore, ourbounds improve the previous bounds for strong-DR submodular functions by Thang and Srivastav(2019), which are O ( T / ) − regret and O ( T / ) − regret for the full-information and banditsettings, respectively. The aforementioned regret bounds are obtained using a variation of the Frank-Wolfe algorithm.For the online problem of maximizing set monotone submodular functions with cardinally con-straints with size k , we transform the offline greedy algorithm by Nemhauser et al. (1978), which isa (1 − /e ) − approximation, to yield O ( k √ T log n ) (1 − /e ) − regret in the online full-informationsetting, matching the bound by Streeter and Golovin (2008) who use a variation of the EXP3algorithm. Furthermore, our framework gives O ( kn (log n ) / T / ) (1 − /e ) − regret in the ban-dit setting, improving the previous bound of O ( k ( n log n ) / T / (log T ) ) (1 − /e ) − regret byStreeter and Golovin (2008, 2007) in the opaque feedback model, which is the limited feedbackmodel that is analog to our bandit feedback model under exploration. See Section 5 for more details. While the closely related work has already been discussed, in this section, broader related work willbe discussed.
Combinatorial learning.
Our work is related to the literature on online combinatorial learning.While in our work, we study the design of efficient online learning algorithms for combinatorialproblems whose loss function is not necessarily linear in the chosen action, the work on combinatoriallearning focuses on linear loss functions; see, for example, Abernethy et al. (2008), Uchiya et al.(2010), Cesa-Bianchi and Lugosi (2012), Audibert et al. (2014), Chen et al. (2013), Combes et al.(2015), and Zimmert et al. (2019). Here, the learner’s loss is a inner product of a d -dimensionalaction z t and loss vector a t . This line of work examines both the full-information and bandit settings.In the full-information setting, the learner observes the loss vector a t , while in the bandit setting,only the loss a Tt z t is observable. The standard exponentially weighted average forecaster obtains a We omit the dependence on the Lipschitz constant here. iazadeh et al.:
Online Learning via Offline Greedy Table 1 Our results for selective applications of our framework, compared to previously known results.
Online Full-Information Setting Online Bandit SettingApplication Approx Our γ -Regret The Best Our γ -Regret The BestFactor ( γ ) Bound Prior Bound Bound Prior BoundProduct Ranking Problem / - - O ( n √ T log n ) O (cid:16) n / (log n ) / T / (cid:17) Reserve Price Optimization / O ( n √ T log T ) ∗ - O ( n √ T log T ) O (cid:16) n / T / (log nT ) / (cid:17) Monotone Set SM − /e O (cid:0) k √ T log n (cid:1) † O (cid:16) k ( n log n ) / T / (log T ) (cid:17) † with Cardinality Constraints O ( k √ T log n ) O (cid:16) kn / (log n ) / T / (cid:17) Non-Monotone Set / O ( n √ T ) ‡ -SM Functions O ( n √ T ) O (cid:16) nT / (cid:17) Non-monotone Continuous / γ = 1 / , O ( T / ) § γ = 1 / , O ( T / ) § SM (Strong-DR) Functions O ( n √ T log T ) O ( nT / (log T ) / ) Non-monotone Continuous / - -SM (Weak-DR) Functions O ( n √ T log T ) O ( nT / (log T ) / ) ∗ Roughgarden and Wang (2019) † Streeter and Golovin (2008); ‡ Roughgarden and Wang(2018); § Thang and Srivastav (2019); tight O (cid:16) m q T log dm (cid:17) regret in the full-information setting, where m is the maximum ℓ -norm ofaction vectors (Audibert et al. (2014)). The state-of-the-art regret bound for the bandit setting is O (cid:16)q dm T log dm (cid:17) , as reported in several papers (Bubeck et al. (2012), Cesa-Bianchi and Lugosi(2012), Hazan and Karnin (2016)). Our framework achieves matching regret with respect to T inthe full-information setting without requiring the loss function to be linear. We get a worse regret(proportional to T / ) for the bandit setting to account for the non-linearity in loss functions. Online adversarial submodular optimization.
In the previous section, we review some of thework that is closely related to our results on maximizing submodular functions. Here, we reviewother work that studies the problem of maximizing submodular functions in an online adversarialsetting. Chen et al. (2018, 2019) use Frank-Wolfe method to design low-regret learning algorithmsfor maximizing monotone continuous strong-DR submodular functions with matroid constraints.Chen et al. (2018) (respectively Chen et al. (2019)) assume that the algorithm can access to T / exact (respectively T / stochastic) gradient evaluations in every round and design an algorithmwhose (1 − /e ) -regret is O ( √ T ) . The results of Chen et al. (2018, 2019) were later improved byZhang et al. (2019a) who design another Frank-Wolfe inspired learning algorithm that accesses toone stochastic gradient in each round and obtain O ( T / ) (1 − /e ) -regret. Zhang et al. (2019a)further present a learning algorithm in the bandit setting for the problem of maximizing monotone The dependency on the number of elements n is not well specified in this work. iazadeh et al.: Online Learning via Offline Greedy continuous strong-DR submodular functions subject to matroid constraints. Their algorithm obtain O ( T / ) ( − /e )-regret. In contrast, we do not impose the monotonicity condition when consid-ering continuous submodular functions and hence instead of having the approximation factor of − /e , we have an approximation factor of / . Note that / is a tight approximation ratio (unlessRP=NP). Our O (cid:0) n √ T log T (cid:1) -regret in the full-information setting and O (cid:0) nT / (log T ) / (cid:1) -regret in the bandit setting for non-monotone weak-DR submodular maximization imply the samebounds for non-monotone strong-DR submodular functions. Online stochastic submodular optimization.
Designing learning algorithms for maximizing stochas-tic monotone continuous strong-DR submodular functions has been studied in Hassani et al. (2017),Mokhtari et al. (2018), Hassani et al. (2019), and Zhang et al. (2019b). The best result for this set-ting is by Zhang et al. (2019b) who obtain O ( √ T ) (1 − /e ) -regret using a stochastic variant of theFrank-Wolfe method. Their algorithm also implies the same regret bound for monotone set sub-modular maximization, which matches our regret bound for maximizing monotone set submodularfunction in the adversarial setting. Blackwell approachability.
Several aspects of Blackwell sequential game, including the design ofefficient algorithms for Blackwell game with various information feedback structures, and the alter-native conditions for approachability, have been studied in the literature. In terms of feedbackstructures, the original Blackwell game develops efficient projection algorithm for games that returnthe adversary’s moves on each round. Mannor et al. (2011) develop simple and efficient algorithmsfor a variant of Blackwell game where on each round, the player only obtains a random signal whosedistribution depends on the action of the player and the adversary (as opposed to the action of theadversary). This variant is called Blackwell approachability with partial monitoring, and is furtherstudied in Mannor et al. (2014) and Kwon and Perchet (2017). In terms of equivalent conditionsfor approachability, aside from the original halfspace-satisfiability condition for approachability inBlackwell (1956), alternative conditions for approachability, including the response-satisfiability cri-teria that we use in this paper, can be found in Lehrer (2003), Vieille (1992), Spinat (2002), Milman(2006), and Even-Dar et al. (2009).Blackwell approachability has also been proven to be a quintessential tool in construct-ing online learning algorithms in various applications, as shown in Even-Dar et al. (2009),Mannor and Shimkin (2006), and Bernstein and Shimkin (2013). However, most applications do notinvolve NP-hard combinatorial problems, and use the best-fixed action in hindsight (no approxima-tion factor) as the benchmark for regret. Furthermore, they only create one Blackwell instance oneach round. In contrast, we create multiple Blackwell instances on each round because the problems All these results (Chen et al. (2018), Chen et al. (2019), Zhang et al. (2019a)) can be extended for monotone setsubmodular functions using rounding method and multi-linear extension as a bridge. iazadeh et al.:
Online Learning via Offline Greedy we consider have combinatorial nature and can only be solved efficiently in multiple stages. Fur-thermore, since we are solving NP-hard combinatorial problems with an intractable offline problem,we use a γ -approximation benchmark in our regret. Organization
In Section 2, we present the offline optimization problem, adversarial online learning framework,Blackwell sequential games, and definition of set and continuous submodular functions. Section3 presents the greedy approximation algorithm for the offline problem. In Sections 4 and 5, wepresent our offline-to-online transformation in the full information and bandit settings, respectively.Section 6 provides our regret bounds for the product ranking problem, optimizing reserve prices,and maximizing submodular functions. We conclude in Section 7.
2. Preliminaries and Notations
In this section, we formulate our adversarial online learning framework for approximation algorithms.We then give an overview of Blackwell approachability (Blackwell 1956), an important technical toolthat we use in this paper. We also provide a brief recap of a few definitions and results concerningmaximization of submodular functions, a canonical application for demonstrating our techniques.
Let F be a space of functions defined over a (discrete or continuous) domain D . Assume that F is closed under addition, i.e., for any two functions f , f ∈ F , we have f + f ∈ F . In the offlineoptimization problem , the problem of interest is finding a point z ∗ ∈ D such that z ∗ ∈ arg max z ∈C f ( z ) , (1)where f : D → [0 , , which belongs to F , is the objective function, and C ⊆ D is the feasible region. We further denote the optimal objective value of problem (1) by
OPT ; that is,
OPT = max z ∈C f ( z ) .We focus on maximization problems in this paper, but our techniques and results can easily beextended to minimization problems as well.We consider offline problems that are NP-hard to solve exactly, and at the same time are amenableto a γ -approximation algorithm for some constant γ ∈ (0 , . Definition 1 ( γ -approximation offline algorithm). An offline algorithm A for prob-lem (1) is a polynomial time γ -approximation algorithm if for every f ∈ F returns a feasible (possiblyrandomized) point ˆ z ∈ C in polynomial time in the size of the algorithm’s input, so that E [ f ( ˆ z )] ≥ γ · OPT . Here, the expectation is with respect to the randomness in algorithm A . The constant γ ∈ (0 , isreferred to as the approximation factor of algorithm A . For maximization problems, which are the focus of this paper, we only need our functions to be upper bounded bya constant. However, for simplicity, we assume that our functions are upper bounded by one. iazadeh et al.:
Online Learning via Offline Greedy Framework.
In the adversarial online learning version of problem (1), there is a learner, denotedby ALG, who plays T rounds of a sequential game against an adversary, denoted by ADV. In eachround t ∈ [ T ] , ADV picks a function f t ∈ F and simultaneously ALG takes an action by picking afeasible point z t ∈ C . Then, ALG obtains a reward equal to f t ( z t ) and receives a feedback concerningthis round. We highlight that unlike the offline optimization Problem (1), the unknown function f t is not observable to ALG when it chooses action z t , and he only knows that f t belongs to F .Furthermore, ALG picks his action at time t only given the feedback of previous rounds , , . . . , t − ,and in that sense, ALG is an online learner. ALG’s goal is to pick points { z t } Tt =1 given the feedbackof each round to maximize the accumulated reward P Tt =1 f t ( z t ) against a worst-case adversary ADV.In this paper, for the sake of brevity and simplicity, we limit our focus to worst-case obliviousadversaries, i.e., adversaries that pick the sequence f , f , . . . , f T upfront. Feedback structures.
We consider two feedback structures: (i) full information feedback , whereALG observes the entire function f t after choosing z t , and (ii) bandit feedback , where ALG onlyobserves the quantity f t ( z t ) after choosing z t . Let φ t be the feedback that ALG receives after picking z t . Then, ALG’s next action z t +1 is a function of the history H t , where H t , { ( z , φ ) , . . . , ( z t , φ t ) } .More formally, any learning algorithm ALG can be described as mappings { π ( t ) ALG } Tt =1 , where each π ( t ) ALG maps the history H t − to action z t for any t ∈ [ T ] . The mapping π ( t ) ALG can be either determin-istic or randomized.
Benchmarks and regret.
We would like to design polynomial-time online learning algorithms foroffline problems that are NP-hard to solve exactly. Thus, we use the adapted notion of approximateregret to quantify the performance of an online algorithm. This notion is the regret with respect to γ fraction of the objective value at the best in-hindsight point. The notion of γ -regret, which is formallydefined below, is common in the literature, see, for example, Kakade et al. 2009, Dudík et al. 2017,and Roughgarden and Wang 2018. At a high level, our goal is to take an efficient γ -approximationoffline algorithm, and transform it to an online algorithm ALG with a sublinear γ -regret. Definition 2 ( γ -regret). Let σ = { ( z t , f t ) } Tt =1 be a sequence of strategies realized by onlinelearner ALG and adversary ADV. Then, for any such σ and γ ∈ (0 , , γ -regret ( σ ) is defined as γ -regret ( σ ) , γ · max z ∈C T X t =1 f t ( z ) − T X t =1 f t ( z t ) . With a slight abuse of the notation, we denote the worst-case expected approximate regret of ALGagainst any (oblivious) adversary ADV as follows: γ -regret ( ALG ) , max { f t } Tt =1 n E [ γ -regret ( σ )] : σ = { ( z t , f t ) } Tt =1 , z t = ALG’s strategy at time t ∈ [ T ] o , where the expectation is with respect to any randomness in ALG. iazadeh et al.: Online Learning via Offline Greedy To transform offline approximation algorithms to efficient online learning algorithms, we take advan-tage of
Blackwell sequential games . A Blackwell sequential game is a repeated two-player gamecharacterized by a tuple ( X , Y , p ) . In this repeated game, X and Y are both compact convex setsrepresenting the players’ action spaces, and p : X × Y → R d is a biaffine vector payoff function. Moreover, parameter d ∈ N is known as the dimension of the Blackwell sequential game. The vec-tor payoff function p is assumed to be known by both players. The game is played in T rounds.Each round involves player 1 choosing an action x t ∈ X and player 2 choosing an action y t ∈ Y simultaneously. Both actions may depend on the observed history (( x , y ) , · · · , ( x t − , y t − )) . Thispair of actions produces the vector payoff p ( x t , y t ) . The objective of player 1 is to ensure that thetime-averaged payoff approaches a closed and convex target set S ⊆ R d , and the objective of player2 is to prevent this from happening. Definition 3 (Blackwell approachabilty).
In the Blackwell sequential game ( X , Y , p ) , atarget set S is g ( T ) -approachable if there exists a player 1 strategy such that for every player 2’sstrategy, the resulting sequence of actions satisfies d ∞ T T X t =1 p ( x t , y t ) , S ! ≤ g ( T ) , where for any vector w ∈ R d and set S ⊆ R d , d ∞ ( w , S ) , inf v ∈ S k w − v k ∞ is the ℓ ∞ -distance ofvector w from set S .In this paper, we focus on the ℓ ∞ norm rather than the usual ℓ norm since it is more suitablefor our applications. Our bounds on the approachability term g ( T ) will depend on the scale of theproblem, and more formally on the diameter D ( p ) of the payoff function p , defined asD ( p ) , max x ∈X , y ∈Y k p ( x , y ) k ∞ . (2)Ideally, player 1 aims to develop a strategy so that the term g ( T ) in Definition 3 converges to as T converges to + ∞ , and hence would be able to approach the target set S asymptotically. However,not every closed and convex target set S is approachable. To help with characterizing which setsare approachable, we additionally define the concept of response-satisfiablity . Definition 4 (Response-Satisfiable).
A closed and convex target set S is response-satisfiable in the Blackwell sequential game ( X , Y , p ) if for every player 2’s action y ∈ Y , there existsa player 1’s action x ∈ X such that the vector payoff falls into the target set, that is p ( x , y ) ∈ S . Function p ( · , · ) is biaffine if for any x ∈ X , p ( x , · ) is affine and for any y ∈ Y , p ( · , y ) is affine. iazadeh et al.: Online Learning via Offline Greedy Blackwell’s landmark result (Blackwell 1956) is an equivalence of (asymptotic) approachabilityand response-satisfiablity. We extend this result in the following theorem.
Theorem 1.
A closed and convex target set S is O ( D ( p ) log( d ) / T − / ) -approachable in theBlackwell sequential game ( X , Y , p ) if and only if the set S is response-satisfiable, where D ( p ) ,defined in Equation (2) , is the ℓ ∞ diameter of the payoff function p , and d is the dimension of thegame. We present a detailed proof of Theorem 1 in Section A.2 in the appendix, which is an adapta-tion of the original result of Blackwell (1956). The main difference between the Blackwell’s originalresult and Theorem 1 is how the distance between the average payoff and set S is computed. WhileBlackwell uses norm 2, we apply norm infinity. To account for this difference, we use the equiv-alence between Blackwell approachability and online linear optimization (Abernethy et al. 2011).This equivalence allows us to apply regret bounds for the latter problem that uses an arbitrarynorm to find new bounds for the approachability problem. The regret bounds (on online linearoptimization) can then be obtained via using Follow-the-Regularized-Leader (Shalev-Shwartz et al.2012) or Online Mirror Descent (Bubeck et al. 2015) algorithms.We finish this subsection by important remarks regarding our treatment of the Blackwellapproachability. Remark 1.
As our goal is designing polynomial-time online learning algorithms, we further usealgorithmic results in Even-Dar et al. 2009, and Abernethy et al. (2011) due to the equivalencebetween Blackwell approachability and full information adversarial online linear optimization. Theseresults provide a polynomial-time approachable online algorithm satisfying the bound in Theorem 1,given access to a separation oracle for the closed and convex set S . From this point on, when set S is response-satisfiable, we assume access to such an online algorithm that uses a separation oraclefor the convex set S in a blackbox fashion. Remark 2.
Another upshot of the above line of research on the equivalence between Blackwellapproachability and full information online linear optimization is that an algorithm for player toapproach set S might only have access to the realized vector payoffs ( p ( x , y ) , . . . , p ( x t − , y t − )) in round t , rather than the entire history (( x , y ) , . . . , ( x t − , y t − )) , and this is indeed without loss There are other equivalent structural criteria for approachability similar to response-satisfiability; see Section A.1in the appendix for a list of these conditions. Given the separation oracle for convex set S , the running-time should be polynomial in d , T , and the number ofbits required to encode X . We are also considering a computational model where either the realized vector payoff isgiven as feedback at the end of each round, or the vector payoff function p can be evaluated efficiently at any givenpair of actions ( x , y ) . iazadeh et al.: Online Learning via Offline Greedy of generality for obtaining the optimal bound of Theorem 1 (Abernethy et al. 2011). We relax thisassumption in our “bandit Blackwell sequential game”, where we assume player 1 can only sometimeshave access to an unbiased estimator of the realized vector payoff; see Section 5.1 for the definitionand more details. A particular class of NP-hard optimization problems we study concern maximizing set or continuous submodular functions . Definition 5 (Set submodularity).
A set function f : 2 [ n ] → [0 , is submodular if for all S, T ⊆ [ n ] , f ( S ∪ T ) + f ( S ∩ T ) ≤ f ( S ) + f ( T ) . Similarly, the concept of submodularity can be extended from subset lattice (above definition) toany discrete or continuous lattice. In particular, by considering the positive orthant cone lattice, weobtain the following definition for the continuous variant of set submodularity.
Definition 6 (Continuous submodularity).
A continuous multivariate function f :[0 , n → [0 , is submodular if for all x , y ∈ [0 , n ,f ( x ∨ y ) + f ( x ∧ y ) ≤ f ( x ) + f ( y ) , where ∨ and ∧ are coordinate-wise max and min operations. As an equivalent definition (Bian et al.2016), f is continuous submodular if for all i ∈ [ n ] , z ∈ [0 , , x − i (cid:22) y − i ∈ [0 , n − , and δ ≥ ,f ( z + δ, x − i ) − f ( z, x − i ) ≥ f ( z + δ, y − i ) − f ( z, y − i ) . The above class of continuous functions is also referred to as the weak-Diminishing Return (weak-DR) submodular in the literature (cf. Bian et al. 2016, Niazadeh et al. 2018, Soma and Yoshida2018). We further consider a special subclass of these functions satisfying concavity along eachcoordinate, referred to as the strong-Diminishing Return (strong-DR).
Definition 7 (Strong-DR continuous Submodularity).
A continuous multivariate func-tion f : [0 , n → [0 , is strong-DR submodular if for all i ∈ [ n ] , x (cid:22) y ∈ [0 , n , and δ ≥ ,f ( x i + δ, x − i ) − f ( x ) ≥ f ( y i + δ, y − i ) − f ( y ) , where x − i (resp. y − i ) is an ( n − -dimensional vector with all coordinate values of x (resp. y )except i , and x (cid:22) y if and only if ∀ j ∈ [ n ] : x j ≤ y j .The problem of maximizing monotone set submodular functions under a cardinality constraintadmits a classic greedy algorithm by Nemhauser et al. 1978 that achieves a tight approximationfactor of γ = 1 − /e . For unconstrained non-monotone set submodular maximization problem, the iazadeh et al.: Online Learning via Offline Greedy double greedy algorithm by Buchbinder et al. 2015 achieves a tight approximation factor of γ = 1 / .For maximizing non-monotone weak-DR continuous submodular functions within unit hyper cube [0 , n (or more generally any box constraint of the form × ni =1 [ a i , b i ] ), the continuous bi-greedyalgorithm by Niazadeh et al. 2018 achieves a tight approximation factor of γ = 1 / . For the specialcase of strong-DR, they also propose a variation of the continuous bi-greedy that is provably a faster / -approximation algorithm. See Section 6.3 for more details.
3. Approximation Algorithms for the Offline Problem: Iterative Greedy
As stated earlier, we are interested in transforming a γ -approximation algorithm for the offlineproblem (1) to an online learning algorithm, so that the worst-case γ -regret is sublinear in thenumber of rounds T . We consider a general class of algorithms for obtaining such an approximationguarantee, named Iterative Greedy (IG) algorithms. In an algorithm in this class, roughly speaking,a sequence of locally optimal decisions with respect to a specific metric (which we elaborate onmore later) leads to picking the final point. This point then provably provides an approximationguarantee with respect to the global optimal solution of problem (1).Formally, consider the following abstract skeleton. Suppose that we have N subproblems indexedby i ∈ [ N ] . The algorithm starts from an initial feasible point z (0) ∈ C . It then goes over the subprob-lems in the increasing order of their indices. The goal of each subproblem i is to return a new feasiblepoint z ( i ) ∈ C given the output of the previous subproblem, i.e., z ( i − . The algorithm finishes byreturning the point z ( N ) . Now, each subproblem i performs two steps:1. Local optimization: We associate a space of update parameters Θ ⊆ R d param to each subproblem.Given the previous point z ( i − and the objective function f , the goal of this step is to find a locally optimal update parameter θ ( i ) ∈ Θ that satisfies: Payoff ( θ ( i ) , z ( i − , f ) ≥ , where Payoff : Θ × D × F → R d payoff denotes the parameter vector payoff function .2. Local update: Given the update parameter θ ( i ) and z ( i − , this step returns the next point z ( i ) = Local-update ( θ ( i ) , z ( i − ) ∈ C . Notably, we allow
Local-update : Θ × D → ∆( C ) to incorporate randomness, and therefore z ( i ) can be potentially a randomized point.The above procedure is summarized in Algorithm 1. Remark 3.
To simplify the notation, we only consider symmetric subproblems in this section,i.e., all of the subproblems have the same update parameter spaces, local optimization steps, etc. Insome of our applications in Section 6, we need slightly different subproblems for different i = 1 , . . . , N .Our method directly extends to that case by having index-dependent subproblems. iazadeh et al.: Online Learning via Offline Greedy Algorithm 1:
Offline-IG ( C , F , D , Θ) Meta Input:
Feasible region C , function space F , defined over domain D , parameter space Θ ⊆ R d param , and parameter vector payoff function Payoff : Θ × D × F → R d payoff . Input: function f ∈ F . Output: feasible point z ∈ C .Initialize z (0) ∈ C ; for subproblem i = 1 to N do Choose update parameter θ ( i ) so that Payoff ( θ ( i ) , z ( i − , f ) ≥ ;Set z ( i ) ← Local-update ( θ ( i ) , z ( i − ) ;Return the final point z ← z ( N ) . Example 1.
As a simple running example, consider the problem of maximizing a monotonesubmodular set function f : 2 [ n ] → [0 , subject to the cardinality constraint k , and the classic (1 − e )-approximation greedy algorithm for this problem (Nemhauser et al. 1978). This algorithmstarts from the empty set and picks elements greedily based on their marginal value to the currentset. This problem is an example of problem (1) where D = { , } n , C = { z ∈ { , } n : z · n ≤ k } and F is the space of all monotone submodular set functions. Here, n is the all-ones vector withsize n . The greedy algorithm is an instance of Algorithm 1 with Θ = ∆([ n ]) , which is the set of allpossible probability distributions over n elements, and N = k subproblems, one for each iterationof the greedy algorithm. To describe each subproblem, for θ ∈ Θ , z ∈ D , and f ∈ F , ∀ j ∈ [ n ] : [ Payoff ( θ , z , f )] j = θ T y − [ y ] j , where [ · ] j denotes the j th coordinate value of a vector and y , [ f ( z ∪ { j } ) − f ( z )] j ∈ [ n ] is the marginalobjective value of adding element j to z . Moreover, Local-update ( θ , z ) samples an element i ∗ ∼ θ ,where θ ∈ ∆([ n ]) is a probability distribution over n elements, and returns z ∪ { i ∗ } . Note that Payoff ( θ , z , f ) ≥ guarantees θ to only have positive mass on elements with maximum marginalvalue with respect to the point z . We focus on IG algorithms that (i) provide a worst-case multiplicative approximation guarantee forproblem (1), and (ii) have a local optimization step that is robust to small errors, i.e., if we replacethe locally optimal decisions with almost locally optimal ones, the final point still remains to beapproximately optimal (with the same approximation factor), but up to a small additive error. Thefollowing definition formalizes this robustness notion.
Definition 8 ( ( γ, δ ) -robust approximation). An instance of Algorithm 1 is a ( γ, δ ) -robustapproximation algorithm for γ ∈ (0 , and δ > , if it satisfies the following properties: iazadeh et al.: Online Learning via Offline Greedy
1. Algorithm 1 is a γ -approximation offline algorithm as in Definition 1,2. Supposed that we replace θ ( i ) with ˜ θ ( i ) for every subproblem i = 1 , . . . , N . Then, if ∀ j ∈ [ d payoff ] : h Payoff ( ˜ θ ( i ) , z ( i − , f ) i j + ǫ ≥ , then we should have: ∀ ˆ z ∈ C : E [ f ( z )] ≥ γ · f ( ˆ z ) − δN ǫ , where ǫ > and [ · ] j denotes the j th coordinate value of a vector.For our purpose, we actually need a stronger version of this robustness property. This propertyessentially concerns multiple runs of the offline algorithm on a group of functions in F , i.e., { f t } t ∈ [ T ] ,producing a sequence of feasible points z t ∈ C for t ∈ [ T ] , and then guarantees a robust approximationfor the summation function, i.e., P t ∈ [ T ] f t ( z ) , against errors that are small on-average over theseruns by the sequence { z t } Tt =1 . This property is satisfied in all of the applications that motivate ourwork, in particular in various set and continuous submodular maximization problems we study inSection 6, and in both reserve price optimization and product ranking problems. Definition 9 (Extended ( γ, δ ) -robust approximation). An instance of Algorithm 1 is anextended ( γ, δ ) -robust approximation algorithm for γ ∈ (0 , and δ > , if for any sequence offunctions f , f , . . . f T ∈ F the following property is satisfied:• Suppose that z t is the output of Algorithm 1 on function f t for t ∈ [ T ] when θ ( i ) t (i.e., the choiceof parameter for subproblem i of run t ) is replaced with ˜ θ ( i ) t for t ∈ [ T ] and i ∈ [ N ] . Then, if ∀ j ∈ [ d payoff ] : " T X t =1 Payoff ( ˜ θ ( i ) t , z ( i − t , f t ) j + h ( T ) ≥ , we should have ∀ ˆ z ∈ C : T X t =1 E [ f t ( z t )] ≥ γ · T X t =1 f t ( ˆ z ) − δN h ( T ) . Here, z ( i ) t is the output of subproblem i ∈ [ N ] for run t ∈ [ T ] , h ( · ) : N → R + , and [ · ] j denotesthe j th coordinate value of a vector.When there is only one run of the function (i.e., T = 1 ), the extended ( γ, δ ) -robust approxi-mation guarantee boils down to the weaker ( γ, δ ) -robust approximation guarantee in Definition 8.We finish this section by revisiting our running example and demonstrating the (extended) robustapproximation property in this example. Example 1 (continued).
By digging deeper in the original analysis of the greedy algo-rithm (Nemhauser et al. 1978), we show that the greedy algorithm satisfies the extended ( γ, δ ) -robust approximation property for γ = 1 − e and δ = 1 . iazadeh et al.: Online Learning via Offline Greedy Suppose that z ∗ = { a , . . . , a k } is the optimal solution of the offline problem; that is, z ∗ =arg max z ∈{ , } n : z · n ≤ k P Tt =1 f t ( z ) . Further, let z ( i ) t be the solution returned by the i th subproblemof the greedy algorithm when the objective function is f t . Then, for every i ∈ [ k ] , T X t =1 f t ( z ∗ ) − T X t =1 f t ( z ( i − t ) (1) ≤ T X t =1 f t ( z ∗ ∪ z ( i − t ) − T X t =1 f t ( z ( i − t )= T X t =1 k X j =1 (cid:16) f t ( z ( i − t ∪ { a , . . . , a j } ) − f t ( z ( i − t ∪ { a , . . . , a j − } ) (cid:17) (2) ≤ k X j =1 T X t =1 (cid:16) f t ( z ( i − t ∪ { a j } ) − f t ( z ( i − t ) (cid:17) (3) = k X j =1 T X t =1 (cid:18) h ˜ θ ( i ) t , y ( i − t i − h Payoff ( ˜ θ ( i ) t , z ( i − t , f t ) i a j (cid:19) = k X j =1 T X t =1 n X j =1 [ ˜ θ ( i ) t ] j (cid:16) f t ( z ( i − t ∪ { j } ) − f t ( z ( i − t ) (cid:17) − h Payoff ( ˜ θ ( i ) t , z ( i − t , f t ) i a j ! = k · T X t =1 E h f t ( z ( i ) t ) − f t ( z ( i − t ) i − k X j =1 T X t =1 h Payoff ( ˜ θ ( i ) t , z ( i − t , f t ) i a j (4) ≤ k · T X t =1 E h f t ( z ( i ) t ) − f t ( z ( i − t ) i + kh ( T ) , where y ( i ) t , h f t ( z ( i ) t ∪ { j } ) − f t ( z ( i ) t ) i j ∈ [ n ] . In the above chain of inequalities, inequality (1) holdsbecause function f t is monotone, inequality (2) holds due to submodularity of functions { f t } Tt =1 ,equality (3) holds because of the definition of the payoff vector in Example 1, and inequality (4)holds because of the condition in Definition 9. By rearranging the terms and taking expectations,we have: T X t =1 E h f t ( z ∗ ) − f t ( z ( i ) t ) i ≤ (1 − k ) T X t =1 E h f t ( z ∗ ) − f t ( z ( i − t ) i + h ( T ) . By recursing the above inequality for i = 1 , . . . , k , and rearranging the terms, we finally have: T X t =1 E [ f t ( z t )] ≥ (1 − (1 − k ) k ) T X t =1 f t ( z ∗ ) − h ( T ) k X i =1 (1 − k ) i − ≥ (1 − e ) T X t =1 f t ( z ∗ ) − kh ( T ) . (cid:4) Not all greedy algorithms have robust guarantees. Example 2 of Section B in the appendix showswhy, e.g., Dijkstra’s algorithm for the shortest path problem, is not robust to local errors. We use binary indicator vectors and sets interchangably in this paper. iazadeh et al.:
Online Learning via Offline Greedy
4. Online Algorithm under Full Information Feedback Structure
In this section, we show how to transform an offline IG algorithm (Algorithm 1) to an online learningalgorithm with a small approximate regret whenever it (i) is an extended robust approximationalgorithm (Definition 9), and (ii) satisfies an extra condition that we call
Blackwell reduciblity . Wefirst introduce this condition. Then, with the help of the Blackwell approachability (Theorem 1), wepropose a meta full information online learning algorithm as our offline-to-online transformation.
The crux of our technique to transform an offline IG algorithm to an online learning algorithm isthe possibility of reducing the local optimization step of Algorithm 1 to an approachable instanceof the Blackwell sequential game as in Section 2.3.
Definition 10 (Blackwell reduciblity).
An instance
Offline-IG ( C , F , D , Θ) of Algo-rithm 1 is Blackwell reducible if there exists an instance ( X , Y , p ) of the Blackwell sequential game(with a biaffine vector payoff function p ) and a mapping AdvB : D × F → Y called synthetic Blackwelladversary function , such that:1. The player 1’s action space X is equal to the parameter space Θ in Algorithm 1; i.e., X = Θ ,and for any θ ∈ Θ , z ∈ D , f ∈ F , we have Payoff ( θ , z , f ) = p ( θ , AdvB ( z , f )) .2. The set S , { u ∈ R d payoff : [ u ] j ≥ , j ∈ [ d payoff ] } is response-satisfiable (Definition 4). Example 1 (continued).
The greedy algorithm of Nemhauser et al. (1978) is Blackwellreducible. Consider an instance ( X , Y , p ) of Blackwell where X = Θ = ∆([ n ]) and Y = [0 , n . Thesynthetic Blackwell adversary function is AdvB ( z , f ) = [ f ( z ∪ { j } ) − f ( z )] j ∈ [ n ] , and the biaffineBlackwell vector payoff function is p ( θ , y ) = θ T y n − y . Recall that n is all-ones n -dimensionalvector. Furthermore, set S is response-satisfiable because for every player 2’s action y ∈ Y , playing θ = e j ∗ with j ∗ = argmax j ∈ [ n ] y j implies that p ( θ , y ) ≥ . If the offline algorithm (Algorithm 1) is Blackwell reducible, then one can think of the followingapproach to transform it into an online learning algorithm: associate an instance of the Blackwellsequential game to each subproblem i following the Blackwell reducibility, and then running N parallel online approachable algorithms for these Blackwell instances to find a sequence of assign-ments of the update parameter of each subproblem i over time. We further need to show how tosynchronize these parallel runs through a proper communication between them, so as to constructa sequence of feasible solutions z , . . . , z T guaranteeing a small approximate regret. Note that Y = [0 , n because f : 2 [ n ] → [0 , is monotone non-decreasing. iazadeh et al.: Online Learning via Offline Greedy Recall that our goal in the offline problem is to solvethe optimization problem max z ∈C f ( z ) , where f ∈ F . The offline problem admits a polynomial timeIG γ -approximation algorithm, Offline-IG ( C , F , D , Θ) , presented in Algorithm 1. This algorithmsolves N subproblems sequentially, building the solution step by step. In step/subproblem i of thisalgorithm, we first update parameters θ ( i ) ∈ Θ ⊆ R d param using the previous point z ( i − , and thenreturn the next point z ( i ) to feed to the next subproblem. The algorithm finishes by returning thefinal point z ( N ) .As stated earlier, we assume that the offline problem is Blackwell reducible; that is, we can definethe Blackwell instance ( X , Y , p ) and synthetic Blackwell adversary function AdvB : D × F → Y thatsatisfy the conditions in Definition 10. Although this definition might seem technical, verifying itfor many offline algorithms is indeed straightforward; see Section 6.For the online version of the above offline algorithm, the meta input is feasible region C , functionspace F , which is defined over domain D , and parameter space Θ ⊆ R d param . We further considerhaving access to an online Blackwell algorithm AlgB, player 1’s strategy in the above Blackwellsequential game ( X , Y , p ) , where such algorithm (i) ensures that the distance between the averagevector payoff T P Tt =1 p ( x t , y t ) and set S goes to zero with rate g ( T ) = O (cid:18) D ( p ) q log( d ) T (cid:19) againstany adversarial player 2’s strategy (see Theorem 1), and (ii) can be implemented in polynomialtime having access to a separation oracle for the convex set S . As stated earlier, the existence ofsuch an algorithm follows from the work of Even-Dar et al. 2009 and Abernethy et al. (2011); seeRemark 1. We consider N parallel copies of this algorithm, one for each subproblem i ∈ [ N ] . It isalso important to note that in our application, set S is the positive orthant, for which a polynomialtime separation oracle exists. Our algorithm that takes advantage of N parallel copies of the onlineBlackwell algorithm is summarized in Algorithm 2.Let AlgB ( i ) be the copy of the above online Blackwell algorithm associated to subproblem i ∈ [ N ] .This copy handles the local optimization step of subproblem i in the Offline-IG ( C , F , D , Θ) inevery round t ∈ [ T ] without knowing function f t . Consider the decision-making process of thisonline algorithm in round t . The inputs prior to this round are all the update parameters of thesubproblem i in the first t − rounds, i.e., θ ( i )1 , . . . , θ ( i ) t − , and the realized vector payoffs of the first t − rounds against player 2 in the Blackwell sequential game associated to subproblem i , i.e., p ( θ ( i )1 , y ( i )1 ) , . . . , p ( θ ( i ) t − , y ( i ) t − ) . We consider a particular player 2 for this Blackwell sequential game.More explicitly, the synthetic adversary function AdvB, which is part of our reduction, plays therole of player 2 in any round t , i.e., y ( i ) t = AdvB ( z ( i − t , f t ) . Given the input prior to time t , AlgB ( i ) returns the new update parameter θ ( i ) t .After the online Blackwell algorithm AlgB ( i ) returns the update parameter θ ( i ) t , we returnthe point z ( i ) t by calling the Local-update function in the offline algorithm, i.e., we set z ( i ) t to iazadeh et al.: Online Learning via Offline Greedy Local-update ( θ ( i ) t , z ( i − t ) . Observe that the point returned by the subproblem i , i.e., z ( i ) t , dependson the point returned by the previous subproblem z ( i − t . This highlights that while each onlineBlackwell algorithm is responsible for one subproblem, they communicate with each other to buildthe final solution, where this communication is structured by the offline algorithm through the Local-update function. After obtaining the point z ( i ) t , we move to subproblem i + 1 .Finally note that simulating the actions of our particular player 2 to determine the realized vectorpayoffs of each round, and computing/sending this feedback at the end of each round to AlgB ( i ) (asplayer 1) in a computationally efficient manner, require the following:• Knowing the point z ( i − t picked by subproblem i − at time t : This is possible as we go overour subproblems in the order i = 1 , . . . , N in each round t .• Knowing the function f t : This is possible because here we study the full information feedbackstructure, where under this structure we have access to f t after we choose point z t = z ( N ) t .• Being able to compute the realized vector payoff p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) efficiently given θ ( i ) t , f t , and z ( i − t . This is possible as this quantity is equal to Payoff ( θ ( i ) t , z ( i − t , f t ) , which can beevaluated in polynomial time as Offline-IG ( C , F , D , Θ) is a polynomial time algorithm. The following theorem, which bounds the regret of our algorithm, isthe main result of this section.
Theorem 2 (Full information offline-to-online transformation) . Suppose that aninstance of the algorithm
Offline-IG ( C , F , D , Θ) for the offline problem (1) satisfies the followingproperties:• It is an extended ( γ, δ ) -robust approximation for γ ∈ (0 , and δ ∈ R + , as in Definition 9.• It is Blackwell reducible, that is, we can define the Blackwell sequential game ( X , Y , p ) andsynthetic Blackwell adversary function AdvB : D × F → Y that satisfy the conditions in Defini-tion 10.Consider the full-information adversarial online learning version of the problem (1) , and let AlgB bea polynomial time Blackwell algorithm for ( X , Y , p ) as in Remark 1. Then, for this online problem, Online-IG ( C , F , D , Θ , AlgB ) runs in polynomial time and satisfies the following γ -regret bound: γ -regret ( Online-IG ( C , F , D , Θ , AlgB )) ≤ O (cid:18) D ( p ) N δ q log( d payoff ) T (cid:19) , where N is the number of subproblems, d payoff is the dimension of vector payoffs, and D ( p ) , definedin Equation (2) , is the ℓ ∞ -diameter of the vector payoff space.Proof of Theorem 2. Consider a subproblem i ∈ [ N ] . Let S be the d payoff -dimensional positiveorthant; see the Blackwell reducibility definition and its associated approachable set S in Defini-tion 10. Because S is response-satsifiable and projection onto S can be done in polynomial-time, iazadeh et al.: Online Learning via Offline Greedy Algorithm 2:
Full-information Online Learning Meta-algorithm (
Online-IG ) Meta Input:
Feasible region C , function space F , defined over domain D , and parameterspace Θ ⊆ R d param . Offline algorithm and reduction gadgets:
An instance
Offline-IG ( C , F , D , Θ) ofAlgorithm 1, the Blackwell instance ( X , Y , p ) and synthetic Blackwell adversary functionAdvB : D × F → Y as this offline algorithm is Blackwell reducible (Definition 10) .
Input:
Number of rounds T ; access to a Blackwell online algorithm AlgB . Output:
Points z , z , . . . , z T ∈ C .Initialize N parallel instances { AlgB ( i ) } Ni =1 of the online algorithm AlgB ; for round t = 1 to T do Initialize z (0) t ∈ C ; for subproblem i = 1 to N do Choose update parameter θ ( i ) t by querying online algorithm AlgB ( i ) given the updateparameters and vector payoffs prior to round t in the Blackwell sequential game ofsubproblem i , that is, θ ( i ) t ← AlgB ( i ) (cid:16) θ ( i )1 , . . . , θ ( i ) t − , p ( θ ( i )1 , y ( i )1 ) , . . . , p ( θ ( i ) t − , y ( i ) t − ) (cid:17) ;Set z ( i ) t ← Local-update ( θ ( i ) t , z ( i − t ) ∈ C ; end Play the final point z t ← z ( N ) t ;< Full information feedback: adversary reveals function f t ∈ F > ; for i = 1 to N do Give feedback p ( θ ( i ) t , y ( i ) t ) ← Payoff ( θ ( i ) t , z ( i − t , f t ) to the Blackwell AlgorithmAlgB ( i ) (as the vector payoff of round t against player 2) ; // Note that y ( i ) t = AdvB ( z ( i − t , f t ) for player 2 implicitly, although we do not need to evaluate AdvB tocompute this action explicitly. endend there exists a polynomial-time online algorithm AlgB (with N parallel copies { AlgB ( i ) } Ni =1 ) thatguarantees Blackwell approachability for the Blackwell instance corresponding to subproblem i with g ( T ) = O (cid:18) D ( p ) q log( d payoff ) T (cid:19) , based on Theorem 1. Therefore, we have: d ∞ T T X t =1 p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) , S ! = d ∞ T T X t =1 p (cid:16) θ ( i ) t , y ( i ) t (cid:17) , S ! ≤ g ( T ) . Because the target set S is the positive orthant, we have d ∞ T T X t =1 p (cid:16) θ ( i ) t , AdvB ( z i − t , f t ) (cid:17) , S ! ≤ g ( T ) ⇐⇒ ∀ j : " T X t =1 p (cid:16) θ ( i ) t , AdvB ( z i − t , f t ) (cid:17) j ≥ − T g ( T ) iazadeh et al.: Online Learning via Offline Greedy Because of Blackwell reduciblity,
Payoff (cid:16) θ ( i ) t , z ( i − t , f t (cid:17) = p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) . Therefore, ∀ j ∈ [ d payoff ] : " T X t =1 Payoff (cid:16) θ ( i ) t , z ( i − t , f t (cid:17) j ≥ − T g ( T ) . (3)Finally, because Algorithm 1 is an extended ( γ, δ ) -robust approximation (see Definition 9), fromEquation (3) we have: T X t =1 E [ f t ( z t )] ≥ γ · T X t =1 f t ( z ∗ ) − δN T g ( T ) = γ · T X t =1 f t ( z ∗ ) − O (cid:18) δN D ( p ) q log( d payoff ) T (cid:19) , which finishes the proof. Here, z ∗ is the optimal in-hindsight feasible solution, i.e., z ∗ = argmax z ∈C P Tt =1 f t ( z ) . (cid:4) We finish this section by reviewing our running example (Example 1) and mentioning the regretbound we get as a direct corollary of Theorem 2.
Example 1 (continued).
The greedy algorithm in Nemhauser et al. (1978) is an extended (1 − e , -robust approximation algorithm and Blackwell reducible. It has N = k subproblems, the ℓ ∞ diameter of the payoff space is D = 1 , and the dimension of vector payoffs is d = n . Therefore,by invoking Algorithm 2 given any Blackwell algorithm satisfying the approachability bound inTheorem 1, we obtain the following bound: (cid:18) − e (cid:19) -Regret ( Algorithm 2 ) ≤ O ( k p log( n ) T ) , which exactly matches the bound known in Streeter and Golovin (2008) for the same problem.
5. Online Algorithm under Bandit Information Feedback Structure
So far, we presented a framework to transform an offline iterative greedy algorithm to its onlinecounterpart under the full information feedback structure. While the full information setting pro-vides the theoretical foundations for the rest of our results, from an application point of view, it isless motivated. In almost all applications of our framework in revenue management and online deci-sion making (e.g., product ranking problem and reserve price optimization), assuming the learnerhas full information feedback is rather a strong assumption.In this section, we seek to relax this assumption, and try to understand if our framework canbe extended to the more challenging bandit feedback structure setting. Under the bandit feedbackstructure, at the end of each round t , the learner faces an additional challenge: he only has accessto f t ( z t ) , rather than the entire function f t like in the full information setting. Such a feedbackstructure prevents the online Blackwell algorithms AlgB ( i ) to receive the feedback they require. iazadeh et al.: Online Learning via Offline Greedy To overcome this challenge, we first consider a stylized bandit variation of the sequential Blackwellgame. We characterize a new notion of approachability that we call bandit Blackwell approachability and provide an algorithm achieving the information-theoretic tight approachability bound for thisproblem. This algorithm uses an algorithm for the full information version of the Blackwell sequentialgame in a blackbox fashion.We then introduce the extra ingredient that is needed for our bandit transformation, whichis the possibility of creating an unbiased estimator for the vector payoff of the Blackwell gamesassociated with different subproblems. Putting all these pieces together, we propose a bandit onlinelearning algorithm with the help of our bandit Blackwell approachability. We highlight that thisapproach essentially uses the unbiased estimators to obtain bandit-style feedback for the onlinelearning problems of each subproblem, leading to an efficient overall bandit learning algorithm witha sublinear γ -regret. In the bandit online learning version of problem (1), an online algorithm can only see the value ofthe function at the particular point that is picked in that round. Therefore, in our transformation,multiple online Blackwell algorithms compete over a single piece of information in order to estimatethe vector payoffs, where estimating the vector payoff of a Blackwell algorithm can be typically doneby taking a costly “exploration” move, tailored to that algorithm.With the goal of properly modeling this paradigm at a lower level, we propose the notion of a bandit Blackwell sequential game , characterized by the extended tuple ( X , Y , p , ˆ p ) . In this variant,player 1 makes an additional decision in each round: whether to explore or not. Only if player 1chooses to explore in round t , do they receive the unbiased estimator ˆ p ( x t , y t ) whose expectation isthe vector payoff for that round p ( x t , y t ) . However, player 1 is punished by an additive cost D ( p ) .If player 1 refrains from exploration, they neither receive any feedback nor any punishment. Player1’s new goal is to minimize the distance from the time-averaged payoff to the target set S plus theirtime-averaged exploration penalty. Definition 11 (Bandit Blackwell approachability).
A closed convex target set S is g ( T ) -bandit-approachable in the bandit Blackwell sequential game ( X , Y , p , ˆ p ) if there exists a ban-dit player ’s strategy such that for every player 2’s strategy, the resulting sequence of actions satisfy d ∞ T T X t =1 p ( x t , y t ) , S ! + E (cid:20) T D ( p ) · ( ) (cid:21) ≤ g ( T ) , where ( T is information-theoreticallytight. See Section C.3 for details. iazadeh et al.: Online Learning via Offline Greedy Theorem 3.
A closed convex set S is O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) -bandit-approachablein the bandit Blackwell sequential game ( X , Y , p , ˆ p ) if and only if S is response-satisfiable in theBlackwell game ( X , Y , p ) . In particular, when S is response satisfiable, the online algorithm AlgBB(Algorithm 3) achieves this approachability bound in polynomial time, given access to a separationoracle for S .Proof sketch of Theorem 3. To see the only if direction of the first part of the theorem, banditBlackwell approachability implies Blackwell approachability. Specifically, if d ∞ T T X t =1 p ( x t , y t ) , S ! + E (cid:20) T C · ( ) (cid:21) ≤ O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) , then we must have d ∞ (cid:16) T P Tt =1 p ( x t , y t ) , S (cid:17) ≤ O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) , and hence this ℓ ∞ -distance is vanishing as T → + ∞ . This, in turn, implies that the target set S is response satisfi-able (see Theorem 1). Note that while Theorem 1 is stated for a specific g ( T ) , the only if directionof this theorem holds for any vanishing approachability bound (Blackwell 1956).To see the if direction and the second part of the theorem, we consider a simple algorithm thatuses a (full information) Blackwell algorithm AlgB as a blackbox. We pick an algorithm AlgB thatsatisfies the approachability bound of Theorem 1, and can obtain this bound in polynomial timegiven a separation oracle for S ; see Remark 1. At the beginning of each round, our bandit algorithmplays the last suggested action by AlgB. It then explores randomly with probability q by flippingan independent coin. Based on the outcome of the coin, it either updates the state of AlgB usingthe unbiased payoff feedback it gets (exploration) and queries AlgB for suggesting a new action tofollow, or decides not to explore with probability − q and refrains the state of AlgB. These stepsare summarized in Algorithm 3.As for the running time, the above algorithm will run in polynomial time given a separationoracle for S based on Remark 1. As for the approachability bound, at a high level, if we imaginethat unbiased payoffs are the actual payoffs in the Blackwell game, then the expected distance oftime-averaged unbiased vector payoff from S is roughly equal to the same quantity for only roundsthat we explore. There are qT such rounds in expectation. Therefore, the expected distance is upperbounded by O ( D ( ˆ p ) (log( d )) / q − / T − / ) due to the approachability of AlgB for this imaginaryBlackwell sequential game (Theorem 1). Also, the algorithm gets penalized on average by O ( D ( p ) q ) due to exploring. Taking expectation to replace unbiased estimators with the actual payoffs andbalancing the two terms in regret by setting q = D ( p ) − / D ( ˆ p ) / (log d ) / T − / gives the finalbound. See Section C.1 in the appendix for a detailed proof with a more involved argument. (cid:4) iazadeh et al.: Online Learning via Offline Greedy Algorithm 3:
Bandit Blackwell Online Algorithm (AlgBB)
Meta Input:
Parameter q ∈ [0 , , bandit Blackwell sequential game ( X , Y , p , ˆ p ) . Input:
Number of rounds T , blackbox access to full information online algorithm AlgB forthe Blackwell sequential game ( X , Y , p ) , achieving approachability bound of Theorem 1 . Output:
Actions { x t } t ∈ [ T ] and binary signals { π t } t ∈ [ T ] , where x t ∈ X , and π t ∈ { Yes , No } for any t ∈ [ T ] .Initialize x new by sending the initial query to AlgB ; for round t = 1 to T do Play the action x t ← x new ; Set π t to be YES with probability q , and No withprobability − q ; if π t = Yes then
Obtain ˆ p ( x t , y t ) and send ˆ p ( x t , y t ) /q as feedback to AlgB ; // AlgB gets a new feedbackin each exploration round, i.e., round t where π t = Yes . Update x new by querying AlgB given the actions and realized unbiased estimatorvector payoffs in exploration rounds prior to round t + 1 , i.e., x new ← AlgB ( { ( x τ , ˆ p ( x τ , y τ ) : τ ≤ t, π τ = Yes } ) ; Remark 4.
Our notion of bandit Blackwell approachability and the algorithm that achieves thetight bound (Algorithm 3) bear some resemblance to the ǫ -greedy algorithm in the classic banditsetting, where in every round of this algorithm, we decide whether or not to explore, and when weexplore in a round we assume we suffer from the maximum possible regret in this round. Remark 5.
The vanilla version of AlgBB needs to tune exploration probability q based on thehorizon T to obtain the bound in Theorem 3. However, by using the standard doubling trick inonline learning (e.g., see Bubeck et al. (2015)) in a blackbox fashion, one can boost Algorithm 3 towork for unknown but bounded T : the new algorithm starts with a guess for horizon (e.g., T = 1 )and sets q according to this guess. Each time it reaches the guessed horizon, it doubles its guess, andrestarts by tuning a new value for q and initializing again. The doubling trick is a well-known ideain the online learning literature that can be traced back to the classic work of Auer et al. (2002).We refer the reader to aforementioned work, and omit the details here for brevity. Similar to our full information offline-to-online transformation, which gave us algorithm
Online-IG in Section 4, we transform an offline IG algorithm to a bandit online learning algorithm by associ-ating an instance of the bandit Blackwell sequential game to each subproblem i ∈ [ N ] of the offlinealgorithm. That is, we crucially rely on a reduction from the local optimization step of each sub-problem in Algorithm 1 to an approachable instance of the bandit Blackwell sequential game as inDefinition 11. Such a reduction is possible if the offline algorithm is Bandit Blackwell reducible ; seethe following definition. iazadeh et al.:
Online Learning via Offline Greedy Definition 12 (Bandit Blackwell Reducibility).
An instance
Offline-IG ( C , F , D , Θ) of Algorithm 1 is bandit Blackwell reducible if there is an instance ( X , Y , p , ˆ p ) of bandit Blackwellsequential game (Section 5.1) and an exploration sampling device ExpS : Θ × D → ∆ (cid:0) R d payoff × C (cid:1) ,such that:1. Offline-IG ( C , F , D , Θ) is Blackwell reducible as in Definition 10, using the Blackwell sequen-tial game ( X , Y , p ) (with biaffine p ) and the synthetic Blackwell adversary function AdvB.2. If y = AdvB ( z , f ) for some f ∈ F , z ∈ D , then ˆ p ( θ , y ) = f ( z exp ) w exp for all θ ∈ Θ , where ( w exp , z exp ) ∼ ExpS ( θ , z ) . Otherwise, ˆ p ( θ , y ) = p ( θ , y ) .3. The above ˆ p is an unbiased estimator for the actual vector payoff, i.e., for all θ ∈ Θ , y ∈ Y : E [ ˆ p ( θ , y )] = p ( θ , y ) .4. The exploration sampling device ExpS ( θ , z ) returns its samples ( w exp , z exp ) in polynomial time.To better understand the bandit Blackwell reducibility, we revisit our running example. Example 1 (continued).
The greedy algorithm of Nemhauser et al. (1978) is also banditBlackwell reducible. As stated in Section 4, this algorithm is Blackwell reducible. Recall that in thisexample, the biaffine Blackwell payoff is p ( θ , y ) = θ T y n − y , where n is all ones n -dimensionalvector. We will construct an exploration sampling device ExpS that returns ( w exp , z exp ) such that if ∀ θ ∈ Θ , we have y = AdvB ( z , f ) for some f ∈ F , z ∈ D , we set ˆ p ( θ , y ) = f ( z exp ) w exp and we musthave E [ ˆ p ( θ , y )] = p ( θ , y ) . The exploration sampling device ExpS works as follows. Given a point z ∈ C (which represents a set of elements) and parameter θ ∈ Θ , it draws j ∼ Uniform { , . . . , n } andreturns (i) w exp = n ( θ j n − e j ) , (ii) z exp = z ∪ { j } . Now, ˆ p is an unbiased estimator of p , because: E [ ˆ p ( θ , AdvB ( z , f ))] = E [ f ( z exp ) w exp ]= E [ n ( θ j f ( z ∪ { j } ) n − f ( z ∪ { j } ) e j )]= n X j ∈ [ n ] θ j f ( z ∪ { j } ) − [ f ( z ∪ { } ) , . . . , f ( z ∪ { n } )] T = n X j ∈ [ n ] θ j ( f ( z ∪ { j } ) − f ( z )) − [ f ( z ∪ { } ) , . . . , f ( z ∪ { n } )] T + f ( z ) n = θ T y n − y = p ( θ , AdvB ( z , f )) , where y , [ f ( z ∪ { j } ) − f ( z )] j =1 , ,...,n = AdvB ( z , f ) . Here, the fourth equation holds because P j ∈ [ n ] θ j = 1 . Observe that the exploration sampling device ExpS has an intuitive interpretation,at every round, it randomly picks one of the elements j ∈ [ n ] , and evaluates the marginal benefit ofadding element j to z . iazadeh et al.: Online Learning via Offline Greedy When the offline algorithm (Algorithm 1) is bandit Black-well reducible (Definition 12), we can employ a similar offline-to-online transformation mentionedin Section 4. However, instead of associating an instance of the Blackwell game to each subproblem,we associate an instance of the bandit Blackwell game. To obtain unbiased estimators for the vectorpayoffs of these bandit Blackwell instances, we rely on the exploration sampling devices that arepromised by Definition 12. This sampling device allows us to strike a balance between explorationand exploitation in all of the online bandit Blackwell games. We formalize this transformation ofthe offline algorithm to an online bandit algorithm called
Bandit-IG in Algorithm 4.Suppose that the offline algorithm
Offline-IG ( C , F , D , Θ) is given. For the particular banditBlackwell sequential game ( X , Y , p , ˆ p ) coming from Definition 12, we use AlgBB (Algorithm 3) todetermine the strategy of player 1. Such an online bandit Blackwell algorithm as player 1 ensuresthat the distance between the average vector payoff T P Tt =1 p ( x t , y t ) and set S plus the explorationpenalty goes to zero with rate g ( T ) = O ( D ( p ) / D ( ˆ p ) / (log d ) / T − / ) ; see Theorem 3.We dedicate a copy of the above algorithm AlgBB ( i ) to each subproblem i ∈ [ N ] . We queryalgorithms AlgBB ( i ) in the increasing order of their index i . Consider the online bandit Blackwellalgorithm AlgBB ( i ) , and assume that in round t , we query this algorithm. The algorithm returnstwo outputs: the update parameter θ ( i ) t and a binary signal π ( i ) t ∈ { Yes , No } . If π ( i ) t = Yes , thealgorithm explores: it samples ( w ( i ) t, exp , z ( i ) t, exp ) from the exploration sampling device ExpS ( θ ( i ) t , z ( i − t ) .Note that the exploration sampling device uses the update parameter θ ( i ) t and the point returned bythe previous subproblem z ( i − t . This indeed allows the subproblems to communicate with each otherduring exploration. The algorithm then plays z t = z ( i ) t, exp and provides the payoff vector feedback ˆ p ( i ) t = f t ( z t ) w ( i ) t, exp to AlgBB ( i ) . This feedback is only used by the online bandit Blackwell algorithmAlgBB ( i ) , not the rest of N − bandit Blackwell algorithms. We highlight that if AlgBB ( i ) decides toexplore in round t , the rest of bandit Blackwell algorithms will not be queried. Finally, if π ( i ) t = No ,the algorithm exploits: it returns point z ( i ) t = Local-update ( θ ( i ) t , z ( i − t ) . Again observe that duringexploitation, subproblem i also communicates with subproblem i − through using z ( i − t . Theorem 4 bounds the regret of the
Bandit-IG algorithm. The proofis deferred to Section C.2 in the appendix.
Theorem 4 (Bandit information offline-to-online transformation) . Suppose that aninstance of
Offline-IG ( C , F , D , Θ) for the offline problem (1) satisfies the following properties:• It is an extended ( γ, δ ) -robust approximation for γ ∈ (0 , and δ > , as in Definition 9.• It is bandit Blackwell reducible; that is, we can define the bandit Blackwell sequential game ( X , Y , p , ˆ p ) and exploration sampling device ExpS : Θ × D → ∆ (cid:0) R d payoff × C (cid:1) that satisfy thecontions in Definition 12. iazadeh et al.: Online Learning via Offline Greedy Algorithm 4:
Bandit Online Learning Meta-algorithm (
Bandit-IG ) Meta Input:
Feasible region C , function space F , defined over domain D , parameter space Θ ⊆ R d param . Offline algorithm and reduction gadgets:
An instance
Offline-IG ( C , F , D , Θ) ofAlgorithm 1; this algorithm is bandit Blackwell reducible as in Definition 12, using thebandit Blackwell instance ( X , Y , p , ˆ p ) and exploration sampling deviceExpS : Θ × D× → ∆ (cid:0) R d payoff × C (cid:1) . Input:
Number of rounds T ; access to a bandit Blackwell online algorithm AlgBB . Output:
Points z , z , . . . , z T ∈ C .Initialize N parallel instances { AlgBB ( i ) } Ni =1 of the online algorithm AlgBB ; for round t = 1 to T do Initialize z (0) t ∈ C ; for subproblem i = 1 to N do Choose the update parameter θ ( i ) t ∈ Θ and exploration signal π ( i ) t ∈ { Yes , No } byquerying online algorithm AlgBB ( i ) given the update parameters and vector payoffs ˆ p of exploration rounds prior to round t in the bandit Blackwell sequential game ofsubproblem i , that is (cid:16) θ ( i ) t , π ( i ) t (cid:17) ← AlgBB ( i ) (cid:16) θ ( i )1 , . . . , θ ( i ) t − , { ˆ p ( θ ( i ) τ , y ( i ) τ ) } τ ≤ t − π ( i ) τ = Yes (cid:17) ; if π ( i ) t = Yes , then Sample ( w ( i ) t, exp , z ( i ) t, exp ) from the exploration sampling device ExpS ( θ ( i ) t , z ( i − t ) ;Play the exploration point z t ← z ( i ) t, exp ;< Bandit information feedback: observe f t ( z t ) > ;Give payoff vector feedback ˆ p ( i ) t = f t ( z t ) · w ( i ) t, exp to AlgBB ( i ) ; Skip immediately tothe beginning of the next round t + 1 ;Set z ( i ) t ← Local-update ( θ ( i ) t , z ( i − t ) ;Play the final point z t ← z ( N ) t and receive bandit feedback f t ( z t ) , and ignore it. Consider the bandit-information adversarial online learning version of problem (1) , and let AlgBBbe a polynomial time bandit Blackwell algorithm for ( X , Y , p , ˆ p ) as in Theorem 3. Then, for thisonline problem, Bandit-IG ( C , F , D , Θ , AlgBB ) runs in polynomial time and satisfies the following γ -regret bound: γ -regret ( Bandit-IG ( C , F , D , Θ , AlgBB )) ≤ O (cid:16) D ( p ) / D ( ˆ p ) / N δ (log( d payoff )) / T / (cid:17) , where N is the number of subproblems and d payoff is the dimension of vector payoffs. iazadeh et al.: Online Learning via Offline Greedy We finish this section by wrapping up our running example (Example 1) and mentioning thebandit regret bound we get as a direct corollary of Theorem 4.
Example 1 (finished).
The greedy algorithm in Nemhauser et al. (1978) satisfies (1 − e , -robust approximation and is bandit Blackwell reducible. It has N = k subproblems and ℓ ∞ diameterof ˆ p is D ( ˆ p ) = O ( n ) . Therefore, by invoking Algorithm 4 given any bandit Blackwell algorithmsatisfying the approachability bound in Theorem 3, we obtain the following bandit regret bound: (cid:18) − e (cid:19) -regret ( Algorithm 4 ) ≤ O ( kn / (log n ) / T / ) , which in turn, by noting that k can be as large as n , gives us an immediate improvement over regretbound of O (cid:0) k ( n log n ) / T / (log T ) (cid:1) in Streeter and Golovin (2007, 2008).
6. Applications to Revenue Management and Combinatorial Optimization
We have already showed how to fit monotone submodular maximization into our framework throughExample 1. In this section, we apply our framework to three other selected problems: prod-uct ranking through sequential submodular maximization, personalized reserve price optimizationin second-price auction, and non-monotone submodular maximization. Our framework results inimproved/new regret bounds in all mentioned applications for both full-information and bandit set-tings. We should emphasize that our framework is quite general and would potentially capture othergreedy-solvable/approximate problems in revenue management and combinatorial optimization.
Problem definition.
In the
Product Ranking Problem , a platform aims to characterize aranking of n items, where a ranking is a permutation π over the items. Here, items on positions withlower indices have more visibility. The goal of the platform is to maximize its user engagement (alsoknown as market share), which is the probability that a consumer does not leave the platform withouttaking a desired action. This action can be a click, purchase, or even installing an application.For the sake of presentation, assume that the desired action is clicking on an item. We considerthe model proposed by Asadpour et al. (2020), which is inspired by an earlier model proposed inFerreira et al. (2019). In this model, a consumer u is characterized by a patience level θ u togetherwith a monotone non-decreasing submodular set function κ u : 2 [ n ] −→ [0 , . A consumer of type ( θ u , κ u ) , when offered a ranked list of products π = ([ π ] , [ π ] , . . . , [ π ] n ) , inspects the first θ u productsand clicks with probability κ u ( { [ π ] , . . . , [ π ] θ u } ) . The platform knows the distribution G from which u is selected. The goal is to pick a permutation π maximizing the probability of click E u ∼G [ κ u ( { [ π ] , . . . , [ π ] θ u } )] . For a wide range of choice models in the literature, the probability of a purchase from an offeredset S can be described using a monotone submodular function κ u . This includes multinomial logit,nested logit, and paired combinatorial logit models. See Kök et al. (2008) for details on these models. iazadeh et al.: Online Learning via Offline Greedy Product ranking problem as sequential submodular maximization.
A slight reformulation of theabove model casts the product ranking problem as a special case of a class of optimization problemsover permutations called sequential submodular maximization (Asadpour et al. 2020). We define thesequential submodular maximization problem as follows. Given a sequence of monotone submodularset functions { f ( · ) , . . . , f n ( · ) } , and a sequence of non-negative weights λ = ( λ , . . . , λ n ) , we aim tofind a ranking π that maximizes n X i =1 λ i f i ( { [ π ] , . . . , [ π ] i } ) , where [ π ] i denotes the item on the i th position of ranking π. In the aforementioned choice model, forall i ∈ [ n ] , we have f i ( S ) , E u ∼G [ κ u ( S ) | θ u = i ] , representing the probability of clicks functions, and λ i , P u ∼G ( θ u = i ) , representing the probability that a consumer has patience level i . The probabilitythat a consumer clicks on at least one product when offered a ranked set of products π is then f ( π ) , λ f ( { [ π ] } ) + λ f ( { [ π ] , [ π ] } ) + . . . + λ n f n ( { [ π ] , . . . , [ π ] n } ) , where f i ’s are monotone submodular functions and λ i ’s are non-negative. To simplify the analysis,notice that while f is a function of a set of ranked/ordered items, f i is a function of a set that hasat most i items for each i ∈ [ n ] . Online problem.
In the offline setting, the platform knows G , which translates to knowing theprobability of clicks { f ( · ) , . . . , f n ( · ) } and the probability distribution of the patience level λ =( λ , . . . , λ n ) . We study the online user-engagement-maximization ranking problem where on everyround t, a distribution over patience level λ t and the expected probability of click function f t , whichis made of { f t, ( · ) , . . . , f t,n } , are chosen adversarially. The platform, whose goal is to maximize itsuser-engagement, chooses a ranking π t without observing λ t and f t . After choosing the ranking, theplatform observes the function f t in the full-information setting. In the bandit setting, the platform only observes whether or not the consumer clicks on at least one item, but not which item wasclicked on. To the best of our knowledge, the online adversarial version of this problem has not beenstudied before, neither under full information nor bandit setting.Asadpour et al. (2020) showed the offline problem of sequential submodular maximization isNP-hard. They also proposed an optimal (1 − /e ) -approximation algorithm, and a simple -approximation greedy algorithm for this offline problem. Notably, Ferreira et al. (2019) studied aspecial case of the above model for a particular choice model, where consumers click on itemsindependently with given probabilities. They proposed the same simple − approximation greedyalgorithm, and a "learning-then-earning" algorithm for the offline PAC-learning version of theirproblem where a learner has access to samples from user choices. iazadeh et al.:
Online Learning via Offline Greedy Offline algorithm.
In this paper, rather than the optimal algorithm, we also focus on this greedyalgorithm that achieves − approximation, and transform it into an online adversarial learningalgorithm. Our offline algorithm is presented in Algorithm 5. The input to this algorithm is asequential submodular function f : Π → [0 , , where Π is the set of ranking permutations of n items,i.e., Π = { , , . . . , n } n . In this case, having [ π ] i = 0 represents putting no item at position i andmultiple positions can display the same item for simplicity. In this problem, both the domain and thefeasible region are D = C = Π . Let S i denote the set of subsets of [ n ] that consist of at most i items,i.e., S i = { S ⊆ [ n ] | | S | ≤ i } . We have f ( π ) = P nj =1 λ j f j ( { [ π ] , . . . , [ π ] j } ) where f i is a monotonesubmodular function that takes an element of S i as input and returns a probability in [0 , , i.e., f i : S i → [0 , . Algorithm 5, taken from Asadpour et al. (2020) and Ferreira et al. (2019), is a greedyalgorithm with n subproblems, where each subproblem corresponds to a position on the ranking.The algorithm starts filling up the positions from the top and for each position i, it chooses theitem that has the highest marginal probability of click. The update π ( i ) ← π ( i − + z ( i ) e i representsthe action of adding item z ( i ) to position i. Algorithm 5:
Greedy for Sequential Submodular Maximization (Asadpour et al. 2020)
Input:
A sequential submodular function f , which can be represented using a sequence ofmonotone submodular functions { f i ( · ) } i ∈ [ n ] and a sequence of non-negative weights λ . Output:
Ranking π ∈ Π .Set initial ranking π (0) ← n . for position i = 1 , , . . . , n do Local Optimization StepChoose z ( i ) ∈ arg max z ∈ [ n ] P nj = i λ j f j (cid:0) { [ π ( i − ] , . . . , [ π ( i − ] i − , z } (cid:1) − P nj = i λ j f j ( { [ π ( i − ] , . . . , [ π ( i − ] i − } ) .Local Update StepSet π ( i ) ← π ( i ) + z ( i ) e i . return π ← π ( n ) . We cast Algorithm 5 as an instance of
Offline-IG (Algorithm 1). The parameter space is
Θ = ∆([ n ]) and d param = n . Moreover, in subproblem i the algorithm picks the distribution θ ( i ) overitems so that the resulting vector payoff lands in set S . In this language, set S is the n -dimensionalpositive orthant and the vector payoff function is: ∀ j ∈ [ n ] : (cid:2) Payoff ( θ ( i ) , π ( i − , f ) (cid:3) j = θ T y ( i ) − [ y ( i ) ] j , iazadeh et al.: Online Learning via Offline Greedy where y ( i ) , " n X a = i λ a f a ( { [ π ( i − ] , . . . , [ π ( i − ] i − , j } ) − n X a = i λ a f a ( { [ π ( i − ] , . . . , [ π ( i − ] i − } ) j ∈ [ n ] = (cid:2) f ( π ( i − + j e i ) − f ( π ( i − ) (cid:3) j ∈ [ n ] , is the marginal objective value of putting item j on the i th position. Note that any θ ( i ) for which the vector payoff is in S is indeed a distribution over items z ( i ) such that z ( i ) ∈ arg max z ∈ [ n ] P nj = i λ j f j (cid:0) { [ π ( i − ] , . . . , [ π ( i − ] i − , z } (cid:1) − P nj = i λ j f j ( { [ π ( i − ] , . . . , [ π ( i − ] i − } ) . Theorem 5 (Online learning for sequential submodular maximization) . Let n be thenumber of items. For the problem of maximizing sequential submodular functions in the online fullinformation setting, there exists a learning algorithm that obtains O (cid:0) n √ T log n (cid:1) -regret, where T is the number of rounds. Furthermore, in the online bandit setting, there exists a learning algorithmthat obtains O (cid:16) n / (log n ) / T / (cid:17) -regret, where T is the number of rounds. For this problem,the benchmark in the regret bounds is max π ∈ Π P Tt =1 f t ( π ) . Theorem 5 and the following corollary are proved in Section D using our offline-to-online trans-formations presented in Section 4 and Section 5.
Corollary 1 (Online learning for product ranking) . Let n be the number of items. Forthe problem of product ranking optimization to maximize user engagement using the model fromAsadpour et al. (2020), there exists a learning algorithm that obtains O (cid:0) n √ T log n (cid:1) − regret in thefull-information setting and O (cid:16) n / (log n ) / T / (cid:17) − regret in the bandit setting, where T is thenumber of consumers. As an implication, the same problem for the consumer choice model fromFerreira et al. (2019) also has a learning algorithm that obtains O (cid:0) n √ T log n (cid:1) − regret in the fullinformation setting and O (cid:16) n / (log n ) / T / (cid:17) − regret in the bandit setting. For both problems,the benchmark in the regret bounds is max π ∈ Π P Tt =1 f t ( π ) . Problem definition.
In the
Maximizing Multiple Reserves (MMR) prob-lem (Roughgarden and Wang 2019, Derakhshan et al. 2019), a seller wants to sell an item to one of n bidders. Each bidder i has a private value v i for the item. The seller runs a second-price auctionwith personalized reserves r ; the winner is the bidder with the highest bid/valuation among thebidders whose bids exceed/clear their reserve prices. The winner pays the minimum bid that theycould have won with, which is the maximum between their reserve price and the second-highestbid that cleared its reserve price. The seller wishes to maximize their revenue. Since second price auctions are truthful, we will use bids and valuations interchangeably. iazadeh et al.:
Online Learning via Offline Greedy Online problem.
We are interested in the seller’s problem in the online full information and banditsettings. In both settings, each round t ∈ [ T ] involves the seller choosing a set of reserves r t and theadversary choosing a valuation profile v t . In the online full information setting, the seller observesthe valuation profile and gets credit for the resulting revenue. In the online bandit setting, the sellerobserves just the resulting revenue and does not observe the bidders’ valuations or even the identityof the winner. The seller’s goal is to minimize the difference between his average revenue and thebest average revenue in hindsight for a fixed set of reserves r ∗ . To the best of our knowledge, thebandit setting has not been studied in the literature. However, the full information setting of thisproblem is studied by Roughgarden and Wang (2019), where they present a learning algorithm with O (cid:0) n √ T log T (cid:1) -regret, which we improve upon. Offline non-batch vs batch problem.
We start with formulating the offline non-batch problem.Let R = { ρ , ρ , . . . , ρ m } be the set of feasible reserve prices, where |R| = m , and they are sorted: ρ < ρ < · · · < ρ m . For the offline (non-batch) problem, let f : R n × [0 , n → [0 , be theseller’s revenue function: f ( r , v ) = max { [ v ] ˆ j , [ r ] j ∗ } for some v . Here, j ∗ and ˆ j are the highestand second-highest bidders among those who cleared their bids, with ties broken arbitrarily: j ∗ ∈ arg max j ∈ [ n ]:[ v ] j ≥ [ r ] j { [ v ] j } and ˆ j ∈ arg max j ∈ [ n ]:[ v ] j ≥ [ r ] j ,j = j ∗ { [ v ] j } . If no bidder clears their reserve,then we say [ r ] j ∗ and [ v ] ˆ j are both zero. Similarly, if only one bidder clears their reserve, we say [ v ] ˆ j iszero. Moreover, F is the space of all such revenue functions: F = { f | ∃ v ∈ [0 , n such that f ( r , v ) =max { [ v ] ˆ j , [ r ] j ∗ } . In the offline (non-batch) problem, the goal is to solve max r ∈R n f ( r , v ) for an inputvaluation profile v ∈ [0 , n . In the optimization problem, the domain D we consider and the feasibleregion C are both R n .The aforementioned offline problem can be solved efficiently. Note that in the offline problem, theseller who has access to the valuations of the bidders in one auction needs to optimize personalizedreserve prices. It is then obvious that in the offline setting, the best action is to set reserve pricesof all the bidders to zero, except the bidder with the highest bid; for this bidder, his reserve priceis set to his valuation. Then, one may wonder why for the online version of this offline (non-batch)problem, which is not even NP-hard, we characterize -regret, rather than -regret. The reasonis that Roughgarden and Wang (2019) show that the full information online setting is at least ashard as the offline batch problem, which is APX-hard. In the offline batch problem, the seller hasaccess to the valuation profiles in m auctions and would like to determine a single vector of reserveprices r that maximizes revenue across all the m auctions. Considering the hardness of the offlinebatch problem, to solve the offline (non-batch) problem, we use a slight variation of the algorithm ofRoughgarden and Wang (2019). This variation is stated in Algorithm 6, which obtains a fractionof the optimal revenue similar to the original algorithm. See section E.1 for a discussion on themajor differences between this variation and the original one. iazadeh et al.: Online Learning via Offline Greedy Algorithm 6:
Greedy Algorithm for Discretized MMR (Roughgarden and Wang 2019)
Input:
Valuation profile v . Output:
Reserve prices r ∈ R n .Set initial reserves r (0) ← n . for bidder i = 1 , , . . . , n do Define revenue-from-reserves function q ( i ) : R → [0 , as q ( i ) ( r ) equals r if i has thehighest valuation (ties broken arbitrarily) and r ∈ [[ v ] i ′ , [ v ] i ] where i ′ has thesecond-highest valuation, and otherwise.Local Optimization StepChoose z ( i ) ∈ arg max r ∈R q ( i ) ( r ) .// In this case θ ( i ) ∈ ∆( R ) is the distribution that always returns z ( i ) . Local Update StepSet r ( i ) ← r ( i − + z ( i ) e i . return r ∼ Uniform { n , r ( n ) } .Offline algorithm. We now briefly discuss Algorithm 6 and show how to cast it as an instance of
Offline-IG (Algorithm 1). This greedy algorithm has n subproblems, where in each subproblem,reserve price of a bidder i is set using our revenue-from-reserves function q . At the end, the algorithmrandomly returns either the all-zeros reserve vector n or the crafted reserve vector denoted by r ( n ) , where the former yields revenue equal to the second-highest valuation and the latter yieldsrevenue of at least q ( j ∗ ) ( z ( j ∗ ) ) ; see the definition of q ( · ) in the algorithm. By definition of the revenuefunction, the optimal reserves obtain their revenue via one of these two cases, i.e., f ( r ∗ , v ) ≤ max { [ v ] ˆ j , q ( j ∗ ) ([ r ] j ∗ ) } ≤ [ v ] ˆ j + q ( j ∗ ) ([ r ] j ∗ ) ≤ f ( n , v ) + f ( r ( n ) , v ) = 2 E [ f ( r , v )] , where the expectation is taken with respect to the randomness in the algorithm. This impliesthat our algorithm is indeed a -approximation. Stated in the language of our Algorithm 1, ourlocal updates manage to guarantee that Payoff ( θ ( i ) , r ( i − , v ) is in the positive orthant, where the(asymmetric) vector payoff function Payoff returns an m -dimensional point whose j th coordinatevalue is the expected difference between the expected value of picking a reserve according to θ ( i ) and that of picking ρ j : (cid:2) Payoff ( θ ( i ) , r ( i − , v ) (cid:3) j , E z ′ ∼ θ ( i ) (cid:2) q ( i ) ( z ′ ) − q ( i ) ( ρ j ) (cid:3) . The following theorem shows that using our framework, the greedy Algorithm 6 can be trans-formed to polynomial-time online learning algorithms under both full information and bandit feed-back structures. Note that here,
Payoff ( θ ( i ) , r ( i − , v ) is not a function of r ( i − . iazadeh et al.: Online Learning via Offline Greedy Theorem 6 (Online learning for maximizing multiple reserves) . Let R = { ρ , . . . , ρ m } be the set of possible reserve prices and n be the number of bidders. Assume that the maximumvaluation is normalized to one. Then, for the problem of maximizing personalized reserve prices inthe online full information setting, there exists a learning algorithm that obtains O (cid:16) nT / log / m (cid:17) -regret, where T is the number of auctions. Furthermore, in the online bandit setting, there exists alearning algorithm that obtains O (cid:16) nm / T / log / m (cid:17) -regret, where T is the number of auctions.Here, the benchmark in the regret bounds is max r ∈R n P Tt =1 f ( r , v t ) . The proof of Theorem 6 is presented in Section E.2 in the appendix. At a high level, to prove thistheorem, we first show that Algorithm 6 is an extended ( , ) -robust approximation algorithm. Wethen confirm that this algorithm is bandit Blackwell reducible. To do so, we construct an unbiasedestimator of the revenue-from-reserves function q ( i ) . This unbiased estimator allows us to build anexplore sampling device per Definition 12. Having verified these two main properties, we then invokeTheorems 2 and 4 to get the final regret bounds.The following corollary considers a stronger benchmark than the one we considered earlier. Thisbenchmark allows the reserve prices to be any number in [0 , n , i.e., the regret is computed against max r ∈ [0 , n P Tt =1 f ( r , v t ) , rather than max r ∈R n P Tt =1 f ( r , v t ) . The corollary then confirms theexistence of learning algorithms with sublinear regret bounds against this stronger benchmark. SeeSection E.3 in the appendix for proof. Corollary 2.
Let n be the number of bidders. Assume that the maximum valuation is normalizedto one. Then, for the problem of maximizing personalized reserve prices in the online full informationsetting, there exists a learning algorithm that obtains O (cid:16) nT / log / T (cid:17) -regret, where T is thenumber of auctions. Furthermore, in the online bandit setting, there exists a learning algorithm thatobtains O (cid:16) n / T / log / ( nT ) (cid:17) -regret, where T is the number of auctions. Here, the benchmarkin the regret bounds is max r ∈ [0 , n P Tt =1 f ( r , v t ) . Problem definition.
Consider the
Non-monotone submodular maximization (NSM) prob-lem, for both set and continuous functions, as defined in Section 2.4. For set functions, our goal isto maximize a non-monotone submodular set function without any constraints, and for continuousfunctions, our goal is to maximize a non-monotone continuous submodular function, either weak-DR(Definition 6) or strong-DR (Definition 7), over the unit hypercube [0 , n .For set functions, the offline algorithm of Buchbinder et al. 2015 gives a / -approximation fac-tor, which is known to be the best possible approximation factor with polynomial query calls to thefunction (Feige et al. 2011). For the continuous case, under both weak-DR and strong-DR submod-ularity, the offline algorithm of Niazadeh et al. 2018 gives a / -approximation factor for Lipschitz iazadeh et al.: Online Learning via Offline Greedy continuous functions, which again achieves the best possible approximation factor with polynomialquery calls to the function.To have a unified offline problem and algorithm capturing both of the above variations, we firstconsider a slight reformulation where a continuous (weak-DR) submodular function is restrictedto a discrete domain R n instead of [0 , n . Here, R = { ρ , ρ , . . . , ρ m } is the finite set of possiblecoordinate values, where |R| = m and ρ < ρ < · · · < ρ m are real numbers. Note that R = { , } when we focus on set functions. For Lipschitz continuous functions, one should think of R n as an ǫ -net that discretizes the function with O ( ǫ ) additive error due to Lipschitzness.Given this unified setting, we essentially consider discrete functions f : R n → [0 , that satisfya discrete version of (weak-DR) submodularity. This property is exactly the same as continuoussubmodularity in Definition 6, with a slight modification that we only consider points x ∈ R n . Givensuch a function, our goal in the offline problem is to solve the optimization problem max z ∈R n f ( z ) .We refer to this problem as discretized submodular maximization . Note that this problem is aninstance of problem (1), where both D and the feasible region C are R n , and our function class isthe class of submodular functions f described above.Inspired by the algorithms in Buchbinder et al. (2015) and Niazadeh et al. (2018), we then presenta unified offline algorithm (which essentially is an adaptation of the algorithm in Niazadeh et al.(2018) restricted to the discrete domain R n ) with the same / -approximation factor for the pro-posed unified offline problem. This is presented in Algorithm 7. We then transform this offlinealgorithm to online full-information and bandit learning algorithms using our framework. Offline algorithm.
Algorithm 7 is a modified version of the continuous randomized bi-greedy algo-rithm by Niazadeh et al. (2018). The difference between Algorithm 7 and the continuous randomizedbi-greedy algorithm is discussed in Section F.1 in the appendix. Throughout this section, we use thenotation ( z ′ , z − i ) to denote the point constructed by taking z and replacing its i th coordinate valuewith z ′ , and f ( z ′ , z − i ) to denote the function evaluated at the corresponding point. The algorithmkeeps track of two points: lower bound z ¯ ( i ) and upper bound ¯ z ( i ) , where initially, z ¯ (0) = ( ρ , . . . , ρ ) ,and ¯ z (0) = ( ρ m , . . . , ρ m ) . The lower and upper bounds get updated as the algorithm goes through n subproblems. In subproblem i , the algorithm decides about the i th coordinate: it sets the i th coordi-nate to z ′ i , where z ′ i is drawn from distribution θ ( i ) ∈ ∆( R ) . Here, this distribution is chosen in a wayto satisfy the following condition E z ′ ∼ θ ( i ) (cid:2) α ( i ) ( z ′ ) + β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ′ ) (cid:3) ≥ for all ˆ z ∈ R . Notethat α ( i ) ( z ′ ) = f ( z ′ , z ¯ ( i − − i ) − f ( z ¯ ( i − ) is the marginal value of increasing the value of i th -coordinatefrom ρ to z ′ when the rest of coordinates are z ¯ ( i − − i , and similarly β ( i ) ( z ′ ) = f ( z ′ , ¯ z ( i − − i ) − f ( ¯ z ( i − ) is the marginal value of decreasing the i th coordinate from ρ m to z ′ when the rest of coordinatesare ¯ z ( i − − i . Moreover, ζ ( i ) (ˆ z, z ′ ) is equal to α ( i ) (ˆ z ) − α ( i ) ( z ′ ) if ˆ z ≥ z ′ and β ( i ) (ˆ z ) − β ( i ) ( z ′ ) otherwise.Roughly speaking, ζ ( i ) (ˆ z, z ′ ) measures the extent to which setting the i th coordinate to z ′ , rather iazadeh et al.: Online Learning via Offline Greedy than ˆ z , is locally suboptimal. With this interpretation, the aforementioned condition ensures thatthe algorithm’s choice for the i th coordinate approximately compensates for the cost caused by thesuboptimality of this choice. We refer the readers to Niazadeh et al. (2018) for a more detaileddiscussion on the intuition behind this condition.We now show how to cast the above algorithm as an instance of Offline-IG (Algorithm 1). Inthe language of Algorithm 1, the aforementioned condition can be presented using the following
Payoff function: j ∈ [ m ] : (cid:2) Payoff (cid:0) θ ( i ) , z ¯ ( i − , f (cid:1)(cid:3) j = E z ′ ∼ θ ( i ) (cid:20) α ( i ) ( z ′ ) + 12 β ( i ) ( z ′ ) − ζ ( i ) ( ρ j , z ′ ) (cid:21) ≥ . (4)Moreover, we have Θ = ∆( R ) , d param = |R| = m, and z is the vector z ¯ that starts as ( ρ , . . . , ρ ) T then gets updated at each iteration. Theorem 7 (Online learning for discretized non-monotone submodular maximization) . Let n be the number of dimensions and R = { ρ , ρ , . . . , ρ m } be the set of potential values that eachcoordinate i ∈ [ n ] can take. Assume that the maximum function value is normalized to one. Then,for the problem of maximizing a non-monotone submodular function in the online full-informationsetting, there exists a learning algorithm that obtains O (cid:0) nT / (log m ) / (cid:1) -regret, where T isthe number of rounds. Furthermore, in the online bandit setting, there exists an online learningalgorithm that obtains O (cid:0) nm / T / (log m ) / (cid:1) -regret. Here, in both online algorithms, thebenchmark in the regret bounds is max z ∈R n P Tt =1 f t ( z ) . The proof of Theorem 7, which is presented in Section F.2 in the appendix, has two main steps.In the first step, we show that the offline Algorithm 7 is an extended ( , ) - robust approximationalgorithm and in the second step, we show that it is bandit Blackwell reducible. The challengingpart of the proof is to construct an explore sampling device that leads to an unbiased estimator forthe payoff function. We then invoke Theorems 2 and 4 to get the final regret bounds.The following is an immediate corollary of Theorem 7. Corollary 3 (Online learning for non-monotone set submodular maximization) . Let n be the number of items, and assume the maximum function value is normalized to one.Then for the problem of maximizing a nonmonotone (set) submodular function in the onlinefull-information setting, there exists an online learning algorithm that obtains O (cid:16) n √ T (cid:17) -regret,where T is the number of rounds. Furthermore, in the online bandit setting, there exists a learningalgorithm that obtains O (cid:0) nT / (cid:1) -regret, where T is the number of rounds. For any i ∈ [ n ] , one can construct ¯ z ( i ) from z ¯ ( i ) by replacing its last n − coordinates with ρ m . Thus, it suffices todefine Payoff as a function of z ¯ ( i ) . iazadeh et al.: Online Learning via Offline Greedy Algorithm 7:
Greedy Algorithm for Discretized NSM (Niazadeh et al. 2018)
Input:
Discrete submodular function f . Output:
Point z ∈ R n .Set initial lower bound z ¯ (0) ← ( ρ , ρ , . . . , ρ ) T and upper bound ¯ z (0) ← ( ρ m , ρ m , . . . , ρ m ) T . for coordinate i = 1 , , . . . , n do Define the lower marginal function α ( i ) : R → [ − , +1] as α ( i ) ( z ′ ) = f ( z ′ , z ¯ ( i − − i ) − f ( z ¯ ( i − ) .Define the upper marginal function β ( i ) : R → [ − , +1] as β ( i ) ( z ′ ) = f ( z ′ , ¯ z ( i − − ) − f ( ¯ z ( i − ) .Define comparison function ζ ( i ) : R × R → [ − , +1] as ζ ( i ) (ˆ z, z ′ ) = (cid:26) α ( i ) (ˆ z ) − α ( i ) ( z ′ ) if ˆ z ≥ z ′ β ( i ) (ˆ z ) − β ( i ) ( z ′ ) if ˆ z ≤ z ′ . Local Optimization StepChoose θ ( i ) ∈ ∆( R ) so that for all ˆ z ∈ R , E z ′ ∼ θ ( i ) (cid:20) α ( i ) ( z ′ ) + 12 β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ′ ) (cid:21) ≥ (5)(done in Niazadeh et al. (2018) via preprocessing and computing a 2D convex hull).Local Update StepSample z ′ i ∼ θ ( i ) . Set z ¯ ( i ) ← z ¯ ( i − and ¯ z ( i ) ← ¯ z ( i − and then update their i th coordinate: (cid:2) z ¯ ( i ) (cid:3) i ← z ′ i and (cid:2) ¯ z ( i ) (cid:3) i ← z ′ i . return z ← z ¯ ( n ) .So far, we assumed that for the continuous submodular functions, the set of potential value foreach coordinate is finite and belongs to set R = { ρ , ρ , . . . , ρ m } , rather than the interval [0 , ,and we design learning algorithms with sublinear regret bounds where the regrets are computedwith respect to max z ∈R P Tt =1 f t ( z ) . Now, one may wonder if one can design learning algorithmsagainst the benchmark of max z ∈ [0 , n P Tt =1 f t ( z ) that allows the coordinates to be any number in [0 , n . The following corollary answers this question for any L -Lipschitz non-monotone continuoussubmodular functions. Corollary 4 (Online learning for L − Lipschitz continuous submodular maximization) . Let n be the number of dimensions, and assume the maximum function value is normalized to one.Then for the problem of maximizing a coordinate-wise L -Lipschitz non-monotone (continuous)submodular function in the online full-information setting, there exists a learning algorithm thatobtains O (cid:16) nT / log / ( LT ) (cid:17) -regret, where T is the number of rounds. Furthermore, in the online iazadeh et al.: Online Learning via Offline Greedy bandit setting, there exists a learning algorithm that obtains O (cid:16) nL / T / log / ( LT ) (cid:17) -regret,where T is the number of rounds. Here, in both online algorithms, the benchmark in the regretbounds is max z ∈ [0 , n P Tt =1 f t ( z ) . Proofs of the above corollaries are in Section F.3 in the appendix.
7. Conclusion
In many settings, decision-makers need to solve NP-hard combinatorial problems repeatedly overtime while the underlying (unknown) reward function changes. Motivated by this, we study theproblem of designing online adversarial algorithms for combinatorial problems that are amenable toa robust greedy approximation algorithm. Using Blackwell strategies, we present a unified frameworkto transform offline robust greedy approximation algorithms to their online counterparts for bothfull information and bandit feedback structures. We show that by applying our framework to severalapplications including maximizing submodular functions, optimizing reserve prices in second priceauctions, and optimizing product ranking on online platforms, we can obtain improved/new regretbounds.While we have investigated a selective set of applications of our framework, we believe our frame-work is general and can capture several other algorithmic problems in revenue management andmarket design. Examples are online learning in assortment optimization (which is also closely relatedto submodular optimization and greedy algorithms proved to be useful there as well; cf. Désir et al.(2015), Agrawal et al. (2019)) and budgeted allocation/Adwords problem in sponsored search auc-tions (where again greedy algorithms proved to be helpful; cf Mehta et al. (2013)). Thus, we believethat investigating other applications of our framework is indeed an interesting future research direc-tion.
Acknowledgments
We thank Tim Roughgarden, Dimitris Bertsimas, and Amin Karbasi for their insightful comments duringthis work.
References
Jacob Abernethy, Elad E Hazan, and Alexander Rakhlin. Competing in the dark: An efficient algorithmfor bandit linear optimization. In , pages263–273, 2008.Jacob Abernethy, Peter L Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning areequivalent. In
Proceedings of the 24th Annual Conference on Learning Theory , pages 27–46, 2011.Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learningapproach to assortment selection.
Operations Research , 67(5):1453–1485, 2019. iazadeh et al.:
Online Learning via Offline Greedy
Mathematicalprogramming , 128(1-2):149–169, 2011.Saeed Alaei, Jason Hartline, Rad Niazadeh, Emmanouil Pountourakis, and Yang Yuan. Optimal auctionsvs. anonymous pricing.
Games and Economic Behavior , 118:494–510, 2019.Ali Aouad and Danny Segev. Display optimization for vertically differentiated locations under multinomiallogit preferences.
Available at SSRN 2709652 , 2015.Arash Asadpour, Rad Niazadeh, Amin Saberi, and Ali Shameli. Ranking an assortment of products viasequential submodular optimization.
Available at SSRN , 2020.Susan Athey and Glenn Ellison. Position auctions with consumer search.
The Quarterly Journal of Eco-nomics , 126(3):1213–1270, 2011.Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization.
Mathematics of Operations Research , 39(1):31–45, 2014.Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed banditproblem.
SIAM journal on computing , 32(1):48–77, 2002.Andrey Bernstein and Nahum Shimkin. Response-based approachability and its application to generalizedno-regret algorithms. arXiv preprint arXiv:1312.7658 , 2013.Hedyeh Beyhaghi, Negin Golrezaei, Renato Paes Leme, Martin Pal, and Balasubramanian Sivan. Improvedapproximations for free-order prophets and second-price auctions. arXiv preprint arXiv:1807.03435 ,2018.Andrew An Bian, Baharan Mirzasoleiman, Joachim M Buhmann, and Andreas Krause. Guaran-teed non-convex optimization: Submodular maximization over continuous domains. arXiv preprintarXiv:1606.05615 , 2016.David Blackwell. An analog of the minimax theorem for vector payoffs.
Pacific Journal of Mathematics , 6(1):1–8, 1956.Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies for online linearoptimization with bandit feedback. In
Conference on Learning Theory , pages 41–1, 2012.Sébastien Bubeck et al. Convex optimization: Algorithms and complexity.
Foundations and Trends® inMachine Learning , 8(3-4):231–357, 2015.Niv Buchbinder and Moran Feldman. Deterministic algorithms for submodular maximization problems.
ACM Transactions on Algorithms (TALG) , 14(3):32, 2018.Niv Buchbinder, Moran Feldman, Joseph Seffi, and Roy Schwartz. A tight linear time (1/2)-approximationfor unconstrained submodular maximization.
SIAM Journal on Computing , 44(5):1384–1402, 2015.Nicolo Cesa-Bianchi and Gabor Lugosi.
Prediction, learning, and games . Cambridge university press, 2006. iazadeh et al.:
Online Learning via Offline Greedy
Journal of Computer and System Sciences ,78(5):1404–1422, 2012.Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Regret minimization for reserve prices in second-price auctions.
IEEE Transactions on Information Theory , 61(1):549–564, 2014.Lin Chen, Christopher Harshaw, Hamed Hassani, and Amin Karbasi. Projection-free online optimizationwith stochastic gradient: From convexity to submodularity. arXiv preprint arXiv:1802.08183 , 2018.Lin Chen, Mingrui Zhang, Hamed Hassani, and Amin Karbasi. Black box submodular maximization: Discreteand continuous settings. arXiv preprint arXiv:1901.09515 , 2019.Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and appli-cations. In
International Conference on Machine Learning , pages 151–159, 2013.Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al. Combinatorialbandits revisited. In
Advances in Neural Information Processing Systems , pages 2116–2124, 2015.Mahsa Derakhshan, Negin Golrezaei, Vahideh Manshadi, and Vahab Mirrokni. Product ranking on onlineplatforms.
Available at SSRN 3130378 , 2018.Mahsa Derakhshan, Negin Golrezaei, and Renato Paes Leme. Lp-based approximation for personalizedreserve prices. In
Proceedings of the 2019 ACM Conference on Economics and Computation , pages589–589, 2019.Antoine Désir, Vineet Goyal, Danny Segev, and Chun Ye. Capacity constrained assortment optimizationunder the markov chain based choice model.
Operations Research, Forthcoming , 2015.Shahar Dobzinski and Michael Schapira. An improved approximation algorithm for combinatorial auctionswith submodular bidders. In
Proceedings of the seventeenth annual ACM-SIAM symposium on Discretealgorithm , pages 1064–1073, 2006.Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E Schapire, Vasilis Syrgkanis, and Jennifer WortmanVaughan. Oracle-efficient online learning and auction design. In , pages 528–539. IEEE, 2017.Eyal Even-Dar, Robert Kleinberg, Shie Mannor, and Yishay Mansour. Online learning for global costfunctions. In
COLT , 2009.Uriel Feige, Vahab S Mirrokni, and Jan Vondrák. Maximizing non-monotone submodular functions.
SIAMJournal on Computing , 40(4):1133–1153, 2011.Kris Ferreira, Sunanda Parthasarathy, and Shreyas Sekar. Learning to rank an assortment of products.
Available at SSRN 3395992 , 2019.Negin Golrezaei, Adel Javanmard, and Vahab Mirrokni. Dynamic incentive-aware learning: Robust pricingin contextual auctions. In
Advances in Neural Information Processing Systems , pages 9756–9766, 2019. iazadeh et al.:
Online Learning via Offline Greedy
Proceedings 10th ACMConference on Electronic Commerce (EC-2009), Stanford, California, USA, July 6–10, 2009 , pages225–234, 2009.Hamed Hassani, Mahdi Soltanolkotabi, and Amin Karbasi. Gradient methods for submodular maximization.In
Advances in Neural Information Processing Systems , pages 5841–5851, 2017.Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Zebang Shen. Stochastic conditional gradient++. arXiv preprint arXiv:1902.06992 , 2019.Elad Hazan and Zohar Karnin. Volumetric spanners: an efficient exploration basis for learning.
The Journalof Machine Learning Research , 17(1):4062–4095, 2016.Elad Hazan and Tomer Koren. The computational power of optimization in online learning. In
Proceedingsof the forty-eighth annual ACM symposium on Theory of Computing , pages 128–141. ACM, 2016.Sham M Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms.
SIAM Journal on Computing , 39(3):1088–1106, 2009.Adam Kalai and Santosh Vempala. Efficient algorithms for online decision problems.
Journal of Computerand System Sciences , 71(3):291–307, 2005.David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of influence through a social network.In
Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining , pages 137–146, 2003.A Gürhan Kök, Marshall L Fisher, and Ramnath Vaidyanathan. Assortment planning: Review of literatureand industry practice. In
Retail supply chain management , pages 99–153. Springer, 2008.Joon Kwon and Vianney Perchet. Online learning and blackwell approachability with partial monitoring:optimal convergence rates. In
Artificial Intelligence and Statistics , pages 604–613, 2017.Ehud Lehrer. Approachability in infinite dimensional spaces.
International Journal of Game Theory , 31(2):253–268, 2003.Shie Mannor and Nahum Shimkin. Online learning with variable stage duration. In
International Conferenceon Computational Learning Theory , pages 408–422. Springer, 2006.Shie Mannor, Vianney Perchet, and Gilles Stoltz. Robust approachability and regret minimization in gameswith partial monitoring. In
Proceedings of the 24th Annual Conference on Learning Theory , pages515–536, 2011.Shie Mannor, Vianney Perchet, and Gilles Stoltz. Set-valued approachability and online learning with partialmonitoring.
The Journal of Machine Learning Research , 15(1):3247–3295, 2014.Aranyak Mehta et al. Online matching and ad allocation.
Foundations and Trends® in Theoretical ComputerScience , 8(4):265–368, 2013.Emanuel Milman. Approachable sets of vector payoffs in stochastic games.
Games and Economic Behavior ,56(1):135–147, 2006. iazadeh et al.:
Online Learning via Offline Greedy arXiv preprint arXiv:1804.09554 , 2018.Elchanan Mossel and Sebastien Roch. Submodularity of influence in social networks: From local to global.
SIAM Journal on Computing , 39(6):2176–2188, 2010.George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maxi-mizing submodular set functions—i.
Mathematical programming , 14(1):265–294, 1978.Rad Niazadeh, Tim Roughgarden, and Joshua Wang. Optimal algorithms for continuous non-monotonesubmodular and dr-submodular maximization. In
Advances in Neural Information Processing Systems ,pages 9594–9604, 2018.Tim Roughgarden and Joshua R Wang. An optimal learning algorithm for online unconstrained submodularmaximization. In
Conference On Learning Theory , pages 1307–1325, 2018.Tim Roughgarden and Joshua R Wang. Minimizing regret with multiple reserves.
ACM Transactions onEconomics and Computation (TEAC) , 7(3):1–18, 2019.Shai Shalev-Shwartz. Online learning: Theory, algorithms, and applications.
PhD Thesis , 2007.Shai Shalev-Shwartz et al. Online learning and online convex optimization.
Foundations and Trends® inMachine Learning , 4(2):107–194, 2012.Tasuku Soma and Yuichi Yoshida. Maximizing monotone submodular functions over the integer lattice.
Mathematical Programming , 172(1-2):539–563, 2018.Xavier Spinat. A necessary and sufficient condition for approachability.
Mathematics of Operations Research ,27(1):31–44, 2002.M Streeter and D Golovin. An online algorithm for maximizing submodular functions (technical reportcmu-cs-07-171), 2007.Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In
Advances in Neural Information Processing Systems , pages 1577–1584, 2008.Nguyen Kim Thang and Abhinav Srivastav. Online non-monotone dr-submodular maximization. arXivpreprint arXiv:1909.11426 , 2019.Taishi Uchiya, Atsuyoshi Nakamura, and Mineichi Kudo. Algorithms for adversarial bandit problems withmultiple plays. In
International Conference on Algorithmic Learning Theory , pages 375–389. Springer,2010.Raluca Mihaela Ursu. The power of rankings: Quantifying the effect of rankings on online consumer searchand purchase decisions.
Browser Download This Paper , 2016.Nicolas Vieille. Weak approachability.
Mathematics of operations research , 17(4):781–791, 1992.Jan Vondrák. Optimal approximation for the submodular welfare problem in the value oracle model. In
Proceedings of the fortieth annual ACM symposium on Theory of computing , pages 67–74, 2008. iazadeh et al.:
Online Learning via Offline Greedy
44H Martin Weingartner.
Mathematical programming and the analysis of capital budgeting problems . MarkhamPublishing Company, 1967.Andrew Chi-Chin Yao. Probabilistic computations: Toward a unified measure of complexity. In , pages 222–227. IEEE, 1977.Mingrui Zhang, Lin Chen, Hamed Hassani, and Amin Karbasi. Online continuous submodular maximization:From full-information to bandit feedback. In
Advances in Neural Information Processing Systems ,pages 9206–9217, 2019a.Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. One sample stochasticfrank-wolfe. arXiv preprint arXiv:1910.04322 , 2019b.Julian Zimmert, Haipeng Luo, and Chen-Yu Wei. Beating stochastic and adversarial semi-bandits optimallyand simultaneously. arXiv preprint arXiv:1901.08779 , 2019. iazadeh et al.:
Online Learning via Offline Greedy Appendix A: Proofs and Remarks of Section 2.3
A.1. Equivalent criteria for approachability
Interestingly, there are other structural conditions that are equivalent to approachability. For example, theoriginal proof of the Blackwell approachability theorem (Blackwell 1956) uses a condition called “halfspace-satisfiability”. The following proposition summarizes all the known equivalences.
Proposition 1 (Satisfiable/Halfspace-Satisfiable/Response-Satisfiable (Abernethy et al. 2011)) . The following conditions are all equivalent to the approachability condition (Definition 3): A target set S is satisfiable in the Blackwell sequential game ( X , Y , p ) if there exists a player 1’s action x ∈ X such that for every player 2’s action y ∈ Y , the vector payoff falls into the target set, that is p ( x , y ) ∈ S . A target set S is halfspace-satisfiable in the Blackwell sequential game ( X , Y , p ) if for every halfspace H ⊇ S , H is satisfiable. A target set S is response-satisfiable in the Blackwell sequential game ( X , Y , p ) if for every player 2’saction y ∈ Y , there exists a player 1’s action x ∈ X such that the vector payoff falls into the target set,that is p ( x , y ) ∈ S . A.2. Proof of Theorem 1
Proof.
The proof for the only if direction relies on the fact that the ℓ ∞ -distance between the averagepayoff and S is vanishing as T → + ∞ since S is o (1) -approachable. Suppose that S is not response satisfiable,then there exists player 2’s action y ∈ Y such that for every player 1’s action x ∈ X , the payoff p ( x , y ) is notin S . Consider the set U := { p ( x , y ) : x ∈ X } . Because the payoff p is biaffine and X is convex and compact,so is U, hence inf u ∈ U d ∞ ( u , S ) = d ∞ ( p ( x , y ) , S ) for some x ∈ X . As p ( x , y ) / ∈ S, β = d ∞ ( p ( x , y ) , S ) > . Whenplayer 2 always plays y , we know that the ℓ ∞ distance between the average payoff and S should convergeto zero as S is o (1) -approachable. At the same time, it is at least β , a contradiction.To prove the if direction, we first show a reduction from Blackwell approachability to Online LinearOptimization (OLO) by showing that we can upper bound the ℓ ∞ distance between the average payoff andthe target set in a Blackwell approachability problem with the regret of the corresponding OLO instance.Then, we bound the regret of the OLO problem from above in terms of the ℓ ∞ norm of the payoff D ( p ) (because of our desired bound), the number of rounds T, and the dimension of the payoff function d . Weassume that S is a cone throughout the proof, which is not an issue because we can always lift a convex setto a cone in one dimension higher while not perturbing the distances by more than a factor of . Blackwell approachability reduces to OLO.
In an OLO problem, a player is given a compact convexdecision set
K ⊂ R d , and have to decide on a sequence of actions w , w , . . . , w T ∈ K . In round t, after theplayer decides on an action w t , Nature reveals a loss vector l t and the player pays h l t , w t i . The player observesthe loss vector l t in each round (full-information setting) and aims to minimize his cost. We want to constructa learning algorithm L , such that, for any sequence of loss vectors l , l , . . . , l T ∈ R d , the algorithm outputs w , w , . . . , w T ∈ K that attains a small regret, i.e. P Tt =1 h l t , w t i − min w ∈K P Tt =1 h l t , w i ≤ o ( T ) . Abernethy et al.(2011) show that we can efficiently obtain an algorithm for a Blackwell approachability problem from analgorithm for its corresponding OLO problem. Specifically, we have the following lemma. iazadeh et al.:
Online Learning via Offline Greedy Lemma 1. (Abernethy et al. (2011)) Given a Blackwell instance ( X , Y , p ) , and a cone S such that S isresponse-satisfiable, we can construct an OLO problem with K = S ◦ ∩ B (1) and l t = − p ( x t , y t ) for all t, such that, if the OLO learning algorithm returns w t in round t, we can convert it into x t ∈ X where d T T X t =1 p ( x t , y t ) , S ! ≤ T T X t =1 h l t , w t i − min w ∈K T X t =1 h l t , w i ! . Proof of Lemma 1.
This lemma was proved in Abernethy et al. (2011), but we include the proof herefor completion. Notice that, for any x ∈ R d and convex cone S ⊂ R d , the distance from x to S can be writtenas d ∞ ( x , S ) = max w ∈ S ◦ , k w k≤ h w , x i (6)because d ∞ ( x , S ) = k x − π S ( x ) k ≥ k w kk x − π S ( x ) k ≥ h w , x − π S ( x ) i ≥ h w , x i , where π S ( x ) denotes the projection of x onto S , and when w = x − π S ( x ) k x − π S ( x ) k , we have equality, i.e. h w , x i = d ( x , S ) . To construct a mapping from the output of the OLO algorithm w t to x t for the Blackwell game,we utilize the halfspace oracle for the Blackwell problem (see Proposition 1). Specifically, we pick x t suchthat p ( x t , y t ) ∈ H w t for all y ∈ Y , where H w t = { x : h w t , x i ≤ } is a halfspace that contains S ( H w t contains S because its normal, w t , is in S ◦ ). This gives us the following guarantee d T T X t =1 p ( x t , y t ) , S ! (1) = max w ∈K * T T X t =1 p ( x t , y t ) , w + = 1 T max w ∈K − T X t =1 h l t , w i ! (7) (2) ≤ T T X t =1 h− p ( x t , y t ) , w t i − min w ∈K T X t =1 h l t , w i ! (3) = 1 T T X t =1 h l t , w t i − min w ∈K T X t =1 h l t , w i ! . Here, Equality (1) follows from Equation (6), Inequality (2) holds because p ( x t , y t ) ∈ H w t , and Equality (3) holds from our definition of l t . (cid:4) As a corollary, since for any x ∈ R d and S ⊆ R d , the ℓ ∞ distance is always less than equal to the ℓ distance,i.e., d ∞ ( x , S ) ≤ d ( x , S ) , we obtain d ∞ T T X t =1 p ( x t , y t ) , S ! ≤ d T T X t =1 p ( x t , y t ) , S ! ≤ T T X t =1 h l t , w t i − min w ∈K T X t =1 h l t , w i ! . S ◦ is the polar cone of S , i.e. S ◦ := { s ∈ R d : h s , x i ≤ for all x ∈ S } , and B (1) is a Euclidian ball with radius ,i.e. B (1) = { w ∈ R d : k w k ≤ } . iazadeh et al.: Online Learning via Offline Greedy OLO regret upper-bound with Follow-the-Regularized-Leader algorithm.
To obtain the upper-boundon the regret of an OLO problems in terms of the ℓ ∞ norm of its losses, we apply the Follow-the-Regularized-Leader algorithm with a µ -strongly convex regularizer with respect to the ℓ norm. We use a regularizerwith respect to the ℓ norm, the dual of the ℓ ∞ norm, because of the bound structure of the algorithm asstated in Lemma 2. We elaborate further in the following lemmas. Lemma 2. (Shalev-Shwartz et al. (2012)) Consider an OLO problem on a convex and compact decisionspace
K ⊆ R d . Applying Follow-the-Regularized-Leader algorithm with a regularizer R, where R : R d → R is a µ -strongly convex function with respect to some norm k·k for µ > , implies T T X t =1 h l t , w t i − min w ∈K T X t =1 h l t , w i ! ≤ O (cid:0) CB / µ − / T − / (cid:1) , where B > upper bounds the function R, C > upper bounds the dual norm of the loss vector k l k ∗ , and T is the number of rounds. Lemma 3. (Shalev-Shwartz (2007)) For q ∈ (1 , , the function f : R d → R defined as f ( x ) = q − k x k q is strongly convex with respect to the ℓ q norm over R d . Recall that the ℓ q norm is defined as k x k q =( x q + x qw + . . . + x qd ) /q for x ∈ R d . To get a bound from Lemma 2 that depends on the upper-bound of the ℓ ∞ norm of the loss vectors, wewant a regularizer R that is µ -strongly convex w.r.t the ℓ norm for some µ > (to be determined later).However, the function from Lemma 3 does not work for q = 1 . To solve this, we set q to be greater than , q = log( d )log( d ) − in particular, then bound the ℓ q norm from above using the ℓ norm. Specifically, setting R ( x ) = k x k q and q = log( d )log( d ) − we have R ( x ) = ( q − · q − k x k q (1) ≥ ( q −
1) 12( q − k x k q + ( q − ∇ (cid:18) q − k x k q (cid:19) T ( y − x ) + ( q − · · k y − x k q (2) ≥ ( q −
1) 12( q − k x k q + ( q − ∇ (cid:18) q − k x k q (cid:19) T ( y − x ) + ( q − · · k y − x k R ( x ) + R ( x ) T ( y − x ) + ( q − / k y − x k , where µ = q − = log( d )log( d ) − − =
13 log( d ) . So, the function R is
13 log( d ) -strongly convex with respect to the ℓ norm. Furthermore, Inequality (1) follows from Lemma 3 and Inequality (2) holds because k w k / ≤ k w k q for any w ∈ R d . Consequently, by constructing an OLO problem with K = S ◦ ∩ B (1) and l t = − p ( x t , y t ) on each round,applying the Follow-the-Regularized-Leader algorithm with regularizer R ( w ) = k w k q to the OLO problem,and converting w t to x t in each round, we obtain d ∞ T T X t =1 p ( x t , y t ) , S ! ≤ T T X t =1 h l t , w t i − min w ∈K T X t =1 h l t , w i ! ≤ O (cid:0) D ( p ) log( d ) / T − / (cid:1) . A µ -strongly convex function f with respect to the ℓ q norm is a differentiable function that satisfies f ( y ) ≥ f ( x ) + ∇ (cid:0) f ( x ) T ( y − x ) (cid:1) + µ k y − x k q for some µ > . If k·k is a norm in R d , the dual norm k·k ∗ of k·k is defined as k w k ∗ = sup { w T x | k x k ≤ } . iazadeh et al.: Online Learning via Offline Greedy B = 1 and C = D ( p ) . Notice that for any w ∈ K , k w k ≤ B because we set K = S ◦ ∩ B (1) inLemma 1. Furthermore, since we set l t = p ( x t , y t ) , we have k l t k ∞ = k p ( x t , y t ) k ∞ ≤ D ( p ) by definition. (cid:4) Appendix B: Proofs and Remarks of Section 3.1
Example 2 (Non-Robust Greedy Algorithm).
In the shortest path tree problem, we are given anundirected graph G = ( V, E ) along with a root node u and edge weights { w uv } ( u,v ) ∈ E . We want to computea spanning tree of G such that for all vertices v ∈ V , the distance to the root in the tree, dist T ( u, v ) , equalsthe distance to the root in the original graph, dist G ( u, v ) . This problem can be solved by a greedy algorithmwhich runs Dijkstra’s algorithm from u and then for each node v = u chooses a parent p ∈ neighborhood ( v ) with the smallest w vp + dist G ( p, u ) . Suppose that we want to solve the online problem where the G and u are fixed over all rounds but the edge weights are chosen by an adversary.This can be translated into the language of our meta-algorithm as follows. The feasible region is to choosea parent for every non-root vertex ( C = Q v ∈ V \{ u } neighborhood ( v ) ). The adversary’s function space is tochoose (bounded) weights ( F ∼ = (0 , E ), and the cost of a chosen set of edges that we aim to minimize is theaverage distance from a random vertex to u. For each of our | V | subroutines, the parameter space is to choosea distribution for the parent vertex ( Θ = ∆( neighborhood ( v )) ) . The (one-dimensional) payoff vector is theshortest path from v to u through that parent p ( Payoff ( θ , z , { w uv } ) = E p ∼ θ [ w vp + dist G ( p, u )] ), where θ isthe probability of choosing a vertex among v ’s neighbors as parent.Managing to perfectly minimize the one-dimensional payoff vector at each iteration results in a shortestpath tree and therefore the best possible objective value. However, if the local choices deviate from theiroptimal values, then we can create cycles which result in infinite objective value.For example, consider the clique on V = { , , } , where we want a shortest path tree to the root node u = 1 . When the weights are w = 0 . , w = 1 . , w = 0 . , then it would be best for node three to firsttake edge (2 , . If we swap the role of nodes two and three, w = 1 . , w = 0 . , w , then it would bebest for node two to first take edge (2 , . When our subroutines for nodes two and three make simultaneousdecisions without actually seeing the input, they could easily both choose edge (2 , , yielding an invalidshortest path tree and making it impossible to get from either node to the root. This global issue can’t beexpressed as local utilities, so the algorithm is not robust in the sense that is needed to apply our framework. Appendix C: Proofs and Remarks of Section 5
C.1. Proof of Theorem 3
In this section, we complete the proof of Theorem 3, which is restated below for convenience.
Theorem 3.
A closed convex set S is O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) -bandit-approachable in thebandit Blackwell sequential game ( X , Y , p , ˆ p ) if and only if S is response-satisfiable in the Blackwell game ( X , Y , p ) . In particular, when S is response satisfiable, the online algorithm AlgBB (Algorithm 3) achievesthis approachability bound in polynomial time, given access to a separation oracle for S . Our framework is amenable to the parameter space depending on the iteration i . iazadeh et al.: Online Learning via Offline Greedy Proof.
The only if direction is proved in the sketch. To prove the if direction and second part ofTheorem 3, we propose an algorithm AlgBB that is parameterized by an exploration probability q ∈ (0 , (Algorithm 3). We later choose q to be D ( p ) − / D ( ˆ p ) / (log d ) / T − / to balance terms in our regret upper-bound. In each round t , this algorithm outputs a move x t ∈ Y as well as whether to explore π t ∈ { Yes , No } ,and then receives an unbiased estimate of the resulting payoff ˆ p ( x t , y t ) based on both players’ actions if itpicks to explore. It also maintains a (full-information) Blackwell algorithm AlgB. In each round t = 1 , , ..., T ,our algorithm follows the last suggested action by AlgB to generate a move x t . Note that this move willbe exactly the same as the previous round, if the algorithm chose to not explore in the previous round.Our algorithm then decides to either explore with probability q or not explore with probability − q . If itexplores, then it receives an unbiased estimator, ˆ p of the current vector payoff p ( x t , y t ) , and passes a scaledversion ˆ p /q on to AlgB. If it does not explore, then it rewinds the state of AlgB to the beginning of thecurrent round. Our goal here is to show that under algorithm AlgBB, d ∞ (cid:0) T P Tt =1 p ( x t , y t ) , S (cid:1) plus exploringpenalty term E (cid:2) T D ( p ) · ( ) (cid:3) is O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) .We start with bounding the first term d ∞ (cid:0) T P Tt =1 p ( x t , y t ) , S (cid:1) as a function of ˆΠ T , which is the time-averaged rescaled estimated payoffs from the rounds that we explore in , , . . . , T : ˆΠ T , T T X t =1 q ˆ p ( x t , y t ) [ explore in round t ] . Specifically, we have d ∞ T T X t =1 p ( x t , y t ) , S ! = d ∞ T T X t =1 E (cid:20) q ˆ p ( x t , y t ) [ explore in round t ] (cid:21) , S ! ≤ E h d ∞ (cid:16) ˆΠ T , S (cid:17)i , where the equality follows because E h q ˆ p ( x t , y t ) [ explore in round t ] i = qq E [ ˆ p ( x t , y t )] = p ( x t , y t ) as ˆ p is an unbiased estimator for p , and the inequality is obtained by applying Jensen’s inequality to theconvex ℓ ∞ distance function. We next show that if we explore with probability q, E h d ∞ (cid:16) ˆΠ T , S (cid:17)i ≤ O (cid:0) D ( ˆ p ) log( d ) / ( qT ) − / (cid:1) . Then, observe that the exploring penalty term E (cid:2) T D ( p ) · ( ) (cid:3) equals D ( p ) q . Our choice of exploring probability q = D ( p ) − / D ( ˆ p ) / log( d ) / T − / makes the two terms equalto O ( D ( p ) / D ( ˆ p ) / (log d ) / T − / ) , and gives us the desired bound.To see why E h d ∞ (cid:16) ˆΠ T , S (cid:17)i ≤ O (cid:0) D ( ˆ p ) log( d ) / ( qT ) − / (cid:1) when we explore with probability q , let M be arandom variable equal to the number of rounds we explore and ( τ , τ , · · · , τ M ) be the rounds that we explore.Note that M ∼ Binomial ( T, q ) . By applying the law of total expectation, we have: E h d ∞ (cid:16) ˆΠ T , S (cid:17)i = T X m =0 E h d ∞ (cid:16) ˆΠ T , S (cid:17) (cid:12)(cid:12)(cid:12) M = m i Pr [ M = m ] . We provide an upper bound on each term in the above summation separately. First, we handle the M = m = 0 case by noting ˆΠ T = , hence the distance from S is bounded by D ( ˆ p ) in this case. Moreover, thisevent occurs with probability (1 − q ) T . See that (1 − q ) T = (1 − q ) q − · q · T ≤ (1 /e ) qT ≤ O (cid:0) ( qT ) − / (cid:1) , iazadeh et al.: Online Learning via Offline Greedy E h d ∞ (cid:16) ˆΠ T , S (cid:17) (cid:12)(cid:12)(cid:12) M = 0 i Pr [ M = 0] ≤ O (cid:16) D ( ˆ p ) log( d ) / ( qT ) − / (cid:17) .Now fix some M = m = 0 . Assuming that S is a cone (we can always lift the convex set S to a cone in onedimension higher as shown in Abernethy et al. (2011)), our full-information Blackwell algorithm AlgB, whoreceives “fake payoffs” { ˆ p ( x τ i , y τ i ) } mi =1 with a diameter of q D ( ˆ p ) , guarantees that: E h d ∞ (cid:16) ˆΠ T , S (cid:17) (cid:12)(cid:12)(cid:12) M = m i = E (cid:20) MT · d ∞ (cid:18) TM ˆΠ T , S (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) M = m (cid:21) = mT E " d ∞ M M X i =1 q ˆ p ( x τ i , y τ i ) , S ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M = m ≤ mT · O (cid:18) q D ( ˆ p ) log( d ) / m − / (cid:19) = O (cid:18) qT D ( ˆ p ) log( d ) / m / (cid:19) , (8)where the expectation is taken w.r.t. the randomness in ˆ p and the inequality holds because S is response-satisfiable in the Blackwell game ( X , Y , p ) and ˆ p is an unbiased estimator of p . To be more clear why theabove inequality holds, note that set S is response-satisfiable in the Blackwell game ( X , Y , p ) , and is notnecessarily response-satisfiable if we replace p with ˆ p . However, by (i) following exactly the same steps asin proof of Theorem 1 (Section A.2 in the appendix) to reduce Blackwell approachability to online linearoptimization for rounds { τ i } Mi =1 , (ii) plugging ˆ p as the vector payoff of each round and using l i = − ˆ p ( x τ i , y τ i ) as the loss function in the online linear optimization, and then (iii) using the fact that ˆ p is an unbiasedestimator for p and S is response-satisfiable w.r.t. payoffs p , we can obtain exactly the same approachabilitybound in expectation as if S was response-satisfiable w.r.t. payoffs ˆ p . To see this, consider the chain ofinequalities (7) in the proof of Theorem 1 in Section A.2, tailored to our problem, and take an expectationw.r.t. the randomness in ˆ p . We have: E " d ∞ M M X i =1 ˆ p ( x τ i , y τ i ) , S ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { τ i } Mi =1 ≤ E " max w ∈K * M M X i =1 ˆ p ( x τ i , y τ i ) , w + (cid:12)(cid:12)(cid:12)(cid:12) { τ i } Mi =1 = E " M max w ∈K − M X i =1 h l i , w i ! (cid:12)(cid:12)(cid:12)(cid:12) { τ i } Mi =1 (2) ≤ M M X i =1 h− p ( x τ i , y τ i ) , w i i − E " min w ∈K M X i =1 h l i , w i (cid:12)(cid:12)(cid:12)(cid:12) { τ i } Mi =1 (3) = 1 M E " M X i =1 h l i , w i i − min w ∈K M X i =1 h l i , w i (cid:12)(cid:12)(cid:12)(cid:12) { τ i } Mi =1 . This time, Inequality (2) holds as before, because S is response-satisfiabe w.r.t. payoffs p (and hence half-space satisfiable when using w t as the normal of the half-space), but Equality (3) holds because: −h p ( x τ i , y τ i ) , w i i = − E (cid:2) h ˆ p ( x τ i , y τ i ) , w i i|{ τ i } Mi =1 (cid:3) = − E (cid:2) h l i , w i i|{ τ i } Mi =1 (cid:3) Note that expectation is conditioned on { τ i } Mi =1 , but only we use a universal upper-bound on the last term(regret of online linear optimization) that is a function of M , so we can change the conditioning on only M . iazadeh et al.: Online Learning via Offline Greedy q = D ( p ) − / D ( ˆ p ) / log( d ) / T − / , and then use Jensen’s inequalityapplied to the (concave) square-root function. E (cid:20) d ∞ (cid:16) ˆΠ T , S (cid:17) + 1 T D ( p ) · ( ) (cid:21) = E m ∼ Binomial ( T,q ) h E h d ∞ (cid:16) ˆΠ T , S (cid:17) (cid:12)(cid:12)(cid:12) M = m ii + O ( D ( p ) q ) ≤ E m ∼ Binomial ( T,q ) (cid:20) O (cid:18) qT D ( ˆ p ) log( d ) / m / (cid:19)(cid:21) + O ( D ( p ) q ) ≤ O (cid:18) qT D ( ˆ p ) log( d ) / ( T q ) / (cid:19) + O ( D ( p ) q )= O (cid:0) D ( ˆ p ) log( d ) / ( qT ) − / (cid:1) + O ( D ( p ) q )= O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) . The last inequality is the desired result. (cid:4)
C.2. Proof of Theorem 4
Proof.
The function ˆ p is an unbiased estimator for p (due to the bandit Blackwell reduciblity),so ( X , Y , p , ˆ p ) is a valid instance of the bandit Blackwell sequential game. Moreover, our target set S is the d payoff -dimensional positive orthant. Therefore, there exists a polynomial-time separation ora-cle for set S . Set S is also response-satisfiable (due to bandit Blackwell reduciblity). Thus, there existsa polynomial time online algorithm AlgBB that guarantees the bandit approachability upper-bound O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoff )) / T / (cid:17) , established in Theorem 3, for each of the bandit Blackwellinstances corresponding to the N different subproblems.Consider a subproblem i ∈ [ N ] . Note that AlgBB ( i ) is not invoked in all rounds [ T ] , but rather a subset T i ⊆ [ T ] depending on when its fellow Blackwell bandit algorithms, i.e., AlgBB ( i ′ ) , i ′ ∈ [ i − , decide toexplore. Note that T i is a random set, and only depends on realizations of binary signals { π ( i ′ ) t } i ′ ∈ [ i − ,t ∈ [ T ] .Fix a particular realization of set T i . By using the upper-bound in Theorem 3 for each of the terms in the LHSof the bound (i.e., the distance of the average payoff vector from set S and expected number of explorations)separately, we have d ∞ |T i | X t ∈T i p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) , S ! ≤ O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoff )) / |T i | − / (cid:17) . Moreover, let M i be the number of rounds that AlgBB ( i ) explores out of |T i | rounds it is invoked. Then, byour choice of q = D ( p ) − / D ( ˆ p ) / (log d ) / T − / , we have E (cid:20) |T i | M i (cid:12)(cid:12)(cid:12)(cid:12) T i (cid:21) = D ( p ) − / D ( ˆ p ) / (log( d payoff )) / | T | − / ≤ D ( p ) − / D ( ˆ p ) / (log( d payoff )) / |T i | − / , As the set S is the positive orthant, we have: ∀ j ∈ [ n ] : "X t ∈T i p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) j ≥ − O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoff )) / |T i | / (cid:17) . iazadeh et al.: Online Learning via Offline Greedy
Payoff (cid:16) θ ( i ) t , z ( i − t , f t (cid:17) = p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) . Thus, ∀ j ∈ [ n ] : "X t ∈T i Payoff ( θ ( i ) t , z ( i − t , f t ) j ≥ − O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoff )) / |T i | / (cid:17) . (9)Let T − be the set of rounds where no AlgBB ( i ) explored and T + be the set of rounds where some AlgBB ( i ) explored. Note also that T − ⊆ T i , simply because if no algorithm explores, AlgBB ( i ) will be invoked. Then,for any j ∈ [ n ] , we have E " X t ∈T − Payoff ( θ ( i ) t , z ( i − t , f t ) (cid:12)(cid:12)(cid:12)(cid:12) T i j = E "X t ∈T i Payoff ( θ ( i ) t , z ( i − t , f t ) (cid:12)(cid:12)(cid:12)(cid:12) T i j − E X t ∈T i \T − Payoff ( θ ( i ) t , z ( i − t , f t ) (cid:12)(cid:12)(cid:12)(cid:12) T i j ≥ − O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoff )) / |T i | / (cid:17) − D ( p ) E [ M i |T i ] , where the expectation is with respect to z ( i − t , t ∈ T i . Here, the inequality follows from Equation (9) andthe fact that |T i \ T − | ≤ M i and for any i ∈ [ N ] and t ∈ [ T ] , Payoff ( θ ( i ) t , z ( i − t , f t ) ≤ D ( p ) . By consideringthe fact that E [ M i |T i ] = D ( p ) − / D ( ˆ p ) / (log d ) / T − / |T i | ≤ D ( p ) − / D ( ˆ p ) / (log( d payoff )) / |T i | / from our choice of the probability of exploring, q , in Theorem 3, and |T i | ≤ T , we have: ∀ j ∈ [ n ] : E " X t ∈T − Payoff ( θ ( i ) t , z ( i − t , f t ) j ≥ − O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoff )) / |T i | / (cid:17) . (10)Because the offline algorithm Offline-IG ( C , F , D , Θ) (Algorithm 1) is an extended ( γ, δ ) -robust approxi-mation, by focusing on rounds in T − and applying Inequality (10), together with linearity of expectation,we have: E " X t ∈T − f t ( z t ) ≥ γ · E " X t ∈T − f t ( z ∗ ) − O (cid:16) D ( p ) / D ( ˆ p ) / N δ (log( d payoff )) / T / (cid:17) , where z ∗ = arg max z ∈C P Tt =1 f t ( z ) is the optimal in-hindsight solution.Finally, note that Bandit-IG ( C , F , D , Θ , AlgBB ) does not explore too often in total among its subprob-lems. More precisely, E (cid:2) |T + | (cid:3) = E " N X i =1 M i ≤ N X i =1 O (cid:16) (log( d payoff )) / E h T / i i(cid:17) ≤ O (cid:16) N (log( d payoff )) / T / (cid:17) . Noting the fact that the functions f t have output value at most , for the remaining rounds T + we have: E " X t ∈T + f t ( z t ) ≥ γ · E " X t ∈T + f t ( z ∗ ) − O (cid:16) N (log( d payoff )) / T / (cid:17) . (11)Combining the two types of bounds in rounds T − and T + yields the desired claim. (cid:4) iazadeh et al.: Online Learning via Offline Greedy C.3. Bandit Blackwell Regret Lowerbound
In this section, we show that in a Bandit Blackwell sequential game, ( X , Y , p , ˆ p ) , the distance from the time-averaged payoff to the target set S plus the time-averaged exploring penalty of any prediction strategy mustbe at least Ω (cid:0) DT − / (cid:1) , where D = min { D ( p ) , D ( p ) } . Put differently, we show that the performance boundproved in Theorem 3 is unimprovable with respect to T (the number of rounds), i.e., no other strategies canhave a better performance for all problems. Theorem 8.
In a bandit Blackwell sequential game, ( X , Y , p , ˆ p ) , there exists an adversary’s strategy suchthat for every player 1’s strategy, the resulting sequence of actions satisfy: d ∞ T T X t =1 p ( x t , y t ) , S ! + E (cid:20) T D ( p ) · ( ) (cid:21) ≥ Ω (cid:0) DT − / (cid:1) . where ( D = min { D ( p ) , D ( p ) } . Proof of Theorem 8.
Let M be a random variable equal to the number of rounds the player explores.We first show in that if the number of rounds that the player explores at is at most M, then there exists aBandit Blackwell instance: an adversary’s action ( y , . . . , y T ) , a convex closed set S , and an affine payoff p together with an unbiased estimator function ˆ p such that: E " d ∞ T T X t =1 p ( x t , y t ) , S ! (cid:12)(cid:12)(cid:12)(cid:12) M ≥ Ω (cid:18) D √ M (cid:19) for any player’s strategies ( x , . . . , x T ) , where the expectation is taken over the randomness in the adversary’sstrategy. We later show this statement in Lemma 4. For now, we assume that the statement is true. Sincethe Bandit Blackwell total regret defined in Definition 11 includes another term for the cost of exploring,the total regret conditioned on M is E " d ∞ T T X t =1 p ( x t , y t ) , S ! (cid:12)(cid:12)(cid:12)(cid:12) M + E (cid:20) T D ( p ) · ( explore ) (cid:12)(cid:12)(cid:12)(cid:12) M (cid:21) ≥ Ω (cid:18) D √ M (cid:19) + Ω (cid:18) DMT (cid:19) (1) ≥ Ω (cid:0) DT − / (cid:1) , where Inequality (2) follows from setting M = T / ; notice that at M = T / , D √ M = DMT and Ω (cid:16) D √ M (cid:17) +Ω (cid:0) DMT (cid:1) is minimized. Again, the expectation here is with respect to the adversary’s strategy. Now, takinganother expectation with respect to M , we have E " d ∞ T T X t =1 p ( x t , y t ) , S ! + E (cid:20) T D ( p ) · ( explore ) (cid:21) = E " E " d ∞ T T X t =1 p ( x t , y t ) , S ! (cid:12)(cid:12)(cid:12)(cid:12) M + E (cid:20) E (cid:20) T D ( p ) · ( explore ) (cid:12)(cid:12)(cid:12)(cid:12) M (cid:21)(cid:21) ≥ Ω (cid:0) DT − / (cid:1) . This completes the proof. (cid:4)
We now prove the lower bound on the distance from the average payoff to S when the number of explorationrounds is M . As is common in proofs of lower bounds, we construct a random sequence of similar adversariesand show that with M rounds of explorations, it is impossible to distinguish the different types of adversarieswithout suffering a regret less than D √ M . iazadeh et al.: Online Learning via Offline Greedy Lemma 4.
In a Bandit Blackwell problem, if the number of exploration rounds is at most M, there existsan adversary’s strategy ( y , . . . , y T ) , a convex closed set S , and an affine payoff p together with an unbiasedestimator function ˆ p such that for any strategies ( x , . . . , x T ) , E " d ∞ T T X t =1 p ( x t , y t ) , S ! (cid:12)(cid:12)(cid:12)(cid:12) M ≥ Ω (cid:18) D √ M (cid:19) , where the expectation is taken with respect to the adversary’s strategies.Proof of Lemma 4 We only prove our lower bound for a deterministic player. Note that any randomizedstrategy can be expressed as a randomization of deterministic strategies, and based on Yao’s minimaxprinciple (Yao (1977)), our lower bound still holds when we average them over several deterministic strategiesaccording to some randomization. We refer to Cesa-Bianchi and Lugosi (2006) for details on deriving anidentical lower bound for a randomized adversary from a deterministic adversary.Recall that M is the random variable of the number of rounds that the player explores. Consider a fixed M . From this point onward (until the end of the proof), every probability and expectation are conditionedon M , we remove the dependence on M on the notations for simplicity. Let X = ∆([ n ]) , Y = { , } n , thepayoff function p ( x , y ) = x T y n − y , and S be the non-positive orthant, i.e. S = { s ∈ R n | [ s ] j ≤ ∀ j ∈ [ n ] } . For deterministic strategies, x must be equal to e z for some coordinate z , which happens when player 1 playsaction z ∈ [ n ] . In that case, [ p ( x , y )] j = [ y ] z − [ y ] j for all j ∈ [ n ] . We now define the adversary’s strategy. For each round t ∈ [ T ] and coordinate j ∈ [ n ] , let [ y t ] j beBernoulli random variables whose joint distribution are defined as follows. We first pick a random variable ζ ∼ Uniform { , , . . . , n } . Then, given that ζ = i, [ y ] j , [ y ] j , . . . , [ y T ] j are conditionally independent Bernoullirandom variables with parameter (1 − µ ) / if j = i, and (1 + µ ) / if j = i, where µ < / (will be specifiedlater). For analysis purposes, we define another move for the adversary, which we call the base move: all [ y ] j , [ y ] j , . . . , [ y T ] j are conditionally independent Bernoulli variables with parameter (1 − µ ) / . Supposethat this happens when ζ = 0 (just for ease of notations).Let I t be the player’s action (in { , , . . . , n } ) on round t , and π t be the exploring indicator in round t : π t = 1 if we explore in round t, and otherwise. Let η t = ( π , . . . , π t ) be the history of exploration decisions up toround t. Since the player is deterministic, I t is determined by ( p ( x , y ) , p ( x , y ) , . . . , p ( x t − , y t − ) , η t − ) . Also, let T j = P Tt =1 [ I t = j ] be the number of times action j is played in the first T rounds. We furtherdefine P j and E j as P ( ·| ζ = j ) E ( ·| ζ = j ) , respectively. More rigorously, if A is a σ -algebra generated by allpossible outcomes of the game, P j is a measure on the σ − algebra A and E j is an expectation taken withrespect to the conditional probability P j , which solely depends on the adversary’s move since we assume thatplayer 1’s strategy is deterministic. iazadeh et al.: Online Learning via Offline Greedy j ∈ [ n ] , when ζ = j , playing action j has the highest average reward than any otheractions. Then, we have E j " d ∞ T T X t =1 p ( x t , y t ) , S ! (1) ≥ max z ∈ [ n ] E j "" T T X t =1 p ( x t , y t ) z = 1 T max z ∈ [ n ] E j " T X t =1 ([ y t ] z − [ y t ] I t ) ≥ T E j " T X t =1 ([ y t ] j − [ y t ] I t ) = 1 T T X t =1 E j [[ y t ] j − [ y t ] I t ] (2) = 1 T T X t =1 µ E j [ ( I t = j )]= 1 T µ X j ′ = j E j [ T j ′ ] = 1 T µ ( T − E j [ T j ])= (cid:16) µ − µT E j [ T j ] (cid:17) . Inequality (1) follows because S is the non-positive orthant. Equality (2) follows because E j [[ y t ] j ′ ] is (1 + µ ) / when j ′ = j and (1 − µ ) / otherwise, so the difference between E j [[ y t ] j ] and E j [[ y t ] I t ] is µ when I t = j and otherwise.As for each j ∈ [ n ] , since the event { ζ = j } happens with probability n , we have sup E " d ∞ T T X t =1 p ( x t , y t ) , S ! ≥ µ − nT X j E j [ T j ] ! , (12)where the expectation is taken with respect to the adversary’s move and the supremum is taken over ζ ∈{ , , . . ., n } since the ζ picked by the adversary in the beginning of the game determines his whole strategy.The proof now reduces to bounding E j [ T j ] from above. We do this by comparing E j [ T j ] with E [ T j ] . If player1 chooses action i at round t and decides to explore, i.e., I t = i and π t = 1 , he then observes the payoff [ y t ] i . Recall that y t is the random variable that represents adversary’s move at round t, where y t ∈ Y = { , } n . Forany sequence of history ( H t , η t ) where H t = ([ y ] I , . . . , [ y t ] I t ) = ( h , . . . , h t ) ∈ { , } t and η t = ( π , . . . , π t ) ∈{ , } t , let χ t,j ( H t , η t ) = P j ([ y ] I = h , . . . , [ y t ] I t = h t , η t ) . Note that P j is a measure on the σ − algebra A as mentioned above, and the randomness comes from theadversary’s moves (the adversary plays a randomized y t at time t, where the j th coordinate of y t is a Bernoullivariable with mean either (1 + µ ) / or (1 − µ ) / depending on his choice of ζ at the beginning of the game).From our assumption that the player is deterministic, for any H T ∈ { , } T and history of exploration η T .Then, E i h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i = E h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i , ∀ ≤ i ≤ n, (13) iazadeh et al.: Online Learning via Offline Greedy ζ decided by the adversary is, the player has the same sequenceof moves given the same history. Therefore, E j [ T j ] − E [ T j ] = X H T ∈{ , } T , η T ∈{ , } T χ T,j ( H T , η T ) E j h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i − X H T ∈{ , } T , η T ∈{ , } T χ T, ( H T , η T ) E h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i (1) = X H T ∈{ , } T , η T ∈{ , } T ( χ T,j ( H T , η T ) − χ T, ( H T , η T )) E j h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i ≤ X ( H T , η T ): χ T,j ( H T , η T ) >χ T, ( H T , η T ) ( χ T,j ( H T , η T ) − χ T, ( H T , η T )) E j h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i (2) ≤ T X ( H T , η T ): χ T,j ( H T , η T ) >χ T, ( H T , η T ) ( χ T,j ( H T , η T ) − χ T, ( H T , η T )) , (14)where Equation (1) follows from Equation (13) and Inequality (2) follows from E j [ T j | [ y ] I = h , . . . , [ y T ] I T = h T , η T ] ≤ T. See that P nj =1 E [ T j ] = T since on each round, player 1’s action is in { , , . . ., n } . We can boundthe total variation using Pinsker’s inequality: X ( H T , η T ): χ T,j ( H T , η T ) >χ T, ( H T , η T ) ( χ T,j ( H T , η T ) − χ T, ( H T , η T )) ≤ r KL ( χ T, ( H T , η T ) || χ T,j ( H T , η T )) . (15)Putting Equations (14) and (15) together and applying Jensen’s inequality to the concave the square rootfunction, we get n n X j =1 E j [ T j ] ≥ n n X j =1 E [ T j ] + T X H T : χ T,j ( H T , η T ) >χ T, ( H T , η T ) ( χ T,j ( H T , η T ) − χ T, ( H T , η T )) ≤ T n + 1 n n X j =1 r KL ( χ T, ( H T , η T ) || χ T,j ( H T , η T )) ! ≤ T n + vuut n n X j =1 KL ( χ T, ( H T , η T ) || χ T,j ( H T , η T )) . (16)Recall that from the definition of χ t,j , we have the following conditional distribution: χ t,j ( h t | H t − , η t − ) = P j (cid:0) [ y t ] I t = h t | [ y ] I = h , . . . , [ y t − ] I t − = h t − , η t − (cid:1) . Applying the chain rule, we haveKL ( χ T, || χ T,j ) (1) = T X t =1 X H t − , η t − χ t − , ( H t − , η t − ) KL ( χ t, ( ·| H t − , η t − ) || χ t,j ( ·| H t − , η t − )) (2) = T X t =1 X H t − , η t − χ t − , ( H t − , η t − ) ( { I t = j and π t = 1 | H t − , η t − } ) KL (cid:18) − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ (cid:19) (3) = KL (cid:18) − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ (cid:19) E " T X t =1 ( { I t = j and π t = 1 } ) . Pinsker’s inequality bounds the total variation distance in terms of KL divergence. For two probability distributions P and Q , Pinsker’s inequality states that || P − Q || TV ≤ q KL ( P || Q ) where || P − Q || TV is the total variation distance sup A {| P ( A ) − Q ( A ) |} over measurable events A . Taking A = { x : P ( x ) > Q ( x ) } , we get P x : P ( x ) >Q ( x ) | P ( x ) − Q ( x ) | ≤ q KL ( P || Q ) . See Section A.2 Cesa-Bianchi and Lugosi (2006) for details. iazadeh et al.:
Online Learning via Offline Greedy (1) follows from applying the chain rule to χ T, and χ T,j . Equation (2) holdsbecause χ t, ( ·| H t − , η t − ) = Ber (cid:0) − µ (cid:1) and χ t,j ( ·| H t − , η t − ) = Ber (cid:0) µ (cid:1) when we play the arm j on round t, I t = j, and observe the payoff, π t = 1 . Otherwise, they are identical and wehave KL ( χ t, ( ·| H t − , η t − ) || χ t,j ( ·| H t − , η t − )) = 0 . Lastly, we get Equation (3) by factoring outKL (cid:0) − µ || µ (cid:1) , and collecting all the probability terms ( χ t, ( H t , µ t ) for all t ) to form the expectation of ( { I t = j and π t = 1 } ) with respect to P . Summing over j and applying KL ( p || q ) ≤ ( p − q ) q (1 − q ) : n X j =1 KL ( χ T, || χ T,j ) = KL (cid:18) − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ (cid:19) n X j =1 E " T X t =1 ( { I t = j and π t = 1 } ) = KL (cid:18) − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ (cid:19) E " T X t =1 n X j =1 ( { I t = j and π t = 1 } ) = KL (cid:18) − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ (cid:19) E " T X t =1 ( { π t = 1 } ) ≤ µ − µ M, (17)where the last line follows from the assumption that the number of rounds the player explores is M .Putting Equation (12), (16) and (17) altogether we get: E j " d ∞ T T X t =1 p ( x t , y t ) , S ! ≥ µ − n − s n µ − µ M ≥ µ − n − µ r M n ! , (18)where the last inequality follows from µ ≤ / . Taking µ = λ p nM , we have: E j " d ∞ T T X t =1 p ( x t , y t ) , S ! ≥ λ r nM (cid:18) − λ √ (cid:19) ≥ Ω (cid:18) √ M (cid:19) = Ω (cid:18) D √ M (cid:19) , (19)where the last equality follows from D ≤ D ( p ) = 1 in this case since the adversary’s move is in { , } n . Wefinish the proof by choosing the constant λ to be small enough to ensure that (cid:16) − λ √ (cid:17) is positive. (cid:4) Appendix D: Proofs and Remarks of Section 6.1 – Product Ranking and SequentialSubmodular Maximization
In this appendix, we give the missing proofs of the results from Section 6.1.
D.1. Proof of Theorem 5
Proof . We show that our meta Algorithm 4 works by verifying the following conditions. (i) Algorithm 5 is an extended ( , ) -robust approximation algorithm. We need to show that if thefollowing equation holds for some function h : ∀ j ∈ [ n ] , " T X t =1 Payoff ( ˜ θ ( i ) t , π ( i − t , f t ) j ≥ − h ( T ) , then we must have ∀ π ∗ ∈ Π , T X t =1 E [ f t ( π t )] ≥ T X t =1 f t ( π ∗ ) − nh ( T ) . (20) iazadeh et al.: Online Learning via Offline Greedy j ∈ [ n ] , we have (cid:2) Payoff ( θ ( i ) , π ( i − , f ) (cid:3) j = θ T y ( i ) − [ y ( i ) ] j , where y ( i ) , (cid:2) f ( π ( i − + j e i ) − f ( π ( i − ) (cid:3) j ∈ [ n ] . First, we prove several inequalities that will later be used to prove inequality (20). Sinceeach function f t,i is monotone submodular, we have: λ i i X j =1 (cid:18) f t,i ( { [ π t ] , . . . , [ π t ] j } ) − f t,i ( { [ π t ] , . . . , [ π t ] j − } ) (cid:19) + λ i i X j =1 (cid:18) f t,i ( { [ π t ] , . . . , [ π t ] j − , [ π ∗ ] j } ) − f t,i ( { [ π t ] , . . . , [ π t ] j − } ) (cid:19) (1) ≥ λ i i X j =1 (cid:18) f t,i ( { [ π t ] , . . . , [ π t ] j , [ π ∗ ] , . . . , [ π ∗ ] j } ) − f t,i ( { [ π t ] , . . . , [ π t ] j − , [ π ∗ ] , . . . , [ π ∗ ] j − } ) (cid:19) = λ i f t,i ( { [ π t ] , . . . , [ π t ] i , [ π ∗ ] , . . . , [ π ∗ ] i } ) − λ i f t,i ( ∅ ) (2) ≥ λ i f t,i ( { [ π ∗ ] , . . . , [ π ∗ ] i } ) , where inequality (1) follows from submodularity and inequality (2) follows from monotonicity and non-negativity of f t,i . To be more clear, inequality (1) holds because for each j = 1 , , . . . , i, the sum of the marginalvalues of adding [ π t ] j to π ( j − and adding [ π ∗ ] j to π ( j − is greater than equal to the marginal value ofadding { [ π t ] j , [ π ∗ ] j } to { [ π t ] , . . . , [ π t ] i , [ π ∗ ] , . . . , [ π ∗ ] i } as f t,i is submodular. Recall that f t ( π ) , λ f t, ( { [ π ] } ) + λ f t, ( { [ π ] , [ π ] } ) + . . . + λ n f t,n ( { [ π ] , . . . , [ π ] n } ) ∀ t ∈ [ T ] , so summing the inequalities above for i = 1 , , . . . , n, we get: n X i =1 λ i i X j =1 (cid:18) f t,i ( { [ π t ] , . . . , [ π t ] j } ) − f t,i ( { [ π t ] , . . . , [ π t ] j − } ) (cid:19) + n X i =1 λ i i X j =1 (cid:18) f t,i ( { [ π t ] , . . . , [ π t ] j − , [ π ∗ ] j } ) − f t,i ( { [ π t ] , . . . , [ π t ] j − } ) (cid:19) ≥ n X i =1 λ i f t,i ( { [ π ∗ ] , . . . , [ π ∗ ] i } ) ⇔ n X j =1 n X i = j ( λ i f t,i ( { [ π t ] , . . . , [ π t ] j } ) − λ i f t,i ( { [ π t ] , . . . , [ π t ] j − } ))+ n X j =1 n X i = j ( λ i f t,i ( { [ π t ] , . . . , [ π t ] j − , [ π ∗ ] j } ) − λ i f t,i ( { [ π t ] , . . . , [ π t ] j − } )) ≥ f t ( π ∗ ) ⇔ n X j =1 (cid:16) f t ( π ( j ) t ) − f t ( π ( j − t ) (cid:17) + n X j =1 (cid:16) f t ( π ( j − t + [ π ∗ ] j e j ) − f t ( π ( j − t ) (cid:17) ≥ f t ( π ∗ ) . (21)We get the first equivalence by switching the summations. We now use the inequality (21) to prove the finalclaim, i.e., the desired ineqaulity (20). We have: T X t =1 E [ f t ( π t )] = 12 T X t =1 n X i =1 E h f t ( π ( i ) t ) − f t ( π ( i − t ) i = 12 T X t =1 n X i =1 E h f t ( π ( i ) t ) − f t ( π ( i − t ) i + 12 n X i =1 T X t =1 E h f t ( π ( i ) t ) − f t ( π ( i − t ) i − n X i =1 T X t =1 E h f t ( π ( i ) t + [ π ∗ ] i e i ) − f t ( π ( i − t ) i + 12 n X i =1 T X t =1 E h f t ( π ( i ) t + [ π ∗ ] i e i ) − f t ( π ( i − t ) i iazadeh et al.: Online Learning via Offline Greedy (1) = 12 T X t =1 n X i =1 E h f t ( π ( i ) t ) − f t ( π ( i − t ) i + 12 n X i =1 T X t =1 h Payoff ( ˜ θ ( i ) t , π ( i − t , f t ) i [ π ∗ ] i + 12 n X i =1 T X t =1 E h f t ( π ( i − t + [ π ∗ ] i e i ) − f t ( π ( i − t ) i (2) ≥ T X t =1 n X i =1 E h f t ( π ( i ) t ) − f t ( π ( i − t ) i − nh ( T ) + 12 T X t =1 n X i =1 E h f t ( π ( i ) t + [ π ∗ ] i e i ) − f t ( π ( i − t ) i (3) ≥ T X t =1 f t ( π ∗ ) − nh ( T ) . In the above chain of inequalities, equality (1) follows from the definition of
Payoff , inequality (2) followsfrom our assumption, and inequality (3) follows from inequality (21). Rearranging the terms will finish theproof. (ii) Algorithm 5 is bandit Blackwell reducible.
We verify the following conditions based on Definition 12to show bandit Blackwell reducibility:•
Algorithm 5 is Blackwell reducible.
For each subproblem, consider an instance ( X , Y , p ) of Blackwellwhere X = Θ = ∆([ n ]) and Y = [ − , n . Our Blackwell adversary function is the marginal increase in theobjective function of placing item on position i, AdvB ( π ( i − , f ) = h f ( π ( i − t + j e i ) − f ( π ( i − t ) i j ∈ [ n ] .The biaffine payoff is p ( θ , y ) = θ T y n − y , where n is an n -dimensional all ones vector. The targetset S is the non-negative orthant, and it is response-satisfiable since for every adversary’s action y ∈ Y , the strategy θ = e j ∗ where j ∗ = argmax j ∈ [ n ] [ y ] j results in p ( θ , y ) ≥ . • An unbiased estimator for the Blackwell payoff function p can be constructed. Specifically, we needto construct an exploration sampling device U that receives ( θ , π ( i − ) in subproblem i and returns ( w exp , π exp ) such that (i) for all f ∈ F , θ ∈ Θ , π ( i − ∈ D , i ∈ [ n ] : ˆ p (cid:0) θ , AdvB ( π ( i − , f ) (cid:1) = f ( π exp ) w exp ,where ( w exp , π exp ) ∼ ExpS ( θ , π ( i − ) , and (ii) ˆ p is an unbiased estimator for the actual payoff, i.e, ∀ θ ∈ Θ , y ∈ Y : E [ ˆ p ( θ , y )] = p ( θ , y ) . The explore sampling device U works as follows. Given a point π ( i − ∈ Π and a parameter θ ∈ Θ , it draws j ∼ Uniform { , , . . . , n } and returns ( w exp , π exp ) = (cid:0) n ( θ j n − e j ) , π ( i − + j e i (cid:1) . Now, ˆ p is an unbiased estimator of p because iazadeh et al.: Online Learning via Offline Greedy E [ ˆ p ( θ , y )] = E (cid:2) ˆ p ( θ , AdvB ( π ( i − , f )) (cid:3) = E [ f ( π exp ) w exp ]= E (cid:20) n θ j f ( π ( i − + j e i ) n − f ( π ( i − + j e i ) e j (cid:21) (1) = n n X j =1 θ j f ( π ( i − + j e i ) − (cid:20) f ( π ( i − + 1 e i ) , f ( π ( i − + 2 e i ) , . . . , f ( π ( i − + n e i ) (cid:21) T (2) = n n X j =1 θ j (cid:0) f ( π ( i − + j e i ) − f ( π ( i − ) (cid:1) − (cid:20) f ( π ( i − + 1 e i ) , f ( π ( i − + 2 e i ) , . . . , f ( π ( i − + n e i ) (cid:21) T + f ( π ( i − ) n = n n X j =1 θ j (cid:0) f ( π ( i − + j e i ) − f ( π ( i − ) (cid:1) − (cid:20) f ( π ( i − + 1 e i ) − f ( π ( i − ) , . . . , f ( π ( i − + n e i ) − f ( π ( i − ) (cid:21) T = θ T y n − y = p ( θ , AdvB ( π ( i − , f )) , where y , [ f ( π ( i − (1) , . . . , π ( i − ( i − , j ) − f ( π ( i − )] j ∈ [ n ] .Here, equation (1) holds because we take j ∼ Uniform { , , . . ., n } , and equation (2) holds because P nj =1 θ j =1 . Intuitively, at every round, U randomly picks one of the items j ∈ [ n ] , and evaluate the marginal benefitof putting element j on the i th position of π ( i − .Putting (i) and (ii) altogether, Algorithm 5 is a ( , ) − extended robust approximation algorithm with n subproblems. Its payoff diameter D ( p ) is O (1) and its payoff estimator diameter D ( ˆ p ) is O ( n ) . The dimensionof vector payoffs is also d payoff = n . It is also bandit Blackwell reducible, hence from Theorems 2 and 4: -regret(Algorithm 2 applied on Algorithm 5) ≤ O ( n p T log n ) . -regret(Algorithm 4 applied on Algorithm 5) ≤ O ( n / (log n ) / T / ) . This completes the proof. (cid:4)
D.2. Proof of Corollary 1
Proof.
The proof for the model from Asadpour et al. (2020) is a direct application of Theorem 5 by taking λ i , P u ∼G ( θ u = i ) , the probability that a consumer has patience level i, and f i ( S ) , E u ∼G [ κ u ( S ) | θ u = i ] , the expected probability that a consumer with patience level i clicks on any of the top i products in S , asmentioned in Section 6.1. Thus, the sequential submodular function of interest is the expected probabilitythat a consumer clicks on at least one product when offered an ordering π : f ( π ) = n X i =1 λ i f i ( { π (1) , . . . , π ( i ) } ) = n X i =1 P u ∼G ( θ u = i ) E e ∼G [ κ u ( π ) | θ u = i ] . By invoking Theorem 5, we get the desired O (cid:0) n √ T log n (cid:1) -regret in the full-information setting and O (cid:16) n / (log n ) / T / (cid:17) − regret in the bandit setting.For the special consumer choice model in Ferreira et al. (2019), a consumer is characterized by two param-eters: distribution of clicks for each item q u = ( q u, , . . . , q u,n ) and attention window size k u . A consumer u iazadeh et al.: Online Learning via Offline Greedy k u positions, and an examined item i is clicked with probability q u,i whileunexamined items are never clicked. The events of clicking on two different items i or j in the event windoware assumed to be independent. Notice that this is a special case of the choice model by Asadpour et al.(2020), where θ u = k u and κ u ( { π (1) , . . . , π ( θ u ) } = κ u ( { π (1) , . . . , π ( k u ) } = 1 − k u Y i =1 (1 − q u,i ) . The probability of click function κ u is monotone since when X ⊆ Y ⊆ [ n ] , we have Q i ∈ X (1 − q u,i ) ≥ Q i ∈ Y (1 − q u,i ) (as ≤ q u,i ≤ for all u and i ), which implies κ u ( X ) ≤ κ u ( Y ) . It is also submodular, as forall X ⊂ Y ⊆ [ n ] and any item j / ∈ Y, j ∈ [ n ] , we have − Y i ∈ Y \ X (1 − q u,i ) ≥ ⇔ (1 − (1 − q j )) Y i ∈ X (1 − q u,i ) ! − Y i ∈ Y \ X (1 − q u,i ) ≥ ⇔ Y i ∈ X (1 − q u,i ) − (1 − q u,j ) Y i ∈ X (1 − q u,i ) ≥ Y i ∈ Y (1 − q u,i ) − (1 − q u,j ) Y i ∈ Y (1 − q u,i ) ⇔ Y i ∈ X (1 − q u,i ) − Y i ∈ X ∪{ j } (1 − q u,i ) ≥ Y i ∈ Y (1 − q u,i ) − Y i ∈ Y ∪{ j } (1 − q u,i ) ⇔ κ u ( X ∪ { j } ) − κ u ( X ) ≥ κ u ( Y ∪ { j } ) − κ u ( Y ) . Since this choice model is a special case of the choice model in Corollary 1, we can invoke Corollary 1 to getthe desired O (cid:0) n √ T log n (cid:1) -regret in the full-information setting and O (cid:16) n / (log n ) / T / (cid:17) − regret inthe bandit setting. (cid:4) Appendix E: Proofs and Remarks of Section 6.2 – Maximizing Multiple Reserves
In this appendix, first provide a discussion on major difference between Algorithm 6 and the algorithm inRoughgarden and Wang (2019). We then give the missing proofs of the results from Section 6.2. These resultsare restated for convenience.
E.1. Discussion on Algorithm 6
The main difference between our algorithm and the algorithm in Roughgarden and Wang (2019) is the choiceof revenue-from-reserves function q . Their revenue-from-reserves function q is different (coordinate-wise less)than ours. As it becomes more clear later in the proof, the need to design a new revenue-from-reservesfunction stems from our requirement to construct an explore sampling device for the online bandit learningalgorithm. E.2. Proof of Theorem 6
Proof . We will show that our meta Algorithms 2 and 4 work by verifying the following conditions. iazadeh et al.:
Online Learning via Offline Greedy (i) Algorithm 6 is an extended ( , ) -robust approximation algorithm. By Definition 9, we need toshow that if each coordinate of our vector payoffs is bounded by some function h : ∀ j ∈ [ m ] , " T X t =1 Payoff ( i ) ( ˜ θ ( i ) t , r ( i − t , v t ) j ≥ − h ( T ) , then we must have that our overall solution’s error is bounded by: ∀ r ∗ ∈ C , T X t =1 E [ f ( r t , v t )] ≥ · T X t =1 f ( r ∗ , v t ) − nh ( T ) . Recall from Section 6.2, that we defined the j th coordinate of this vector payoff to be ( j ∈ [ m ] ), h Payoff ( i ) ( θ ( i ) , r ( i − , v ) i j , E z ′ ∼ θ ( i ) (cid:2) q ( i ) ( z ′ ) − q ( i ) ( ρ j ) (cid:3) , and that r ( i ) t is the reserve vector after subproblem i . Let’s now define S i to be the set of rounds where bidder i has the highest bid. We now carry out thestandard offline analysis, but summed over all rounds t ∈ [ T ] . T X t =1 E [ f ( r t , v t )] (1) = 12 T X t =1 [ v t ] ˆ j t + 12 E " n X i =1 X t ∈ S i [ r t ] i h [ r t ] i ∈ [[ v t ] ˆ j t , [ v t ] j ∗ t i (2) = 12 T X t =1 [ v t ] ˆ j t + 12 E " n X i =1 T X t =1 q ( i ) t ([ r t ] i ) = 12 T X t =1 [ v t ] ˆ j t + 12 n X i =1 X t ∈ S i q ( i ) t ([ r ∗ ] i ) + 12 E " n X i =1 T X t =1 q ( i ) t ([ r t ] i ) − n X i =1 X t ∈ S i q ( i ) t ([ r ∗ ] i ) (3) ≥ T X t =1 f ( r ∗ , v t ) + 12 n X i =1 T X t =1 E h q ( i ) t ([ r t ] i ) i − n X i =1 X t ∈ S i q ( i ) t ([ r ∗ ] i ) (4) = 12 T X t =1 f ( r ∗ , v t ) + 12 n X i =1 T X t =1 h Payoff ( i ) ( ˜ θ ( i ) t , z ( i − t , f t ) i b i (5) ≥ T X t =1 f t ( z ∗ ) − nh ( T ) , where j ∗ t and ˆ j t are respectively the highest and second highest bidders in the valuation profile v t . We alsodefined a function q ( i ) t for each round, which is the same as the original q ( i ) except with v replaced by v t .Note that Inequality (1) holds because in each round t , with probability / , the algorithm returns z t = n ,which implies f ( r t , v t ) = [ v t ] ˆ j t and with the same probability, it returns r t = r ( n ) t which implies f ( r t , v t ) is atleast equal to the reserve of the buyer with the highest bid when his reserve is less than his bid. Inequality (2) follows from the definition of q ( i ) t . Inequality (3) holds because under the optimal reserve price r ∗ , f ( r ∗ , v t ) is less than or equal to the second highest bid [ v t ] ˆ j t when the bidder with the highest bid does not win orthey win and their reserve is less than or equal to [ v t ] ˆ j t ; otherwise, f ( r ∗ , v t ) is equal to q ( i ) t (cid:16) [ r t ] j ∗ t (cid:17) ≥ [ v t ] ˆ j t .Equality (4) follows from the definition of Payoff ( i ) . In this inequality b i is the index of element [ r ∗ ] i ; thatis, [ r ∗ ] i = ρ b i . Recall that ˜ θ ( i ) t is the (approximately-locally-optimal) distribution from which we are drawing [ r t ] i . Finally, inequality (5) follows from the assumption. This inequality is the desired result. (ii) Algorithm 6 is bandit Blackwell reducible. Per Definition 12, to show this statement, we will verifythe following conditions: iazadeh et al.:
Online Learning via Offline Greedy
Algorithm 6 is Blackwell reducible.
For every subproblem i ∈ [ n ] , consider an instance (cid:0) X , Y , p ( i ) (cid:1) ofBlackwell where X = Θ = ∆( R ) and Y = [0 , d param , where d param = |R| = m . We can use the Blackwelladversary function (note that we identify adversary functions with valuation vector) AdvB ( i ) ( r , v ) = (cid:2) q ( i ) ( ρ j ) (cid:3) j =1 , ,...,m .The biaffine Blackwell payoff is p ( i ) ( θ , y ) = θ T y n − y where n is an n -dimensional all ones vector.Notice that the target set S, the non-negative orthant, is response-satisfiable because if player 1 plays θ = e j ∗ where j ∗ = arg max j ∈ [ m ] [ y ] j then for every adversary’s action y ∈ Y , p ( i ) ( θ , y ) ≥ .• An unbiased estimator for the Blackwell payoff function p ( i ) can be constructed. We will show that forevery subproblem i ∈ [ n ] , there exists an explore sampling device U ( i ) that returns ( w ( i ) exp , r ( i ) exp ) such that(i) for all f ∈ F , θ ∈ Θ , r ∈ D , i ∈ [ n ] : ˆ p ( i ) (cid:16) θ , AdvB ( i ) ( r , v ) (cid:17) = f ( r ( i ) exp , v ) w ( i ) exp , where (cid:0) w ( i ) exp , r ( i ) exp (cid:1) ∼ ExpS ( i ) ( θ , r ) , and (ii) ˆ p is an unbiased estimator for the actual payoff, i.e, ∀ θ ∈ Θ , y ∈ Y : E [ ˆ p ( i ) ( θ , y )] = p ( i ) ( θ , y ) . More specifically, we will construct a exploring distribution U ( i ) such that if y = AdvB ( r , f ) for some f ∈ F , r ∈ D , then E [ ˆ p ( θ , y )] = E [ f ( r exp ) w exp ] = p ( θ , y ) , where the expectation is taken withrespect to U ( i ) . Notice that in Definition 12, U is not indexed by subproblems, but since the AdvB forthis particular problem is subproblem specific, the distribution U should also depend on the subproblem.Because we would like to construct an unbiased estimator of the actual payoff p ( i ) , which is an affinefunction of y = q ( i ) , we focus on constructing an unbiased estimator for the function q ( i ) . To do so, wemake use of the following representation of q ( i ) : q ( i ) ( r ) = f ( r m , v ) − f ( r ( m − e i ) , v ) . To see what the above equation holds note that when bidder i does not have the highest bid in anauction, both q ( i ) ( r ) and f ( r m , v ) − f ( r ( m − e i ) , v ) , which is the revenue gain of increasing bidder i ’s reserve price from zero to r , are both zero. When bidder i has the highest bid in the auction, q ( i ) ( r ) = r if r ∈ [[ v ] ˆ j , [ v ] j ∗ ] and zero otherwise. Furthermore, the revenue from the reserve price r m ,i.e., f ( r m , v ) , is [ v ] ˆ j if r < [ v ] ˆ j ; r if r ∈ [[ v ] ˆ j , [ v ] j ∗ ] and zero otherwise (the case r > [ v ] j ∗ ). The revenuefrom the reserve price r ( m − e i ) , i.e., f ( r ( m − e i ) , v ) , is [ v ] ˆ j if r < [ v ] ˆ j and zero otherwise. Thus, f ( r m , v ) − f ( r ( m − e i ) , v ) is r if r ∈ [[ v ] ˆ j , [ v ] j ∗ ] and zero otherwise, which is exactly q ( i ) ( r ) . Thisinteresting relationship is depicted in Figure 1.We now define the sampling distribution U ( i ) : D × Θ → ∆([0 , m × C ) . For each j ∈ [ m ] , we pick: ( w ( i ) exp , r ( i ) exp ) = (2 m ( θ j m − e j ) , ρ j m ) , or ( w ( i ) exp , r ( i ) exp ) = ( − m ( θ j m − e j ) , ρ j ( m − e i )) , with probability m each. Recall that D = C = R n , where R = { ρ , ρ , . . . , ρ m } is the set of possiblereserve prices and ρ j is the j -th largest reserve price in set R . We then have E (cid:2) ˆ p ( i ) ( θ , y ) (cid:3) = E h ˆ p ( i ) ( θ , AdvB ( i ) ( r , v )) i = E [ f ( r ( i ) exp , v ) w ( i ) exp ]= θ T f ( ρ m , v ) − f ( ρ ( m − e i ) , v ) ... f ( ρ m m , v ) − f ( ρ m ( m − e i ) , v ) − f ( ρ m , v ) − f ( ρ ( m − e i ) , v ) ... f ( ρ m m , v ) − f ( ρ m ( m − e i ) , v ) iazadeh et al.: Online Learning via Offline Greedy rf ( r m , v )[ v ] ˆ j [ v ] j ∗ [ v ] ˆ j r rf ( r ( m − e i ) , v )[ v ] ˆ j [ v ] j ∗ [ v ] ˆ j r rq ( i ) ( r ) [ v ] ˆ j [ v ] j ∗ r Figure 1 The function q ( i ) (right) and the two functions we combine to get it (left, center). The solid red linedenotes the function value when i is the highest bidder, and the dashed blue line denotes the function value when i isnot the highest bidder. = θ T q ( i ) ( ρ ) ... q ( i ) ( ρ m ) − q ( i ) ( ρ ) ... q ( i ) ( ρ m ) = θ T y n − y = p ( i ) ( θ , y ) . Wrapping up, Algorithm 6 is an extended (cid:0) , (cid:1) -robust approximation algorithm with n subproblems andwith a payoff diameter D ( p ) of O (1) and a payoff estimator diameter D ( ˆ p ) of O ( m ) . It is also banditBlackwell reducible. Therefore, from Theorems 2 and 4: -regret(Algorithm 2 applied on Algorithm 6) ≤ O ( nT / log / m )12 -regret(Algorithm 4 applied on Algorithm 6) ≤ O ( nm / T / log / m ) . This completes the proof. (cid:4)
E.3. Proof of Corollary 2
Proof . Let m ∈ Z + be a parameter we choose later to balance terms. We invoke Theorem 6 with thediscretization R = { , m , m , . . . , } . Given any reserves r ∗ ∈ [0 , n , we can produce rounded reserves ˜ r ∗ defined by rounding every reserve down to the nearest multiple of m : [˜ r ] i = m ⌊ m [ r ] i ⌋ . Importantly, thisnever causes any bidder to fail to clear their reserve price (this is why we must round down and cannotround up). Hence this can only grow the set of bidders that clear their reserve and hence the maximum bidfrom this set can only increase. If a bidder that was already in this set proceeds to win the auction, thenthey are only competing with more bidders and their reserve price drops by at most /m , so their paymentcan only drop by at most /m . If a bidder not previously in this set proceeds to win the auction, then theirreserve price used to be higher than their valuation, but their valuation must be higher than the previouswinner’s valuation. They pay at least their reserve less m , so the revenue of the auction drops by at most /m in this case as well. Hence the (summed) discretization error is T m , and we choose either m = n T / (full-information) or m = n − / T / (bandit) to obtain: O (cid:0) nT / log / m (cid:1) + T m = O (cid:0) nT / log / T (cid:1) O (cid:0) nm / T / log / m (cid:1) + T m = O (cid:0) n / T / log / ( nT ) (cid:1) iazadeh et al.: Online Learning via Offline Greedy (cid:4)
Appendix F: Proofs and Remarks of Section 6.3 – Non-monotone SubmodularMaximization
In this appendix, we first discuss the differences between Algorithm 7 and the bi-greedy algorithm byNiazadeh et al. (2018), and show that despite these differences Algorithm 7 obtains the same approximationfactor as that of the bi-greedy algorithm. We then present the proof of Theorem 7.
F.1. Discussion on Algorithm 7
Algorithm 7 is a modification of the bi-greedy algorithm by Niazadeh et al. (2018). But, as we show inthis section, these modifications do not change the / approximation factor of the bi-greedy algorithm.We modify the bi-greedy algorithm to better satisfy the form of Algorithm 1, ease our construction of thesampling device and unbiased estimators in the bandit case, and provide a unified framework for submodularfunctions with a more general domain. The major differences and their corresponding reasons are as follows:• To cover a more general discrete function domain, we optimize over points in the discrete set R whiletheir algorithm optimizes over [0 , n implemented by casting an ǫ − net.• To help us construct the sampling device and unbiased estimators for the bandit case, in our localoptimization step, we use ζ ( i ) (ˆ z, z ′ ) , which is a linear combination of marginal functions α ( i ) and β ( i ) ,rather than max (cid:8) α ( i ) (ˆ z ) − α ( i ) ( z ′ ) , β ( i ) (ˆ z ) − β ( i ) ( z ′ ) (cid:9) , in quantifying the value decrease of ˆ z . Recall thatin this step, we choose θ ( i ) ∈ ∆( R ) so that for all ˆ z ∈ R , E z ′ ∼ θ ( i ) (cid:2) α ( i ) ( z ′ ) + β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ) (cid:3) ≥ .Using the technique in Niazadeh et al. (2018), as we argue next, we can still find θ ( i ) that satisfies thecondition in the local optimization step. Note that the bi-greedy analysis in Niazadeh et al. (2018) provesthat satisfying this condition implies that Algorithm 7 is a -approximation algorithm for the discretizedsubmodular maximization problem. Satisfying the Local Optimization Step.
Here, we show how to choose θ ( i ) ∈ ∆( R ) that satisfies thecondition in the local optimization step of Algorithm 7. To do so, First, we choose z ℓ ∈ arg max z ∈R f ( z, ¯ z ( i − ) and z u ∈ arg max z ∈R f ( z, z ¯ ( i − ) . Then, we look at these two cases. Case (i): z u ≤ z ℓ . We want to prove that deterministically returning z ℓ ( θ ( i ) puts all its weight on z ℓ )suffices. The key realization is that in this case, z u and z ℓ maximize the functions f ( · , ¯ z ( i − ) and f ( · , z ¯ ( i − )) respectively: f ( z ℓ , ¯ z ( i − ) ≥ f ( z u , ¯ z ( i − ) , f ( z u , z ¯ ( i − ) ≥ f ( z ℓ , z ¯ ( i − ) . (22)We know by submodularity that two points are better than their coordinate-wise max and min: f ( z u , z ¯ ( i − ) + f ( z ℓ , ¯ z ( i − ) ≤ f ( z ℓ , z ¯ ( i − ) + f ( z u , ¯ z ( i − ) . (23)Since adding up the first two inequalities in Equation (22) yields the third inequality in Equation (23), butwith the direction reversed, we know all three must hold with equality. We conclude by noting that since z ℓ maximizes both functions, it also maximizes both α ( i ) and β ( i ) at some nonnegative value and hence satisfiesthe desired condition in the local step optimization. iazadeh et al.: Online Learning via Offline Greedy Case (ii): z ℓ < z u . Suppose that the algorithm is able to find a θ ( i ) such that for any ˆ z ∈ [ z ℓ , z u ] , we have E z ′ ∼ θ ( i ) (cid:20) α ( i ) ( z ′ ) + 12 β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ) (cid:21) = 12 α ( i ) ( z ′ ) + 12 β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ′ ) ≥ . (24)We claim that this equation is still true for ˆ z outside of the interval [ z ℓ , z u ] .Suppose that ˆ z < z ℓ . By the choice of z ℓ , we know that β ( i ) ( z ℓ ) ≥ β ( i ) (ˆ z ) . By submodularity, we know that: f (ˆ z, z ¯ ( i − ) + f ( z ℓ , ¯ z ( i − ) ≤ f ( z ℓ , z ¯ ( i − ) + f (ˆ z, ¯ z ( i − ) α ( i ) (ˆ z ) + β ( i ) ( z ℓ ) ≤ α ( i ) ( z ℓ ) + β ( i ) (ˆ z ) β ( i ) ( z ℓ ) − β ( i ) (ˆ z ) ≤ α ( i ) ( z ℓ ) − α ( i ) (ˆ z ) . Since the LHS is nonnegative by the choice of z ℓ , so is the RHS. We have shown that z ℓ has strictly larger α ( i ) and β ( i ) values (than ˆ z ) and hence inequality in Equation (24) must be valid for ˆ z < z ℓ as well. Analogousreasoning shows the same for the z r < ˆ z case. Notice that the method in Niazadeh et al. (2018) is able tocompute a θ ( i ) that guarantees E z ′ ∼ θ ( i ) (cid:2) α ( i ) ( z ′ ) + β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ) (cid:3) ≥ for any ˆ z ∈ [ z ℓ , z u ] , which meansthis is also true for any ˆ z ∈ R . Recall that the payoff function is (cid:2)
Payoff (cid:0) θ , z ¯ ( i − , f (cid:1)(cid:3) j = E z ′ ∼ θ (cid:20) α ( i ) ( z ′ ) + 12 β ( i ) ( z ′ ) − ζ ( i ) ( ρ j , z ) (cid:21) , so such θ ( i ) also guarantees that Payoff (cid:0) θ ( i ) , z ¯ ( i − , f (cid:1) is in the positive orthant. F.2. Proof of Theorem 7
Proof . We will show that our meta Algorithms 2 and 4 work by verifying the following conditions. (i) Algorithm 6 is an extended ( , ) -robust approximation algorithm. Following the analysis of thebi-greedy algorithm in Buchbinder et al. (2015), we consider three sequences of points: the lower boundsequence z ¯ ( i ) , the upper bound sequence ¯ z ( i ) , and the hybrid-optimal sequence z ∗ ( i ) . The key proof idea isto bound the decrease in the hybrid-optimal sequence value z ∗ ( i ) with the total increase in the lower boundand upper bound sequence values. We define z ¯ ( i ) and ¯ z ( i ) to agree on the first i coordinates, while the restof the coordinates are ρ for z ¯ ( i ) and ρ m for ¯ z ( i ) . The hybrid-optimal sequence starts from z ∗ (0) , z ∗ , then z ∗ ( i ) is equal to z ∗ ( i − but with the i th coordinate replaced with the sampled z ′ i ∼ θ ( i ) .Importantly, if the i th -coordinate of the optimal vector z ∗ , which is z ∗ i , is less than our sampled point z ′ i from the i th subproblem/iteration, then the loss in value of the hybrid-optimal sequence is bounded by adifference of two β ( i ) evaluations. In particular, the submodularity of f implies: f ( z ∗ i , z ∗ ( i − − i ) + f ( z ′ i , ¯ z ( i − − i ) ≤ f ( z ′ i , z ∗ ( i − − i ) + f ( z ∗ i , ¯ z ( i − − i ) f ( z ∗ ( i − ) + β ( i ) ( z ′ i ) ≤ f ( z ∗ ( i ) ) + β ( i ) ( z ∗ i ) f ( z ∗ ( i − ) − f ( z ∗ ( i ) ) ≤ β ( i ) ( z ∗ i ) − β ( i ) ( z ′ i ) . There is also the symmetric case where the i th -coordinate of the optimal vector z ∗ i is greater than oursampled point z ′ i from the the i th subproblem: f ( z ′ i , z ¯ ( i − − i ) + f ( z ∗ i , z ∗ ( i − − i ) ≤ f ( z ∗ i , z ¯ ( i − − i ) + f ( z ′ i , z ∗ ( i − − i ) α ( i ) ( z ′ i ) + f ( z ∗ ( i − ) ≤ α ( i ) ( z ∗ i ) + f ( z ∗ ( i ) ) f ( z ∗ ( i − ) − f ( z ∗ ( i ) ) ≤ α ( i ) ( z ∗ i ) − α ( i ) ( z ′ i ) . iazadeh et al.: Online Learning via Offline Greedy ζ ( i ) ): f ( z ∗ ( i − ) − f ( z ∗ ( i ) ) ≤ ζ ( i ) ( z ∗ i , z ′ i ) . (25)Also, just by the definition of α ( i ) and β ( i ) we know that: f ( z ¯ ( i ) ) − f ( z ¯ ( i − ) = α ( i ) ( z ′ i ) (26) f ( ¯ z ( i ) ) − f ( ¯ z ( i − ) = β ( i ) ( z ′ i ) . (27)We are now ready to consider the θ ( i ) t which guarantees that for all i ∈ [ n ] and ˆ z ∈ R , including z ∗ i : T X t =1 E z ′ i ∼ θ ( i ) t (cid:20) α ( i ) t ( z ′ i ) + 12 β ( i ) t ( z ′ i ) − ζ ( i ) t ( z ∗ i , z ′ i ) (cid:21) ≥ − h ( T ) . Note that α ( i ) t β ( i ) t , and ζ ( i ) t are respectively obtained by replacing f with f t in the definition of α ( i ) β ( i ) ,and ζ ( i ) . We sum those inequalities together and then apply Equations (25), (26), and (27): − nh ( T ) ≤ T X t =1 n X i =1 E z ′ i ∼ θ ( i ) t (cid:20) α ( i ) t ( z ′ i ) + 12 β ( i ) t ( z ′ i ) − ζ ( i ) t ( z ∗ i , z ′ i ) (cid:21) ≤ T X t =1 n X i =1 E (cid:20) (cid:2) f t ( z ¯ ( i ) ) − f t ( z ¯ ( i − ) (cid:3) + 12 (cid:2) f t ( ¯ z ( i ) ) − f t ( ¯ z ( i − ) (cid:3) − h f t ( z ∗ ( i − ) − f t ( z ∗ ( i ) ) i(cid:21) = T X t =1 E (cid:20) (cid:2) f t ( z ¯ ( n ) ) − f t ( z ¯ (0) ) (cid:3) + 12 (cid:2) f t ( ¯ z ( n ) ) − f t ( ¯ z (0) ) (cid:3) − h f t ( z ∗ (0) ) − f t ( z ∗ ( n ) ) i(cid:21) = T X t =1 E f t ( z t ) − f t ( z ¯ (0) ) | {z } ≥ + 12 f t ( z t ) − f t ( ¯ z (0) ) | {z } ≥ − [ f t ( z ∗ ) − f t ( z t )] ≤ T X t =1 E [2 f t ( z t ) − f t ( z ∗ )] . See that the fourth equality is because the algorithm returns z t = z ¯ ( n ) = ¯ z ( n ) at round t. We finish by movingterms between sides and dividing by two: T X t =1 E [2 f t ( z t ) − f t ( z ∗ )] ≥ − nh ( T ) T X t =1 E [ f t ( z t )] ≥ T X t =1 f t ( z ∗ ) − nh ( T ) . Thus, our algorithm is an extended ( , ) -robust approximation. (ii) Algorithm 7 is bandit Blackwell reducible. We first show that Algorithm 7 is Blackwell reducible.Consider an instance ( X , Y , p ) of Blackwell where X , Θ = ∆( R ) and Y , ∆( C × F ) = ∆( R [ n ] × F ) . Oursynthetic Blackwell adversary function is the deterministic distribution that has weight on its input (point,function) pair and anywhere else, i.e. AdvB ( z , f ) = κ where κ ( z , f ) = 1 . The (asymmetric) biaffine Blackwellpayoff p is the expectation of the Payoff function from Equation (4) over its second input: p ( θ , κ ) , E ( z ,f ) ∼ κ [ Payoff ( θ , z , f )] . iazadeh et al.: Online Learning via Offline Greedy S is response-satisfiable since given any player 2 distribution κ over (point, function)pairs, we can convert each pair into the marginal functions α ( i ) and β ( i ) . Averaging these marginal functionstogether according to their likelihood in κ does not impact the submodularity fact we require for our proofs.We can think of p ( θ , κ ) as p ( θ , κ ) , E ( z ,f ) ∼ κ [ Payoff ( θ , z , f )] = Payoff ( θ , z , f ′ ) , for another submodular function f ′ ∈ F because a weighted average of submodular function is submodular.Since for any submodular functions f ∈ F and z ∈ C , we show that we can find θ such that Payoff ( θ , z , f ) ≥ , for any κ, the algorithm can find θ such that p ( θ , κ ) is in S . Therefore, Algorithm 7 is Blackwell reducible.To show that Algorithm 7 is bandit Blackwell reducible, we need to construct an unbiased estimatorfor p and an explore sampling device U. In subproblem i, U receives pairs of the form ( θ , z ¯ ( i − ) andreturns ( w exp , z exp ) such that (i) for all f ∈ F , θ ∈ Θ , z ¯ ( i − ∈ D , ˆ p (cid:0) θ , AdvB ( z ¯ ( i − , f ) (cid:1) = f ( z exp ) w exp where ( w exp , z exp ) ∼ U ( θ , z ¯ ( i − ) , and (ii) ˆ p is an unbiased estimator for the actual payoff, i.e. ∀ θ ∈ Θ , κ ∈ Y , wehave E [ ˆ p ( θ , κ )] = p ( θ , κ ) . Because we would like to construct an unbiased estimator of the actual payoff p , which is an expectation(over κ ) of the payoff function Payoff , which is further an affine combination of the functions α ( i ) , β ( i ) , and ζ ( i ) on R , we construct unbiased estimators from function evaluations for these functions. Observe thatgiven z ¯ ( i − , U can immediately reconstruct the corresponding upper bound point: ¯ z ( i − ← z ¯ ( i − ∨ ( ρ , . . . , ρ | {z } first ( i − coordinates , ρ m , . . . , ρ m | {z } last ( n − i + 1) coordinates ) T = ( z ′ , . . . , z ′ i − , ρ m , . . . , ρ m | {z } last ( n − i + 1) coordinates ) T . We can use z ¯ ( i − and ¯ z ( i − to express the marginal functions α ( i ) and β ( i ) , α ( i ) , α ( i ) ( ρ ) α ( i ) ( ρ ) ... α ( i ) ( ρ m ) = f ( ρ , z ¯ ( i − − i ) − f ( z ¯ ( i − ) f ( ρ , z ¯ ( i − − i ) − f ( z ¯ ( i − ) ... f ( ρ m , z ¯ ( i − − i ) − f ( z ¯ ( i − ) β ( i ) , β ( i ) ( ρ ) β ( i ) ( ρ ) ... β ( i ) ( ρ m ) = f ( ρ , ¯ z ( i − − i ) − f ( ¯ z ( i − ) f ( ρ , ¯ z ( i − − i ) − f ( ¯ z ( i − ) ... f ( ρ m , ¯ z ( i − − i ) − f ( ¯ z ( i − ) . These can be used in turn to express our comparison function ζ ( i ) : ζ ( i ) , ζ ( i ) ( ρ , ρ ) ζ ( i ) ( ρ , ρ ) · · · ζ ( i ) ( ρ , ρ m ) ζ ( i ) ( ρ , ρ ) ζ ( i ) ( ρ , ρ ) · · · ζ ( i ) ( ρ , ρ m ) ... ... . . . ... ζ ( i ) ( ρ m , ρ ) ζ ( i ) ( ρ m , ρ ) · · · ζ ( i ) ( ρ m , ρ m ) = diag (cid:0) α ( i ) (cid:1) L m,m − L m,m diag (cid:0) α ( i ) (cid:1) + diag (cid:0) β ( i ) (cid:1) U m,m − U m,m diag (cid:0) β ( i ) (cid:1) , where L m,m is the lower-triangular matrix defined by [ L m,m ] i,j = [ i > j ] and U m,m is the upper-triangularmatrix defined by [ U m,m ] i,j = [ i < j ] . Our desired payoff function can be expressed using all three of thesefunctions: Payoff ( θ , z ¯ ( i − , f ) = (cid:20) m (cid:0) α ( i ) (cid:1) T + 12 m (cid:0) β ( i ) (cid:1) T + (cid:0) ζ ( i ) (cid:1)(cid:21) θ , iazadeh et al.: Online Learning via Offline Greedy m is the m -dimensional all-ones vector. By using matrix notation, we have managed to clearly expressour desired payoff function as the linear combination of many function evaluations.We now define the explore sampling distribution U : Θ × D → ∆( R m × C ) as follows. With probability,we return the point z exp = z ¯ ( i − and weight vector w exp = ( − diag ( m ) θ = ( − m , where diag ( m ) isthe identity matrix with size m × m . With probability, we return the point z exp = ¯ z ( i − and weightvector w exp = ( − diag ( m ) θ = ( − m . For i = 1 , ..., m , with m probability we return z exp = ( ρ i , z ¯ ( i − − i ) and w exp = (4 m ) (cid:2) m e Ti + diag ( e i ) L m,m − L m,m diag ( e i ) (cid:3) θ . For i = 1 , ..., m , with m probability we return thepoint z exp = ( ρ i , ¯ z ( i − − i ) and weight vector w exp = (4 m ) (cid:2) m e Ti + diag ( e i ) U m,m − U m,m diag ( e i ) (cid:3) θ . Observethat, at subproblem i (essentially by construction): E [ ˆ p ( θ , κ )] = E ( z ¯ ( i − ,f ) ∼ κ E (cid:2) ˆ p (cid:0) θ , AdvB ( z ¯ ( i − , f ) (cid:1)(cid:3) = E ( z ¯ ( i − ,f ) ∼ κ E ( w exp , z exp ) ∼ ExpS ( θ , z ¯ ( i − ) [ f ( z exp ) w exp ] , where E ( w exp , z exp ) ∼ ExpS ( θ , z ¯ ( i − ) [ f ( z exp ) w exp ]= 14 f ( z ¯ ( i − ) [( − diag ( m ) θ ] + 14 f ( ¯ z ( i − ) [( − diag ( m ) θ ]+ m X i =1 m f ( ρ i , z ¯ ( i − − i ) (cid:20) (4 m ) (cid:20) m e Ti + diag ( e i ) L m,m − L m,m diag ( e i ) (cid:21) θ (cid:21) + m X i =1 m f ( ρ i , ¯ z ( i − − i ) (cid:20) (4 m ) (cid:20) m e Ti + diag ( e i ) U m,m − U m,m diag ( e i ) (cid:21) θ (cid:21) = f ( z ¯ ( i − ) (cid:20) − m Tm − diag ( m ) L m,m + L m,m diag ( m ) (cid:21) θ + f ( ¯ z ( i − ) (cid:20) − m Tm − diag ( m ) U m,m + U m,m diag ( m ) (cid:21) θ + m X i =1 f ( ρ i , z ¯ ( i − − i ) (cid:20)(cid:20) m e Ti + diag ( e i ) L m,m − L m,m diag ( e i ) (cid:21) θ (cid:21) + m X i =1 f ( ρ i , ¯ z ( i − − i ) (cid:20)(cid:20) m e Ti + diag ( e i ) U m,m − U m,m diag ( e i ) (cid:21) θ (cid:21) = (cid:20) m ( α ( i ) ) T + 12 m ( β ( i ) ) T + ζ ( i ) (cid:21) θ = Payoff ( θ , z ¯ ( i − , f ) . This explore sampling device also clearly runs in polynomial-time. Finally, we have E [ ˆ p ( θ , κ )] = E ( z ¯ ( i − ,f ) ∼ κ E ( w exp , z exp ) ∼ ExpS ( θ , z ¯ ( i − ) [ f ( z exp ) w exp ]= E ( z ¯ ( i − ,f ) ∼ κ (cid:2) Payoff ( θ , z ¯ ( i − , f ) (cid:3) = p ( θ , κ ) . This completes the proof of bandit Blackwell reducibility.For our bounds, we care about both the ℓ ∞ diameter of the payoff D ( p ) and the ℓ ∞ diameter of the payoffestimator D ( ˆ p ) . The former is bounded by O (1) , since for any θ , the payoff function is a linear combinatrionof O (1) function evaluations with O (1) coefficients. The latter is bounded by O ( m ) since aside from the iazadeh et al.: Online Learning via Offline Greedy O (4 m ) -scaling, the function evaluation yields a result in the range [0 , and the remaining terms have O (1) norms: k m k ∞ = 1 (cid:13)(cid:13)(cid:13)(cid:13) m e Ti θ (cid:13)(cid:13)(cid:13)(cid:13) ∞ = 12 [ θ ] i ≤ k diag ( e i ) L m,m θ k ∞ = X ji [ θ ] j ≤ k U m,m diag ( e i ) k ∞ = [ θ ] i ≤ . We complete the proof by applying Theorem 2 and Theorem 4, noting that our payoff dimension d equalsthe number of potential values that a coordinate can take, m : -regret(Algorithm 2 applied on Algorithm 7) ≤ O ( nT / log / m )12 -regret(Algorithm 4 applied on Algorithm 7) ≤ O ( nm / T / log / m ) . (cid:4) F.3. Proof of Corollaries 3 and 4