[PDF] Online Learning via Offline Greedy Algorithms: Applications in Market Design and Optimization

Abstract

Motivated by online decision-making in time-varying combinatorial environments, we study the problem of transforming offline algorithms to their online counterparts. We focus on offline combinatorial problems that are amenable to a constant factor approximation using a greedy algorithm that is robust to local errors. For such problems, we provide a general framework that efficiently transforms offline robust greedy algorithms to online ones using Blackwell approachability. We show that the resulting online algorithms have O(\sqrt{T}) (approximate) regret under the full information setting. We further introduce a bandit extension of Blackwell approachability that we call Bandit Blackwell approachability. We leverage this notion to transform greedy robust offline algorithms into a O(T^{2/3}) (approximate) regret in the bandit setting. Demonstrating the flexibility of our framework, we apply our offline-to-online transformation to several problems at the intersection of revenue management, market design, and online optimization, including product ranking optimization in online platforms, reserve price optimization in auctions, and submodular maximization. We show that our transformation, when applied to these applications, leads to new regret bounds or improves the current known bounds.

Full PDF

aa r X i v : . [ c s . L G ] F e b Online Learning via Oﬄine Greedy Algorithms:Applications in Market Design and Optimization

Rad Niazadeh

Chicago Booth School of Business, Operations Management, [email protected]

Negin Golrezaei

MIT Sloan School of Management, Operations Management, [email protected]

Joshua Wang

Google Research Mountain View, [email protected]

Fransisca Susan

MIT Sloan School of Management, Operations Management, [email protected]

Ashwinkumar Badanidiyuru

Google Research Mountain View, [email protected]

Motivated by online decision-making in time-varying combinatorial environments, we study the problemof transforming oﬄine algorithms to their online counterparts. We focus on oﬄine combinatorial problemsthat are amenable to a constant factor approximation using a greedy algorithm that is robust to localerrors. For such problems, we provide a general framework that eﬃciently transforms oﬄine robust greedyalgorithms to online ones using Blackwell approachability. We show that the resulting online algorithms have O ( √ T ) (approximate) regret under the full information setting. We further introduce a bandit extension ofBlackwell approachability that we call Bandit Blackwell approachability. We leverage this notion to transformgreedy robust oﬄine algorithms into a O ( T / ) (approximate) regret in the bandit setting. Demonstratingthe ﬂexibility of our framework, we apply our oﬄine-to-online transformation to several problems at theintersection of revenue management, market design, and online optimization, including product rankingoptimization in online platforms, reserve price optimization in auctions, and submodular maximization. Weshow that our transformation, when applied to these applications, leads to new regret bounds or improvesthe current known bounds. Key words : Blackwell approachability, Oﬄine-to-online, No-regret, Submodular maximization, Productranking, Reserve price optimization.

1. Introduction

We study the problem of designing eﬃcient no-regret—also known as vanishing regret —online learn-ing algorithms in complex real-world environments, where the underlying decision-making process iscombinatorial in nature. In such environments, a decision-maker (learner) needs to experiment withexponentially many options whose rewards exhibit non-trivial and non-linear structures. Exploit- iazadeh et al.: Online Learning via Oﬄine Greedy ing such structures to design eﬃcient online learning algorithms is challenging as the underlyingoﬄine problems can indeed be NP-hard. Such oﬄine problems can only admit approximation algo-rithms. Therefore, any eﬃcient online learning algorithm can only hope to obtain vanishing regretwith respect to an in-hindsight approximately optimal benchmark. This motivates our key researchquestions: How can one transform existing approximation algorithms for NP-hard oﬄine problems to van-ishing regret learning algorithms for a wide range of combinatorial environments? Can we eﬃ-ciently exploit the combinatorial reward structure to eliminate the necessity of experimentingwith exponentially many arms?

To answer these questions, we consider an adversarial online learning setting. In every round t ,the learner takes an action by choosing a (feasible) point z t among possibly exponentially manychoices, and receives a reward of f t ( z t ) . The adversarially chosen reward function f t ∈ F , whichis unknown to the learner at the time of action, can be non-linear in action z t . We are interestedin settings where the oﬄine problem is NP-hard, and amenable to a γ -approximation algorithm,where γ ∈ (0 , . In the oﬄine problem, the reward function f ∈ F is fully known, and the goal isto choose a feasible point z that maximizes the obtained reward f ( z ) .We focus on the prevalent class of oﬄine approximation algorithms with a greedy nature. Roughlyspeaking, such approximation algorithms build up a solution stage by stage, choosing the nextstage that oﬀers the most local improvement with respect to a metric. We require the greedyapproximation algorithms to be robust to local errors in every stage; for details, see Deﬁnition 8.Several combinatorial problems studied in operations research and computer science, ranging fromclassic submodular maximization problems to more recently studied optimization problems relatedto market design and revenue management, admit such robust greedy approximation algorithms.For details, see Section 6.The problem of transforming oﬄine problems to online learning algorithm is studied byKalai and Vempala (2005) and Dudík et al. (2017) when the learner can solve the oﬄine problemeﬃciently. However, the approaches in these works fail when the learner only has access to anapproximate solutions to the oﬄine problem. This drawback is alleviated by Kakade et al. (2009)who study the oﬄine to online transformation when (i) the oﬄine problem is NP-hard but amenableto approximation, and (ii) the reward function is linear in the learner’s action. While the proposedapproach in Kakade et al. (2009) provides a general purpose oﬄine-to-online blackbox reduction, itcrucially uses the linearity of the reward function. As a result, it cannot be applied to our settingswith nonlinear reward functions. We highlight that as shown by Hazan and Koren (2016), for a Our framework can also be applied to polynomially solvable problems. For these problems, the approximation factor γ = 1 . iazadeh et al.: Online Learning via Oﬄine Greedy general oﬄine problem, there may not exist an eﬃcient oﬄine-to-online transformation, justifyingour assumption on the type of approximation algorithms we study.We now summarize our main contributions. A framework for oﬄine-to-online transformation.

We design a uniﬁed framework to trans-form robust greedy approximation algorithms to eﬃcient online learning algorithms when the rewardfunctions are not necessarily linear. We consider two online learning settings: full information and bandit . In the full information setting, the learner observes function f t after taking action z t , andin the bandit setting, the learner only observes the obtained reward f t ( z t ) .For both settings, our proposed transformation relies on the celebrated Blackwell approacha-bility theorem due to Blackwell (1956). The Blackwell approachability theorem is concerned witha two-player repeated game with a vector payoﬀ, and presents a strategy under which the time-averaged vector payoﬀ approaches some target set S that satisﬁes certain properties. As it is shownin Abernethy et al. (2011), there is a strong connection between Blackwell approachability anddesigning vanishing regret learning algorithms. In fact, for online linear optimization, they showthat any strategy/algorithm for Blackwell approachability can be transformed to a vanishing regretlearning algorithm and vice versa. Online learning algorithms using Blackwell strategies.

In this work, as one of our maincontributions, we show that the transformation of Blackwell strategies to online vanishing regretalgorithms is also possible for combinatorial non-linear learning settings whose underlying oﬄineproblem is NP-hard and admits a robust greedy γ -approximation algorithm. Speciﬁcally, we showthat if the oﬄine problem is Blackwell reducible (see Deﬁnitions 10 and 12), then we can design avanishing regret learning algorithm by running a Blackwell algorithm for each stage (subproblem)of the oﬄine greedy algorithm. In every round, these Blackwell algorithms are run sequentially tobuild up the learner’s action stage by stage. This allows the Blackwell algorithms to communicatewith each other in a speciﬁc pattern dictated by the oﬄine greedy algorithm. Thanks to suchcommunication between Blackwell algorithms and the robustness of the oﬄine greedy algorithm tolocal errors, the resulting online algorithm has a vanishing γ -regret. In fact, for the full informationsetting, we show that this transformation leads to an algorithm with O ( N √ T ) γ -regret, where N is the number of subproblems in the oﬄine algorithm.The bandit setting turns out to be much trickier as the Blackwell algorithms cannot all obtaintheir desired feedback to update their course of actions over time. To resolve this, we introduce anovel bandit version of the sequential Blackwell game, that we call bandit Blackwell . In this version,the player/algorithm does not obtain any feedback on his payoﬀ unless he agrees to pay a certain Our regret bounds also depend on the diameter of vector payoﬀ of the Blackwell games and their dimension; seeTheorems 2 and 3. iazadeh et al.:

Online Learning via Oﬄine Greedy cost. When the player agrees to pay such a cost, an extra “exploration” will be done, and he obtainsan unbiased estimator of his payoﬀ. Surprisingly, we show that in the bandit Blackwell sequentialgames, approachability is feasible. We further give a tight lower bound on the rate of convergencefor bandit Blackwell sequential games.Leveraging our notions of bandit Blackwell sequential games and approachability, we present anoﬄine-to-online transformation in which N bandit Blackwell algorithms communicate with eachother to build up a solution. To mimic the extra exploration step of bandit Blackwell games, we showhow this communication can be interrupted in a controlled way when one of the bandit algorithmsrequests acquiring feedback. We also show how the required unbiased estimator of the vector payoﬀcan be constructed. These pieces together give us the ﬁnal bandit oﬄine-to-online transformation.We show that this transformation leads to an online algorithm with O ( N T / ) γ -regret. Applications.

Finally, to demonstrate the generality and eﬀectiveness of our framework, we applyour oﬄine-to-online transformation to several problems at the intersection of revenue management,market design, and online optimization that have been proposed and studied in the literature. Inparticular, we consider problems of (i) optimizing product ranking, (ii) optimizing personalizedreserve prices in second price auctions, and (iii) Submodular Maximization (SM) in discrete andcontinuous domains (see Table 1). We show that in most cases, our transformations lead to new orimproved regret bounds. We emphasize that applications presented in this work are only selectivesamples of applications that can ﬁt to our framework. The fact that we can obtain improved boundsfor these well-studied problems highlight the generality of the framework and its potentials to beapplied to other problems in the operations research domain, or even other domains of interest.In the following, we discuss our bounds in detail.

Product ranking optimization.

Online marketplaces have the opportunity of optimizingthe ranking of displayed products in order to improve revenue, shape the demand, and reduceusers’ search cost (see, for example, Athey and Ellison (2011), Kempe et al. (2003), Ursu (2016),Aouad and Segev (2015), and Derakhshan et al. (2018)). Inspired by this, we study the productranking problem in the online adversarial setting with the objective of maximizing user engagement.To express user engagement as a function of the ranking over the products, we use the model pro-posed by Asadpour et al. (2020), which is a generalization of the model presented by Ferreira et al.(2019). Under this model, the oﬄine ranking problem can be written as maximizing sequentialsubmodular functions; see Section 6.1 for the deﬁnition of the sequential submodular functions. Byapplying our framework to this problem, we get O ( n √ T log n ) -regret and O (cid:0) n / (log n ) / T / (cid:1) -regret in full information and bandit settings, respectively. We note that our work is the ﬁrst onethat studies the product ranking problem under the aforementioned model in an online adversarial iazadeh et al.: Online Learning via Oﬄine Greedy setting. The oﬄine PAC learning problem, which resembles aspects of the online learning in thestochastic setting, is studied by Ferreira et al. (2019) for a special case of our model. Optimizing personalized reserve prices.

Second price auctions with reserve prices are preva-lent in many marketplaces including online advertising markets, making them objects of bothwide practical relevance and scientiﬁc interest (see, for example, Hartline and Roughgarden (2009),Cesa-Bianchi et al. (2014), Beyhaghi et al. (2018), Roughgarden and Wang (2019), Golrezaei et al.(2019)). We study the online problem of optimizing personalized reserve prices, where buyers’ val-uations are chosen adversarially in every round. By applying our framework to this problem, weachieve O ( n √ T log T ) − regret in the full-information setting and O ( n / T / (log nT ) / ) − regretin the bandit setting. Our results match the previous bound for the full-information setting byRoughgarden and Wang (2019) who apply a slight variant of the Follow-the-Perturbed-Leader algo-rithm of Kalai and Vempala (2005) every round for each bidder; the bandit setting had not beenstudied prior to our work. We should note that in the special case with symmetric buyers and uni-form reserve prices (also known as anonymous reserve auction, cf. Alaei et al. (2019)), minimizingregret under stochastic bandit setting is studied in Cesa-Bianchi et al. (2014), in which they obtain ˜ O ( n √ T ) regret bound. Here, the oﬄine problem of ﬁnding the uniform optimal reserve can be solvedexactly in polynomial time. Submodular maximization problems.

Many optimization problems that arise in the realworld, including revenue management problems, can be expressed as maximizing a submodularfunction. The notion of submodularity is commonly used to describe the diminishing return prop-erty in discrete and continuous domains. Examples include the welfare maximization problem (e.g.,Dobzinski and Schapira (2006) and Vondrák (2008)), capital budgeting with risk-averse investors(e.g., Weingartner (1967) and Ahmed and Atamtürk (2011)), and the problem of maximizing inﬂu-ence through the network (e.g., Kempe et al. (2003) and Mossel and Roch (2010)).We apply our framework to the adversarial online submodular maximization problem. For theonline problem of maximizing non-monotone set submodular functions without any constraints, wetransform a variation of the bi-greedy oﬄine algorithm by Buchbinder and Feldman (2018) usingour framework and obtain O ( nT / ) − regret in the full-information setting, matching the previousbound by Roughgarden and Wang (2018) who also take advantages of the bi-greedy oﬄine algorithmof Buchbinder and Feldman (2018). Here, n is the number of coordinates. For the bandit setting,our transformation yields O ( nT / ) − regret. To the best of our knowledge, this is the ﬁrst regretbound for the bandit setting of this challenging problem. PAC stands for probably approximately correct. iazadeh et al.:

Online Learning via Oﬄine Greedy For the online problem of maximizing continuous submodular functions without any constraints,we transform a variation of the continuous bi-greedy algorithm by Niazadeh et al. (2018) and obtain O ( n √ T log T ) − regret in the online full-information setting. For the bandit setting, we obtain O ( nT / (log T ) / ) − regret when the continuous submodular functions is weak-DR. Our resultsfor weak-DR submodular functions trivially yield results for strong-DR submodular functions. Wehighlight that the notion of weak-DR submodularity is equivalent to continuous submodularityand is easier to satisfy than strong-DR submodularity, which additionally requires coordinate-wiseconcavity; see the deﬁnition of weak-DR and strong-DR submodular functions in Section 2. Our workis the ﬁrst one that designs online algorithms for weak-DR submodular functions. Furthermore, ourbounds improve the previous bounds for strong-DR submodular functions by Thang and Srivastav(2019), which are O ( T / ) − regret and O ( T / ) − regret for the full-information and banditsettings, respectively. The aforementioned regret bounds are obtained using a variation of the Frank-Wolfe algorithm.For the online problem of maximizing set monotone submodular functions with cardinally con-straints with size k , we transform the oﬄine greedy algorithm by Nemhauser et al. (1978), which isa (1 − /e ) − approximation, to yield O ( k √ T log n ) (1 − /e ) − regret in the online full-informationsetting, matching the bound by Streeter and Golovin (2008) who use a variation of the EXP3algorithm. Furthermore, our framework gives O ( kn (log n ) / T / ) (1 − /e ) − regret in the ban-dit setting, improving the previous bound of O ( k ( n log n ) / T / (log T ) ) (1 − /e ) − regret byStreeter and Golovin (2008, 2007) in the opaque feedback model, which is the limited feedbackmodel that is analog to our bandit feedback model under exploration. See Section 5 for more details. While the closely related work has already been discussed, in this section, broader related work willbe discussed.

Combinatorial learning.

Our work is related to the literature on online combinatorial learning.While in our work, we study the design of eﬃcient online learning algorithms for combinatorialproblems whose loss function is not necessarily linear in the chosen action, the work on combinatoriallearning focuses on linear loss functions; see, for example, Abernethy et al. (2008), Uchiya et al.(2010), Cesa-Bianchi and Lugosi (2012), Audibert et al. (2014), Chen et al. (2013), Combes et al.(2015), and Zimmert et al. (2019). Here, the learner’s loss is a inner product of a d -dimensionalaction z t and loss vector a t . This line of work examines both the full-information and bandit settings.In the full-information setting, the learner observes the loss vector a t , while in the bandit setting,only the loss a Tt z t is observable. The standard exponentially weighted average forecaster obtains a We omit the dependence on the Lipschitz constant here. iazadeh et al.:

Online Learning via Oﬄine Greedy Table 1 Our results for selective applications of our framework, compared to previously known results.

Online Full-Information Setting Online Bandit SettingApplication Approx Our γ -Regret The Best Our γ -Regret The BestFactor ( γ ) Bound Prior Bound Bound Prior BoundProduct Ranking Problem / - - O ( n √ T log n ) O (cid:16) n / (log n ) / T / (cid:17) Reserve Price Optimization / O ( n √ T log T ) ∗ - O ( n √ T log T ) O (cid:16) n / T / (log nT ) / (cid:17) Monotone Set SM − /e O (cid:0) k √ T log n (cid:1) † O (cid:16) k ( n log n ) / T / (log T ) (cid:17) † with Cardinality Constraints O ( k √ T log n ) O (cid:16) kn / (log n ) / T / (cid:17) Non-Monotone Set / O ( n √ T ) ‡ -SM Functions O ( n √ T ) O (cid:16) nT / (cid:17) Non-monotone Continuous / γ = 1 / , O ( T / ) § γ = 1 / , O ( T / ) § SM (Strong-DR) Functions O ( n √ T log T ) O ( nT / (log T ) / ) Non-monotone Continuous / - -SM (Weak-DR) Functions O ( n √ T log T ) O ( nT / (log T ) / ) ∗ Roughgarden and Wang (2019) † Streeter and Golovin (2008); ‡ Roughgarden and Wang(2018); § Thang and Srivastav (2019); tight O (cid:16) m q T log dm (cid:17) regret in the full-information setting, where m is the maximum ℓ -norm ofaction vectors (Audibert et al. (2014)). The state-of-the-art regret bound for the bandit setting is O (cid:16)q dm T log dm (cid:17) , as reported in several papers (Bubeck et al. (2012), Cesa-Bianchi and Lugosi(2012), Hazan and Karnin (2016)). Our framework achieves matching regret with respect to T inthe full-information setting without requiring the loss function to be linear. We get a worse regret(proportional to T / ) for the bandit setting to account for the non-linearity in loss functions. Online adversarial submodular optimization.

In the previous section, we review some of thework that is closely related to our results on maximizing submodular functions. Here, we reviewother work that studies the problem of maximizing submodular functions in an online adversarialsetting. Chen et al. (2018, 2019) use Frank-Wolfe method to design low-regret learning algorithmsfor maximizing monotone continuous strong-DR submodular functions with matroid constraints.Chen et al. (2018) (respectively Chen et al. (2019)) assume that the algorithm can access to T / exact (respectively T / stochastic) gradient evaluations in every round and design an algorithmwhose (1 − /e ) -regret is O ( √ T ) . The results of Chen et al. (2018, 2019) were later improved byZhang et al. (2019a) who design another Frank-Wolfe inspired learning algorithm that accesses toone stochastic gradient in each round and obtain O ( T / ) (1 − /e ) -regret. Zhang et al. (2019a)further present a learning algorithm in the bandit setting for the problem of maximizing monotone The dependency on the number of elements n is not well speciﬁed in this work. iazadeh et al.: Online Learning via Oﬄine Greedy continuous strong-DR submodular functions subject to matroid constraints. Their algorithm obtain O ( T / ) ( − /e )-regret. In contrast, we do not impose the monotonicity condition when consid-ering continuous submodular functions and hence instead of having the approximation factor of − /e , we have an approximation factor of / . Note that / is a tight approximation ratio (unlessRP=NP). Our O (cid:0) n √ T log T (cid:1) -regret in the full-information setting and O (cid:0) nT / (log T ) / (cid:1) -regret in the bandit setting for non-monotone weak-DR submodular maximization imply the samebounds for non-monotone strong-DR submodular functions. Online stochastic submodular optimization.

Designing learning algorithms for maximizing stochas-tic monotone continuous strong-DR submodular functions has been studied in Hassani et al. (2017),Mokhtari et al. (2018), Hassani et al. (2019), and Zhang et al. (2019b). The best result for this set-ting is by Zhang et al. (2019b) who obtain O ( √ T ) (1 − /e ) -regret using a stochastic variant of theFrank-Wolfe method. Their algorithm also implies the same regret bound for monotone set sub-modular maximization, which matches our regret bound for maximizing monotone set submodularfunction in the adversarial setting. Blackwell approachability.

Several aspects of Blackwell sequential game, including the design ofeﬃcient algorithms for Blackwell game with various information feedback structures, and the alter-native conditions for approachability, have been studied in the literature. In terms of feedbackstructures, the original Blackwell game develops eﬃcient projection algorithm for games that returnthe adversary’s moves on each round. Mannor et al. (2011) develop simple and eﬃcient algorithmsfor a variant of Blackwell game where on each round, the player only obtains a random signal whosedistribution depends on the action of the player and the adversary (as opposed to the action of theadversary). This variant is called Blackwell approachability with partial monitoring, and is furtherstudied in Mannor et al. (2014) and Kwon and Perchet (2017). In terms of equivalent conditionsfor approachability, aside from the original halfspace-satisﬁability condition for approachability inBlackwell (1956), alternative conditions for approachability, including the response-satisﬁability cri-teria that we use in this paper, can be found in Lehrer (2003), Vieille (1992), Spinat (2002), Milman(2006), and Even-Dar et al. (2009).Blackwell approachability has also been proven to be a quintessential tool in construct-ing online learning algorithms in various applications, as shown in Even-Dar et al. (2009),Mannor and Shimkin (2006), and Bernstein and Shimkin (2013). However, most applications do notinvolve NP-hard combinatorial problems, and use the best-ﬁxed action in hindsight (no approxima-tion factor) as the benchmark for regret. Furthermore, they only create one Blackwell instance oneach round. In contrast, we create multiple Blackwell instances on each round because the problems All these results (Chen et al. (2018), Chen et al. (2019), Zhang et al. (2019a)) can be extended for monotone setsubmodular functions using rounding method and multi-linear extension as a bridge. iazadeh et al.:

Online Learning via Oﬄine Greedy we consider have combinatorial nature and can only be solved eﬃciently in multiple stages. Fur-thermore, since we are solving NP-hard combinatorial problems with an intractable oﬄine problem,we use a γ -approximation benchmark in our regret. Organization

In Section 2, we present the oﬄine optimization problem, adversarial online learning framework,Blackwell sequential games, and deﬁnition of set and continuous submodular functions. Section3 presents the greedy approximation algorithm for the oﬄine problem. In Sections 4 and 5, wepresent our oﬄine-to-online transformation in the full information and bandit settings, respectively.Section 6 provides our regret bounds for the product ranking problem, optimizing reserve prices,and maximizing submodular functions. We conclude in Section 7.

2. Preliminaries and Notations

In this section, we formulate our adversarial online learning framework for approximation algorithms.We then give an overview of Blackwell approachability (Blackwell 1956), an important technical toolthat we use in this paper. We also provide a brief recap of a few deﬁnitions and results concerningmaximization of submodular functions, a canonical application for demonstrating our techniques.

Let F be a space of functions deﬁned over a (discrete or continuous) domain D . Assume that F is closed under addition, i.e., for any two functions f , f ∈ F , we have f + f ∈ F . In the oﬄineoptimization problem , the problem of interest is ﬁnding a point z ∗ ∈ D such that z ∗ ∈ arg max z ∈C f ( z ) , (1)where f : D → [0 , , which belongs to F , is the objective function, and C ⊆ D is the feasible region. We further denote the optimal objective value of problem (1) by

OPT ; that is,

OPT = max z ∈C f ( z ) .We focus on maximization problems in this paper, but our techniques and results can easily beextended to minimization problems as well.We consider oﬄine problems that are NP-hard to solve exactly, and at the same time are amenableto a γ -approximation algorithm for some constant γ ∈ (0 , . Definition 1 ( γ -approximation offline algorithm). An oﬄine algorithm A for prob-lem (1) is a polynomial time γ -approximation algorithm if for every f ∈ F returns a feasible (possiblyrandomized) point ˆ z ∈ C in polynomial time in the size of the algorithm’s input, so that E [ f ( ˆ z )] ≥ γ · OPT . Here, the expectation is with respect to the randomness in algorithm A . The constant γ ∈ (0 , isreferred to as the approximation factor of algorithm A . For maximization problems, which are the focus of this paper, we only need our functions to be upper bounded bya constant. However, for simplicity, we assume that our functions are upper bounded by one. iazadeh et al.:

Online Learning via Oﬄine Greedy Framework.

In the adversarial online learning version of problem (1), there is a learner, denotedby ALG, who plays T rounds of a sequential game against an adversary, denoted by ADV. In eachround t ∈ [ T ] , ADV picks a function f t ∈ F and simultaneously ALG takes an action by picking afeasible point z t ∈ C . Then, ALG obtains a reward equal to f t ( z t ) and receives a feedback concerningthis round. We highlight that unlike the oﬄine optimization Problem (1), the unknown function f t is not observable to ALG when it chooses action z t , and he only knows that f t belongs to F .Furthermore, ALG picks his action at time t only given the feedback of previous rounds , , . . . , t − ,and in that sense, ALG is an online learner. ALG’s goal is to pick points { z t } Tt =1 given the feedbackof each round to maximize the accumulated reward P Tt =1 f t ( z t ) against a worst-case adversary ADV.In this paper, for the sake of brevity and simplicity, we limit our focus to worst-case obliviousadversaries, i.e., adversaries that pick the sequence f , f , . . . , f T upfront. Feedback structures.

We consider two feedback structures: (i) full information feedback , whereALG observes the entire function f t after choosing z t , and (ii) bandit feedback , where ALG onlyobserves the quantity f t ( z t ) after choosing z t . Let φ t be the feedback that ALG receives after picking z t . Then, ALG’s next action z t +1 is a function of the history H t , where H t , { ( z , φ ) , . . . , ( z t , φ t ) } .More formally, any learning algorithm ALG can be described as mappings { π ( t ) ALG } Tt =1 , where each π ( t ) ALG maps the history H t − to action z t for any t ∈ [ T ] . The mapping π ( t ) ALG can be either determin-istic or randomized.

Benchmarks and regret.

We would like to design polynomial-time online learning algorithms foroﬄine problems that are NP-hard to solve exactly. Thus, we use the adapted notion of approximateregret to quantify the performance of an online algorithm. This notion is the regret with respect to γ fraction of the objective value at the best in-hindsight point. The notion of γ -regret, which is formallydeﬁned below, is common in the literature, see, for example, Kakade et al. 2009, Dudík et al. 2017,and Roughgarden and Wang 2018. At a high level, our goal is to take an eﬃcient γ -approximationoﬄine algorithm, and transform it to an online algorithm ALG with a sublinear γ -regret. Definition 2 ( γ -regret). Let σ = { ( z t , f t ) } Tt =1 be a sequence of strategies realized by onlinelearner ALG and adversary ADV. Then, for any such σ and γ ∈ (0 , , γ -regret ( σ ) is deﬁned as γ -regret ( σ ) , γ · max z ∈C T X t =1 f t ( z ) − T X t =1 f t ( z t ) . With a slight abuse of the notation, we denote the worst-case expected approximate regret of ALGagainst any (oblivious) adversary ADV as follows: γ -regret ( ALG ) , max { f t } Tt =1 n E [ γ -regret ( σ )] : σ = { ( z t , f t ) } Tt =1 , z t = ALG’s strategy at time t ∈ [ T ] o , where the expectation is with respect to any randomness in ALG. iazadeh et al.: Online Learning via Oﬄine Greedy To transform oﬄine approximation algorithms to eﬃcient online learning algorithms, we take advan-tage of

Blackwell sequential games . A Blackwell sequential game is a repeated two-player gamecharacterized by a tuple ( X , Y , p ) . In this repeated game, X and Y are both compact convex setsrepresenting the players’ action spaces, and p : X × Y → R d is a biaﬃne vector payoﬀ function. Moreover, parameter d ∈ N is known as the dimension of the Blackwell sequential game. The vec-tor payoﬀ function p is assumed to be known by both players. The game is played in T rounds.Each round involves player 1 choosing an action x t ∈ X and player 2 choosing an action y t ∈ Y simultaneously. Both actions may depend on the observed history (( x , y ) , · · · , ( x t − , y t − )) . Thispair of actions produces the vector payoﬀ p ( x t , y t ) . The objective of player 1 is to ensure that thetime-averaged payoﬀ approaches a closed and convex target set S ⊆ R d , and the objective of player2 is to prevent this from happening. Definition 3 (Blackwell approachabilty).

In the Blackwell sequential game ( X , Y , p ) , atarget set S is g ( T ) -approachable if there exists a player 1 strategy such that for every player 2’sstrategy, the resulting sequence of actions satisﬁes d ∞ T T X t =1 p ( x t , y t ) , S ! ≤ g ( T ) , where for any vector w ∈ R d and set S ⊆ R d , d ∞ ( w , S ) , inf v ∈ S k w − v k ∞ is the ℓ ∞ -distance ofvector w from set S .In this paper, we focus on the ℓ ∞ norm rather than the usual ℓ norm since it is more suitablefor our applications. Our bounds on the approachability term g ( T ) will depend on the scale of theproblem, and more formally on the diameter D ( p ) of the payoﬀ function p , deﬁned asD ( p ) , max x ∈X , y ∈Y k p ( x , y ) k ∞ . (2)Ideally, player 1 aims to develop a strategy so that the term g ( T ) in Deﬁnition 3 converges to as T converges to + ∞ , and hence would be able to approach the target set S asymptotically. However,not every closed and convex target set S is approachable. To help with characterizing which setsare approachable, we additionally deﬁne the concept of response-satisﬁablity . Definition 4 (Response-Satisfiable).

A closed and convex target set S is response-satisﬁable in the Blackwell sequential game ( X , Y , p ) if for every player 2’s action y ∈ Y , there existsa player 1’s action x ∈ X such that the vector payoﬀ falls into the target set, that is p ( x , y ) ∈ S . Function p ( · , · ) is biaﬃne if for any x ∈ X , p ( x , · ) is aﬃne and for any y ∈ Y , p ( · , y ) is aﬃne. iazadeh et al.: Online Learning via Oﬄine Greedy Blackwell’s landmark result (Blackwell 1956) is an equivalence of (asymptotic) approachabilityand response-satisﬁablity. We extend this result in the following theorem.

Theorem 1.

A closed and convex target set S is O ( D ( p ) log( d ) / T − / ) -approachable in theBlackwell sequential game ( X , Y , p ) if and only if the set S is response-satisﬁable, where D ( p ) ,deﬁned in Equation (2) , is the ℓ ∞ diameter of the payoﬀ function p , and d is the dimension of thegame. We present a detailed proof of Theorem 1 in Section A.2 in the appendix, which is an adapta-tion of the original result of Blackwell (1956). The main diﬀerence between the Blackwell’s originalresult and Theorem 1 is how the distance between the average payoﬀ and set S is computed. WhileBlackwell uses norm 2, we apply norm inﬁnity. To account for this diﬀerence, we use the equiv-alence between Blackwell approachability and online linear optimization (Abernethy et al. 2011).This equivalence allows us to apply regret bounds for the latter problem that uses an arbitrarynorm to ﬁnd new bounds for the approachability problem. The regret bounds (on online linearoptimization) can then be obtained via using Follow-the-Regularized-Leader (Shalev-Shwartz et al.2012) or Online Mirror Descent (Bubeck et al. 2015) algorithms.We ﬁnish this subsection by important remarks regarding our treatment of the Blackwellapproachability. Remark 1.

As our goal is designing polynomial-time online learning algorithms, we further usealgorithmic results in Even-Dar et al. 2009, and Abernethy et al. (2011) due to the equivalencebetween Blackwell approachability and full information adversarial online linear optimization. Theseresults provide a polynomial-time approachable online algorithm satisfying the bound in Theorem 1,given access to a separation oracle for the closed and convex set S . From this point on, when set S is response-satisﬁable, we assume access to such an online algorithm that uses a separation oraclefor the convex set S in a blackbox fashion. Remark 2.

Another upshot of the above line of research on the equivalence between Blackwellapproachability and full information online linear optimization is that an algorithm for player toapproach set S might only have access to the realized vector payoﬀs ( p ( x , y ) , . . . , p ( x t − , y t − )) in round t , rather than the entire history (( x , y ) , . . . , ( x t − , y t − )) , and this is indeed without loss There are other equivalent structural criteria for approachability similar to response-satisﬁability; see Section A.1in the appendix for a list of these conditions. Given the separation oracle for convex set S , the running-time should be polynomial in d , T , and the number ofbits required to encode X . We are also considering a computational model where either the realized vector payoﬀ isgiven as feedback at the end of each round, or the vector payoﬀ function p can be evaluated eﬃciently at any givenpair of actions ( x , y ) . iazadeh et al.: Online Learning via Oﬄine Greedy of generality for obtaining the optimal bound of Theorem 1 (Abernethy et al. 2011). We relax thisassumption in our “bandit Blackwell sequential game”, where we assume player 1 can only sometimeshave access to an unbiased estimator of the realized vector payoﬀ; see Section 5.1 for the deﬁnitionand more details. A particular class of NP-hard optimization problems we study concern maximizing set or continuous submodular functions . Definition 5 (Set submodularity).

A set function f : 2 [ n ] → [0 , is submodular if for all S, T ⊆ [ n ] , f ( S ∪ T ) + f ( S ∩ T ) ≤ f ( S ) + f ( T ) . Similarly, the concept of submodularity can be extended from subset lattice (above deﬁnition) toany discrete or continuous lattice. In particular, by considering the positive orthant cone lattice, weobtain the following deﬁnition for the continuous variant of set submodularity.

Definition 6 (Continuous submodularity).

A continuous multivariate function f :[0 , n → [0 , is submodular if for all x , y ∈ [0 , n ,f ( x ∨ y ) + f ( x ∧ y ) ≤ f ( x ) + f ( y ) , where ∨ and ∧ are coordinate-wise max and min operations. As an equivalent deﬁnition (Bian et al.2016), f is continuous submodular if for all i ∈ [ n ] , z ∈ [0 , , x − i (cid:22) y − i ∈ [0 , n − , and δ ≥ ,f ( z + δ, x − i ) − f ( z, x − i ) ≥ f ( z + δ, y − i ) − f ( z, y − i ) . The above class of continuous functions is also referred to as the weak-Diminishing Return (weak-DR) submodular in the literature (cf. Bian et al. 2016, Niazadeh et al. 2018, Soma and Yoshida2018). We further consider a special subclass of these functions satisfying concavity along eachcoordinate, referred to as the strong-Diminishing Return (strong-DR).

Definition 7 (Strong-DR continuous Submodularity).

A continuous multivariate func-tion f : [0 , n → [0 , is strong-DR submodular if for all i ∈ [ n ] , x (cid:22) y ∈ [0 , n , and δ ≥ ,f ( x i + δ, x − i ) − f ( x ) ≥ f ( y i + δ, y − i ) − f ( y ) , where x − i (resp. y − i ) is an ( n − -dimensional vector with all coordinate values of x (resp. y )except i , and x (cid:22) y if and only if ∀ j ∈ [ n ] : x j ≤ y j .The problem of maximizing monotone set submodular functions under a cardinality constraintadmits a classic greedy algorithm by Nemhauser et al. 1978 that achieves a tight approximationfactor of γ = 1 − /e . For unconstrained non-monotone set submodular maximization problem, the iazadeh et al.: Online Learning via Oﬄine Greedy double greedy algorithm by Buchbinder et al. 2015 achieves a tight approximation factor of γ = 1 / .For maximizing non-monotone weak-DR continuous submodular functions within unit hyper cube [0 , n (or more generally any box constraint of the form × ni =1 [ a i , b i ] ), the continuous bi-greedyalgorithm by Niazadeh et al. 2018 achieves a tight approximation factor of γ = 1 / . For the specialcase of strong-DR, they also propose a variation of the continuous bi-greedy that is provably a faster / -approximation algorithm. See Section 6.3 for more details.

3. Approximation Algorithms for the Oﬄine Problem: Iterative Greedy

As stated earlier, we are interested in transforming a γ -approximation algorithm for the oﬄineproblem (1) to an online learning algorithm, so that the worst-case γ -regret is sublinear in thenumber of rounds T . We consider a general class of algorithms for obtaining such an approximationguarantee, named Iterative Greedy (IG) algorithms. In an algorithm in this class, roughly speaking,a sequence of locally optimal decisions with respect to a speciﬁc metric (which we elaborate onmore later) leads to picking the ﬁnal point. This point then provably provides an approximationguarantee with respect to the global optimal solution of problem (1).Formally, consider the following abstract skeleton. Suppose that we have N subproblems indexedby i ∈ [ N ] . The algorithm starts from an initial feasible point z (0) ∈ C . It then goes over the subprob-lems in the increasing order of their indices. The goal of each subproblem i is to return a new feasiblepoint z ( i ) ∈ C given the output of the previous subproblem, i.e., z ( i − . The algorithm ﬁnishes byreturning the point z ( N ) . Now, each subproblem i performs two steps:1. Local optimization: We associate a space of update parameters Θ ⊆ R d param to each subproblem.Given the previous point z ( i − and the objective function f , the goal of this step is to ﬁnd a locally optimal update parameter θ ( i ) ∈ Θ that satisﬁes: Payoff ( θ ( i ) , z ( i − , f ) ≥ , where Payoff : Θ × D × F → R d payoﬀ denotes the parameter vector payoﬀ function .2. Local update: Given the update parameter θ ( i ) and z ( i − , this step returns the next point z ( i ) = Local-update ( θ ( i ) , z ( i − ) ∈ C . Notably, we allow

Local-update : Θ × D → ∆( C ) to incorporate randomness, and therefore z ( i ) can be potentially a randomized point.The above procedure is summarized in Algorithm 1. Remark 3.

To simplify the notation, we only consider symmetric subproblems in this section,i.e., all of the subproblems have the same update parameter spaces, local optimization steps, etc. Insome of our applications in Section 6, we need slightly diﬀerent subproblems for diﬀerent i = 1 , . . . , N .Our method directly extends to that case by having index-dependent subproblems. iazadeh et al.: Online Learning via Oﬄine Greedy Algorithm 1:

Offline-IG ( C , F , D , Θ) Meta Input:

Feasible region C , function space F , deﬁned over domain D , parameter space Θ ⊆ R d param , and parameter vector payoﬀ function Payoff : Θ × D × F → R d payoﬀ . Input: function f ∈ F . Output: feasible point z ∈ C .Initialize z (0) ∈ C ; for subproblem i = 1 to N do Choose update parameter θ ( i ) so that Payoff ( θ ( i ) , z ( i − , f ) ≥ ;Set z ( i ) ← Local-update ( θ ( i ) , z ( i − ) ;Return the ﬁnal point z ← z ( N ) . Example 1.

As a simple running example, consider the problem of maximizing a monotonesubmodular set function f : 2 [ n ] → [0 , subject to the cardinality constraint k , and the classic (1 − e )-approximation greedy algorithm for this problem (Nemhauser et al. 1978). This algorithmstarts from the empty set and picks elements greedily based on their marginal value to the currentset. This problem is an example of problem (1) where D = { , } n , C = { z ∈ { , } n : z · n ≤ k } and F is the space of all monotone submodular set functions. Here, n is the all-ones vector withsize n . The greedy algorithm is an instance of Algorithm 1 with Θ = ∆([ n ]) , which is the set of allpossible probability distributions over n elements, and N = k subproblems, one for each iterationof the greedy algorithm. To describe each subproblem, for θ ∈ Θ , z ∈ D , and f ∈ F , ∀ j ∈ [ n ] : [ Payoff ( θ , z , f )] j = θ T y − [ y ] j , where [ · ] j denotes the j th coordinate value of a vector and y , [ f ( z ∪ { j } ) − f ( z )] j ∈ [ n ] is the marginalobjective value of adding element j to z . Moreover, Local-update ( θ , z ) samples an element i ∗ ∼ θ ,where θ ∈ ∆([ n ]) is a probability distribution over n elements, and returns z ∪ { i ∗ } . Note that Payoff ( θ , z , f ) ≥ guarantees θ to only have positive mass on elements with maximum marginalvalue with respect to the point z . We focus on IG algorithms that (i) provide a worst-case multiplicative approximation guarantee forproblem (1), and (ii) have a local optimization step that is robust to small errors, i.e., if we replacethe locally optimal decisions with almost locally optimal ones, the ﬁnal point still remains to beapproximately optimal (with the same approximation factor), but up to a small additive error. Thefollowing deﬁnition formalizes this robustness notion.

Definition 8 ( ( γ, δ ) -robust approximation). An instance of Algorithm 1 is a ( γ, δ ) -robustapproximation algorithm for γ ∈ (0 , and δ > , if it satisﬁes the following properties: iazadeh et al.: Online Learning via Oﬄine Greedy

1. Algorithm 1 is a γ -approximation oﬄine algorithm as in Deﬁnition 1,2. Supposed that we replace θ ( i ) with ˜ θ ( i ) for every subproblem i = 1 , . . . , N . Then, if ∀ j ∈ [ d payoﬀ ] : h Payoff ( ˜ θ ( i ) , z ( i − , f ) i j + ǫ ≥ , then we should have: ∀ ˆ z ∈ C : E [ f ( z )] ≥ γ · f ( ˆ z ) − δN ǫ , where ǫ > and [ · ] j denotes the j th coordinate value of a vector.For our purpose, we actually need a stronger version of this robustness property. This propertyessentially concerns multiple runs of the oﬄine algorithm on a group of functions in F , i.e., { f t } t ∈ [ T ] ,producing a sequence of feasible points z t ∈ C for t ∈ [ T ] , and then guarantees a robust approximationfor the summation function, i.e., P t ∈ [ T ] f t ( z ) , against errors that are small on-average over theseruns by the sequence { z t } Tt =1 . This property is satisﬁed in all of the applications that motivate ourwork, in particular in various set and continuous submodular maximization problems we study inSection 6, and in both reserve price optimization and product ranking problems. Definition 9 (Extended ( γ, δ ) -robust approximation). An instance of Algorithm 1 is anextended ( γ, δ ) -robust approximation algorithm for γ ∈ (0 , and δ > , if for any sequence offunctions f , f , . . . f T ∈ F the following property is satisﬁed:• Suppose that z t is the output of Algorithm 1 on function f t for t ∈ [ T ] when θ ( i ) t (i.e., the choiceof parameter for subproblem i of run t ) is replaced with ˜ θ ( i ) t for t ∈ [ T ] and i ∈ [ N ] . Then, if ∀ j ∈ [ d payoﬀ ] : " T X t =1 Payoff ( ˜ θ ( i ) t , z ( i − t , f t ) j + h ( T ) ≥ , we should have ∀ ˆ z ∈ C : T X t =1 E [ f t ( z t )] ≥ γ · T X t =1 f t ( ˆ z ) − δN h ( T ) . Here, z ( i ) t is the output of subproblem i ∈ [ N ] for run t ∈ [ T ] , h ( · ) : N → R + , and [ · ] j denotesthe j th coordinate value of a vector.When there is only one run of the function (i.e., T = 1 ), the extended ( γ, δ ) -robust approxi-mation guarantee boils down to the weaker ( γ, δ ) -robust approximation guarantee in Deﬁnition 8.We ﬁnish this section by revisiting our running example and demonstrating the (extended) robustapproximation property in this example. Example 1 (continued).

By digging deeper in the original analysis of the greedy algo-rithm (Nemhauser et al. 1978), we show that the greedy algorithm satisﬁes the extended ( γ, δ ) -robust approximation property for γ = 1 − e and δ = 1 . iazadeh et al.: Online Learning via Oﬄine Greedy Suppose that z ∗ = { a , . . . , a k } is the optimal solution of the oﬄine problem; that is, z ∗ =arg max z ∈{ , } n : z · n ≤ k P Tt =1 f t ( z ) . Further, let z ( i ) t be the solution returned by the i th subproblemof the greedy algorithm when the objective function is f t . Then, for every i ∈ [ k ] , T X t =1 f t ( z ∗ ) − T X t =1 f t ( z ( i − t ) (1) ≤ T X t =1 f t ( z ∗ ∪ z ( i − t ) − T X t =1 f t ( z ( i − t )= T X t =1 k X j =1 (cid:16) f t ( z ( i − t ∪ { a , . . . , a j } ) − f t ( z ( i − t ∪ { a , . . . , a j − } ) (cid:17) (2) ≤ k X j =1 T X t =1 (cid:16) f t ( z ( i − t ∪ { a j } ) − f t ( z ( i − t ) (cid:17) (3) = k X j =1 T X t =1 (cid:18) h ˜ θ ( i ) t , y ( i − t i − h Payoff ( ˜ θ ( i ) t , z ( i − t , f t ) i a j (cid:19) = k X j =1 T X t =1 n X j =1 [ ˜ θ ( i ) t ] j (cid:16) f t ( z ( i − t ∪ { j } ) − f t ( z ( i − t ) (cid:17) − h Payoff ( ˜ θ ( i ) t , z ( i − t , f t ) i a j ! = k · T X t =1 E h f t ( z ( i ) t ) − f t ( z ( i − t ) i − k X j =1 T X t =1 h Payoff ( ˜ θ ( i ) t , z ( i − t , f t ) i a j (4) ≤ k · T X t =1 E h f t ( z ( i ) t ) − f t ( z ( i − t ) i + kh ( T ) , where y ( i ) t , h f t ( z ( i ) t ∪ { j } ) − f t ( z ( i ) t ) i j ∈ [ n ] . In the above chain of inequalities, inequality (1) holdsbecause function f t is monotone, inequality (2) holds due to submodularity of functions { f t } Tt =1 ,equality (3) holds because of the deﬁnition of the payoﬀ vector in Example 1, and inequality (4)holds because of the condition in Deﬁnition 9. By rearranging the terms and taking expectations,we have: T X t =1 E h f t ( z ∗ ) − f t ( z ( i ) t ) i ≤ (1 − k ) T X t =1 E h f t ( z ∗ ) − f t ( z ( i − t ) i + h ( T ) . By recursing the above inequality for i = 1 , . . . , k , and rearranging the terms, we ﬁnally have: T X t =1 E [ f t ( z t )] ≥ (1 − (1 − k ) k ) T X t =1 f t ( z ∗ ) − h ( T ) k X i =1 (1 − k ) i − ≥ (1 − e ) T X t =1 f t ( z ∗ ) − kh ( T ) . (cid:4) Not all greedy algorithms have robust guarantees. Example 2 of Section B in the appendix showswhy, e.g., Dijkstra’s algorithm for the shortest path problem, is not robust to local errors. We use binary indicator vectors and sets interchangably in this paper. iazadeh et al.:

Online Learning via Oﬄine Greedy

4. Online Algorithm under Full Information Feedback Structure

In this section, we show how to transform an oﬄine IG algorithm (Algorithm 1) to an online learningalgorithm with a small approximate regret whenever it (i) is an extended robust approximationalgorithm (Deﬁnition 9), and (ii) satisﬁes an extra condition that we call

Blackwell reduciblity . Weﬁrst introduce this condition. Then, with the help of the Blackwell approachability (Theorem 1), wepropose a meta full information online learning algorithm as our oﬄine-to-online transformation.

The crux of our technique to transform an oﬄine IG algorithm to an online learning algorithm isthe possibility of reducing the local optimization step of Algorithm 1 to an approachable instanceof the Blackwell sequential game as in Section 2.3.

Definition 10 (Blackwell reduciblity).

An instance

Offline-IG ( C , F , D , Θ) of Algo-rithm 1 is Blackwell reducible if there exists an instance ( X , Y , p ) of the Blackwell sequential game(with a biaﬃne vector payoﬀ function p ) and a mapping AdvB : D × F → Y called synthetic Blackwelladversary function , such that:1. The player 1’s action space X is equal to the parameter space Θ in Algorithm 1; i.e., X = Θ ,and for any θ ∈ Θ , z ∈ D , f ∈ F , we have Payoff ( θ , z , f ) = p ( θ , AdvB ( z , f )) .2. The set S , { u ∈ R d payoﬀ : [ u ] j ≥ , j ∈ [ d payoﬀ ] } is response-satisﬁable (Deﬁnition 4). Example 1 (continued).

The greedy algorithm of Nemhauser et al. (1978) is Blackwellreducible. Consider an instance ( X , Y , p ) of Blackwell where X = Θ = ∆([ n ]) and Y = [0 , n . Thesynthetic Blackwell adversary function is AdvB ( z , f ) = [ f ( z ∪ { j } ) − f ( z )] j ∈ [ n ] , and the biaﬃneBlackwell vector payoﬀ function is p ( θ , y ) = θ T y n − y . Recall that n is all-ones n -dimensionalvector. Furthermore, set S is response-satisﬁable because for every player 2’s action y ∈ Y , playing θ = e j ∗ with j ∗ = argmax j ∈ [ n ] y j implies that p ( θ , y ) ≥ . If the oﬄine algorithm (Algorithm 1) is Blackwell reducible, then one can think of the followingapproach to transform it into an online learning algorithm: associate an instance of the Blackwellsequential game to each subproblem i following the Blackwell reducibility, and then running N parallel online approachable algorithms for these Blackwell instances to ﬁnd a sequence of assign-ments of the update parameter of each subproblem i over time. We further need to show how tosynchronize these parallel runs through a proper communication between them, so as to constructa sequence of feasible solutions z , . . . , z T guaranteeing a small approximate regret. Note that Y = [0 , n because f : 2 [ n ] → [0 , is monotone non-decreasing. iazadeh et al.: Online Learning via Oﬄine Greedy Recall that our goal in the oﬄine problem is to solvethe optimization problem max z ∈C f ( z ) , where f ∈ F . The oﬄine problem admits a polynomial timeIG γ -approximation algorithm, Offline-IG ( C , F , D , Θ) , presented in Algorithm 1. This algorithmsolves N subproblems sequentially, building the solution step by step. In step/subproblem i of thisalgorithm, we ﬁrst update parameters θ ( i ) ∈ Θ ⊆ R d param using the previous point z ( i − , and thenreturn the next point z ( i ) to feed to the next subproblem. The algorithm ﬁnishes by returning theﬁnal point z ( N ) .As stated earlier, we assume that the oﬄine problem is Blackwell reducible; that is, we can deﬁnethe Blackwell instance ( X , Y , p ) and synthetic Blackwell adversary function AdvB : D × F → Y thatsatisfy the conditions in Deﬁnition 10. Although this deﬁnition might seem technical, verifying itfor many oﬄine algorithms is indeed straightforward; see Section 6.For the online version of the above oﬄine algorithm, the meta input is feasible region C , functionspace F , which is deﬁned over domain D , and parameter space Θ ⊆ R d param . We further considerhaving access to an online Blackwell algorithm AlgB, player 1’s strategy in the above Blackwellsequential game ( X , Y , p ) , where such algorithm (i) ensures that the distance between the averagevector payoﬀ T P Tt =1 p ( x t , y t ) and set S goes to zero with rate g ( T ) = O (cid:18) D ( p ) q log( d ) T (cid:19) againstany adversarial player 2’s strategy (see Theorem 1), and (ii) can be implemented in polynomialtime having access to a separation oracle for the convex set S . As stated earlier, the existence ofsuch an algorithm follows from the work of Even-Dar et al. 2009 and Abernethy et al. (2011); seeRemark 1. We consider N parallel copies of this algorithm, one for each subproblem i ∈ [ N ] . It isalso important to note that in our application, set S is the positive orthant, for which a polynomialtime separation oracle exists. Our algorithm that takes advantage of N parallel copies of the onlineBlackwell algorithm is summarized in Algorithm 2.Let AlgB ( i ) be the copy of the above online Blackwell algorithm associated to subproblem i ∈ [ N ] .This copy handles the local optimization step of subproblem i in the Offline-IG ( C , F , D , Θ) inevery round t ∈ [ T ] without knowing function f t . Consider the decision-making process of thisonline algorithm in round t . The inputs prior to this round are all the update parameters of thesubproblem i in the ﬁrst t − rounds, i.e., θ ( i )1 , . . . , θ ( i ) t − , and the realized vector payoﬀs of the ﬁrst t − rounds against player 2 in the Blackwell sequential game associated to subproblem i , i.e., p ( θ ( i )1 , y ( i )1 ) , . . . , p ( θ ( i ) t − , y ( i ) t − ) . We consider a particular player 2 for this Blackwell sequential game.More explicitly, the synthetic adversary function AdvB, which is part of our reduction, plays therole of player 2 in any round t , i.e., y ( i ) t = AdvB ( z ( i − t , f t ) . Given the input prior to time t , AlgB ( i ) returns the new update parameter θ ( i ) t .After the online Blackwell algorithm AlgB ( i ) returns the update parameter θ ( i ) t , we returnthe point z ( i ) t by calling the Local-update function in the oﬄine algorithm, i.e., we set z ( i ) t to iazadeh et al.: Online Learning via Oﬄine Greedy Local-update ( θ ( i ) t , z ( i − t ) . Observe that the point returned by the subproblem i , i.e., z ( i ) t , dependson the point returned by the previous subproblem z ( i − t . This highlights that while each onlineBlackwell algorithm is responsible for one subproblem, they communicate with each other to buildthe ﬁnal solution, where this communication is structured by the oﬄine algorithm through the Local-update function. After obtaining the point z ( i ) t , we move to subproblem i + 1 .Finally note that simulating the actions of our particular player 2 to determine the realized vectorpayoﬀs of each round, and computing/sending this feedback at the end of each round to AlgB ( i ) (asplayer 1) in a computationally eﬃcient manner, require the following:• Knowing the point z ( i − t picked by subproblem i − at time t : This is possible as we go overour subproblems in the order i = 1 , . . . , N in each round t .• Knowing the function f t : This is possible because here we study the full information feedbackstructure, where under this structure we have access to f t after we choose point z t = z ( N ) t .• Being able to compute the realized vector payoﬀ p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) eﬃciently given θ ( i ) t , f t , and z ( i − t . This is possible as this quantity is equal to Payoff ( θ ( i ) t , z ( i − t , f t ) , which can beevaluated in polynomial time as Offline-IG ( C , F , D , Θ) is a polynomial time algorithm. The following theorem, which bounds the regret of our algorithm, isthe main result of this section.

Theorem 2 (Full information oﬄine-to-online transformation) . Suppose that aninstance of the algorithm

Offline-IG ( C , F , D , Θ) for the oﬄine problem (1) satisﬁes the followingproperties:• It is an extended ( γ, δ ) -robust approximation for γ ∈ (0 , and δ ∈ R + , as in Deﬁnition 9.• It is Blackwell reducible, that is, we can deﬁne the Blackwell sequential game ( X , Y , p ) andsynthetic Blackwell adversary function AdvB : D × F → Y that satisfy the conditions in Deﬁni-tion 10.Consider the full-information adversarial online learning version of the problem (1) , and let AlgB bea polynomial time Blackwell algorithm for ( X , Y , p ) as in Remark 1. Then, for this online problem, Online-IG ( C , F , D , Θ , AlgB ) runs in polynomial time and satisﬁes the following γ -regret bound: γ -regret ( Online-IG ( C , F , D , Θ , AlgB )) ≤ O (cid:18) D ( p ) N δ q log( d payoﬀ ) T (cid:19) , where N is the number of subproblems, d payoﬀ is the dimension of vector payoﬀs, and D ( p ) , deﬁnedin Equation (2) , is the ℓ ∞ -diameter of the vector payoﬀ space.Proof of Theorem 2. Consider a subproblem i ∈ [ N ] . Let S be the d payoﬀ -dimensional positiveorthant; see the Blackwell reducibility deﬁnition and its associated approachable set S in Deﬁni-tion 10. Because S is response-satsiﬁable and projection onto S can be done in polynomial-time, iazadeh et al.: Online Learning via Oﬄine Greedy Algorithm 2:

Full-information Online Learning Meta-algorithm (

Online-IG ) Meta Input:

Feasible region C , function space F , deﬁned over domain D , and parameterspace Θ ⊆ R d param . Oﬄine algorithm and reduction gadgets:

An instance

Offline-IG ( C , F , D , Θ) ofAlgorithm 1, the Blackwell instance ( X , Y , p ) and synthetic Blackwell adversary functionAdvB : D × F → Y as this oﬄine algorithm is Blackwell reducible (Deﬁnition 10) .

Input:

Number of rounds T ; access to a Blackwell online algorithm AlgB . Output:

Points z , z , . . . , z T ∈ C .Initialize N parallel instances { AlgB ( i ) } Ni =1 of the online algorithm AlgB ; for round t = 1 to T do Initialize z (0) t ∈ C ; for subproblem i = 1 to N do Choose update parameter θ ( i ) t by querying online algorithm AlgB ( i ) given the updateparameters and vector payoﬀs prior to round t in the Blackwell sequential game ofsubproblem i , that is, θ ( i ) t ← AlgB ( i ) (cid:16) θ ( i )1 , . . . , θ ( i ) t − , p ( θ ( i )1 , y ( i )1 ) , . . . , p ( θ ( i ) t − , y ( i ) t − ) (cid:17) ;Set z ( i ) t ← Local-update ( θ ( i ) t , z ( i − t ) ∈ C ; end Play the ﬁnal point z t ← z ( N ) t ;< Full information feedback: adversary reveals function f t ∈ F > ; for i = 1 to N do Give feedback p ( θ ( i ) t , y ( i ) t ) ← Payoff ( θ ( i ) t , z ( i − t , f t ) to the Blackwell AlgorithmAlgB ( i ) (as the vector payoﬀ of round t against player 2) ; // Note that y ( i ) t = AdvB ( z ( i − t , f t ) for player 2 implicitly, although we do not need to evaluate AdvB tocompute this action explicitly. endend there exists a polynomial-time online algorithm AlgB (with N parallel copies { AlgB ( i ) } Ni =1 ) thatguarantees Blackwell approachability for the Blackwell instance corresponding to subproblem i with g ( T ) = O (cid:18) D ( p ) q log( d payoﬀ ) T (cid:19) , based on Theorem 1. Therefore, we have: d ∞ T T X t =1 p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) , S ! = d ∞ T T X t =1 p (cid:16) θ ( i ) t , y ( i ) t (cid:17) , S ! ≤ g ( T ) . Because the target set S is the positive orthant, we have d ∞ T T X t =1 p (cid:16) θ ( i ) t , AdvB ( z i − t , f t ) (cid:17) , S ! ≤ g ( T ) ⇐⇒ ∀ j : " T X t =1 p (cid:16) θ ( i ) t , AdvB ( z i − t , f t ) (cid:17) j ≥ − T g ( T ) iazadeh et al.: Online Learning via Oﬄine Greedy Because of Blackwell reduciblity,

Payoff (cid:16) θ ( i ) t , z ( i − t , f t (cid:17) = p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) . Therefore, ∀ j ∈ [ d payoﬀ ] : " T X t =1 Payoff (cid:16) θ ( i ) t , z ( i − t , f t (cid:17) j ≥ − T g ( T ) . (3)Finally, because Algorithm 1 is an extended ( γ, δ ) -robust approximation (see Deﬁnition 9), fromEquation (3) we have: T X t =1 E [ f t ( z t )] ≥ γ · T X t =1 f t ( z ∗ ) − δN T g ( T ) = γ · T X t =1 f t ( z ∗ ) − O (cid:18) δN D ( p ) q log( d payoﬀ ) T (cid:19) , which ﬁnishes the proof. Here, z ∗ is the optimal in-hindsight feasible solution, i.e., z ∗ = argmax z ∈C P Tt =1 f t ( z ) . (cid:4) We ﬁnish this section by reviewing our running example (Example 1) and mentioning the regretbound we get as a direct corollary of Theorem 2.

Example 1 (continued).

The greedy algorithm in Nemhauser et al. (1978) is an extended (1 − e , -robust approximation algorithm and Blackwell reducible. It has N = k subproblems, the ℓ ∞ diameter of the payoﬀ space is D = 1 , and the dimension of vector payoﬀs is d = n . Therefore,by invoking Algorithm 2 given any Blackwell algorithm satisfying the approachability bound inTheorem 1, we obtain the following bound: (cid:18) − e (cid:19) -Regret ( Algorithm 2 ) ≤ O ( k p log( n ) T ) , which exactly matches the bound known in Streeter and Golovin (2008) for the same problem.

5. Online Algorithm under Bandit Information Feedback Structure

So far, we presented a framework to transform an oﬄine iterative greedy algorithm to its onlinecounterpart under the full information feedback structure. While the full information setting pro-vides the theoretical foundations for the rest of our results, from an application point of view, it isless motivated. In almost all applications of our framework in revenue management and online deci-sion making (e.g., product ranking problem and reserve price optimization), assuming the learnerhas full information feedback is rather a strong assumption.In this section, we seek to relax this assumption, and try to understand if our framework canbe extended to the more challenging bandit feedback structure setting. Under the bandit feedbackstructure, at the end of each round t , the learner faces an additional challenge: he only has accessto f t ( z t ) , rather than the entire function f t like in the full information setting. Such a feedbackstructure prevents the online Blackwell algorithms AlgB ( i ) to receive the feedback they require. iazadeh et al.: Online Learning via Oﬄine Greedy To overcome this challenge, we ﬁrst consider a stylized bandit variation of the sequential Blackwellgame. We characterize a new notion of approachability that we call bandit Blackwell approachability and provide an algorithm achieving the information-theoretic tight approachability bound for thisproblem. This algorithm uses an algorithm for the full information version of the Blackwell sequentialgame in a blackbox fashion.We then introduce the extra ingredient that is needed for our bandit transformation, whichis the possibility of creating an unbiased estimator for the vector payoﬀ of the Blackwell gamesassociated with diﬀerent subproblems. Putting all these pieces together, we propose a bandit onlinelearning algorithm with the help of our bandit Blackwell approachability. We highlight that thisapproach essentially uses the unbiased estimators to obtain bandit-style feedback for the onlinelearning problems of each subproblem, leading to an eﬃcient overall bandit learning algorithm witha sublinear γ -regret. In the bandit online learning version of problem (1), an online algorithm can only see the value ofthe function at the particular point that is picked in that round. Therefore, in our transformation,multiple online Blackwell algorithms compete over a single piece of information in order to estimatethe vector payoﬀs, where estimating the vector payoﬀ of a Blackwell algorithm can be typically doneby taking a costly “exploration” move, tailored to that algorithm.With the goal of properly modeling this paradigm at a lower level, we propose the notion of a bandit Blackwell sequential game , characterized by the extended tuple ( X , Y , p , ˆ p ) . In this variant,player 1 makes an additional decision in each round: whether to explore or not. Only if player 1chooses to explore in round t , do they receive the unbiased estimator ˆ p ( x t , y t ) whose expectation isthe vector payoﬀ for that round p ( x t , y t ) . However, player 1 is punished by an additive cost D ( p ) .If player 1 refrains from exploration, they neither receive any feedback nor any punishment. Player1’s new goal is to minimize the distance from the time-averaged payoﬀ to the target set S plus theirtime-averaged exploration penalty. Definition 11 (Bandit Blackwell approachability).

A closed convex target set S is g ( T ) -bandit-approachable in the bandit Blackwell sequential game ( X , Y , p , ˆ p ) if there exists a ban-dit player ’s strategy such that for every player 2’s strategy, the resulting sequence of actions satisfy d ∞ T T X t =1 p ( x t , y t ) , S ! + E (cid:20) T D ( p ) · ( ) (cid:21) ≤ g ( T ) , where ( T is information-theoreticallytight. See Section C.3 for details. iazadeh et al.: Online Learning via Oﬄine Greedy Theorem 3.

A closed convex set S is O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) -bandit-approachablein the bandit Blackwell sequential game ( X , Y , p , ˆ p ) if and only if S is response-satisﬁable in theBlackwell game ( X , Y , p ) . In particular, when S is response satisﬁable, the online algorithm AlgBB(Algorithm 3) achieves this approachability bound in polynomial time, given access to a separationoracle for S .Proof sketch of Theorem 3. To see the only if direction of the ﬁrst part of the theorem, banditBlackwell approachability implies Blackwell approachability. Speciﬁcally, if d ∞ T T X t =1 p ( x t , y t ) , S ! + E (cid:20) T C · ( ) (cid:21) ≤ O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) , then we must have d ∞ (cid:16) T P Tt =1 p ( x t , y t ) , S (cid:17) ≤ O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) , and hence this ℓ ∞ -distance is vanishing as T → + ∞ . This, in turn, implies that the target set S is response satisﬁ-able (see Theorem 1). Note that while Theorem 1 is stated for a speciﬁc g ( T ) , the only if directionof this theorem holds for any vanishing approachability bound (Blackwell 1956).To see the if direction and the second part of the theorem, we consider a simple algorithm thatuses a (full information) Blackwell algorithm AlgB as a blackbox. We pick an algorithm AlgB thatsatisﬁes the approachability bound of Theorem 1, and can obtain this bound in polynomial timegiven a separation oracle for S ; see Remark 1. At the beginning of each round, our bandit algorithmplays the last suggested action by AlgB. It then explores randomly with probability q by ﬂippingan independent coin. Based on the outcome of the coin, it either updates the state of AlgB usingthe unbiased payoﬀ feedback it gets (exploration) and queries AlgB for suggesting a new action tofollow, or decides not to explore with probability − q and refrains the state of AlgB. These stepsare summarized in Algorithm 3.As for the running time, the above algorithm will run in polynomial time given a separationoracle for S based on Remark 1. As for the approachability bound, at a high level, if we imaginethat unbiased payoﬀs are the actual payoﬀs in the Blackwell game, then the expected distance oftime-averaged unbiased vector payoﬀ from S is roughly equal to the same quantity for only roundsthat we explore. There are qT such rounds in expectation. Therefore, the expected distance is upperbounded by O ( D ( ˆ p ) (log( d )) / q − / T − / ) due to the approachability of AlgB for this imaginaryBlackwell sequential game (Theorem 1). Also, the algorithm gets penalized on average by O ( D ( p ) q ) due to exploring. Taking expectation to replace unbiased estimators with the actual payoﬀs andbalancing the two terms in regret by setting q = D ( p ) − / D ( ˆ p ) / (log d ) / T − / gives the ﬁnalbound. See Section C.1 in the appendix for a detailed proof with a more involved argument. (cid:4) iazadeh et al.: Online Learning via Oﬄine Greedy Algorithm 3:

Bandit Blackwell Online Algorithm (AlgBB)

Meta Input:

Parameter q ∈ [0 , , bandit Blackwell sequential game ( X , Y , p , ˆ p ) . Input:

Number of rounds T , blackbox access to full information online algorithm AlgB forthe Blackwell sequential game ( X , Y , p ) , achieving approachability bound of Theorem 1 . Output:

Actions { x t } t ∈ [ T ] and binary signals { π t } t ∈ [ T ] , where x t ∈ X , and π t ∈ { Yes , No } for any t ∈ [ T ] .Initialize x new by sending the initial query to AlgB ; for round t = 1 to T do Play the action x t ← x new ; Set π t to be YES with probability q , and No withprobability − q ; if π t = Yes then

Obtain ˆ p ( x t , y t ) and send ˆ p ( x t , y t ) /q as feedback to AlgB ; // AlgB gets a new feedbackin each exploration round, i.e., round t where π t = Yes . Update x new by querying AlgB given the actions and realized unbiased estimatorvector payoﬀs in exploration rounds prior to round t + 1 , i.e., x new ← AlgB ( { ( x τ , ˆ p ( x τ , y τ ) : τ ≤ t, π τ = Yes } ) ; Remark 4.

Our notion of bandit Blackwell approachability and the algorithm that achieves thetight bound (Algorithm 3) bear some resemblance to the ǫ -greedy algorithm in the classic banditsetting, where in every round of this algorithm, we decide whether or not to explore, and when weexplore in a round we assume we suﬀer from the maximum possible regret in this round. Remark 5.

The vanilla version of AlgBB needs to tune exploration probability q based on thehorizon T to obtain the bound in Theorem 3. However, by using the standard doubling trick inonline learning (e.g., see Bubeck et al. (2015)) in a blackbox fashion, one can boost Algorithm 3 towork for unknown but bounded T : the new algorithm starts with a guess for horizon (e.g., T = 1 )and sets q according to this guess. Each time it reaches the guessed horizon, it doubles its guess, andrestarts by tuning a new value for q and initializing again. The doubling trick is a well-known ideain the online learning literature that can be traced back to the classic work of Auer et al. (2002).We refer the reader to aforementioned work, and omit the details here for brevity. Similar to our full information oﬄine-to-online transformation, which gave us algorithm

Online-IG in Section 4, we transform an oﬄine IG algorithm to a bandit online learning algorithm by associ-ating an instance of the bandit Blackwell sequential game to each subproblem i ∈ [ N ] of the oﬄinealgorithm. That is, we crucially rely on a reduction from the local optimization step of each sub-problem in Algorithm 1 to an approachable instance of the bandit Blackwell sequential game as inDeﬁnition 11. Such a reduction is possible if the oﬄine algorithm is Bandit Blackwell reducible ; seethe following deﬁnition. iazadeh et al.:

Online Learning via Oﬄine Greedy Definition 12 (Bandit Blackwell Reducibility).

An instance

Offline-IG ( C , F , D , Θ) of Algorithm 1 is bandit Blackwell reducible if there is an instance ( X , Y , p , ˆ p ) of bandit Blackwellsequential game (Section 5.1) and an exploration sampling device ExpS : Θ × D → ∆ (cid:0) R d payoﬀ × C (cid:1) ,such that:1. Offline-IG ( C , F , D , Θ) is Blackwell reducible as in Deﬁnition 10, using the Blackwell sequen-tial game ( X , Y , p ) (with biaﬃne p ) and the synthetic Blackwell adversary function AdvB.2. If y = AdvB ( z , f ) for some f ∈ F , z ∈ D , then ˆ p ( θ , y ) = f ( z exp ) w exp for all θ ∈ Θ , where ( w exp , z exp ) ∼ ExpS ( θ , z ) . Otherwise, ˆ p ( θ , y ) = p ( θ , y ) .3. The above ˆ p is an unbiased estimator for the actual vector payoﬀ, i.e., for all θ ∈ Θ , y ∈ Y : E [ ˆ p ( θ , y )] = p ( θ , y ) .4. The exploration sampling device ExpS ( θ , z ) returns its samples ( w exp , z exp ) in polynomial time.To better understand the bandit Blackwell reducibility, we revisit our running example. Example 1 (continued).

The greedy algorithm of Nemhauser et al. (1978) is also banditBlackwell reducible. As stated in Section 4, this algorithm is Blackwell reducible. Recall that in thisexample, the biaﬃne Blackwell payoﬀ is p ( θ , y ) = θ T y n − y , where n is all ones n -dimensionalvector. We will construct an exploration sampling device ExpS that returns ( w exp , z exp ) such that if ∀ θ ∈ Θ , we have y = AdvB ( z , f ) for some f ∈ F , z ∈ D , we set ˆ p ( θ , y ) = f ( z exp ) w exp and we musthave E [ ˆ p ( θ , y )] = p ( θ , y ) . The exploration sampling device ExpS works as follows. Given a point z ∈ C (which represents a set of elements) and parameter θ ∈ Θ , it draws j ∼ Uniform { , . . . , n } andreturns (i) w exp = n ( θ j n − e j ) , (ii) z exp = z ∪ { j } . Now, ˆ p is an unbiased estimator of p , because: E [ ˆ p ( θ , AdvB ( z , f ))] = E [ f ( z exp ) w exp ]= E [ n ( θ j f ( z ∪ { j } ) n − f ( z ∪ { j } ) e j )]= n X j ∈ [ n ] θ j f ( z ∪ { j } ) − [ f ( z ∪ { } ) , . . . , f ( z ∪ { n } )] T = n X j ∈ [ n ] θ j ( f ( z ∪ { j } ) − f ( z )) − [ f ( z ∪ { } ) , . . . , f ( z ∪ { n } )] T + f ( z ) n = θ T y n − y = p ( θ , AdvB ( z , f )) , where y , [ f ( z ∪ { j } ) − f ( z )] j =1 , ,...,n = AdvB ( z , f ) . Here, the fourth equation holds because P j ∈ [ n ] θ j = 1 . Observe that the exploration sampling device ExpS has an intuitive interpretation,at every round, it randomly picks one of the elements j ∈ [ n ] , and evaluates the marginal beneﬁt ofadding element j to z . iazadeh et al.: Online Learning via Oﬄine Greedy When the oﬄine algorithm (Algorithm 1) is bandit Black-well reducible (Deﬁnition 12), we can employ a similar oﬄine-to-online transformation mentionedin Section 4. However, instead of associating an instance of the Blackwell game to each subproblem,we associate an instance of the bandit Blackwell game. To obtain unbiased estimators for the vectorpayoﬀs of these bandit Blackwell instances, we rely on the exploration sampling devices that arepromised by Deﬁnition 12. This sampling device allows us to strike a balance between explorationand exploitation in all of the online bandit Blackwell games. We formalize this transformation ofthe oﬄine algorithm to an online bandit algorithm called

Bandit-IG in Algorithm 4.Suppose that the oﬄine algorithm

Offline-IG ( C , F , D , Θ) is given. For the particular banditBlackwell sequential game ( X , Y , p , ˆ p ) coming from Deﬁnition 12, we use AlgBB (Algorithm 3) todetermine the strategy of player 1. Such an online bandit Blackwell algorithm as player 1 ensuresthat the distance between the average vector payoﬀ T P Tt =1 p ( x t , y t ) and set S plus the explorationpenalty goes to zero with rate g ( T ) = O ( D ( p ) / D ( ˆ p ) / (log d ) / T − / ) ; see Theorem 3.We dedicate a copy of the above algorithm AlgBB ( i ) to each subproblem i ∈ [ N ] . We queryalgorithms AlgBB ( i ) in the increasing order of their index i . Consider the online bandit Blackwellalgorithm AlgBB ( i ) , and assume that in round t , we query this algorithm. The algorithm returnstwo outputs: the update parameter θ ( i ) t and a binary signal π ( i ) t ∈ { Yes , No } . If π ( i ) t = Yes , thealgorithm explores: it samples ( w ( i ) t, exp , z ( i ) t, exp ) from the exploration sampling device ExpS ( θ ( i ) t , z ( i − t ) .Note that the exploration sampling device uses the update parameter θ ( i ) t and the point returned bythe previous subproblem z ( i − t . This indeed allows the subproblems to communicate with each otherduring exploration. The algorithm then plays z t = z ( i ) t, exp and provides the payoﬀ vector feedback ˆ p ( i ) t = f t ( z t ) w ( i ) t, exp to AlgBB ( i ) . This feedback is only used by the online bandit Blackwell algorithmAlgBB ( i ) , not the rest of N − bandit Blackwell algorithms. We highlight that if AlgBB ( i ) decides toexplore in round t , the rest of bandit Blackwell algorithms will not be queried. Finally, if π ( i ) t = No ,the algorithm exploits: it returns point z ( i ) t = Local-update ( θ ( i ) t , z ( i − t ) . Again observe that duringexploitation, subproblem i also communicates with subproblem i − through using z ( i − t . Theorem 4 bounds the regret of the

Bandit-IG algorithm. The proofis deferred to Section C.2 in the appendix.

Theorem 4 (Bandit information oﬄine-to-online transformation) . Suppose that aninstance of

Offline-IG ( C , F , D , Θ) for the oﬄine problem (1) satisﬁes the following properties:• It is an extended ( γ, δ ) -robust approximation for γ ∈ (0 , and δ > , as in Deﬁnition 9.• It is bandit Blackwell reducible; that is, we can deﬁne the bandit Blackwell sequential game ( X , Y , p , ˆ p ) and exploration sampling device ExpS : Θ × D → ∆ (cid:0) R d payoﬀ × C (cid:1) that satisfy thecontions in Deﬁnition 12. iazadeh et al.: Online Learning via Oﬄine Greedy Algorithm 4:

Bandit Online Learning Meta-algorithm (

Bandit-IG ) Meta Input:

Feasible region C , function space F , deﬁned over domain D , parameter space Θ ⊆ R d param . Oﬄine algorithm and reduction gadgets:

An instance

Offline-IG ( C , F , D , Θ) ofAlgorithm 1; this algorithm is bandit Blackwell reducible as in Deﬁnition 12, using thebandit Blackwell instance ( X , Y , p , ˆ p ) and exploration sampling deviceExpS : Θ × D× → ∆ (cid:0) R d payoﬀ × C (cid:1) . Input:

Number of rounds T ; access to a bandit Blackwell online algorithm AlgBB . Output:

Points z , z , . . . , z T ∈ C .Initialize N parallel instances { AlgBB ( i ) } Ni =1 of the online algorithm AlgBB ; for round t = 1 to T do Initialize z (0) t ∈ C ; for subproblem i = 1 to N do Choose the update parameter θ ( i ) t ∈ Θ and exploration signal π ( i ) t ∈ { Yes , No } byquerying online algorithm AlgBB ( i ) given the update parameters and vector payoﬀs ˆ p of exploration rounds prior to round t in the bandit Blackwell sequential game ofsubproblem i , that is (cid:16) θ ( i ) t , π ( i ) t (cid:17) ← AlgBB ( i ) (cid:16) θ ( i )1 , . . . , θ ( i ) t − , { ˆ p ( θ ( i ) τ , y ( i ) τ ) } τ ≤ t − π ( i ) τ = Yes (cid:17) ; if π ( i ) t = Yes , then Sample ( w ( i ) t, exp , z ( i ) t, exp ) from the exploration sampling device ExpS ( θ ( i ) t , z ( i − t ) ;Play the exploration point z t ← z ( i ) t, exp ;< Bandit information feedback: observe f t ( z t ) > ;Give payoﬀ vector feedback ˆ p ( i ) t = f t ( z t ) · w ( i ) t, exp to AlgBB ( i ) ; Skip immediately tothe beginning of the next round t + 1 ;Set z ( i ) t ← Local-update ( θ ( i ) t , z ( i − t ) ;Play the ﬁnal point z t ← z ( N ) t and receive bandit feedback f t ( z t ) , and ignore it. Consider the bandit-information adversarial online learning version of problem (1) , and let AlgBBbe a polynomial time bandit Blackwell algorithm for ( X , Y , p , ˆ p ) as in Theorem 3. Then, for thisonline problem, Bandit-IG ( C , F , D , Θ , AlgBB ) runs in polynomial time and satisﬁes the following γ -regret bound: γ -regret ( Bandit-IG ( C , F , D , Θ , AlgBB )) ≤ O (cid:16) D ( p ) / D ( ˆ p ) / N δ (log( d payoﬀ )) / T / (cid:17) , where N is the number of subproblems and d payoﬀ is the dimension of vector payoﬀs. iazadeh et al.: Online Learning via Oﬄine Greedy We ﬁnish this section by wrapping up our running example (Example 1) and mentioning thebandit regret bound we get as a direct corollary of Theorem 4.

Example 1 (finished).

The greedy algorithm in Nemhauser et al. (1978) satisﬁes (1 − e , -robust approximation and is bandit Blackwell reducible. It has N = k subproblems and ℓ ∞ diameterof ˆ p is D ( ˆ p ) = O ( n ) . Therefore, by invoking Algorithm 4 given any bandit Blackwell algorithmsatisfying the approachability bound in Theorem 3, we obtain the following bandit regret bound: (cid:18) − e (cid:19) -regret ( Algorithm 4 ) ≤ O ( kn / (log n ) / T / ) , which in turn, by noting that k can be as large as n , gives us an immediate improvement over regretbound of O (cid:0) k ( n log n ) / T / (log T ) (cid:1) in Streeter and Golovin (2007, 2008).

6. Applications to Revenue Management and Combinatorial Optimization

We have already showed how to ﬁt monotone submodular maximization into our framework throughExample 1. In this section, we apply our framework to three other selected problems: prod-uct ranking through sequential submodular maximization, personalized reserve price optimizationin second-price auction, and non-monotone submodular maximization. Our framework results inimproved/new regret bounds in all mentioned applications for both full-information and bandit set-tings. We should emphasize that our framework is quite general and would potentially capture othergreedy-solvable/approximate problems in revenue management and combinatorial optimization.

Problem deﬁnition.

In the

Product Ranking Problem , a platform aims to characterize aranking of n items, where a ranking is a permutation π over the items. Here, items on positions withlower indices have more visibility. The goal of the platform is to maximize its user engagement (alsoknown as market share), which is the probability that a consumer does not leave the platform withouttaking a desired action. This action can be a click, purchase, or even installing an application.For the sake of presentation, assume that the desired action is clicking on an item. We considerthe model proposed by Asadpour et al. (2020), which is inspired by an earlier model proposed inFerreira et al. (2019). In this model, a consumer u is characterized by a patience level θ u togetherwith a monotone non-decreasing submodular set function κ u : 2 [ n ] −→ [0 , . A consumer of type ( θ u , κ u ) , when oﬀered a ranked list of products π = ([ π ] , [ π ] , . . . , [ π ] n ) , inspects the ﬁrst θ u productsand clicks with probability κ u ( { [ π ] , . . . , [ π ] θ u } ) . The platform knows the distribution G from which u is selected. The goal is to pick a permutation π maximizing the probability of click E u ∼G [ κ u ( { [ π ] , . . . , [ π ] θ u } )] . For a wide range of choice models in the literature, the probability of a purchase from an oﬀeredset S can be described using a monotone submodular function κ u . This includes multinomial logit,nested logit, and paired combinatorial logit models. See Kök et al. (2008) for details on these models. iazadeh et al.: Online Learning via Oﬄine Greedy Product ranking problem as sequential submodular maximization.

A slight reformulation of theabove model casts the product ranking problem as a special case of a class of optimization problemsover permutations called sequential submodular maximization (Asadpour et al. 2020). We deﬁne thesequential submodular maximization problem as follows. Given a sequence of monotone submodularset functions { f ( · ) , . . . , f n ( · ) } , and a sequence of non-negative weights λ = ( λ , . . . , λ n ) , we aim toﬁnd a ranking π that maximizes n X i =1 λ i f i ( { [ π ] , . . . , [ π ] i } ) , where [ π ] i denotes the item on the i th position of ranking π. In the aforementioned choice model, forall i ∈ [ n ] , we have f i ( S ) , E u ∼G [ κ u ( S ) | θ u = i ] , representing the probability of clicks functions, and λ i , P u ∼G ( θ u = i ) , representing the probability that a consumer has patience level i . The probabilitythat a consumer clicks on at least one product when oﬀered a ranked set of products π is then f ( π ) , λ f ( { [ π ] } ) + λ f ( { [ π ] , [ π ] } ) + . . . + λ n f n ( { [ π ] , . . . , [ π ] n } ) , where f i ’s are monotone submodular functions and λ i ’s are non-negative. To simplify the analysis,notice that while f is a function of a set of ranked/ordered items, f i is a function of a set that hasat most i items for each i ∈ [ n ] . Online problem.

In the oﬄine setting, the platform knows G , which translates to knowing theprobability of clicks { f ( · ) , . . . , f n ( · ) } and the probability distribution of the patience level λ =( λ , . . . , λ n ) . We study the online user-engagement-maximization ranking problem where on everyround t, a distribution over patience level λ t and the expected probability of click function f t , whichis made of { f t, ( · ) , . . . , f t,n } , are chosen adversarially. The platform, whose goal is to maximize itsuser-engagement, chooses a ranking π t without observing λ t and f t . After choosing the ranking, theplatform observes the function f t in the full-information setting. In the bandit setting, the platform only observes whether or not the consumer clicks on at least one item, but not which item wasclicked on. To the best of our knowledge, the online adversarial version of this problem has not beenstudied before, neither under full information nor bandit setting.Asadpour et al. (2020) showed the oﬄine problem of sequential submodular maximization isNP-hard. They also proposed an optimal (1 − /e ) -approximation algorithm, and a simple -approximation greedy algorithm for this oﬄine problem. Notably, Ferreira et al. (2019) studied aspecial case of the above model for a particular choice model, where consumers click on itemsindependently with given probabilities. They proposed the same simple − approximation greedyalgorithm, and a "learning-then-earning" algorithm for the oﬄine PAC-learning version of theirproblem where a learner has access to samples from user choices. iazadeh et al.:

Online Learning via Oﬄine Greedy Oﬄine algorithm.

In this paper, rather than the optimal algorithm, we also focus on this greedyalgorithm that achieves − approximation, and transform it into an online adversarial learningalgorithm. Our oﬄine algorithm is presented in Algorithm 5. The input to this algorithm is asequential submodular function f : Π → [0 , , where Π is the set of ranking permutations of n items,i.e., Π = { , , . . . , n } n . In this case, having [ π ] i = 0 represents putting no item at position i andmultiple positions can display the same item for simplicity. In this problem, both the domain and thefeasible region are D = C = Π . Let S i denote the set of subsets of [ n ] that consist of at most i items,i.e., S i = { S ⊆ [ n ] | | S | ≤ i } . We have f ( π ) = P nj =1 λ j f j ( { [ π ] , . . . , [ π ] j } ) where f i is a monotonesubmodular function that takes an element of S i as input and returns a probability in [0 , , i.e., f i : S i → [0 , . Algorithm 5, taken from Asadpour et al. (2020) and Ferreira et al. (2019), is a greedyalgorithm with n subproblems, where each subproblem corresponds to a position on the ranking.The algorithm starts ﬁlling up the positions from the top and for each position i, it chooses theitem that has the highest marginal probability of click. The update π ( i ) ← π ( i − + z ( i ) e i representsthe action of adding item z ( i ) to position i. Algorithm 5:

Greedy for Sequential Submodular Maximization (Asadpour et al. 2020)

Input:

A sequential submodular function f , which can be represented using a sequence ofmonotone submodular functions { f i ( · ) } i ∈ [ n ] and a sequence of non-negative weights λ . Output:

Ranking π ∈ Π .Set initial ranking π (0) ← n . for position i = 1 , , . . . , n do Local Optimization StepChoose z ( i ) ∈ arg max z ∈ [ n ] P nj = i λ j f j (cid:0) { [ π ( i − ] , . . . , [ π ( i − ] i − , z } (cid:1) − P nj = i λ j f j ( { [ π ( i − ] , . . . , [ π ( i − ] i − } ) .Local Update StepSet π ( i ) ← π ( i ) + z ( i ) e i . return π ← π ( n ) . We cast Algorithm 5 as an instance of

Offline-IG (Algorithm 1). The parameter space is

Θ = ∆([ n ]) and d param = n . Moreover, in subproblem i the algorithm picks the distribution θ ( i ) overitems so that the resulting vector payoﬀ lands in set S . In this language, set S is the n -dimensionalpositive orthant and the vector payoﬀ function is: ∀ j ∈ [ n ] : (cid:2) Payoff ( θ ( i ) , π ( i − , f ) (cid:3) j = θ T y ( i ) − [ y ( i ) ] j , iazadeh et al.: Online Learning via Oﬄine Greedy where y ( i ) , " n X a = i λ a f a ( { [ π ( i − ] , . . . , [ π ( i − ] i − , j } ) − n X a = i λ a f a ( { [ π ( i − ] , . . . , [ π ( i − ] i − } ) j ∈ [ n ] = (cid:2) f ( π ( i − + j e i ) − f ( π ( i − ) (cid:3) j ∈ [ n ] , is the marginal objective value of putting item j on the i th position. Note that any θ ( i ) for which the vector payoﬀ is in S is indeed a distribution over items z ( i ) such that z ( i ) ∈ arg max z ∈ [ n ] P nj = i λ j f j (cid:0) { [ π ( i − ] , . . . , [ π ( i − ] i − , z } (cid:1) − P nj = i λ j f j ( { [ π ( i − ] , . . . , [ π ( i − ] i − } ) . Theorem 5 (Online learning for sequential submodular maximization) . Let n be thenumber of items. For the problem of maximizing sequential submodular functions in the online fullinformation setting, there exists a learning algorithm that obtains O (cid:0) n √ T log n (cid:1) -regret, where T is the number of rounds. Furthermore, in the online bandit setting, there exists a learning algorithmthat obtains O (cid:16) n / (log n ) / T / (cid:17) -regret, where T is the number of rounds. For this problem,the benchmark in the regret bounds is max π ∈ Π P Tt =1 f t ( π ) . Theorem 5 and the following corollary are proved in Section D using our oﬄine-to-online trans-formations presented in Section 4 and Section 5.

Corollary 1 (Online learning for product ranking) . Let n be the number of items. Forthe problem of product ranking optimization to maximize user engagement using the model fromAsadpour et al. (2020), there exists a learning algorithm that obtains O (cid:0) n √ T log n (cid:1) − regret in thefull-information setting and O (cid:16) n / (log n ) / T / (cid:17) − regret in the bandit setting, where T is thenumber of consumers. As an implication, the same problem for the consumer choice model fromFerreira et al. (2019) also has a learning algorithm that obtains O (cid:0) n √ T log n (cid:1) − regret in the fullinformation setting and O (cid:16) n / (log n ) / T / (cid:17) − regret in the bandit setting. For both problems,the benchmark in the regret bounds is max π ∈ Π P Tt =1 f t ( π ) . Problem deﬁnition.

In the

Maximizing Multiple Reserves (MMR) prob-lem (Roughgarden and Wang 2019, Derakhshan et al. 2019), a seller wants to sell an item to one of n bidders. Each bidder i has a private value v i for the item. The seller runs a second-price auctionwith personalized reserves r ; the winner is the bidder with the highest bid/valuation among thebidders whose bids exceed/clear their reserve prices. The winner pays the minimum bid that theycould have won with, which is the maximum between their reserve price and the second-highestbid that cleared its reserve price. The seller wishes to maximize their revenue. Since second price auctions are truthful, we will use bids and valuations interchangeably. iazadeh et al.:

Online Learning via Oﬄine Greedy Online problem.

We are interested in the seller’s problem in the online full information and banditsettings. In both settings, each round t ∈ [ T ] involves the seller choosing a set of reserves r t and theadversary choosing a valuation proﬁle v t . In the online full information setting, the seller observesthe valuation proﬁle and gets credit for the resulting revenue. In the online bandit setting, the sellerobserves just the resulting revenue and does not observe the bidders’ valuations or even the identityof the winner. The seller’s goal is to minimize the diﬀerence between his average revenue and thebest average revenue in hindsight for a ﬁxed set of reserves r ∗ . To the best of our knowledge, thebandit setting has not been studied in the literature. However, the full information setting of thisproblem is studied by Roughgarden and Wang (2019), where they present a learning algorithm with O (cid:0) n √ T log T (cid:1) -regret, which we improve upon. Oﬄine non-batch vs batch problem.

We start with formulating the oﬄine non-batch problem.Let R = { ρ , ρ , . . . , ρ m } be the set of feasible reserve prices, where |R| = m , and they are sorted: ρ < ρ < · · · < ρ m . For the oﬄine (non-batch) problem, let f : R n × [0 , n → [0 , be theseller’s revenue function: f ( r , v ) = max { [ v ] ˆ j , [ r ] j ∗ } for some v . Here, j ∗ and ˆ j are the highestand second-highest bidders among those who cleared their bids, with ties broken arbitrarily: j ∗ ∈ arg max j ∈ [ n ]:[ v ] j ≥ [ r ] j { [ v ] j } and ˆ j ∈ arg max j ∈ [ n ]:[ v ] j ≥ [ r ] j ,j = j ∗ { [ v ] j } . If no bidder clears their reserve,then we say [ r ] j ∗ and [ v ] ˆ j are both zero. Similarly, if only one bidder clears their reserve, we say [ v ] ˆ j iszero. Moreover, F is the space of all such revenue functions: F = { f | ∃ v ∈ [0 , n such that f ( r , v ) =max { [ v ] ˆ j , [ r ] j ∗ } . In the oﬄine (non-batch) problem, the goal is to solve max r ∈R n f ( r , v ) for an inputvaluation proﬁle v ∈ [0 , n . In the optimization problem, the domain D we consider and the feasibleregion C are both R n .The aforementioned oﬄine problem can be solved eﬃciently. Note that in the oﬄine problem, theseller who has access to the valuations of the bidders in one auction needs to optimize personalizedreserve prices. It is then obvious that in the oﬄine setting, the best action is to set reserve pricesof all the bidders to zero, except the bidder with the highest bid; for this bidder, his reserve priceis set to his valuation. Then, one may wonder why for the online version of this oﬄine (non-batch)problem, which is not even NP-hard, we characterize -regret, rather than -regret. The reasonis that Roughgarden and Wang (2019) show that the full information online setting is at least ashard as the oﬄine batch problem, which is APX-hard. In the oﬄine batch problem, the seller hasaccess to the valuation proﬁles in m auctions and would like to determine a single vector of reserveprices r that maximizes revenue across all the m auctions. Considering the hardness of the oﬄinebatch problem, to solve the oﬄine (non-batch) problem, we use a slight variation of the algorithm ofRoughgarden and Wang (2019). This variation is stated in Algorithm 6, which obtains a fractionof the optimal revenue similar to the original algorithm. See section E.1 for a discussion on themajor diﬀerences between this variation and the original one. iazadeh et al.: Online Learning via Oﬄine Greedy Algorithm 6:

Greedy Algorithm for Discretized MMR (Roughgarden and Wang 2019)

Input:

Valuation proﬁle v . Output:

Reserve prices r ∈ R n .Set initial reserves r (0) ← n . for bidder i = 1 , , . . . , n do Deﬁne revenue-from-reserves function q ( i ) : R → [0 , as q ( i ) ( r ) equals r if i has thehighest valuation (ties broken arbitrarily) and r ∈ [[ v ] i ′ , [ v ] i ] where i ′ has thesecond-highest valuation, and otherwise.Local Optimization StepChoose z ( i ) ∈ arg max r ∈R q ( i ) ( r ) .// In this case θ ( i ) ∈ ∆( R ) is the distribution that always returns z ( i ) . Local Update StepSet r ( i ) ← r ( i − + z ( i ) e i . return r ∼ Uniform { n , r ( n ) } .Oﬄine algorithm. We now brieﬂy discuss Algorithm 6 and show how to cast it as an instance of

Offline-IG (Algorithm 1). This greedy algorithm has n subproblems, where in each subproblem,reserve price of a bidder i is set using our revenue-from-reserves function q . At the end, the algorithmrandomly returns either the all-zeros reserve vector n or the crafted reserve vector denoted by r ( n ) , where the former yields revenue equal to the second-highest valuation and the latter yieldsrevenue of at least q ( j ∗ ) ( z ( j ∗ ) ) ; see the deﬁnition of q ( · ) in the algorithm. By deﬁnition of the revenuefunction, the optimal reserves obtain their revenue via one of these two cases, i.e., f ( r ∗ , v ) ≤ max { [ v ] ˆ j , q ( j ∗ ) ([ r ] j ∗ ) } ≤ [ v ] ˆ j + q ( j ∗ ) ([ r ] j ∗ ) ≤ f ( n , v ) + f ( r ( n ) , v ) = 2 E [ f ( r , v )] , where the expectation is taken with respect to the randomness in the algorithm. This impliesthat our algorithm is indeed a -approximation. Stated in the language of our Algorithm 1, ourlocal updates manage to guarantee that Payoff ( θ ( i ) , r ( i − , v ) is in the positive orthant, where the(asymmetric) vector payoﬀ function Payoff returns an m -dimensional point whose j th coordinatevalue is the expected diﬀerence between the expected value of picking a reserve according to θ ( i ) and that of picking ρ j : (cid:2) Payoff ( θ ( i ) , r ( i − , v ) (cid:3) j , E z ′ ∼ θ ( i ) (cid:2) q ( i ) ( z ′ ) − q ( i ) ( ρ j ) (cid:3) . The following theorem shows that using our framework, the greedy Algorithm 6 can be trans-formed to polynomial-time online learning algorithms under both full information and bandit feed-back structures. Note that here,

Payoff ( θ ( i ) , r ( i − , v ) is not a function of r ( i − . iazadeh et al.: Online Learning via Oﬄine Greedy Theorem 6 (Online learning for maximizing multiple reserves) . Let R = { ρ , . . . , ρ m } be the set of possible reserve prices and n be the number of bidders. Assume that the maximumvaluation is normalized to one. Then, for the problem of maximizing personalized reserve prices inthe online full information setting, there exists a learning algorithm that obtains O (cid:16) nT / log / m (cid:17) -regret, where T is the number of auctions. Furthermore, in the online bandit setting, there exists alearning algorithm that obtains O (cid:16) nm / T / log / m (cid:17) -regret, where T is the number of auctions.Here, the benchmark in the regret bounds is max r ∈R n P Tt =1 f ( r , v t ) . The proof of Theorem 6 is presented in Section E.2 in the appendix. At a high level, to prove thistheorem, we ﬁrst show that Algorithm 6 is an extended ( , ) -robust approximation algorithm. Wethen conﬁrm that this algorithm is bandit Blackwell reducible. To do so, we construct an unbiasedestimator of the revenue-from-reserves function q ( i ) . This unbiased estimator allows us to build anexplore sampling device per Deﬁnition 12. Having veriﬁed these two main properties, we then invokeTheorems 2 and 4 to get the ﬁnal regret bounds.The following corollary considers a stronger benchmark than the one we considered earlier. Thisbenchmark allows the reserve prices to be any number in [0 , n , i.e., the regret is computed against max r ∈ [0 , n P Tt =1 f ( r , v t ) , rather than max r ∈R n P Tt =1 f ( r , v t ) . The corollary then conﬁrms theexistence of learning algorithms with sublinear regret bounds against this stronger benchmark. SeeSection E.3 in the appendix for proof. Corollary 2.

Let n be the number of bidders. Assume that the maximum valuation is normalizedto one. Then, for the problem of maximizing personalized reserve prices in the online full informationsetting, there exists a learning algorithm that obtains O (cid:16) nT / log / T (cid:17) -regret, where T is thenumber of auctions. Furthermore, in the online bandit setting, there exists a learning algorithm thatobtains O (cid:16) n / T / log / ( nT ) (cid:17) -regret, where T is the number of auctions. Here, the benchmarkin the regret bounds is max r ∈ [0 , n P Tt =1 f ( r , v t ) . Problem deﬁnition.

Consider the

Non-monotone submodular maximization (NSM) prob-lem, for both set and continuous functions, as deﬁned in Section 2.4. For set functions, our goal isto maximize a non-monotone submodular set function without any constraints, and for continuousfunctions, our goal is to maximize a non-monotone continuous submodular function, either weak-DR(Deﬁnition 6) or strong-DR (Deﬁnition 7), over the unit hypercube [0 , n .For set functions, the oﬄine algorithm of Buchbinder et al. 2015 gives a / -approximation fac-tor, which is known to be the best possible approximation factor with polynomial query calls to thefunction (Feige et al. 2011). For the continuous case, under both weak-DR and strong-DR submod-ularity, the oﬄine algorithm of Niazadeh et al. 2018 gives a / -approximation factor for Lipschitz iazadeh et al.: Online Learning via Oﬄine Greedy continuous functions, which again achieves the best possible approximation factor with polynomialquery calls to the function.To have a uniﬁed oﬄine problem and algorithm capturing both of the above variations, we ﬁrstconsider a slight reformulation where a continuous (weak-DR) submodular function is restrictedto a discrete domain R n instead of [0 , n . Here, R = { ρ , ρ , . . . , ρ m } is the ﬁnite set of possiblecoordinate values, where |R| = m and ρ < ρ < · · · < ρ m are real numbers. Note that R = { , } when we focus on set functions. For Lipschitz continuous functions, one should think of R n as an ǫ -net that discretizes the function with O ( ǫ ) additive error due to Lipschitzness.Given this uniﬁed setting, we essentially consider discrete functions f : R n → [0 , that satisfya discrete version of (weak-DR) submodularity. This property is exactly the same as continuoussubmodularity in Deﬁnition 6, with a slight modiﬁcation that we only consider points x ∈ R n . Givensuch a function, our goal in the oﬄine problem is to solve the optimization problem max z ∈R n f ( z ) .We refer to this problem as discretized submodular maximization . Note that this problem is aninstance of problem (1), where both D and the feasible region C are R n , and our function class isthe class of submodular functions f described above.Inspired by the algorithms in Buchbinder et al. (2015) and Niazadeh et al. (2018), we then presenta uniﬁed oﬄine algorithm (which essentially is an adaptation of the algorithm in Niazadeh et al.(2018) restricted to the discrete domain R n ) with the same / -approximation factor for the pro-posed uniﬁed oﬄine problem. This is presented in Algorithm 7. We then transform this oﬄinealgorithm to online full-information and bandit learning algorithms using our framework. Oﬄine algorithm.

Algorithm 7 is a modiﬁed version of the continuous randomized bi-greedy algo-rithm by Niazadeh et al. (2018). The diﬀerence between Algorithm 7 and the continuous randomizedbi-greedy algorithm is discussed in Section F.1 in the appendix. Throughout this section, we use thenotation ( z ′ , z − i ) to denote the point constructed by taking z and replacing its i th coordinate valuewith z ′ , and f ( z ′ , z − i ) to denote the function evaluated at the corresponding point. The algorithmkeeps track of two points: lower bound z ¯ ( i ) and upper bound ¯ z ( i ) , where initially, z ¯ (0) = ( ρ , . . . , ρ ) ,and ¯ z (0) = ( ρ m , . . . , ρ m ) . The lower and upper bounds get updated as the algorithm goes through n subproblems. In subproblem i , the algorithm decides about the i th coordinate: it sets the i th coordi-nate to z ′ i , where z ′ i is drawn from distribution θ ( i ) ∈ ∆( R ) . Here, this distribution is chosen in a wayto satisfy the following condition E z ′ ∼ θ ( i ) (cid:2) α ( i ) ( z ′ ) + β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ′ ) (cid:3) ≥ for all ˆ z ∈ R . Notethat α ( i ) ( z ′ ) = f ( z ′ , z ¯ ( i − − i ) − f ( z ¯ ( i − ) is the marginal value of increasing the value of i th -coordinatefrom ρ to z ′ when the rest of coordinates are z ¯ ( i − − i , and similarly β ( i ) ( z ′ ) = f ( z ′ , ¯ z ( i − − i ) − f ( ¯ z ( i − ) is the marginal value of decreasing the i th coordinate from ρ m to z ′ when the rest of coordinatesare ¯ z ( i − − i . Moreover, ζ ( i ) (ˆ z, z ′ ) is equal to α ( i ) (ˆ z ) − α ( i ) ( z ′ ) if ˆ z ≥ z ′ and β ( i ) (ˆ z ) − β ( i ) ( z ′ ) otherwise.Roughly speaking, ζ ( i ) (ˆ z, z ′ ) measures the extent to which setting the i th coordinate to z ′ , rather iazadeh et al.: Online Learning via Oﬄine Greedy than ˆ z , is locally suboptimal. With this interpretation, the aforementioned condition ensures thatthe algorithm’s choice for the i th coordinate approximately compensates for the cost caused by thesuboptimality of this choice. We refer the readers to Niazadeh et al. (2018) for a more detaileddiscussion on the intuition behind this condition.We now show how to cast the above algorithm as an instance of Offline-IG (Algorithm 1). Inthe language of Algorithm 1, the aforementioned condition can be presented using the following

Payoff function: j ∈ [ m ] : (cid:2) Payoff (cid:0) θ ( i ) , z ¯ ( i − , f (cid:1)(cid:3) j = E z ′ ∼ θ ( i ) (cid:20) α ( i ) ( z ′ ) + 12 β ( i ) ( z ′ ) − ζ ( i ) ( ρ j , z ′ ) (cid:21) ≥ . (4)Moreover, we have Θ = ∆( R ) , d param = |R| = m, and z is the vector z ¯ that starts as ( ρ , . . . , ρ ) T then gets updated at each iteration. Theorem 7 (Online learning for discretized non-monotone submodular maximization) . Let n be the number of dimensions and R = { ρ , ρ , . . . , ρ m } be the set of potential values that eachcoordinate i ∈ [ n ] can take. Assume that the maximum function value is normalized to one. Then,for the problem of maximizing a non-monotone submodular function in the online full-informationsetting, there exists a learning algorithm that obtains O (cid:0) nT / (log m ) / (cid:1) -regret, where T isthe number of rounds. Furthermore, in the online bandit setting, there exists an online learningalgorithm that obtains O (cid:0) nm / T / (log m ) / (cid:1) -regret. Here, in both online algorithms, thebenchmark in the regret bounds is max z ∈R n P Tt =1 f t ( z ) . The proof of Theorem 7, which is presented in Section F.2 in the appendix, has two main steps.In the ﬁrst step, we show that the oﬄine Algorithm 7 is an extended ( , ) - robust approximationalgorithm and in the second step, we show that it is bandit Blackwell reducible. The challengingpart of the proof is to construct an explore sampling device that leads to an unbiased estimator forthe payoﬀ function. We then invoke Theorems 2 and 4 to get the ﬁnal regret bounds.The following is an immediate corollary of Theorem 7. Corollary 3 (Online learning for non-monotone set submodular maximization) . Let n be the number of items, and assume the maximum function value is normalized to one.Then for the problem of maximizing a nonmonotone (set) submodular function in the onlinefull-information setting, there exists an online learning algorithm that obtains O (cid:16) n √ T (cid:17) -regret,where T is the number of rounds. Furthermore, in the online bandit setting, there exists a learningalgorithm that obtains O (cid:0) nT / (cid:1) -regret, where T is the number of rounds. For any i ∈ [ n ] , one can construct ¯ z ( i ) from z ¯ ( i ) by replacing its last n − coordinates with ρ m . Thus, it suﬃces todeﬁne Payoff as a function of z ¯ ( i ) . iazadeh et al.: Online Learning via Oﬄine Greedy Algorithm 7:

Greedy Algorithm for Discretized NSM (Niazadeh et al. 2018)

Input:

Discrete submodular function f . Output:

Point z ∈ R n .Set initial lower bound z ¯ (0) ← ( ρ , ρ , . . . , ρ ) T and upper bound ¯ z (0) ← ( ρ m , ρ m , . . . , ρ m ) T . for coordinate i = 1 , , . . . , n do Deﬁne the lower marginal function α ( i ) : R → [ − , +1] as α ( i ) ( z ′ ) = f ( z ′ , z ¯ ( i − − i ) − f ( z ¯ ( i − ) .Deﬁne the upper marginal function β ( i ) : R → [ − , +1] as β ( i ) ( z ′ ) = f ( z ′ , ¯ z ( i − − ) − f ( ¯ z ( i − ) .Deﬁne comparison function ζ ( i ) : R × R → [ − , +1] as ζ ( i ) (ˆ z, z ′ ) = (cid:26) α ( i ) (ˆ z ) − α ( i ) ( z ′ ) if ˆ z ≥ z ′ β ( i ) (ˆ z ) − β ( i ) ( z ′ ) if ˆ z ≤ z ′ . Local Optimization StepChoose θ ( i ) ∈ ∆( R ) so that for all ˆ z ∈ R , E z ′ ∼ θ ( i ) (cid:20) α ( i ) ( z ′ ) + 12 β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ′ ) (cid:21) ≥ (5)(done in Niazadeh et al. (2018) via preprocessing and computing a 2D convex hull).Local Update StepSample z ′ i ∼ θ ( i ) . Set z ¯ ( i ) ← z ¯ ( i − and ¯ z ( i ) ← ¯ z ( i − and then update their i th coordinate: (cid:2) z ¯ ( i ) (cid:3) i ← z ′ i and (cid:2) ¯ z ( i ) (cid:3) i ← z ′ i . return z ← z ¯ ( n ) .So far, we assumed that for the continuous submodular functions, the set of potential value foreach coordinate is ﬁnite and belongs to set R = { ρ , ρ , . . . , ρ m } , rather than the interval [0 , ,and we design learning algorithms with sublinear regret bounds where the regrets are computedwith respect to max z ∈R P Tt =1 f t ( z ) . Now, one may wonder if one can design learning algorithmsagainst the benchmark of max z ∈ [0 , n P Tt =1 f t ( z ) that allows the coordinates to be any number in [0 , n . The following corollary answers this question for any L -Lipschitz non-monotone continuoussubmodular functions. Corollary 4 (Online learning for L − Lipschitz continuous submodular maximization) . Let n be the number of dimensions, and assume the maximum function value is normalized to one.Then for the problem of maximizing a coordinate-wise L -Lipschitz non-monotone (continuous)submodular function in the online full-information setting, there exists a learning algorithm thatobtains O (cid:16) nT / log / ( LT ) (cid:17) -regret, where T is the number of rounds. Furthermore, in the online iazadeh et al.: Online Learning via Oﬄine Greedy bandit setting, there exists a learning algorithm that obtains O (cid:16) nL / T / log / ( LT ) (cid:17) -regret,where T is the number of rounds. Here, in both online algorithms, the benchmark in the regretbounds is max z ∈ [0 , n P Tt =1 f t ( z ) . Proofs of the above corollaries are in Section F.3 in the appendix.

7. Conclusion

In many settings, decision-makers need to solve NP-hard combinatorial problems repeatedly overtime while the underlying (unknown) reward function changes. Motivated by this, we study theproblem of designing online adversarial algorithms for combinatorial problems that are amenable toa robust greedy approximation algorithm. Using Blackwell strategies, we present a uniﬁed frameworkto transform oﬄine robust greedy approximation algorithms to their online counterparts for bothfull information and bandit feedback structures. We show that by applying our framework to severalapplications including maximizing submodular functions, optimizing reserve prices in second priceauctions, and optimizing product ranking on online platforms, we can obtain improved/new regretbounds.While we have investigated a selective set of applications of our framework, we believe our frame-work is general and can capture several other algorithmic problems in revenue management andmarket design. Examples are online learning in assortment optimization (which is also closely relatedto submodular optimization and greedy algorithms proved to be useful there as well; cf. Désir et al.(2015), Agrawal et al. (2019)) and budgeted allocation/Adwords problem in sponsored search auc-tions (where again greedy algorithms proved to be helpful; cf Mehta et al. (2013)). Thus, we believethat investigating other applications of our framework is indeed an interesting future research direc-tion.

Acknowledgments

We thank Tim Roughgarden, Dimitris Bertsimas, and Amin Karbasi for their insightful comments duringthis work.

References

Jacob Abernethy, Elad E Hazan, and Alexander Rakhlin. Competing in the dark: An eﬃcient algorithmfor bandit linear optimization. In , pages263–273, 2008.Jacob Abernethy, Peter L Bartlett, and Elad Hazan. Blackwell approachability and no-regret learning areequivalent. In

Proceedings of the 24th Annual Conference on Learning Theory , pages 27–46, 2011.Shipra Agrawal, Vashist Avadhanula, Vineet Goyal, and Assaf Zeevi. Mnl-bandit: A dynamic learningapproach to assortment selection.

Operations Research , 67(5):1453–1485, 2019. iazadeh et al.:

Online Learning via Oﬄine Greedy

Mathematicalprogramming , 128(1-2):149–169, 2011.Saeed Alaei, Jason Hartline, Rad Niazadeh, Emmanouil Pountourakis, and Yang Yuan. Optimal auctionsvs. anonymous pricing.

Games and Economic Behavior , 118:494–510, 2019.Ali Aouad and Danny Segev. Display optimization for vertically diﬀerentiated locations under multinomiallogit preferences.

Available at SSRN 2709652 , 2015.Arash Asadpour, Rad Niazadeh, Amin Saberi, and Ali Shameli. Ranking an assortment of products viasequential submodular optimization.

Available at SSRN , 2020.Susan Athey and Glenn Ellison. Position auctions with consumer search.

The Quarterly Journal of Eco-nomics , 126(3):1213–1270, 2011.Jean-Yves Audibert, Sébastien Bubeck, and Gábor Lugosi. Regret in online combinatorial optimization.

Mathematics of Operations Research , 39(1):31–45, 2014.Peter Auer, Nicolo Cesa-Bianchi, Yoav Freund, and Robert E Schapire. The nonstochastic multiarmed banditproblem.

SIAM journal on computing , 32(1):48–77, 2002.Andrey Bernstein and Nahum Shimkin. Response-based approachability and its application to generalizedno-regret algorithms. arXiv preprint arXiv:1312.7658 , 2013.Hedyeh Beyhaghi, Negin Golrezaei, Renato Paes Leme, Martin Pal, and Balasubramanian Sivan. Improvedapproximations for free-order prophets and second-price auctions. arXiv preprint arXiv:1807.03435 ,2018.Andrew An Bian, Baharan Mirzasoleiman, Joachim M Buhmann, and Andreas Krause. Guaran-teed non-convex optimization: Submodular maximization over continuous domains. arXiv preprintarXiv:1606.05615 , 2016.David Blackwell. An analog of the minimax theorem for vector payoﬀs.

Paciﬁc Journal of Mathematics , 6(1):1–8, 1956.Sébastien Bubeck, Nicolo Cesa-Bianchi, and Sham M Kakade. Towards minimax policies for online linearoptimization with bandit feedback. In

Conference on Learning Theory , pages 41–1, 2012.Sébastien Bubeck et al. Convex optimization: Algorithms and complexity.

Foundations and Trends® inMachine Learning , 8(3-4):231–357, 2015.Niv Buchbinder and Moran Feldman. Deterministic algorithms for submodular maximization problems.

ACM Transactions on Algorithms (TALG) , 14(3):32, 2018.Niv Buchbinder, Moran Feldman, Joseph Seﬃ, and Roy Schwartz. A tight linear time (1/2)-approximationfor unconstrained submodular maximization.

SIAM Journal on Computing , 44(5):1384–1402, 2015.Nicolo Cesa-Bianchi and Gabor Lugosi.

Prediction, learning, and games . Cambridge university press, 2006. iazadeh et al.:

Online Learning via Oﬄine Greedy

Journal of Computer and System Sciences ,78(5):1404–1422, 2012.Nicolo Cesa-Bianchi, Claudio Gentile, and Yishay Mansour. Regret minimization for reserve prices in second-price auctions.

IEEE Transactions on Information Theory , 61(1):549–564, 2014.Lin Chen, Christopher Harshaw, Hamed Hassani, and Amin Karbasi. Projection-free online optimizationwith stochastic gradient: From convexity to submodularity. arXiv preprint arXiv:1802.08183 , 2018.Lin Chen, Mingrui Zhang, Hamed Hassani, and Amin Karbasi. Black box submodular maximization: Discreteand continuous settings. arXiv preprint arXiv:1901.09515 , 2019.Wei Chen, Yajun Wang, and Yang Yuan. Combinatorial multi-armed bandit: General framework and appli-cations. In

International Conference on Machine Learning , pages 151–159, 2013.Richard Combes, Mohammad Sadegh Talebi Mazraeh Shahi, Alexandre Proutiere, et al. Combinatorialbandits revisited. In

Advances in Neural Information Processing Systems , pages 2116–2124, 2015.Mahsa Derakhshan, Negin Golrezaei, Vahideh Manshadi, and Vahab Mirrokni. Product ranking on onlineplatforms.

Available at SSRN 3130378 , 2018.Mahsa Derakhshan, Negin Golrezaei, and Renato Paes Leme. Lp-based approximation for personalizedreserve prices. In

Proceedings of the 2019 ACM Conference on Economics and Computation , pages589–589, 2019.Antoine Désir, Vineet Goyal, Danny Segev, and Chun Ye. Capacity constrained assortment optimizationunder the markov chain based choice model.

Operations Research, Forthcoming , 2015.Shahar Dobzinski and Michael Schapira. An improved approximation algorithm for combinatorial auctionswith submodular bidders. In

Proceedings of the seventeenth annual ACM-SIAM symposium on Discretealgorithm , pages 1064–1073, 2006.Miroslav Dudík, Nika Haghtalab, Haipeng Luo, Robert E Schapire, Vasilis Syrgkanis, and Jennifer WortmanVaughan. Oracle-eﬃcient online learning and auction design. In , pages 528–539. IEEE, 2017.Eyal Even-Dar, Robert Kleinberg, Shie Mannor, and Yishay Mansour. Online learning for global costfunctions. In

COLT , 2009.Uriel Feige, Vahab S Mirrokni, and Jan Vondrák. Maximizing non-monotone submodular functions.

SIAMJournal on Computing , 40(4):1133–1153, 2011.Kris Ferreira, Sunanda Parthasarathy, and Shreyas Sekar. Learning to rank an assortment of products.

Available at SSRN 3395992 , 2019.Negin Golrezaei, Adel Javanmard, and Vahab Mirrokni. Dynamic incentive-aware learning: Robust pricingin contextual auctions. In

Advances in Neural Information Processing Systems , pages 9756–9766, 2019. iazadeh et al.:

Online Learning via Oﬄine Greedy

Proceedings 10th ACMConference on Electronic Commerce (EC-2009), Stanford, California, USA, July 6–10, 2009 , pages225–234, 2009.Hamed Hassani, Mahdi Soltanolkotabi, and Amin Karbasi. Gradient methods for submodular maximization.In

Advances in Neural Information Processing Systems , pages 5841–5851, 2017.Hamed Hassani, Amin Karbasi, Aryan Mokhtari, and Zebang Shen. Stochastic conditional gradient++. arXiv preprint arXiv:1902.06992 , 2019.Elad Hazan and Zohar Karnin. Volumetric spanners: an eﬃcient exploration basis for learning.

The Journalof Machine Learning Research , 17(1):4062–4095, 2016.Elad Hazan and Tomer Koren. The computational power of optimization in online learning. In

Proceedingsof the forty-eighth annual ACM symposium on Theory of Computing , pages 128–141. ACM, 2016.Sham M Kakade, Adam Tauman Kalai, and Katrina Ligett. Playing games with approximation algorithms.

SIAM Journal on Computing , 39(3):1088–1106, 2009.Adam Kalai and Santosh Vempala. Eﬃcient algorithms for online decision problems.

Journal of Computerand System Sciences , 71(3):291–307, 2005.David Kempe, Jon Kleinberg, and Éva Tardos. Maximizing the spread of inﬂuence through a social network.In

Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and datamining , pages 137–146, 2003.A Gürhan Kök, Marshall L Fisher, and Ramnath Vaidyanathan. Assortment planning: Review of literatureand industry practice. In

Retail supply chain management , pages 99–153. Springer, 2008.Joon Kwon and Vianney Perchet. Online learning and blackwell approachability with partial monitoring:optimal convergence rates. In

Artiﬁcial Intelligence and Statistics , pages 604–613, 2017.Ehud Lehrer. Approachability in inﬁnite dimensional spaces.

International Journal of Game Theory , 31(2):253–268, 2003.Shie Mannor and Nahum Shimkin. Online learning with variable stage duration. In

International Conferenceon Computational Learning Theory , pages 408–422. Springer, 2006.Shie Mannor, Vianney Perchet, and Gilles Stoltz. Robust approachability and regret minimization in gameswith partial monitoring. In

Proceedings of the 24th Annual Conference on Learning Theory , pages515–536, 2011.Shie Mannor, Vianney Perchet, and Gilles Stoltz. Set-valued approachability and online learning with partialmonitoring.

The Journal of Machine Learning Research , 15(1):3247–3295, 2014.Aranyak Mehta et al. Online matching and ad allocation.

Foundations and Trends® in Theoretical ComputerScience , 8(4):265–368, 2013.Emanuel Milman. Approachable sets of vector payoﬀs in stochastic games.

Games and Economic Behavior ,56(1):135–147, 2006. iazadeh et al.:

Online Learning via Oﬄine Greedy arXiv preprint arXiv:1804.09554 , 2018.Elchanan Mossel and Sebastien Roch. Submodularity of inﬂuence in social networks: From local to global.

SIAM Journal on Computing , 39(6):2176–2188, 2010.George L Nemhauser, Laurence A Wolsey, and Marshall L Fisher. An analysis of approximations for maxi-mizing submodular set functions—i.

Mathematical programming , 14(1):265–294, 1978.Rad Niazadeh, Tim Roughgarden, and Joshua Wang. Optimal algorithms for continuous non-monotonesubmodular and dr-submodular maximization. In

Advances in Neural Information Processing Systems ,pages 9594–9604, 2018.Tim Roughgarden and Joshua R Wang. An optimal learning algorithm for online unconstrained submodularmaximization. In

Conference On Learning Theory , pages 1307–1325, 2018.Tim Roughgarden and Joshua R Wang. Minimizing regret with multiple reserves.

ACM Transactions onEconomics and Computation (TEAC) , 7(3):1–18, 2019.Shai Shalev-Shwartz. Online learning: Theory, algorithms, and applications.

PhD Thesis , 2007.Shai Shalev-Shwartz et al. Online learning and online convex optimization.

Foundations and Trends® inMachine Learning , 4(2):107–194, 2012.Tasuku Soma and Yuichi Yoshida. Maximizing monotone submodular functions over the integer lattice.

Mathematical Programming , 172(1-2):539–563, 2018.Xavier Spinat. A necessary and suﬃcient condition for approachability.

Mathematics of Operations Research ,27(1):31–44, 2002.M Streeter and D Golovin. An online algorithm for maximizing submodular functions (technical reportcmu-cs-07-171), 2007.Matthew Streeter and Daniel Golovin. An online algorithm for maximizing submodular functions. In

Advances in Neural Information Processing Systems , pages 1577–1584, 2008.Nguyen Kim Thang and Abhinav Srivastav. Online non-monotone dr-submodular maximization. arXivpreprint arXiv:1909.11426 , 2019.Taishi Uchiya, Atsuyoshi Nakamura, and Mineichi Kudo. Algorithms for adversarial bandit problems withmultiple plays. In

International Conference on Algorithmic Learning Theory , pages 375–389. Springer,2010.Raluca Mihaela Ursu. The power of rankings: Quantifying the eﬀect of rankings on online consumer searchand purchase decisions.

Browser Download This Paper , 2016.Nicolas Vieille. Weak approachability.

Mathematics of operations research , 17(4):781–791, 1992.Jan Vondrák. Optimal approximation for the submodular welfare problem in the value oracle model. In

Proceedings of the fortieth annual ACM symposium on Theory of computing , pages 67–74, 2008. iazadeh et al.:

Online Learning via Oﬄine Greedy

44H Martin Weingartner.

Mathematical programming and the analysis of capital budgeting problems . MarkhamPublishing Company, 1967.Andrew Chi-Chin Yao. Probabilistic computations: Toward a uniﬁed measure of complexity. In , pages 222–227. IEEE, 1977.Mingrui Zhang, Lin Chen, Hamed Hassani, and Amin Karbasi. Online continuous submodular maximization:From full-information to bandit feedback. In

Advances in Neural Information Processing Systems ,pages 9206–9217, 2019a.Mingrui Zhang, Zebang Shen, Aryan Mokhtari, Hamed Hassani, and Amin Karbasi. One sample stochasticfrank-wolfe. arXiv preprint arXiv:1910.04322 , 2019b.Julian Zimmert, Haipeng Luo, and Chen-Yu Wei. Beating stochastic and adversarial semi-bandits optimallyand simultaneously. arXiv preprint arXiv:1901.08779 , 2019. iazadeh et al.:

Online Learning via Oﬄine Greedy Appendix A: Proofs and Remarks of Section 2.3

A.1. Equivalent criteria for approachability

Interestingly, there are other structural conditions that are equivalent to approachability. For example, theoriginal proof of the Blackwell approachability theorem (Blackwell 1956) uses a condition called “halfspace-satisﬁability”. The following proposition summarizes all the known equivalences.

Proposition 1 (Satisﬁable/Halfspace-Satisﬁable/Response-Satisﬁable (Abernethy et al. 2011)) . The following conditions are all equivalent to the approachability condition (Deﬁnition 3): A target set S is satisﬁable in the Blackwell sequential game ( X , Y , p ) if there exists a player 1’s action x ∈ X such that for every player 2’s action y ∈ Y , the vector payoﬀ falls into the target set, that is p ( x , y ) ∈ S . A target set S is halfspace-satisﬁable in the Blackwell sequential game ( X , Y , p ) if for every halfspace H ⊇ S , H is satisﬁable. A target set S is response-satisﬁable in the Blackwell sequential game ( X , Y , p ) if for every player 2’saction y ∈ Y , there exists a player 1’s action x ∈ X such that the vector payoﬀ falls into the target set,that is p ( x , y ) ∈ S . A.2. Proof of Theorem 1

Proof.

The proof for the only if direction relies on the fact that the ℓ ∞ -distance between the averagepayoﬀ and S is vanishing as T → + ∞ since S is o (1) -approachable. Suppose that S is not response satisﬁable,then there exists player 2’s action y ∈ Y such that for every player 1’s action x ∈ X , the payoﬀ p ( x , y ) is notin S . Consider the set U := { p ( x , y ) : x ∈ X } . Because the payoﬀ p is biaﬃne and X is convex and compact,so is U, hence inf u ∈ U d ∞ ( u , S ) = d ∞ ( p ( x , y ) , S ) for some x ∈ X . As p ( x , y ) / ∈ S, β = d ∞ ( p ( x , y ) , S ) > . Whenplayer 2 always plays y , we know that the ℓ ∞ distance between the average payoﬀ and S should convergeto zero as S is o (1) -approachable. At the same time, it is at least β , a contradiction.To prove the if direction, we ﬁrst show a reduction from Blackwell approachability to Online LinearOptimization (OLO) by showing that we can upper bound the ℓ ∞ distance between the average payoﬀ andthe target set in a Blackwell approachability problem with the regret of the corresponding OLO instance.Then, we bound the regret of the OLO problem from above in terms of the ℓ ∞ norm of the payoﬀ D ( p ) (because of our desired bound), the number of rounds T, and the dimension of the payoﬀ function d . Weassume that S is a cone throughout the proof, which is not an issue because we can always lift a convex setto a cone in one dimension higher while not perturbing the distances by more than a factor of . Blackwell approachability reduces to OLO.

In an OLO problem, a player is given a compact convexdecision set

K ⊂ R d , and have to decide on a sequence of actions w , w , . . . , w T ∈ K . In round t, after theplayer decides on an action w t , Nature reveals a loss vector l t and the player pays h l t , w t i . The player observesthe loss vector l t in each round (full-information setting) and aims to minimize his cost. We want to constructa learning algorithm L , such that, for any sequence of loss vectors l , l , . . . , l T ∈ R d , the algorithm outputs w , w , . . . , w T ∈ K that attains a small regret, i.e. P Tt =1 h l t , w t i − min w ∈K P Tt =1 h l t , w i ≤ o ( T ) . Abernethy et al.(2011) show that we can eﬃciently obtain an algorithm for a Blackwell approachability problem from analgorithm for its corresponding OLO problem. Speciﬁcally, we have the following lemma. iazadeh et al.:

Online Learning via Oﬄine Greedy Lemma 1. (Abernethy et al. (2011)) Given a Blackwell instance ( X , Y , p ) , and a cone S such that S isresponse-satisﬁable, we can construct an OLO problem with K = S ◦ ∩ B (1) and l t = − p ( x t , y t ) for all t, such that, if the OLO learning algorithm returns w t in round t, we can convert it into x t ∈ X where d T T X t =1 p ( x t , y t ) , S ! ≤ T T X t =1 h l t , w t i − min w ∈K T X t =1 h l t , w i ! . Proof of Lemma 1.

This lemma was proved in Abernethy et al. (2011), but we include the proof herefor completion. Notice that, for any x ∈ R d and convex cone S ⊂ R d , the distance from x to S can be writtenas d ∞ ( x , S ) = max w ∈ S ◦ , k w k≤ h w , x i (6)because d ∞ ( x , S ) = k x − π S ( x ) k ≥ k w kk x − π S ( x ) k ≥ h w , x − π S ( x ) i ≥ h w , x i , where π S ( x ) denotes the projection of x onto S , and when w = x − π S ( x ) k x − π S ( x ) k , we have equality, i.e. h w , x i = d ( x , S ) . To construct a mapping from the output of the OLO algorithm w t to x t for the Blackwell game,we utilize the halfspace oracle for the Blackwell problem (see Proposition 1). Speciﬁcally, we pick x t suchthat p ( x t , y t ) ∈ H w t for all y ∈ Y , where H w t = { x : h w t , x i ≤ } is a halfspace that contains S ( H w t contains S because its normal, w t , is in S ◦ ). This gives us the following guarantee d T T X t =1 p ( x t , y t ) , S ! (1) = max w ∈K * T T X t =1 p ( x t , y t ) , w + = 1 T max w ∈K − T X t =1 h l t , w i ! (7) (2) ≤ T T X t =1 h− p ( x t , y t ) , w t i − min w ∈K T X t =1 h l t , w i ! (3) = 1 T T X t =1 h l t , w t i − min w ∈K T X t =1 h l t , w i ! . Here, Equality (1) follows from Equation (6), Inequality (2) holds because p ( x t , y t ) ∈ H w t , and Equality (3) holds from our deﬁnition of l t . (cid:4) As a corollary, since for any x ∈ R d and S ⊆ R d , the ℓ ∞ distance is always less than equal to the ℓ distance,i.e., d ∞ ( x , S ) ≤ d ( x , S ) , we obtain d ∞ T T X t =1 p ( x t , y t ) , S ! ≤ d T T X t =1 p ( x t , y t ) , S ! ≤ T T X t =1 h l t , w t i − min w ∈K T X t =1 h l t , w i ! . S ◦ is the polar cone of S , i.e. S ◦ := { s ∈ R d : h s , x i ≤ for all x ∈ S } , and B (1) is a Euclidian ball with radius ,i.e. B (1) = { w ∈ R d : k w k ≤ } . iazadeh et al.: Online Learning via Oﬄine Greedy OLO regret upper-bound with Follow-the-Regularized-Leader algorithm.

To obtain the upper-boundon the regret of an OLO problems in terms of the ℓ ∞ norm of its losses, we apply the Follow-the-Regularized-Leader algorithm with a µ -strongly convex regularizer with respect to the ℓ norm. We use a regularizerwith respect to the ℓ norm, the dual of the ℓ ∞ norm, because of the bound structure of the algorithm asstated in Lemma 2. We elaborate further in the following lemmas. Lemma 2. (Shalev-Shwartz et al. (2012)) Consider an OLO problem on a convex and compact decisionspace

K ⊆ R d . Applying Follow-the-Regularized-Leader algorithm with a regularizer R, where R : R d → R is a µ -strongly convex function with respect to some norm k·k for µ > , implies T T X t =1 h l t , w t i − min w ∈K T X t =1 h l t , w i ! ≤ O (cid:0) CB / µ − / T − / (cid:1) , where B > upper bounds the function R, C > upper bounds the dual norm of the loss vector k l k ∗ , and T is the number of rounds. Lemma 3. (Shalev-Shwartz (2007)) For q ∈ (1 , , the function f : R d → R deﬁned as f ( x ) = q − k x k q is strongly convex with respect to the ℓ q norm over R d . Recall that the ℓ q norm is deﬁned as k x k q =( x q + x qw + . . . + x qd ) /q for x ∈ R d . To get a bound from Lemma 2 that depends on the upper-bound of the ℓ ∞ norm of the loss vectors, wewant a regularizer R that is µ -strongly convex w.r.t the ℓ norm for some µ > (to be determined later).However, the function from Lemma 3 does not work for q = 1 . To solve this, we set q to be greater than , q = log( d )log( d ) − in particular, then bound the ℓ q norm from above using the ℓ norm. Speciﬁcally, setting R ( x ) = k x k q and q = log( d )log( d ) − we have R ( x ) = ( q − · q − k x k q (1) ≥ ( q −

1) 12( q − k x k q + ( q − ∇ (cid:18) q − k x k q (cid:19) T ( y − x ) + ( q − · · k y − x k q (2) ≥ ( q −

1) 12( q − k x k q + ( q − ∇ (cid:18) q − k x k q (cid:19) T ( y − x ) + ( q − · · k y − x k R ( x ) + R ( x ) T ( y − x ) + ( q − / k y − x k , where µ = q − = log( d )log( d ) − − =

13 log( d ) . So, the function R is

13 log( d ) -strongly convex with respect to the ℓ norm. Furthermore, Inequality (1) follows from Lemma 3 and Inequality (2) holds because k w k / ≤ k w k q for any w ∈ R d . Consequently, by constructing an OLO problem with K = S ◦ ∩ B (1) and l t = − p ( x t , y t ) on each round,applying the Follow-the-Regularized-Leader algorithm with regularizer R ( w ) = k w k q to the OLO problem,and converting w t to x t in each round, we obtain d ∞ T T X t =1 p ( x t , y t ) , S ! ≤ T T X t =1 h l t , w t i − min w ∈K T X t =1 h l t , w i ! ≤ O (cid:0) D ( p ) log( d ) / T − / (cid:1) . A µ -strongly convex function f with respect to the ℓ q norm is a diﬀerentiable function that satisﬁes f ( y ) ≥ f ( x ) + ∇ (cid:0) f ( x ) T ( y − x ) (cid:1) + µ k y − x k q for some µ > . If k·k is a norm in R d , the dual norm k·k ∗ of k·k is deﬁned as k w k ∗ = sup { w T x | k x k ≤ } . iazadeh et al.: Online Learning via Oﬄine Greedy B = 1 and C = D ( p ) . Notice that for any w ∈ K , k w k ≤ B because we set K = S ◦ ∩ B (1) inLemma 1. Furthermore, since we set l t = p ( x t , y t ) , we have k l t k ∞ = k p ( x t , y t ) k ∞ ≤ D ( p ) by deﬁnition. (cid:4) Appendix B: Proofs and Remarks of Section 3.1

Example 2 (Non-Robust Greedy Algorithm).

In the shortest path tree problem, we are given anundirected graph G = ( V, E ) along with a root node u and edge weights { w uv } ( u,v ) ∈ E . We want to computea spanning tree of G such that for all vertices v ∈ V , the distance to the root in the tree, dist T ( u, v ) , equalsthe distance to the root in the original graph, dist G ( u, v ) . This problem can be solved by a greedy algorithmwhich runs Dijkstra’s algorithm from u and then for each node v = u chooses a parent p ∈ neighborhood ( v ) with the smallest w vp + dist G ( p, u ) . Suppose that we want to solve the online problem where the G and u are ﬁxed over all rounds but the edge weights are chosen by an adversary.This can be translated into the language of our meta-algorithm as follows. The feasible region is to choosea parent for every non-root vertex ( C = Q v ∈ V \{ u } neighborhood ( v ) ). The adversary’s function space is tochoose (bounded) weights ( F ∼ = (0 , E ), and the cost of a chosen set of edges that we aim to minimize is theaverage distance from a random vertex to u. For each of our | V | subroutines, the parameter space is to choosea distribution for the parent vertex ( Θ = ∆( neighborhood ( v )) ) . The (one-dimensional) payoﬀ vector is theshortest path from v to u through that parent p ( Payoff ( θ , z , { w uv } ) = E p ∼ θ [ w vp + dist G ( p, u )] ), where θ isthe probability of choosing a vertex among v ’s neighbors as parent.Managing to perfectly minimize the one-dimensional payoﬀ vector at each iteration results in a shortestpath tree and therefore the best possible objective value. However, if the local choices deviate from theiroptimal values, then we can create cycles which result in inﬁnite objective value.For example, consider the clique on V = { , , } , where we want a shortest path tree to the root node u = 1 . When the weights are w = 0 . , w = 1 . , w = 0 . , then it would be best for node three to ﬁrsttake edge (2 , . If we swap the role of nodes two and three, w = 1 . , w = 0 . , w , then it would bebest for node two to ﬁrst take edge (2 , . When our subroutines for nodes two and three make simultaneousdecisions without actually seeing the input, they could easily both choose edge (2 , , yielding an invalidshortest path tree and making it impossible to get from either node to the root. This global issue can’t beexpressed as local utilities, so the algorithm is not robust in the sense that is needed to apply our framework. Appendix C: Proofs and Remarks of Section 5

C.1. Proof of Theorem 3

In this section, we complete the proof of Theorem 3, which is restated below for convenience.

Theorem 3.

The only if direction is proved in the sketch. To prove the if direction and second part ofTheorem 3, we propose an algorithm AlgBB that is parameterized by an exploration probability q ∈ (0 , (Algorithm 3). We later choose q to be D ( p ) − / D ( ˆ p ) / (log d ) / T − / to balance terms in our regret upper-bound. In each round t , this algorithm outputs a move x t ∈ Y as well as whether to explore π t ∈ { Yes , No } ,and then receives an unbiased estimate of the resulting payoﬀ ˆ p ( x t , y t ) based on both players’ actions if itpicks to explore. It also maintains a (full-information) Blackwell algorithm AlgB. In each round t = 1 , , ..., T ,our algorithm follows the last suggested action by AlgB to generate a move x t . Note that this move willbe exactly the same as the previous round, if the algorithm chose to not explore in the previous round.Our algorithm then decides to either explore with probability q or not explore with probability − q . If itexplores, then it receives an unbiased estimator, ˆ p of the current vector payoﬀ p ( x t , y t ) , and passes a scaledversion ˆ p /q on to AlgB. If it does not explore, then it rewinds the state of AlgB to the beginning of thecurrent round. Our goal here is to show that under algorithm AlgBB, d ∞ (cid:0) T P Tt =1 p ( x t , y t ) , S (cid:1) plus exploringpenalty term E (cid:2) T D ( p ) · ( ) (cid:3) is O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) .We start with bounding the ﬁrst term d ∞ (cid:0) T P Tt =1 p ( x t , y t ) , S (cid:1) as a function of ˆΠ T , which is the time-averaged rescaled estimated payoﬀs from the rounds that we explore in , , . . . , T : ˆΠ T , T T X t =1 q ˆ p ( x t , y t ) [ explore in round t ] . Speciﬁcally, we have d ∞ T T X t =1 p ( x t , y t ) , S ! = d ∞ T T X t =1 E (cid:20) q ˆ p ( x t , y t ) [ explore in round t ] (cid:21) , S ! ≤ E h d ∞ (cid:16) ˆΠ T , S (cid:17)i , where the equality follows because E h q ˆ p ( x t , y t ) [ explore in round t ] i = qq E [ ˆ p ( x t , y t )] = p ( x t , y t ) as ˆ p is an unbiased estimator for p , and the inequality is obtained by applying Jensen’s inequality to theconvex ℓ ∞ distance function. We next show that if we explore with probability q, E h d ∞ (cid:16) ˆΠ T , S (cid:17)i ≤ O (cid:0) D ( ˆ p ) log( d ) / ( qT ) − / (cid:1) . Then, observe that the exploring penalty term E (cid:2) T D ( p ) · ( ) (cid:3) equals D ( p ) q . Our choice of exploring probability q = D ( p ) − / D ( ˆ p ) / log( d ) / T − / makes the two terms equalto O ( D ( p ) / D ( ˆ p ) / (log d ) / T − / ) , and gives us the desired bound.To see why E h d ∞ (cid:16) ˆΠ T , S (cid:17)i ≤ O (cid:0) D ( ˆ p ) log( d ) / ( qT ) − / (cid:1) when we explore with probability q , let M be arandom variable equal to the number of rounds we explore and ( τ , τ , · · · , τ M ) be the rounds that we explore.Note that M ∼ Binomial ( T, q ) . By applying the law of total expectation, we have: E h d ∞ (cid:16) ˆΠ T , S (cid:17)i = T X m =0 E h d ∞ (cid:16) ˆΠ T , S (cid:17) (cid:12)(cid:12)(cid:12) M = m i Pr [ M = m ] . We provide an upper bound on each term in the above summation separately. First, we handle the M = m = 0 case by noting ˆΠ T = , hence the distance from S is bounded by D ( ˆ p ) in this case. Moreover, thisevent occurs with probability (1 − q ) T . See that (1 − q ) T = (1 − q ) q − · q · T ≤ (1 /e ) qT ≤ O (cid:0) ( qT ) − / (cid:1) , iazadeh et al.: Online Learning via Oﬄine Greedy E h d ∞ (cid:16) ˆΠ T , S (cid:17) (cid:12)(cid:12)(cid:12) M = 0 i Pr [ M = 0] ≤ O (cid:16) D ( ˆ p ) log( d ) / ( qT ) − / (cid:17) .Now ﬁx some M = m = 0 . Assuming that S is a cone (we can always lift the convex set S to a cone in onedimension higher as shown in Abernethy et al. (2011)), our full-information Blackwell algorithm AlgB, whoreceives “fake payoﬀs” { ˆ p ( x τ i , y τ i ) } mi =1 with a diameter of q D ( ˆ p ) , guarantees that: E h d ∞ (cid:16) ˆΠ T , S (cid:17) (cid:12)(cid:12)(cid:12) M = m i = E (cid:20) MT · d ∞ (cid:18) TM ˆΠ T , S (cid:19) (cid:12)(cid:12)(cid:12)(cid:12) M = m (cid:21) = mT E " d ∞ M M X i =1 q ˆ p ( x τ i , y τ i ) , S ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M = m ≤ mT · O (cid:18) q D ( ˆ p ) log( d ) / m − / (cid:19) = O (cid:18) qT D ( ˆ p ) log( d ) / m / (cid:19) , (8)where the expectation is taken w.r.t. the randomness in ˆ p and the inequality holds because S is response-satisﬁable in the Blackwell game ( X , Y , p ) and ˆ p is an unbiased estimator of p . To be more clear why theabove inequality holds, note that set S is response-satisﬁable in the Blackwell game ( X , Y , p ) , and is notnecessarily response-satisﬁable if we replace p with ˆ p . However, by (i) following exactly the same steps asin proof of Theorem 1 (Section A.2 in the appendix) to reduce Blackwell approachability to online linearoptimization for rounds { τ i } Mi =1 , (ii) plugging ˆ p as the vector payoﬀ of each round and using l i = − ˆ p ( x τ i , y τ i ) as the loss function in the online linear optimization, and then (iii) using the fact that ˆ p is an unbiasedestimator for p and S is response-satisﬁable w.r.t. payoﬀs p , we can obtain exactly the same approachabilitybound in expectation as if S was response-satisﬁable w.r.t. payoﬀs ˆ p . To see this, consider the chain ofinequalities (7) in the proof of Theorem 1 in Section A.2, tailored to our problem, and take an expectationw.r.t. the randomness in ˆ p . We have: E " d ∞ M M X i =1 ˆ p ( x τ i , y τ i ) , S ! (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) { τ i } Mi =1 ≤ E " max w ∈K * M M X i =1 ˆ p ( x τ i , y τ i ) , w + (cid:12)(cid:12)(cid:12)(cid:12) { τ i } Mi =1 = E " M max w ∈K − M X i =1 h l i , w i ! (cid:12)(cid:12)(cid:12)(cid:12) { τ i } Mi =1 (2) ≤ M M X i =1 h− p ( x τ i , y τ i ) , w i i − E " min w ∈K M X i =1 h l i , w i (cid:12)(cid:12)(cid:12)(cid:12) { τ i } Mi =1 (3) = 1 M E " M X i =1 h l i , w i i − min w ∈K M X i =1 h l i , w i (cid:12)(cid:12)(cid:12)(cid:12) { τ i } Mi =1 . This time, Inequality (2) holds as before, because S is response-satisﬁabe w.r.t. payoﬀs p (and hence half-space satisﬁable when using w t as the normal of the half-space), but Equality (3) holds because: −h p ( x τ i , y τ i ) , w i i = − E (cid:2) h ˆ p ( x τ i , y τ i ) , w i i|{ τ i } Mi =1 (cid:3) = − E (cid:2) h l i , w i i|{ τ i } Mi =1 (cid:3) Note that expectation is conditioned on { τ i } Mi =1 , but only we use a universal upper-bound on the last term(regret of online linear optimization) that is a function of M , so we can change the conditioning on only M . iazadeh et al.: Online Learning via Oﬄine Greedy q = D ( p ) − / D ( ˆ p ) / log( d ) / T − / , and then use Jensen’s inequalityapplied to the (concave) square-root function. E (cid:20) d ∞ (cid:16) ˆΠ T , S (cid:17) + 1 T D ( p ) · ( ) (cid:21) = E m ∼ Binomial ( T,q ) h E h d ∞ (cid:16) ˆΠ T , S (cid:17) (cid:12)(cid:12)(cid:12) M = m ii + O ( D ( p ) q ) ≤ E m ∼ Binomial ( T,q ) (cid:20) O (cid:18) qT D ( ˆ p ) log( d ) / m / (cid:19)(cid:21) + O ( D ( p ) q ) ≤ O (cid:18) qT D ( ˆ p ) log( d ) / ( T q ) / (cid:19) + O ( D ( p ) q )= O (cid:0) D ( ˆ p ) log( d ) / ( qT ) − / (cid:1) + O ( D ( p ) q )= O (cid:16) D ( p ) / D ( ˆ p ) / (log d ) / T − / (cid:17) . The last inequality is the desired result. (cid:4)

C.2. Proof of Theorem 4

Proof.

The function ˆ p is an unbiased estimator for p (due to the bandit Blackwell reduciblity),so ( X , Y , p , ˆ p ) is a valid instance of the bandit Blackwell sequential game. Moreover, our target set S is the d payoﬀ -dimensional positive orthant. Therefore, there exists a polynomial-time separation ora-cle for set S . Set S is also response-satisﬁable (due to bandit Blackwell reduciblity). Thus, there existsa polynomial time online algorithm AlgBB that guarantees the bandit approachability upper-bound O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoﬀ )) / T / (cid:17) , established in Theorem 3, for each of the bandit Blackwellinstances corresponding to the N diﬀerent subproblems.Consider a subproblem i ∈ [ N ] . Note that AlgBB ( i ) is not invoked in all rounds [ T ] , but rather a subset T i ⊆ [ T ] depending on when its fellow Blackwell bandit algorithms, i.e., AlgBB ( i ′ ) , i ′ ∈ [ i − , decide toexplore. Note that T i is a random set, and only depends on realizations of binary signals { π ( i ′ ) t } i ′ ∈ [ i − ,t ∈ [ T ] .Fix a particular realization of set T i . By using the upper-bound in Theorem 3 for each of the terms in the LHSof the bound (i.e., the distance of the average payoﬀ vector from set S and expected number of explorations)separately, we have d ∞ |T i | X t ∈T i p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) , S ! ≤ O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoﬀ )) / |T i | − / (cid:17) . Moreover, let M i be the number of rounds that AlgBB ( i ) explores out of |T i | rounds it is invoked. Then, byour choice of q = D ( p ) − / D ( ˆ p ) / (log d ) / T − / , we have E (cid:20) |T i | M i (cid:12)(cid:12)(cid:12)(cid:12) T i (cid:21) = D ( p ) − / D ( ˆ p ) / (log( d payoﬀ )) / | T | − / ≤ D ( p ) − / D ( ˆ p ) / (log( d payoﬀ )) / |T i | − / , As the set S is the positive orthant, we have: ∀ j ∈ [ n ] : "X t ∈T i p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) j ≥ − O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoﬀ )) / |T i | / (cid:17) . iazadeh et al.: Online Learning via Oﬄine Greedy

Payoff (cid:16) θ ( i ) t , z ( i − t , f t (cid:17) = p (cid:16) θ ( i ) t , AdvB ( z ( i − t , f t ) (cid:17) . Thus, ∀ j ∈ [ n ] : "X t ∈T i Payoff ( θ ( i ) t , z ( i − t , f t ) j ≥ − O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoﬀ )) / |T i | / (cid:17) . (9)Let T − be the set of rounds where no AlgBB ( i ) explored and T + be the set of rounds where some AlgBB ( i ) explored. Note also that T − ⊆ T i , simply because if no algorithm explores, AlgBB ( i ) will be invoked. Then,for any j ∈ [ n ] , we have E " X t ∈T − Payoff ( θ ( i ) t , z ( i − t , f t ) (cid:12)(cid:12)(cid:12)(cid:12) T i j = E "X t ∈T i Payoff ( θ ( i ) t , z ( i − t , f t ) (cid:12)(cid:12)(cid:12)(cid:12) T i j − E  X t ∈T i \T − Payoff ( θ ( i ) t , z ( i − t , f t ) (cid:12)(cid:12)(cid:12)(cid:12) T i  j ≥ − O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoﬀ )) / |T i | / (cid:17) − D ( p ) E [ M i |T i ] , where the expectation is with respect to z ( i − t , t ∈ T i . Here, the inequality follows from Equation (9) andthe fact that |T i \ T − | ≤ M i and for any i ∈ [ N ] and t ∈ [ T ] , Payoff ( θ ( i ) t , z ( i − t , f t ) ≤ D ( p ) . By consideringthe fact that E [ M i |T i ] = D ( p ) − / D ( ˆ p ) / (log d ) / T − / |T i | ≤ D ( p ) − / D ( ˆ p ) / (log( d payoﬀ )) / |T i | / from our choice of the probability of exploring, q , in Theorem 3, and |T i | ≤ T , we have: ∀ j ∈ [ n ] : E " X t ∈T − Payoff ( θ ( i ) t , z ( i − t , f t ) j ≥ − O (cid:16) D ( p ) / D ( ˆ p ) / (log( d payoﬀ )) / |T i | / (cid:17) . (10)Because the oﬄine algorithm Offline-IG ( C , F , D , Θ) (Algorithm 1) is an extended ( γ, δ ) -robust approxi-mation, by focusing on rounds in T − and applying Inequality (10), together with linearity of expectation,we have: E " X t ∈T − f t ( z t ) ≥ γ · E " X t ∈T − f t ( z ∗ ) − O (cid:16) D ( p ) / D ( ˆ p ) / N δ (log( d payoﬀ )) / T / (cid:17) , where z ∗ = arg max z ∈C P Tt =1 f t ( z ) is the optimal in-hindsight solution.Finally, note that Bandit-IG ( C , F , D , Θ , AlgBB ) does not explore too often in total among its subprob-lems. More precisely, E (cid:2) |T + | (cid:3) = E " N X i =1 M i ≤ N X i =1 O (cid:16) (log( d payoﬀ )) / E h T / i i(cid:17) ≤ O (cid:16) N (log( d payoﬀ )) / T / (cid:17) . Noting the fact that the functions f t have output value at most , for the remaining rounds T + we have: E " X t ∈T + f t ( z t ) ≥ γ · E " X t ∈T + f t ( z ∗ ) − O (cid:16) N (log( d payoﬀ )) / T / (cid:17) . (11)Combining the two types of bounds in rounds T − and T + yields the desired claim. (cid:4) iazadeh et al.: Online Learning via Oﬄine Greedy C.3. Bandit Blackwell Regret Lowerbound

In this section, we show that in a Bandit Blackwell sequential game, ( X , Y , p , ˆ p ) , the distance from the time-averaged payoﬀ to the target set S plus the time-averaged exploring penalty of any prediction strategy mustbe at least Ω (cid:0) DT − / (cid:1) , where D = min { D ( p ) , D ( p ) } . Put diﬀerently, we show that the performance boundproved in Theorem 3 is unimprovable with respect to T (the number of rounds), i.e., no other strategies canhave a better performance for all problems. Theorem 8.

In a bandit Blackwell sequential game, ( X , Y , p , ˆ p ) , there exists an adversary’s strategy suchthat for every player 1’s strategy, the resulting sequence of actions satisfy: d ∞ T T X t =1 p ( x t , y t ) , S ! + E (cid:20) T D ( p ) · ( ) (cid:21) ≥ Ω (cid:0) DT − / (cid:1) . where ( D = min { D ( p ) , D ( p ) } . Proof of Theorem 8.

Let M be a random variable equal to the number of rounds the player explores.We ﬁrst show in that if the number of rounds that the player explores at is at most M, then there exists aBandit Blackwell instance: an adversary’s action ( y , . . . , y T ) , a convex closed set S , and an aﬃne payoﬀ p together with an unbiased estimator function ˆ p such that: E " d ∞ T T X t =1 p ( x t , y t ) , S ! (cid:12)(cid:12)(cid:12)(cid:12) M ≥ Ω (cid:18) D √ M (cid:19) for any player’s strategies ( x , . . . , x T ) , where the expectation is taken over the randomness in the adversary’sstrategy. We later show this statement in Lemma 4. For now, we assume that the statement is true. Sincethe Bandit Blackwell total regret deﬁned in Deﬁnition 11 includes another term for the cost of exploring,the total regret conditioned on M is E " d ∞ T T X t =1 p ( x t , y t ) , S ! (cid:12)(cid:12)(cid:12)(cid:12) M + E (cid:20) T D ( p ) · ( explore ) (cid:12)(cid:12)(cid:12)(cid:12) M (cid:21) ≥ Ω (cid:18) D √ M (cid:19) + Ω (cid:18) DMT (cid:19) (1) ≥ Ω (cid:0) DT − / (cid:1) , where Inequality (2) follows from setting M = T / ; notice that at M = T / , D √ M = DMT and Ω (cid:16) D √ M (cid:17) +Ω (cid:0) DMT (cid:1) is minimized. Again, the expectation here is with respect to the adversary’s strategy. Now, takinganother expectation with respect to M , we have E " d ∞ T T X t =1 p ( x t , y t ) , S ! + E (cid:20) T D ( p ) · ( explore ) (cid:21) = E " E " d ∞ T T X t =1 p ( x t , y t ) , S ! (cid:12)(cid:12)(cid:12)(cid:12) M + E (cid:20) E (cid:20) T D ( p ) · ( explore ) (cid:12)(cid:12)(cid:12)(cid:12) M (cid:21)(cid:21) ≥ Ω (cid:0) DT − / (cid:1) . This completes the proof. (cid:4)

We now prove the lower bound on the distance from the average payoﬀ to S when the number of explorationrounds is M . As is common in proofs of lower bounds, we construct a random sequence of similar adversariesand show that with M rounds of explorations, it is impossible to distinguish the diﬀerent types of adversarieswithout suﬀering a regret less than D √ M . iazadeh et al.: Online Learning via Oﬄine Greedy Lemma 4.

In a Bandit Blackwell problem, if the number of exploration rounds is at most M, there existsan adversary’s strategy ( y , . . . , y T ) , a convex closed set S , and an aﬃne payoﬀ p together with an unbiasedestimator function ˆ p such that for any strategies ( x , . . . , x T ) , E " d ∞ T T X t =1 p ( x t , y t ) , S ! (cid:12)(cid:12)(cid:12)(cid:12) M ≥ Ω (cid:18) D √ M (cid:19) , where the expectation is taken with respect to the adversary’s strategies.Proof of Lemma 4 We only prove our lower bound for a deterministic player. Note that any randomizedstrategy can be expressed as a randomization of deterministic strategies, and based on Yao’s minimaxprinciple (Yao (1977)), our lower bound still holds when we average them over several deterministic strategiesaccording to some randomization. We refer to Cesa-Bianchi and Lugosi (2006) for details on deriving anidentical lower bound for a randomized adversary from a deterministic adversary.Recall that M is the random variable of the number of rounds that the player explores. Consider a ﬁxed M . From this point onward (until the end of the proof), every probability and expectation are conditionedon M , we remove the dependence on M on the notations for simplicity. Let X = ∆([ n ]) , Y = { , } n , thepayoﬀ function p ( x , y ) = x T y n − y , and S be the non-positive orthant, i.e. S = { s ∈ R n | [ s ] j ≤ ∀ j ∈ [ n ] } . For deterministic strategies, x must be equal to e z for some coordinate z , which happens when player 1 playsaction z ∈ [ n ] . In that case, [ p ( x , y )] j = [ y ] z − [ y ] j for all j ∈ [ n ] . We now deﬁne the adversary’s strategy. For each round t ∈ [ T ] and coordinate j ∈ [ n ] , let [ y t ] j beBernoulli random variables whose joint distribution are deﬁned as follows. We ﬁrst pick a random variable ζ ∼ Uniform { , , . . . , n } . Then, given that ζ = i, [ y ] j , [ y ] j , . . . , [ y T ] j are conditionally independent Bernoullirandom variables with parameter (1 − µ ) / if j = i, and (1 + µ ) / if j = i, where µ < / (will be speciﬁedlater). For analysis purposes, we deﬁne another move for the adversary, which we call the base move: all [ y ] j , [ y ] j , . . . , [ y T ] j are conditionally independent Bernoulli variables with parameter (1 − µ ) / . Supposethat this happens when ζ = 0 (just for ease of notations).Let I t be the player’s action (in { , , . . . , n } ) on round t , and π t be the exploring indicator in round t : π t = 1 if we explore in round t, and otherwise. Let η t = ( π , . . . , π t ) be the history of exploration decisions up toround t. Since the player is deterministic, I t is determined by ( p ( x , y ) , p ( x , y ) , . . . , p ( x t − , y t − ) , η t − ) . Also, let T j = P Tt =1 [ I t = j ] be the number of times action j is played in the ﬁrst T rounds. We furtherdeﬁne P j and E j as P ( ·| ζ = j ) E ( ·| ζ = j ) , respectively. More rigorously, if A is a σ -algebra generated by allpossible outcomes of the game, P j is a measure on the σ − algebra A and E j is an expectation taken withrespect to the conditional probability P j , which solely depends on the adversary’s move since we assume thatplayer 1’s strategy is deterministic. iazadeh et al.: Online Learning via Oﬄine Greedy j ∈ [ n ] , when ζ = j , playing action j has the highest average reward than any otheractions. Then, we have E j " d ∞ T T X t =1 p ( x t , y t ) , S ! (1) ≥ max z ∈ [ n ] E j "" T T X t =1 p ( x t , y t ) z = 1 T max z ∈ [ n ] E j " T X t =1 ([ y t ] z − [ y t ] I t ) ≥ T E j " T X t =1 ([ y t ] j − [ y t ] I t ) = 1 T T X t =1 E j [[ y t ] j − [ y t ] I t ] (2) = 1 T T X t =1 µ E j [ ( I t = j )]= 1 T µ X j ′ = j E j [ T j ′ ] = 1 T µ ( T − E j [ T j ])= (cid:16) µ − µT E j [ T j ] (cid:17) . Inequality (1) follows because S is the non-positive orthant. Equality (2) follows because E j [[ y t ] j ′ ] is (1 + µ ) / when j ′ = j and (1 − µ ) / otherwise, so the diﬀerence between E j [[ y t ] j ] and E j [[ y t ] I t ] is µ when I t = j and otherwise.As for each j ∈ [ n ] , since the event { ζ = j } happens with probability n , we have sup E " d ∞ T T X t =1 p ( x t , y t ) , S ! ≥ µ − nT X j E j [ T j ] ! , (12)where the expectation is taken with respect to the adversary’s move and the supremum is taken over ζ ∈{ , , . . ., n } since the ζ picked by the adversary in the beginning of the game determines his whole strategy.The proof now reduces to bounding E j [ T j ] from above. We do this by comparing E j [ T j ] with E [ T j ] . If player1 chooses action i at round t and decides to explore, i.e., I t = i and π t = 1 , he then observes the payoﬀ [ y t ] i . Recall that y t is the random variable that represents adversary’s move at round t, where y t ∈ Y = { , } n . Forany sequence of history ( H t , η t ) where H t = ([ y ] I , . . . , [ y t ] I t ) = ( h , . . . , h t ) ∈ { , } t and η t = ( π , . . . , π t ) ∈{ , } t , let χ t,j ( H t , η t ) = P j ([ y ] I = h , . . . , [ y t ] I t = h t , η t ) . Note that P j is a measure on the σ − algebra A as mentioned above, and the randomness comes from theadversary’s moves (the adversary plays a randomized y t at time t, where the j th coordinate of y t is a Bernoullivariable with mean either (1 + µ ) / or (1 − µ ) / depending on his choice of ζ at the beginning of the game).From our assumption that the player is deterministic, for any H T ∈ { , } T and history of exploration η T .Then, E i h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i = E h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i , ∀ ≤ i ≤ n, (13) iazadeh et al.: Online Learning via Oﬄine Greedy ζ decided by the adversary is, the player has the same sequenceof moves given the same history. Therefore, E j [ T j ] − E [ T j ] = X H T ∈{ , } T , η T ∈{ , } T χ T,j ( H T , η T ) E j h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i − X H T ∈{ , } T , η T ∈{ , } T χ T, ( H T , η T ) E h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i (1) = X H T ∈{ , } T , η T ∈{ , } T ( χ T,j ( H T , η T ) − χ T, ( H T , η T )) E j h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i ≤ X ( H T , η T ): χ T,j ( H T , η T ) >χ T, ( H T , η T ) ( χ T,j ( H T , η T ) − χ T, ( H T , η T )) E j h T j (cid:12)(cid:12)(cid:12) [ y ] I = h , . . . , [ y T ] I T = h T , η T i (2) ≤ T X ( H T , η T ): χ T,j ( H T , η T ) >χ T, ( H T , η T ) ( χ T,j ( H T , η T ) − χ T, ( H T , η T )) , (14)where Equation (1) follows from Equation (13) and Inequality (2) follows from E j [ T j | [ y ] I = h , . . . , [ y T ] I T = h T , η T ] ≤ T. See that P nj =1 E [ T j ] = T since on each round, player 1’s action is in { , , . . ., n } . We can boundthe total variation using Pinsker’s inequality: X ( H T , η T ): χ T,j ( H T , η T ) >χ T, ( H T , η T ) ( χ T,j ( H T , η T ) − χ T, ( H T , η T )) ≤ r KL ( χ T, ( H T , η T ) || χ T,j ( H T , η T )) . (15)Putting Equations (14) and (15) together and applying Jensen’s inequality to the concave the square rootfunction, we get n n X j =1 E j [ T j ] ≥ n n X j =1  E [ T j ] + T X H T : χ T,j ( H T , η T ) >χ T, ( H T , η T ) ( χ T,j ( H T , η T ) − χ T, ( H T , η T ))  ≤ T n + 1 n n X j =1 r KL ( χ T, ( H T , η T ) || χ T,j ( H T , η T )) ! ≤ T  n + vuut n n X j =1 KL ( χ T, ( H T , η T ) || χ T,j ( H T , η T ))  . (16)Recall that from the deﬁnition of χ t,j , we have the following conditional distribution: χ t,j ( h t | H t − , η t − ) = P j (cid:0) [ y t ] I t = h t | [ y ] I = h , . . . , [ y t − ] I t − = h t − , η t − (cid:1) . Applying the chain rule, we haveKL ( χ T, || χ T,j ) (1) = T X t =1 X H t − , η t − χ t − , ( H t − , η t − ) KL ( χ t, ( ·| H t − , η t − ) || χ t,j ( ·| H t − , η t − )) (2) = T X t =1 X H t − , η t − χ t − , ( H t − , η t − ) ( { I t = j and π t = 1 | H t − , η t − } ) KL (cid:18) − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ (cid:19) (3) = KL (cid:18) − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ (cid:19) E " T X t =1 ( { I t = j and π t = 1 } ) . Pinsker’s inequality bounds the total variation distance in terms of KL divergence. For two probability distributions P and Q , Pinsker’s inequality states that || P − Q || TV ≤ q KL ( P || Q ) where || P − Q || TV is the total variation distance sup A {| P ( A ) − Q ( A ) |} over measurable events A . Taking A = { x : P ( x ) > Q ( x ) } , we get P x : P ( x ) >Q ( x ) | P ( x ) − Q ( x ) | ≤ q KL ( P || Q ) . See Section A.2 Cesa-Bianchi and Lugosi (2006) for details. iazadeh et al.:

Online Learning via Oﬄine Greedy (1) follows from applying the chain rule to χ T, and χ T,j . Equation (2) holdsbecause χ t, ( ·| H t − , η t − ) = Ber (cid:0) − µ (cid:1) and χ t,j ( ·| H t − , η t − ) = Ber (cid:0) µ (cid:1) when we play the arm j on round t, I t = j, and observe the payoﬀ, π t = 1 . Otherwise, they are identical and wehave KL ( χ t, ( ·| H t − , η t − ) || χ t,j ( ·| H t − , η t − )) = 0 . Lastly, we get Equation (3) by factoring outKL (cid:0) − µ || µ (cid:1) , and collecting all the probability terms ( χ t, ( H t , µ t ) for all t ) to form the expectation of ( { I t = j and π t = 1 } ) with respect to P . Summing over j and applying KL ( p || q ) ≤ ( p − q ) q (1 − q ) : n X j =1 KL ( χ T, || χ T,j ) = KL (cid:18) − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ (cid:19) n X j =1 E " T X t =1 ( { I t = j and π t = 1 } ) = KL (cid:18) − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ (cid:19) E " T X t =1 n X j =1 ( { I t = j and π t = 1 } ) = KL (cid:18) − µ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) µ (cid:19) E " T X t =1 ( { π t = 1 } ) ≤ µ − µ M, (17)where the last line follows from the assumption that the number of rounds the player explores is M .Putting Equation (12), (16) and (17) altogether we get: E j " d ∞ T T X t =1 p ( x t , y t ) , S ! ≥ µ  − n − s n µ − µ M  ≥ µ − n − µ r M n ! , (18)where the last inequality follows from µ ≤ / . Taking µ = λ p nM , we have: E j " d ∞ T T X t =1 p ( x t , y t ) , S ! ≥ λ r nM (cid:18) − λ √ (cid:19) ≥ Ω (cid:18) √ M (cid:19) = Ω (cid:18) D √ M (cid:19) , (19)where the last equality follows from D ≤ D ( p ) = 1 in this case since the adversary’s move is in { , } n . Weﬁnish the proof by choosing the constant λ to be small enough to ensure that (cid:16) − λ √ (cid:17) is positive. (cid:4) Appendix D: Proofs and Remarks of Section 6.1 – Product Ranking and SequentialSubmodular Maximization

In this appendix, we give the missing proofs of the results from Section 6.1.

D.1. Proof of Theorem 5

Proof . We show that our meta Algorithm 4 works by verifying the following conditions. (i) Algorithm 5 is an extended ( , ) -robust approximation algorithm. We need to show that if thefollowing equation holds for some function h : ∀ j ∈ [ n ] , " T X t =1 Payoff ( ˜ θ ( i ) t , π ( i − t , f t ) j ≥ − h ( T ) , then we must have ∀ π ∗ ∈ Π , T X t =1 E [ f t ( π t )] ≥ T X t =1 f t ( π ∗ ) − nh ( T ) . (20) iazadeh et al.: Online Learning via Oﬄine Greedy j ∈ [ n ] , we have (cid:2) Payoff ( θ ( i ) , π ( i − , f ) (cid:3) j = θ T y ( i ) − [ y ( i ) ] j , where y ( i ) , (cid:2) f ( π ( i − + j e i ) − f ( π ( i − ) (cid:3) j ∈ [ n ] . First, we prove several inequalities that will later be used to prove inequality (20). Sinceeach function f t,i is monotone submodular, we have: λ i i X j =1 (cid:18) f t,i ( { [ π t ] , . . . , [ π t ] j } ) − f t,i ( { [ π t ] , . . . , [ π t ] j − } ) (cid:19) + λ i i X j =1 (cid:18) f t,i ( { [ π t ] , . . . , [ π t ] j − , [ π ∗ ] j } ) − f t,i ( { [ π t ] , . . . , [ π t ] j − } ) (cid:19) (1) ≥ λ i i X j =1 (cid:18) f t,i ( { [ π t ] , . . . , [ π t ] j , [ π ∗ ] , . . . , [ π ∗ ] j } ) − f t,i ( { [ π t ] , . . . , [ π t ] j − , [ π ∗ ] , . . . , [ π ∗ ] j − } ) (cid:19) = λ i f t,i ( { [ π t ] , . . . , [ π t ] i , [ π ∗ ] , . . . , [ π ∗ ] i } ) − λ i f t,i ( ∅ ) (2) ≥ λ i f t,i ( { [ π ∗ ] , . . . , [ π ∗ ] i } ) , where inequality (1) follows from submodularity and inequality (2) follows from monotonicity and non-negativity of f t,i . To be more clear, inequality (1) holds because for each j = 1 , , . . . , i, the sum of the marginalvalues of adding [ π t ] j to π ( j − and adding [ π ∗ ] j to π ( j − is greater than equal to the marginal value ofadding { [ π t ] j , [ π ∗ ] j } to { [ π t ] , . . . , [ π t ] i , [ π ∗ ] , . . . , [ π ∗ ] i } as f t,i is submodular. Recall that f t ( π ) , λ f t, ( { [ π ] } ) + λ f t, ( { [ π ] , [ π ] } ) + . . . + λ n f t,n ( { [ π ] , . . . , [ π ] n } ) ∀ t ∈ [ T ] , so summing the inequalities above for i = 1 , , . . . , n, we get: n X i =1 λ i i X j =1 (cid:18) f t,i ( { [ π t ] , . . . , [ π t ] j } ) − f t,i ( { [ π t ] , . . . , [ π t ] j − } ) (cid:19) + n X i =1 λ i i X j =1 (cid:18) f t,i ( { [ π t ] , . . . , [ π t ] j − , [ π ∗ ] j } ) − f t,i ( { [ π t ] , . . . , [ π t ] j − } ) (cid:19) ≥ n X i =1 λ i f t,i ( { [ π ∗ ] , . . . , [ π ∗ ] i } ) ⇔ n X j =1 n X i = j ( λ i f t,i ( { [ π t ] , . . . , [ π t ] j } ) − λ i f t,i ( { [ π t ] , . . . , [ π t ] j − } ))+ n X j =1 n X i = j ( λ i f t,i ( { [ π t ] , . . . , [ π t ] j − , [ π ∗ ] j } ) − λ i f t,i ( { [ π t ] , . . . , [ π t ] j − } )) ≥ f t ( π ∗ ) ⇔ n X j =1 (cid:16) f t ( π ( j ) t ) − f t ( π ( j − t ) (cid:17) + n X j =1 (cid:16) f t ( π ( j − t + [ π ∗ ] j e j ) − f t ( π ( j − t ) (cid:17) ≥ f t ( π ∗ ) . (21)We get the ﬁrst equivalence by switching the summations. We now use the inequality (21) to prove the ﬁnalclaim, i.e., the desired ineqaulity (20). We have: T X t =1 E [ f t ( π t )] = 12 T X t =1 n X i =1 E h f t ( π ( i ) t ) − f t ( π ( i − t ) i = 12 T X t =1 n X i =1 E h f t ( π ( i ) t ) − f t ( π ( i − t ) i + 12 n X i =1 T X t =1 E h f t ( π ( i ) t ) − f t ( π ( i − t ) i − n X i =1 T X t =1 E h f t ( π ( i ) t + [ π ∗ ] i e i ) − f t ( π ( i − t ) i + 12 n X i =1 T X t =1 E h f t ( π ( i ) t + [ π ∗ ] i e i ) − f t ( π ( i − t ) i iazadeh et al.: Online Learning via Oﬄine Greedy (1) = 12 T X t =1 n X i =1 E h f t ( π ( i ) t ) − f t ( π ( i − t ) i + 12 n X i =1 T X t =1 h Payoff ( ˜ θ ( i ) t , π ( i − t , f t ) i [ π ∗ ] i + 12 n X i =1 T X t =1 E h f t ( π ( i − t + [ π ∗ ] i e i ) − f t ( π ( i − t ) i (2) ≥ T X t =1 n X i =1 E h f t ( π ( i ) t ) − f t ( π ( i − t ) i − nh ( T ) + 12 T X t =1 n X i =1 E h f t ( π ( i ) t + [ π ∗ ] i e i ) − f t ( π ( i − t ) i (3) ≥ T X t =1 f t ( π ∗ ) − nh ( T ) . In the above chain of inequalities, equality (1) follows from the deﬁnition of

Payoff , inequality (2) followsfrom our assumption, and inequality (3) follows from inequality (21). Rearranging the terms will ﬁnish theproof. (ii) Algorithm 5 is bandit Blackwell reducible.

We verify the following conditions based on Deﬁnition 12to show bandit Blackwell reducibility:•

Algorithm 5 is Blackwell reducible.

For each subproblem, consider an instance ( X , Y , p ) of Blackwellwhere X = Θ = ∆([ n ]) and Y = [ − , n . Our Blackwell adversary function is the marginal increase in theobjective function of placing item on position i, AdvB ( π ( i − , f ) = h f ( π ( i − t + j e i ) − f ( π ( i − t ) i j ∈ [ n ] .The biaﬃne payoﬀ is p ( θ , y ) = θ T y n − y , where n is an n -dimensional all ones vector. The targetset S is the non-negative orthant, and it is response-satisﬁable since for every adversary’s action y ∈ Y , the strategy θ = e j ∗ where j ∗ = argmax j ∈ [ n ] [ y ] j results in p ( θ , y ) ≥ . • An unbiased estimator for the Blackwell payoﬀ function p can be constructed. Speciﬁcally, we needto construct an exploration sampling device U that receives ( θ , π ( i − ) in subproblem i and returns ( w exp , π exp ) such that (i) for all f ∈ F , θ ∈ Θ , π ( i − ∈ D , i ∈ [ n ] : ˆ p (cid:0) θ , AdvB ( π ( i − , f ) (cid:1) = f ( π exp ) w exp ,where ( w exp , π exp ) ∼ ExpS ( θ , π ( i − ) , and (ii) ˆ p is an unbiased estimator for the actual payoﬀ, i.e, ∀ θ ∈ Θ , y ∈ Y : E [ ˆ p ( θ , y )] = p ( θ , y ) . The explore sampling device U works as follows. Given a point π ( i − ∈ Π and a parameter θ ∈ Θ , it draws j ∼ Uniform { , , . . . , n } and returns ( w exp , π exp ) = (cid:0) n ( θ j n − e j ) , π ( i − + j e i (cid:1) . Now, ˆ p is an unbiased estimator of p because iazadeh et al.: Online Learning via Oﬄine Greedy E [ ˆ p ( θ , y )] = E (cid:2) ˆ p ( θ , AdvB ( π ( i − , f )) (cid:3) = E [ f ( π exp ) w exp ]= E (cid:20) n θ j f ( π ( i − + j e i ) n − f ( π ( i − + j e i ) e j (cid:21) (1) = n n X j =1 θ j f ( π ( i − + j e i ) − (cid:20) f ( π ( i − + 1 e i ) , f ( π ( i − + 2 e i ) , . . . , f ( π ( i − + n e i ) (cid:21) T (2) = n n X j =1 θ j (cid:0) f ( π ( i − + j e i ) − f ( π ( i − ) (cid:1) − (cid:20) f ( π ( i − + 1 e i ) , f ( π ( i − + 2 e i ) , . . . , f ( π ( i − + n e i ) (cid:21) T + f ( π ( i − ) n = n n X j =1 θ j (cid:0) f ( π ( i − + j e i ) − f ( π ( i − ) (cid:1) − (cid:20) f ( π ( i − + 1 e i ) − f ( π ( i − ) , . . . , f ( π ( i − + n e i ) − f ( π ( i − ) (cid:21) T = θ T y n − y = p ( θ , AdvB ( π ( i − , f )) , where y , [ f ( π ( i − (1) , . . . , π ( i − ( i − , j ) − f ( π ( i − )] j ∈ [ n ] .Here, equation (1) holds because we take j ∼ Uniform { , , . . ., n } , and equation (2) holds because P nj =1 θ j =1 . Intuitively, at every round, U randomly picks one of the items j ∈ [ n ] , and evaluate the marginal beneﬁtof putting element j on the i th position of π ( i − .Putting (i) and (ii) altogether, Algorithm 5 is a ( , ) − extended robust approximation algorithm with n subproblems. Its payoﬀ diameter D ( p ) is O (1) and its payoﬀ estimator diameter D ( ˆ p ) is O ( n ) . The dimensionof vector payoﬀs is also d payoﬀ = n . It is also bandit Blackwell reducible, hence from Theorems 2 and 4: -regret(Algorithm 2 applied on Algorithm 5) ≤ O ( n p T log n ) . -regret(Algorithm 4 applied on Algorithm 5) ≤ O ( n / (log n ) / T / ) . This completes the proof. (cid:4)

D.2. Proof of Corollary 1

Proof.

The proof for the model from Asadpour et al. (2020) is a direct application of Theorem 5 by taking λ i , P u ∼G ( θ u = i ) , the probability that a consumer has patience level i, and f i ( S ) , E u ∼G [ κ u ( S ) | θ u = i ] , the expected probability that a consumer with patience level i clicks on any of the top i products in S , asmentioned in Section 6.1. Thus, the sequential submodular function of interest is the expected probabilitythat a consumer clicks on at least one product when oﬀered an ordering π : f ( π ) = n X i =1 λ i f i ( { π (1) , . . . , π ( i ) } ) = n X i =1 P u ∼G ( θ u = i ) E e ∼G [ κ u ( π ) | θ u = i ] . By invoking Theorem 5, we get the desired O (cid:0) n √ T log n (cid:1) -regret in the full-information setting and O (cid:16) n / (log n ) / T / (cid:17) − regret in the bandit setting.For the special consumer choice model in Ferreira et al. (2019), a consumer is characterized by two param-eters: distribution of clicks for each item q u = ( q u, , . . . , q u,n ) and attention window size k u . A consumer u iazadeh et al.: Online Learning via Oﬄine Greedy k u positions, and an examined item i is clicked with probability q u,i whileunexamined items are never clicked. The events of clicking on two diﬀerent items i or j in the event windoware assumed to be independent. Notice that this is a special case of the choice model by Asadpour et al.(2020), where θ u = k u and κ u ( { π (1) , . . . , π ( θ u ) } = κ u ( { π (1) , . . . , π ( k u ) } = 1 − k u Y i =1 (1 − q u,i ) . The probability of click function κ u is monotone since when X ⊆ Y ⊆ [ n ] , we have Q i ∈ X (1 − q u,i ) ≥ Q i ∈ Y (1 − q u,i ) (as ≤ q u,i ≤ for all u and i ), which implies κ u ( X ) ≤ κ u ( Y ) . It is also submodular, as forall X ⊂ Y ⊆ [ n ] and any item j / ∈ Y, j ∈ [ n ] , we have − Y i ∈ Y \ X (1 − q u,i ) ≥ ⇔ (1 − (1 − q j )) Y i ∈ X (1 − q u,i ) !  − Y i ∈ Y \ X (1 − q u,i )  ≥ ⇔ Y i ∈ X (1 − q u,i ) − (1 − q u,j ) Y i ∈ X (1 − q u,i ) ≥ Y i ∈ Y (1 − q u,i ) − (1 − q u,j ) Y i ∈ Y (1 − q u,i ) ⇔ Y i ∈ X (1 − q u,i ) − Y i ∈ X ∪{ j } (1 − q u,i ) ≥ Y i ∈ Y (1 − q u,i ) − Y i ∈ Y ∪{ j } (1 − q u,i ) ⇔ κ u ( X ∪ { j } ) − κ u ( X ) ≥ κ u ( Y ∪ { j } ) − κ u ( Y ) . Since this choice model is a special case of the choice model in Corollary 1, we can invoke Corollary 1 to getthe desired O (cid:0) n √ T log n (cid:1) -regret in the full-information setting and O (cid:16) n / (log n ) / T / (cid:17) − regret inthe bandit setting. (cid:4) Appendix E: Proofs and Remarks of Section 6.2 – Maximizing Multiple Reserves

In this appendix, ﬁrst provide a discussion on major diﬀerence between Algorithm 6 and the algorithm inRoughgarden and Wang (2019). We then give the missing proofs of the results from Section 6.2. These resultsare restated for convenience.

E.1. Discussion on Algorithm 6

The main diﬀerence between our algorithm and the algorithm in Roughgarden and Wang (2019) is the choiceof revenue-from-reserves function q . Their revenue-from-reserves function q is diﬀerent (coordinate-wise less)than ours. As it becomes more clear later in the proof, the need to design a new revenue-from-reservesfunction stems from our requirement to construct an explore sampling device for the online bandit learningalgorithm. E.2. Proof of Theorem 6

Proof . We will show that our meta Algorithms 2 and 4 work by verifying the following conditions. iazadeh et al.:

Online Learning via Oﬄine Greedy (i) Algorithm 6 is an extended ( , ) -robust approximation algorithm. By Deﬁnition 9, we need toshow that if each coordinate of our vector payoﬀs is bounded by some function h : ∀ j ∈ [ m ] , " T X t =1 Payoff ( i ) ( ˜ θ ( i ) t , r ( i − t , v t ) j ≥ − h ( T ) , then we must have that our overall solution’s error is bounded by: ∀ r ∗ ∈ C , T X t =1 E [ f ( r t , v t )] ≥ · T X t =1 f ( r ∗ , v t ) − nh ( T ) . Recall from Section 6.2, that we deﬁned the j th coordinate of this vector payoﬀ to be ( j ∈ [ m ] ), h Payoff ( i ) ( θ ( i ) , r ( i − , v ) i j , E z ′ ∼ θ ( i ) (cid:2) q ( i ) ( z ′ ) − q ( i ) ( ρ j ) (cid:3) , and that r ( i ) t is the reserve vector after subproblem i . Let’s now deﬁne S i to be the set of rounds where bidder i has the highest bid. We now carry out thestandard oﬄine analysis, but summed over all rounds t ∈ [ T ] . T X t =1 E [ f ( r t , v t )] (1) = 12 T X t =1 [ v t ] ˆ j t + 12 E " n X i =1 X t ∈ S i [ r t ] i h [ r t ] i ∈ [[ v t ] ˆ j t , [ v t ] j ∗ t i (2) = 12 T X t =1 [ v t ] ˆ j t + 12 E " n X i =1 T X t =1 q ( i ) t ([ r t ] i ) = 12 T X t =1 [ v t ] ˆ j t + 12 n X i =1 X t ∈ S i q ( i ) t ([ r ∗ ] i ) + 12 E " n X i =1 T X t =1 q ( i ) t ([ r t ] i ) − n X i =1 X t ∈ S i q ( i ) t ([ r ∗ ] i ) (3) ≥ T X t =1 f ( r ∗ , v t ) + 12 n X i =1 T X t =1 E h q ( i ) t ([ r t ] i ) i − n X i =1 X t ∈ S i q ( i ) t ([ r ∗ ] i ) (4) = 12 T X t =1 f ( r ∗ , v t ) + 12 n X i =1 T X t =1 h Payoff ( i ) ( ˜ θ ( i ) t , z ( i − t , f t ) i b i (5) ≥ T X t =1 f t ( z ∗ ) − nh ( T ) , where j ∗ t and ˆ j t are respectively the highest and second highest bidders in the valuation proﬁle v t . We alsodeﬁned a function q ( i ) t for each round, which is the same as the original q ( i ) except with v replaced by v t .Note that Inequality (1) holds because in each round t , with probability / , the algorithm returns z t = n ,which implies f ( r t , v t ) = [ v t ] ˆ j t and with the same probability, it returns r t = r ( n ) t which implies f ( r t , v t ) is atleast equal to the reserve of the buyer with the highest bid when his reserve is less than his bid. Inequality (2) follows from the deﬁnition of q ( i ) t . Inequality (3) holds because under the optimal reserve price r ∗ , f ( r ∗ , v t ) is less than or equal to the second highest bid [ v t ] ˆ j t when the bidder with the highest bid does not win orthey win and their reserve is less than or equal to [ v t ] ˆ j t ; otherwise, f ( r ∗ , v t ) is equal to q ( i ) t (cid:16) [ r t ] j ∗ t (cid:17) ≥ [ v t ] ˆ j t .Equality (4) follows from the deﬁnition of Payoff ( i ) . In this inequality b i is the index of element [ r ∗ ] i ; thatis, [ r ∗ ] i = ρ b i . Recall that ˜ θ ( i ) t is the (approximately-locally-optimal) distribution from which we are drawing [ r t ] i . Finally, inequality (5) follows from the assumption. This inequality is the desired result. (ii) Algorithm 6 is bandit Blackwell reducible. Per Deﬁnition 12, to show this statement, we will verifythe following conditions: iazadeh et al.:

Online Learning via Oﬄine Greedy

Algorithm 6 is Blackwell reducible.

For every subproblem i ∈ [ n ] , consider an instance (cid:0) X , Y , p ( i ) (cid:1) ofBlackwell where X = Θ = ∆( R ) and Y = [0 , d param , where d param = |R| = m . We can use the Blackwelladversary function (note that we identify adversary functions with valuation vector) AdvB ( i ) ( r , v ) = (cid:2) q ( i ) ( ρ j ) (cid:3) j =1 , ,...,m .The biaﬃne Blackwell payoﬀ is p ( i ) ( θ , y ) = θ T y n − y where n is an n -dimensional all ones vector.Notice that the target set S, the non-negative orthant, is response-satisﬁable because if player 1 plays θ = e j ∗ where j ∗ = arg max j ∈ [ m ] [ y ] j then for every adversary’s action y ∈ Y , p ( i ) ( θ , y ) ≥ .• An unbiased estimator for the Blackwell payoﬀ function p ( i ) can be constructed. We will show that forevery subproblem i ∈ [ n ] , there exists an explore sampling device U ( i ) that returns ( w ( i ) exp , r ( i ) exp ) such that(i) for all f ∈ F , θ ∈ Θ , r ∈ D , i ∈ [ n ] : ˆ p ( i ) (cid:16) θ , AdvB ( i ) ( r , v ) (cid:17) = f ( r ( i ) exp , v ) w ( i ) exp , where (cid:0) w ( i ) exp , r ( i ) exp (cid:1) ∼ ExpS ( i ) ( θ , r ) , and (ii) ˆ p is an unbiased estimator for the actual payoﬀ, i.e, ∀ θ ∈ Θ , y ∈ Y : E [ ˆ p ( i ) ( θ , y )] = p ( i ) ( θ , y ) . More speciﬁcally, we will construct a exploring distribution U ( i ) such that if y = AdvB ( r , f ) for some f ∈ F , r ∈ D , then E [ ˆ p ( θ , y )] = E [ f ( r exp ) w exp ] = p ( θ , y ) , where the expectation is taken withrespect to U ( i ) . Notice that in Deﬁnition 12, U is not indexed by subproblems, but since the AdvB forthis particular problem is subproblem speciﬁc, the distribution U should also depend on the subproblem.Because we would like to construct an unbiased estimator of the actual payoﬀ p ( i ) , which is an aﬃnefunction of y = q ( i ) , we focus on constructing an unbiased estimator for the function q ( i ) . To do so, wemake use of the following representation of q ( i ) : q ( i ) ( r ) = f ( r m , v ) − f ( r ( m − e i ) , v ) . To see what the above equation holds note that when bidder i does not have the highest bid in anauction, both q ( i ) ( r ) and f ( r m , v ) − f ( r ( m − e i ) , v ) , which is the revenue gain of increasing bidder i ’s reserve price from zero to r , are both zero. When bidder i has the highest bid in the auction, q ( i ) ( r ) = r if r ∈ [[ v ] ˆ j , [ v ] j ∗ ] and zero otherwise. Furthermore, the revenue from the reserve price r m ,i.e., f ( r m , v ) , is [ v ] ˆ j if r < [ v ] ˆ j ; r if r ∈ [[ v ] ˆ j , [ v ] j ∗ ] and zero otherwise (the case r > [ v ] j ∗ ). The revenuefrom the reserve price r ( m − e i ) , i.e., f ( r ( m − e i ) , v ) , is [ v ] ˆ j if r < [ v ] ˆ j and zero otherwise. Thus, f ( r m , v ) − f ( r ( m − e i ) , v ) is r if r ∈ [[ v ] ˆ j , [ v ] j ∗ ] and zero otherwise, which is exactly q ( i ) ( r ) . Thisinteresting relationship is depicted in Figure 1.We now deﬁne the sampling distribution U ( i ) : D × Θ → ∆([0 , m × C ) . For each j ∈ [ m ] , we pick: ( w ( i ) exp , r ( i ) exp ) = (2 m ( θ j m − e j ) , ρ j m ) , or ( w ( i ) exp , r ( i ) exp ) = ( − m ( θ j m − e j ) , ρ j ( m − e i )) , with probability m each. Recall that D = C = R n , where R = { ρ , ρ , . . . , ρ m } is the set of possiblereserve prices and ρ j is the j -th largest reserve price in set R . We then have E (cid:2) ˆ p ( i ) ( θ , y ) (cid:3) = E h ˆ p ( i ) ( θ , AdvB ( i ) ( r , v )) i = E [ f ( r ( i ) exp , v ) w ( i ) exp ]= θ T  f ( ρ m , v ) − f ( ρ ( m − e i ) , v ) ... f ( ρ m m , v ) − f ( ρ m ( m − e i ) , v )  −  f ( ρ m , v ) − f ( ρ ( m − e i ) , v ) ... f ( ρ m m , v ) − f ( ρ m ( m − e i ) , v )  iazadeh et al.: Online Learning via Oﬄine Greedy rf ( r m , v )[ v ] ˆ j [ v ] j ∗ [ v ] ˆ j r rf ( r ( m − e i ) , v )[ v ] ˆ j [ v ] j ∗ [ v ] ˆ j r rq ( i ) ( r ) [ v ] ˆ j [ v ] j ∗ r Figure 1 The function q ( i ) (right) and the two functions we combine to get it (left, center). The solid red linedenotes the function value when i is the highest bidder, and the dashed blue line denotes the function value when i isnot the highest bidder. = θ T  q ( i ) ( ρ ) ... q ( i ) ( ρ m )  −  q ( i ) ( ρ ) ... q ( i ) ( ρ m )  = θ T y n − y = p ( i ) ( θ , y ) . Wrapping up, Algorithm 6 is an extended (cid:0) , (cid:1) -robust approximation algorithm with n subproblems andwith a payoﬀ diameter D ( p ) of O (1) and a payoﬀ estimator diameter D ( ˆ p ) of O ( m ) . It is also banditBlackwell reducible. Therefore, from Theorems 2 and 4: -regret(Algorithm 2 applied on Algorithm 6) ≤ O ( nT / log / m )12 -regret(Algorithm 4 applied on Algorithm 6) ≤ O ( nm / T / log / m ) . This completes the proof. (cid:4)

E.3. Proof of Corollary 2

Proof . Let m ∈ Z + be a parameter we choose later to balance terms. We invoke Theorem 6 with thediscretization R = { , m , m , . . . , } . Given any reserves r ∗ ∈ [0 , n , we can produce rounded reserves ˜ r ∗ deﬁned by rounding every reserve down to the nearest multiple of m : [˜ r ] i = m ⌊ m [ r ] i ⌋ . Importantly, thisnever causes any bidder to fail to clear their reserve price (this is why we must round down and cannotround up). Hence this can only grow the set of bidders that clear their reserve and hence the maximum bidfrom this set can only increase. If a bidder that was already in this set proceeds to win the auction, thenthey are only competing with more bidders and their reserve price drops by at most /m , so their paymentcan only drop by at most /m . If a bidder not previously in this set proceeds to win the auction, then theirreserve price used to be higher than their valuation, but their valuation must be higher than the previouswinner’s valuation. They pay at least their reserve less m , so the revenue of the auction drops by at most /m in this case as well. Hence the (summed) discretization error is T m , and we choose either m = n T / (full-information) or m = n − / T / (bandit) to obtain: O (cid:0) nT / log / m (cid:1) + T m = O (cid:0) nT / log / T (cid:1) O (cid:0) nm / T / log / m (cid:1) + T m = O (cid:0) n / T / log / ( nT ) (cid:1) iazadeh et al.: Online Learning via Oﬄine Greedy (cid:4)

Appendix F: Proofs and Remarks of Section 6.3 – Non-monotone SubmodularMaximization

In this appendix, we ﬁrst discuss the diﬀerences between Algorithm 7 and the bi-greedy algorithm byNiazadeh et al. (2018), and show that despite these diﬀerences Algorithm 7 obtains the same approximationfactor as that of the bi-greedy algorithm. We then present the proof of Theorem 7.

F.1. Discussion on Algorithm 7

Algorithm 7 is a modiﬁcation of the bi-greedy algorithm by Niazadeh et al. (2018). But, as we show inthis section, these modiﬁcations do not change the / approximation factor of the bi-greedy algorithm.We modify the bi-greedy algorithm to better satisfy the form of Algorithm 1, ease our construction of thesampling device and unbiased estimators in the bandit case, and provide a uniﬁed framework for submodularfunctions with a more general domain. The major diﬀerences and their corresponding reasons are as follows:• To cover a more general discrete function domain, we optimize over points in the discrete set R whiletheir algorithm optimizes over [0 , n implemented by casting an ǫ − net.• To help us construct the sampling device and unbiased estimators for the bandit case, in our localoptimization step, we use ζ ( i ) (ˆ z, z ′ ) , which is a linear combination of marginal functions α ( i ) and β ( i ) ,rather than max (cid:8) α ( i ) (ˆ z ) − α ( i ) ( z ′ ) , β ( i ) (ˆ z ) − β ( i ) ( z ′ ) (cid:9) , in quantifying the value decrease of ˆ z . Recall thatin this step, we choose θ ( i ) ∈ ∆( R ) so that for all ˆ z ∈ R , E z ′ ∼ θ ( i ) (cid:2) α ( i ) ( z ′ ) + β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ) (cid:3) ≥ .Using the technique in Niazadeh et al. (2018), as we argue next, we can still ﬁnd θ ( i ) that satisﬁes thecondition in the local optimization step. Note that the bi-greedy analysis in Niazadeh et al. (2018) provesthat satisfying this condition implies that Algorithm 7 is a -approximation algorithm for the discretizedsubmodular maximization problem. Satisfying the Local Optimization Step.

Here, we show how to choose θ ( i ) ∈ ∆( R ) that satisﬁes thecondition in the local optimization step of Algorithm 7. To do so, First, we choose z ℓ ∈ arg max z ∈R f ( z, ¯ z ( i − ) and z u ∈ arg max z ∈R f ( z, z ¯ ( i − ) . Then, we look at these two cases. Case (i): z u ≤ z ℓ . We want to prove that deterministically returning z ℓ ( θ ( i ) puts all its weight on z ℓ )suﬃces. The key realization is that in this case, z u and z ℓ maximize the functions f ( · , ¯ z ( i − ) and f ( · , z ¯ ( i − )) respectively: f ( z ℓ , ¯ z ( i − ) ≥ f ( z u , ¯ z ( i − ) , f ( z u , z ¯ ( i − ) ≥ f ( z ℓ , z ¯ ( i − ) . (22)We know by submodularity that two points are better than their coordinate-wise max and min: f ( z u , z ¯ ( i − ) + f ( z ℓ , ¯ z ( i − ) ≤ f ( z ℓ , z ¯ ( i − ) + f ( z u , ¯ z ( i − ) . (23)Since adding up the ﬁrst two inequalities in Equation (22) yields the third inequality in Equation (23), butwith the direction reversed, we know all three must hold with equality. We conclude by noting that since z ℓ maximizes both functions, it also maximizes both α ( i ) and β ( i ) at some nonnegative value and hence satisﬁesthe desired condition in the local step optimization. iazadeh et al.: Online Learning via Oﬄine Greedy Case (ii): z ℓ < z u . Suppose that the algorithm is able to ﬁnd a θ ( i ) such that for any ˆ z ∈ [ z ℓ , z u ] , we have E z ′ ∼ θ ( i ) (cid:20) α ( i ) ( z ′ ) + 12 β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ) (cid:21) = 12 α ( i ) ( z ′ ) + 12 β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ′ ) ≥ . (24)We claim that this equation is still true for ˆ z outside of the interval [ z ℓ , z u ] .Suppose that ˆ z < z ℓ . By the choice of z ℓ , we know that β ( i ) ( z ℓ ) ≥ β ( i ) (ˆ z ) . By submodularity, we know that: f (ˆ z, z ¯ ( i − ) + f ( z ℓ , ¯ z ( i − ) ≤ f ( z ℓ , z ¯ ( i − ) + f (ˆ z, ¯ z ( i − ) α ( i ) (ˆ z ) + β ( i ) ( z ℓ ) ≤ α ( i ) ( z ℓ ) + β ( i ) (ˆ z ) β ( i ) ( z ℓ ) − β ( i ) (ˆ z ) ≤ α ( i ) ( z ℓ ) − α ( i ) (ˆ z ) . Since the LHS is nonnegative by the choice of z ℓ , so is the RHS. We have shown that z ℓ has strictly larger α ( i ) and β ( i ) values (than ˆ z ) and hence inequality in Equation (24) must be valid for ˆ z < z ℓ as well. Analogousreasoning shows the same for the z r < ˆ z case. Notice that the method in Niazadeh et al. (2018) is able tocompute a θ ( i ) that guarantees E z ′ ∼ θ ( i ) (cid:2) α ( i ) ( z ′ ) + β ( i ) ( z ′ ) − ζ ( i ) (ˆ z, z ) (cid:3) ≥ for any ˆ z ∈ [ z ℓ , z u ] , which meansthis is also true for any ˆ z ∈ R . Recall that the payoﬀ function is (cid:2)

Payoff (cid:0) θ , z ¯ ( i − , f (cid:1)(cid:3) j = E z ′ ∼ θ (cid:20) α ( i ) ( z ′ ) + 12 β ( i ) ( z ′ ) − ζ ( i ) ( ρ j , z ) (cid:21) , so such θ ( i ) also guarantees that Payoff (cid:0) θ ( i ) , z ¯ ( i − , f (cid:1) is in the positive orthant. F.2. Proof of Theorem 7

Proof . We will show that our meta Algorithms 2 and 4 work by verifying the following conditions. (i) Algorithm 6 is an extended ( , ) -robust approximation algorithm. Following the analysis of thebi-greedy algorithm in Buchbinder et al. (2015), we consider three sequences of points: the lower boundsequence z ¯ ( i ) , the upper bound sequence ¯ z ( i ) , and the hybrid-optimal sequence z ∗ ( i ) . The key proof idea isto bound the decrease in the hybrid-optimal sequence value z ∗ ( i ) with the total increase in the lower boundand upper bound sequence values. We deﬁne z ¯ ( i ) and ¯ z ( i ) to agree on the ﬁrst i coordinates, while the restof the coordinates are ρ for z ¯ ( i ) and ρ m for ¯ z ( i ) . The hybrid-optimal sequence starts from z ∗ (0) , z ∗ , then z ∗ ( i ) is equal to z ∗ ( i − but with the i th coordinate replaced with the sampled z ′ i ∼ θ ( i ) .Importantly, if the i th -coordinate of the optimal vector z ∗ , which is z ∗ i , is less than our sampled point z ′ i from the i th subproblem/iteration, then the loss in value of the hybrid-optimal sequence is bounded by adiﬀerence of two β ( i ) evaluations. In particular, the submodularity of f implies: f ( z ∗ i , z ∗ ( i − − i ) + f ( z ′ i , ¯ z ( i − − i ) ≤ f ( z ′ i , z ∗ ( i − − i ) + f ( z ∗ i , ¯ z ( i − − i ) f ( z ∗ ( i − ) + β ( i ) ( z ′ i ) ≤ f ( z ∗ ( i ) ) + β ( i ) ( z ∗ i ) f ( z ∗ ( i − ) − f ( z ∗ ( i ) ) ≤ β ( i ) ( z ∗ i ) − β ( i ) ( z ′ i ) . There is also the symmetric case where the i th -coordinate of the optimal vector z ∗ i is greater than oursampled point z ′ i from the the i th subproblem: f ( z ′ i , z ¯ ( i − − i ) + f ( z ∗ i , z ∗ ( i − − i ) ≤ f ( z ∗ i , z ¯ ( i − − i ) + f ( z ′ i , z ∗ ( i − − i ) α ( i ) ( z ′ i ) + f ( z ∗ ( i − ) ≤ α ( i ) ( z ∗ i ) + f ( z ∗ ( i ) ) f ( z ∗ ( i − ) − f ( z ∗ ( i ) ) ≤ α ( i ) ( z ∗ i ) − α ( i ) ( z ′ i ) . iazadeh et al.: Online Learning via Oﬄine Greedy ζ ( i ) ): f ( z ∗ ( i − ) − f ( z ∗ ( i ) ) ≤ ζ ( i ) ( z ∗ i , z ′ i ) . (25)Also, just by the deﬁnition of α ( i ) and β ( i ) we know that: f ( z ¯ ( i ) ) − f ( z ¯ ( i − ) = α ( i ) ( z ′ i ) (26) f ( ¯ z ( i ) ) − f ( ¯ z ( i − ) = β ( i ) ( z ′ i ) . (27)We are now ready to consider the θ ( i ) t which guarantees that for all i ∈ [ n ] and ˆ z ∈ R , including z ∗ i : T X t =1 E z ′ i ∼ θ ( i ) t (cid:20) α ( i ) t ( z ′ i ) + 12 β ( i ) t ( z ′ i ) − ζ ( i ) t ( z ∗ i , z ′ i ) (cid:21) ≥ − h ( T ) . Note that α ( i ) t β ( i ) t , and ζ ( i ) t are respectively obtained by replacing f with f t in the deﬁnition of α ( i ) β ( i ) ,and ζ ( i ) . We sum those inequalities together and then apply Equations (25), (26), and (27): − nh ( T ) ≤ T X t =1 n X i =1 E z ′ i ∼ θ ( i ) t (cid:20) α ( i ) t ( z ′ i ) + 12 β ( i ) t ( z ′ i ) − ζ ( i ) t ( z ∗ i , z ′ i ) (cid:21) ≤ T X t =1 n X i =1 E (cid:20) (cid:2) f t ( z ¯ ( i ) ) − f t ( z ¯ ( i − ) (cid:3) + 12 (cid:2) f t ( ¯ z ( i ) ) − f t ( ¯ z ( i − ) (cid:3) − h f t ( z ∗ ( i − ) − f t ( z ∗ ( i ) ) i(cid:21) = T X t =1 E (cid:20) (cid:2) f t ( z ¯ ( n ) ) − f t ( z ¯ (0) ) (cid:3) + 12 (cid:2) f t ( ¯ z ( n ) ) − f t ( ¯ z (0) ) (cid:3) − h f t ( z ∗ (0) ) − f t ( z ∗ ( n ) ) i(cid:21) = T X t =1 E   f t ( z t ) − f t ( z ¯ (0) ) | {z } ≥  + 12  f t ( z t ) − f t ( ¯ z (0) ) | {z } ≥  − [ f t ( z ∗ ) − f t ( z t )]  ≤ T X t =1 E [2 f t ( z t ) − f t ( z ∗ )] . See that the fourth equality is because the algorithm returns z t = z ¯ ( n ) = ¯ z ( n ) at round t. We ﬁnish by movingterms between sides and dividing by two: T X t =1 E [2 f t ( z t ) − f t ( z ∗ )] ≥ − nh ( T ) T X t =1 E [ f t ( z t )] ≥ T X t =1 f t ( z ∗ ) − nh ( T ) . Thus, our algorithm is an extended ( , ) -robust approximation. (ii) Algorithm 7 is bandit Blackwell reducible. We ﬁrst show that Algorithm 7 is Blackwell reducible.Consider an instance ( X , Y , p ) of Blackwell where X , Θ = ∆( R ) and Y , ∆( C × F ) = ∆( R [ n ] × F ) . Oursynthetic Blackwell adversary function is the deterministic distribution that has weight on its input (point,function) pair and anywhere else, i.e. AdvB ( z , f ) = κ where κ ( z , f ) = 1 . The (asymmetric) biaﬃne Blackwellpayoﬀ p is the expectation of the Payoff function from Equation (4) over its second input: p ( θ , κ ) , E ( z ,f ) ∼ κ [ Payoff ( θ , z , f )] . iazadeh et al.: Online Learning via Oﬄine Greedy S is response-satisﬁable since given any player 2 distribution κ over (point, function)pairs, we can convert each pair into the marginal functions α ( i ) and β ( i ) . Averaging these marginal functionstogether according to their likelihood in κ does not impact the submodularity fact we require for our proofs.We can think of p ( θ , κ ) as p ( θ , κ ) , E ( z ,f ) ∼ κ [ Payoff ( θ , z , f )] = Payoff ( θ , z , f ′ ) , for another submodular function f ′ ∈ F because a weighted average of submodular function is submodular.Since for any submodular functions f ∈ F and z ∈ C , we show that we can ﬁnd θ such that Payoff ( θ , z , f ) ≥ , for any κ, the algorithm can ﬁnd θ such that p ( θ , κ ) is in S . Therefore, Algorithm 7 is Blackwell reducible.To show that Algorithm 7 is bandit Blackwell reducible, we need to construct an unbiased estimatorfor p and an explore sampling device U. In subproblem i, U receives pairs of the form ( θ , z ¯ ( i − ) andreturns ( w exp , z exp ) such that (i) for all f ∈ F , θ ∈ Θ , z ¯ ( i − ∈ D , ˆ p (cid:0) θ , AdvB ( z ¯ ( i − , f ) (cid:1) = f ( z exp ) w exp where ( w exp , z exp ) ∼ U ( θ , z ¯ ( i − ) , and (ii) ˆ p is an unbiased estimator for the actual payoﬀ, i.e. ∀ θ ∈ Θ , κ ∈ Y , wehave E [ ˆ p ( θ , κ )] = p ( θ , κ ) . Because we would like to construct an unbiased estimator of the actual payoﬀ p , which is an expectation(over κ ) of the payoﬀ function Payoff , which is further an aﬃne combination of the functions α ( i ) , β ( i ) , and ζ ( i ) on R , we construct unbiased estimators from function evaluations for these functions. Observe thatgiven z ¯ ( i − , U can immediately reconstruct the corresponding upper bound point: ¯ z ( i − ← z ¯ ( i − ∨ ( ρ , . . . , ρ | {z } ﬁrst ( i − coordinates , ρ m , . . . , ρ m | {z } last ( n − i + 1) coordinates ) T = ( z ′ , . . . , z ′ i − , ρ m , . . . , ρ m | {z } last ( n − i + 1) coordinates ) T . We can use z ¯ ( i − and ¯ z ( i − to express the marginal functions α ( i ) and β ( i ) , α ( i ) ,  α ( i ) ( ρ ) α ( i ) ( ρ ) ... α ( i ) ( ρ m )  =  f ( ρ , z ¯ ( i − − i ) − f ( z ¯ ( i − ) f ( ρ , z ¯ ( i − − i ) − f ( z ¯ ( i − ) ... f ( ρ m , z ¯ ( i − − i ) − f ( z ¯ ( i − )  β ( i ) ,  β ( i ) ( ρ ) β ( i ) ( ρ ) ... β ( i ) ( ρ m )  =  f ( ρ , ¯ z ( i − − i ) − f ( ¯ z ( i − ) f ( ρ , ¯ z ( i − − i ) − f ( ¯ z ( i − ) ... f ( ρ m , ¯ z ( i − − i ) − f ( ¯ z ( i − )  . These can be used in turn to express our comparison function ζ ( i ) : ζ ( i ) ,  ζ ( i ) ( ρ , ρ ) ζ ( i ) ( ρ , ρ ) · · · ζ ( i ) ( ρ , ρ m ) ζ ( i ) ( ρ , ρ ) ζ ( i ) ( ρ , ρ ) · · · ζ ( i ) ( ρ , ρ m ) ... ... . . . ... ζ ( i ) ( ρ m , ρ ) ζ ( i ) ( ρ m , ρ ) · · · ζ ( i ) ( ρ m , ρ m )  = diag (cid:0) α ( i ) (cid:1) L m,m − L m,m diag (cid:0) α ( i ) (cid:1) + diag (cid:0) β ( i ) (cid:1) U m,m − U m,m diag (cid:0) β ( i ) (cid:1) , where L m,m is the lower-triangular matrix deﬁned by [ L m,m ] i,j = [ i > j ] and U m,m is the upper-triangularmatrix deﬁned by [ U m,m ] i,j = [ i < j ] . Our desired payoﬀ function can be expressed using all three of thesefunctions: Payoff ( θ , z ¯ ( i − , f ) = (cid:20) m (cid:0) α ( i ) (cid:1) T + 12 m (cid:0) β ( i ) (cid:1) T + (cid:0) ζ ( i ) (cid:1)(cid:21) θ , iazadeh et al.: Online Learning via Oﬄine Greedy m is the m -dimensional all-ones vector. By using matrix notation, we have managed to clearly expressour desired payoﬀ function as the linear combination of many function evaluations.We now deﬁne the explore sampling distribution U : Θ × D → ∆( R m × C ) as follows. With probability,we return the point z exp = z ¯ ( i − and weight vector w exp = ( − diag ( m ) θ = ( − m , where diag ( m ) isthe identity matrix with size m × m . With probability, we return the point z exp = ¯ z ( i − and weightvector w exp = ( − diag ( m ) θ = ( − m . For i = 1 , ..., m , with m probability we return z exp = ( ρ i , z ¯ ( i − − i ) and w exp = (4 m ) (cid:2) m e Ti + diag ( e i ) L m,m − L m,m diag ( e i ) (cid:3) θ . For i = 1 , ..., m , with m probability we return thepoint z exp = ( ρ i , ¯ z ( i − − i ) and weight vector w exp = (4 m ) (cid:2) m e Ti + diag ( e i ) U m,m − U m,m diag ( e i ) (cid:3) θ . Observethat, at subproblem i (essentially by construction): E [ ˆ p ( θ , κ )] = E ( z ¯ ( i − ,f ) ∼ κ E (cid:2) ˆ p (cid:0) θ , AdvB ( z ¯ ( i − , f ) (cid:1)(cid:3) = E ( z ¯ ( i − ,f ) ∼ κ E ( w exp , z exp ) ∼ ExpS ( θ , z ¯ ( i − ) [ f ( z exp ) w exp ] , where E ( w exp , z exp ) ∼ ExpS ( θ , z ¯ ( i − ) [ f ( z exp ) w exp ]= 14 f ( z ¯ ( i − ) [( − diag ( m ) θ ] + 14 f ( ¯ z ( i − ) [( − diag ( m ) θ ]+ m X i =1 m f ( ρ i , z ¯ ( i − − i ) (cid:20) (4 m ) (cid:20) m e Ti + diag ( e i ) L m,m − L m,m diag ( e i ) (cid:21) θ (cid:21) + m X i =1 m f ( ρ i , ¯ z ( i − − i ) (cid:20) (4 m ) (cid:20) m e Ti + diag ( e i ) U m,m − U m,m diag ( e i ) (cid:21) θ (cid:21) = f ( z ¯ ( i − ) (cid:20) − m Tm − diag ( m ) L m,m + L m,m diag ( m ) (cid:21) θ + f ( ¯ z ( i − ) (cid:20) − m Tm − diag ( m ) U m,m + U m,m diag ( m ) (cid:21) θ + m X i =1 f ( ρ i , z ¯ ( i − − i ) (cid:20)(cid:20) m e Ti + diag ( e i ) L m,m − L m,m diag ( e i ) (cid:21) θ (cid:21) + m X i =1 f ( ρ i , ¯ z ( i − − i ) (cid:20)(cid:20) m e Ti + diag ( e i ) U m,m − U m,m diag ( e i ) (cid:21) θ (cid:21) = (cid:20) m ( α ( i ) ) T + 12 m ( β ( i ) ) T + ζ ( i ) (cid:21) θ = Payoff ( θ , z ¯ ( i − , f ) . This explore sampling device also clearly runs in polynomial-time. Finally, we have E [ ˆ p ( θ , κ )] = E ( z ¯ ( i − ,f ) ∼ κ E ( w exp , z exp ) ∼ ExpS ( θ , z ¯ ( i − ) [ f ( z exp ) w exp ]= E ( z ¯ ( i − ,f ) ∼ κ (cid:2) Payoff ( θ , z ¯ ( i − , f ) (cid:3) = p ( θ , κ ) . This completes the proof of bandit Blackwell reducibility.For our bounds, we care about both the ℓ ∞ diameter of the payoﬀ D ( p ) and the ℓ ∞ diameter of the payoﬀestimator D ( ˆ p ) . The former is bounded by O (1) , since for any θ , the payoﬀ function is a linear combinatrionof O (1) function evaluations with O (1) coeﬃcients. The latter is bounded by O ( m ) since aside from the iazadeh et al.: Online Learning via Oﬄine Greedy O (4 m ) -scaling, the function evaluation yields a result in the range [0 , and the remaining terms have O (1) norms: k m k ∞ = 1 (cid:13)(cid:13)(cid:13)(cid:13) m e Ti θ (cid:13)(cid:13)(cid:13)(cid:13) ∞ = 12 [ θ ] i ≤ k diag ( e i ) L m,m θ k ∞ = X ji [ θ ] j ≤ k U m,m diag ( e i ) k ∞ = [ θ ] i ≤ . We complete the proof by applying Theorem 2 and Theorem 4, noting that our payoﬀ dimension d equalsthe number of potential values that a coordinate can take, m : -regret(Algorithm 2 applied on Algorithm 7) ≤ O ( nT / log / m )12 -regret(Algorithm 4 applied on Algorithm 7) ≤ O ( nm / T / log / m ) . (cid:4) F.3. Proof of Corollaries 3 and 4

Related Researches

CaPC Learning: Confidential and Private Collaborative Learning

by Christopher A. Choquette-Choo

Label Smoothed Embedding Hypothesis for Out-of-Distribution Detection

by Dara Bahri

SLAPS: Self-Supervision Improves Structure Learning for Graph Neural Networks

by Bahare Fatemi

On Theory-training Neural Networks to Infer the Solution of Highly Coupled Differential Equations

by M. Torabi Rad

Spherical Message Passing for 3D Graph Networks

by Yi Liu

Target Training Does Adversarial Training Without Adversarial Samples

by Blerta Lindqvist

Domain Invariant Representation Learning with Domain Density Transformations

by A. Tuan Nguyen

Sparsification via Compressed Sensing for Automatic Speech Recognition

by Kai Zhen

Using Deep LSD to build operators in GANs latent space with meaning in real space

by J. Quetzalcoatl Toledo-Marin

Consensus Based Multi-Layer Perceptrons for Edge Computing

by Haimonti Dutta

RL for Latent MDPs: Regret Guarantees and a Lower Bound

by Jeongyeol Kwon

Scheduling the NASA Deep Space Network with Deep Reinforcement Learning

by Edwin Goh

Backdoor Scanning for Deep Neural Networks through K-Arm Optimization

by Guangyu Shen

On Explainability of Graph Neural Networks via Subgraph Explorations

by Hao Yuan

More Is More -- Narrowing the Generalization Gap by Adding Classification Heads

by Roee Cates

Benchmarking Deep Graph Generative Models for Optimizing New Drug Molecules for COVID-19

by Logan Ward

Estimation and Applications of Quantiles in Deep Binary Classification

by Anuj Tambwekar

Emotion Transfer Using Vector-Valued Infinite Task Learning

by Alex Lambert

Measuring Progress in Deep Reinforcement Learning Sample Efficiency

by Florian E. Dorner

Bounded Memory Active Learning through Enriched Queries

by Max Hopkins

Adversarially Trained Models with Test-Time Covariate Shift Adaptation

by Jay Nandy

Transfer learning based few-shot classification using optimal transport mapping from preprocessed latent space of backbone neural network

by Tomáš Chobola

Regularization Strategies for Quantile Regression

by Taman Narayan

Large-Scale Training System for 100-Million Classification at Alibaba

by Liuyihan Song

Classifier Calibration: with implications to threat scores in cybersecurity

by Waleed A. Yousef

«

1

2

3

4

»

Submitted on 18 Feb 2021 Updated

arXiv.org Original Source

NASA ADS

Google Scholar

Semantic Scholar