Learning to Persuade on the Fly: Robustness Against Ignorance
LLearning to Persuade on the Fly: Robustness Against Ignorance
You Zu , Krishnamurthy Iyer , and Haifeng Xu Industrial and Systems Engineering, University of Minnesota, Minneapolis, MN { zu000002,kriyer } @umn.edu Department of Computer Science, University of Virginia, Charlottesville, VA [email protected]
February 23, 2021
Abstract
We study a repeated persuasion setting between a sender and a receiver, where at each time t ,the sender observes a payoff-relevant state drawn independently and identically from an unknown prior distribution, and shares state information with the receiver, who then myopically choosesan action. As in the standard setting, the sender seeks to persuade the receiver into choosingactions that are aligned with the sender’s preference by selectively sharing information aboutthe state. However, in contrast to the standard models, the sender does not know the prior, andhas to persuade while gradually learning the prior on the fly.We study the sender’s learning problem of making persuasive action recommendations toachieve low regret against the optimal persuasion mechanism with the knowledge of the priordistribution. Our main positive result is an algorithm that, with high probability, is persuasiveacross all rounds and achieves O ( √ T log T ) regret, where T is the horizon length. The corephilosophy behind the design of our algorithm is to leverage robustness against the sender’signorance of the prior. Intuitively, at each time our algorithm maintains a set of candidatepriors, and chooses a persuasion scheme that is simultaneously persuasive for all of them. Todemonstrate the effectiveness of our algorithm, we further prove that no algorithm can achieveregret better than Ω( √ T ), even if the persuasiveness requirements were significantly relaxed.Therefore, our algorithm achieves optimal regret for the sender’s learning problem up to termslogarithmic in T . Examples of online platforms recommending content or products to their users abound in onlineeconomy. For instance, a marketplace like Etsy recommends vintage items made by independentsellers, a styling service like Stitch Fix recommends clothing designs made by custom brands, andan online platform like YouTube recommends content generated by independent channels to itsusers. Often, the platform making such recommendations must balance the dual objectives of beingpersuasive (a.k.a., being obedient [Bergemann and Morris, 2016]), i.e., make recommendations thatwill be adopted by the users, as well as furthering the platform’s goals, such as increased sales, fewerreturns or more engaged users.To make a concrete illustration, consider a platform that recommends content created byindependent creators (“channels”) to its users. New channels regularly join the platform, and the distribution of the quality and/or relevance of a channel’s content is initially unknown to the platform.1 a r X i v : . [ c s . G T ] F e b he users of the platform seek to consume fresh and high-quality content, while the platform itselfmay have other goals, such as maximizing content consumption, that is not fully aligned with users’interests. For each new content from a channel, the platform observes its quality/relevance (perhapsafter an initial exploration or through in-house reviewers) and decides whether or not to recommendthe content to its users. If the channel’s content quality distribution is known, the platform canreliably make such recommendations that optimize its own goals while maintaining user satisfaction.However, in reality, the platform typically does not know the quality of a new channel, and thusmust learn to make such persuasive recommendations over time.In this paper, we study the problem faced by such a platform learning to make recommendationsover time. Formally, we study a repeated persuasion setting between a sender and a receiver, whereat each time t , the sender shares information about a payoff-relevant state with the receiver. Thestate at each time t is drawn independently and identically from an unknown distribution, andsubsequent to receiving information about it, the receiver (myopically) chooses an action from afinite set. The sender seeks to persuade the receiver into choosing actions that are aligned with herpreference by selectively sharing information about the state.In contrast to the standard persuasion setting, we focus on the case where neither the sendernor the receiver knows the distribution of the payoff relevant state. Instead, the sender learns thisdistribution over time by observing the state realizations. We adopt the assumption common in theliterature on Bayesian persuasion that at each time period, prior to observing the realized statein that period, the sender commits to a signaling mechanism that maps each state to a possiblyrandom action recommendation . Subsequent to the state observation, the sender recommends anaction as per the chosen signaling mechanism.One natural requirement is for the sender to make recommendations that the receiver will findoptimal to follow, i.e., recommendations that are persuasive. In the case where the receiver knowsthe prior, this requirement is easily justified from the observation that a rational Bayesian receiverwill update her beliefs after receiving the recommendation and choose an action that maximizes herexpected utility w.r.t. to her posterior beliefs. However, even if the receiver does not know the prior,practical considerations such as building and maintaining a reputation can make persuasivenessimportant to the sender. This is especially so in our example settings, where the sender is a long-livedplatform and the receiver corresponds to a stream of users [Rayo and Segal, 2010], who may be ableto verify the quality of recommendations ex post .A sender who simply recommends the receiver’s best action at the realized state will certainly bepersuasive, but may end up with a significant loss in utility when compared to her utility if she hadknown the prior. Thus, the sender seeks to make persuasive action recommendations that achievelow regret against the optimal signaling mechanism with the knowledge of the prior distribution.The primary contribution of this work is an efficient algorithm that, with high probability, makespersuasive action recommendations and at the same time achieves vanishing average regret. Thealgorithm we propose proceeds by maintaining at each time a set of candidate priors, based onthe observed state realizations in the past. To compensate for the ignorance of the true prior, thealgorithm resorts to robustness: it chooses a signaling mechanism that is simultaneously persuasivefor each of the candidate priors and maximizes the sender’s utility. Due to this aspect of thealgorithm, we name it the Robustness against Ignorance ( Rai ) algorithm.Our main positive result, Theorem 2, establishes that for any persuasion setting satisfying certainregularity conditions, the
Rai algorithm achieves O ( √ T log T ) regret with high probability, where T is the horizon length. To show this result, we define a quantity Gap that measures the sender’s cost of robust persuasion . Formally,
Gap ( µ, B ) captures the loss in the sender’s expected utility(under belief µ ) from using a signaling mechanism that is persuasive for all beliefs in the set B , asopposed to using one that is persuasive only for the belief µ . The crux of the proof argument lies at2howing, in Proposition 1, that the sender’s cost of robust persuasion Gap ( µ, B ) is at most linear inthe radius of the set B . This is achieved via an explicit construction of a signaling mechanism thatis persuasive for all beliefs in B and achieves sender’s utility close to optimum.We strengthen our contribution by proving in Theorem 3 a matching lower-bound (up to log T terms) for the regret of any algorithm that makes persuasive recommendations. In particular, weconstruct a persuasion instance for which no persuasive algorithm can achieve regret better thanΩ( √ T ). Furthermore, in Theorem 4, we show this lower bound holds even if the persuasivenessrequirements on the algorithm were significantly relaxed. As a byproduct, our regret analysis alsoleads to useful insight about robust persuasion w.r.t. to persuasiveness. Specifically, to prove ourlower bound result, we carefully craft the persuasion instance and use its geometry to prove a linearcost for robust persuasion; this instance thus serves as a lower bound example for robust persuasion,which may be of independent interest.Our results contribute to the work on online learning that seeks to evaluate the value of knowingthe underlying distributional parameters in settings with repeated interactions [Kleinberg andLeighton, 2003]. In particular, our results fully characterize the sender’s value of knowing the prior for repeated persuasion. Our attempt of relaxing the known prior assumptions is also aligned withthe prior-independent mechanism design literature Dhangwatnotai et al. [2015], Chawla et al. [2013]. Our paper contributes to the burgeoning literature on Bayesian persuasion and information designin economics, operations research and computer science. We refer curious readers to [Kamenica andGentzkow, 2011] and [Bergemann and Morris, 2019] for a general overview of the recent developmentsas well as a survey from the algorithmic perspective by Dughmi [2017].
Online learning & mechanism design.
Our work subscribes to the recent line of work thatstudies the interplay of learning and mechanism design in incomplete-information settings, in theabsence of common knowledge on the prior. We briefly discuss the ones closely related to our work.Castiglioni et al. [2020] focus on persuasion setting with a commonly known prior distributionof the state but unknown receiver types chosen adversarially from a finite set. They show thateffective learning, in this case, is computationally intractable but does admit O ( √ T ) regret learningalgorithm, after relaxing the computability constraint. Our model complements theirs by focusingon known receiver types but unknown state distributions in a stochastic setup. Moreover, we achievea similar (and tight) regret bound through a computationally efficient algorithm. Also relevant tous the recent line of work on Bayesian exploration Kremer et al. [2014], Mansour et al. [2015, 2016],which is also motivated by online recommendation systems. Opposite to us, these models assume acommonly known prior but the realized state is unobservable and thus needs to be learned duringthe repeated interactions.Dispensing with the common prior itself, Camara et al. [2020] study an adversarial online learningmodel where both a mechanism designer and the agent learn about the states over time. The agentis assumed to minimize her counterfactual (internal) regret in response to the mechanism designer’spolicies, which are assumed to be non-responsive to the agent’s actions. The authors characterize theregret of the mechanism designer relative to the best-in-hindsight fixed mechanism. Similar to ourwork, the regret bounds require the characterization of a “cost of robustness” of the underlying designproblem. While related, our model is stochastic rather than adversarial, and thus a prior exists. Ourmodel is similar in spirit to the prior-independent mechanism design literature Dhangwatnotai et al.[2015], Chawla et al. [2013], however, our setup is different. Moreover, our algorithm is measured bythe regret whereas approximation ratios are often adopted for prior-independent mechanism design.Recent works by Hahn et al. Hahn et al. [2019, 2020] study information design in online optimiza-3ion problems such as the secretary problem Hahn et al. [2019] and the prophet inequalities Hahnet al. [2020]. The propose constant-approximation persuasive schemes. These online optimizationproblems often take the adversarial approach, which is different from our stochastic setup andlearning-focused tasks. Therefore, our results are not comparable. Robust persuasion:
The algorithm we propose relies crucially on robust persuasion due tothe ignorance of the prior, and as a part of establishing the regret bounds for the algorithm, wequantify the sender’s cost of robustness. Kosterina [2018] studies a persuasion setting in the absenceof the common prior assumption. In particular, the sender has a known prior, whereas only theset in which the receiver’s prior lies is known to the sender. Furthermore, the sender evaluates theexpected utility under each signaling mechanism with respect to the worst-case prior of the receiver.Similarly, Hu and Weng [2020] study the problem of sender persuading a privately informed receiver,where the sender seeks to maximize her expected payoff under the worst-case information of thereceiver. Finally, Dworczak and Pavan [2020] study a related setting and propose a lexicographicsolution concept where the sender first identifies the signaling mechanisms that maximize herworst-case payoff, and then among them chooses the one that maximizes the expected utility underher conjectured prior. In contrast to these work, our model focuses on a setting with common,but unknown, prior, and where the receiver has no private information. Instead, our notion ofrobustness is with respect to this unknown (common) prior.Our work also relates to several other lines of research. Since the persuasion problem can beposed as a linear program, our setting relates to online convex optimization [Mahdavi et al., 2011,Yu et al., 2017, Yu and Neely, 2020, Yuan and Lamperski, 2018]. As in our work, these authorsconsider an online convex optimization problem with unknown objective function and/or constraints,and study algorithms that minimize regret while at the same time ensuring low constraint violations.However, most of the work here seeks to bound the magnitude of the constraint violation, whereasour persuasiveness guarantees correspond to constraint satisfaction with high probability. Finally,by characterizing the persuasion problem as a Stackelberg game between the sender’s choice of asignaling mechanism and the receiver’s subsequent choice of an action, our work is related to thebroader work on the characterization of regret in repeated Stackelberg settings [Balcan et al., 2015,Dong et al., 2018, Chen et al., 2019].
Consider a persuasion setting with a single sender persuading a receiver sequentially over a timehorizon of length T . At each time t ∈ [ T ] = { , · · · , T − } , a state ω t ∈ Ω is drawn independentlyand identically from a prior distribution µ ∗ ∈ ∆(Ω). We focus on the setting where Ω is a knownfinite set, however the distribution µ ∗ is unknown to both the sender and the receiver. To capturethe sender’s initial knowledge (before time t = 0) about the prior µ ∗ , we assume that the senderknows that µ ∗ lies in the set B ⊆ ∆(Ω).At each time t ∈ [ T ], the realized state ω t is observed solely by the sender, who then shares withthe receiver an action recommendation a t ∈ A (chosen according to a signaling algorithm , which wedefine below), where A is a finite set of available actions for the receiver. The receiver then choosesan action ˆ a t (not necessarily equal to a t ), whereupon she receives a utility u ( ω t , ˆ a t ), whereas thesender receives utility v ( ω t , ˆ a t ). We assume that the receiver chooses her actions myopically, in order For any finite set X , let ∆( X ) denote the set of all probability distributions over X . Invoking the standard revelation principle, it can be shown that this is without loss of generality.
4o model settings where a single long-run sender is persuading a stream of receivers. Furthermore,while our baseline model assumes that the receiver’s utility is homogeneous across time, our modeland the results easily apply to the case with heterogeneous receiver utility, assuming the senderobserves the receiver’s type prior to persuasion.Without loss of generality, we assume that with | Ω | ≥ | A | ≥ v ( ω, a ) ∈ [0 ,
1] for all ω ∈ Ωand a ∈ A . We refer to the tuple I = (Ω , A, u, v, B ) with u : Ω × A → R and v : Ω × A → [0 ,
1] asan instance of our problem.
Informally, given a persuasion instance I , the sender’s goal is to send action recommendationssuch that her long-run total expected utility is maximized. To formalize this goal, we begin byfocusing on the setting where the sender and the receiver commonly know that the prior distribution µ ∗ = µ ∈ ∆(Ω). In this setting, the sender’s problem decouples across time periods, and standardresults Kamenica and Gentzkow [2011], Dughmi and Xu [2019] imply that the sender’s problemcan be formulated as a linear program. To elaborate, for ω ∈ Ω and a ∈ A , let σ ( ω, a ) denote theprobability with which the sender recommends action a when the realized state is ω . We refer to σ =( σ ( ω, a ) : ω ∈ Ω , a ∈ A ) as a signaling mechanism , and let S = { σ : σ ( ω, · ) ∈ ∆( A ) for each ω ∈ Ω } denote the set of all signaling mechanisms. A signaling mechanism σ ∈ S is persuasive , if conditionedon receiving an action recommendation a ∈ A , it is indeed optimal for the receiver to choose action a . We denote the set of persuasive mechanisms by Pers ( µ ): Pers ( µ ) (cid:44) (cid:40) σ ∈ S : (cid:88) ω ∈ Ω µ ( ω ) σ ( ω, a ) (cid:0) u ( ω, a ) − u ( ω, a (cid:48) ) (cid:1) ≥ , for all a, a (cid:48) ∈ A (cid:41) . (1)Here, using Bayes’ rule and assuming (cid:80) ω ∈ Ω µ ( ω ) σ ( ω, a ) >
0, the receiver’s posterior belief thatthe realized state is ω , upon receiving the recommendation a ∈ A , is given by µ ( ω ) σ ( ω,a ) (cid:80) ω (cid:48)∈ Ω µ ( ω (cid:48) ) σ ( ω (cid:48) ,a ) ,and hence (cid:80) ω ∈ Ω (cid:16) µ ( ω ) σ ( ω,a ) (cid:80) ω (cid:48)∈ Ω µ ( ω (cid:48) ) σ ( ω (cid:48) ,a ) (cid:17) u ( ω, a (cid:48) ) denotes her expected utility of choosing action a (cid:48) ∈ A conditioned on receiving the recommendation a ∈ A . Thus, the inequality in the preceding definition captures the requirement that the receiver’s expected utility for choosing action a (cid:48) is at most theexpected utility for choosing action a ∈ A . We note that Pers ( µ ) is a non-empty convex polytope.Given a persuasive signaling mechanism σ ∈ Pers ( µ ), the receiver is incentivized to choose therecommended action and thus the sender’s expected utility will be V ( µ, σ ) (cid:44) (cid:88) ω ∈ Ω (cid:88) a ∈ A µ ( ω ) σ ( ω, a ) v ( ω, a ) ∈ [0 , . (2)Consequently, the sender’s problem of selecting an optimal persuasive signaling mechanism thatmaximizes her expected utility can be formulated as the following linear program: OPT ( µ ) (cid:44) max σ V ( µ, σ ) , subject to σ ∈ Pers ( µ ) . We now return to the setting where neither the sender nor the receiver knows the prior µ ∗ .In general, the sender chooses at each time t an action recommendation a t based on the completehistory, namely the past state realizations, the past action recommendations by the sender as well If (cid:80) ω ∈ Ω µ ( ω ) σ ( ω, a ) = 0, the inequality is trivially satisfied.
5s the past actions chosen by the receiver. However, since the receiver does not know the prior,neither the past actions recommended by the sender nor the past actions chosen by the receivercarry any information about the prior beyond that contained in the state realizations. Thus, thehistory relevant to the sender consists solely of the state realizations until time t . To formalize thisdescription, we first define the history h t at the beginning of time t to be the state realizations priorto time t : h t = ∪ τ Rai and show that it is persuasive with 1 − o (1) probability and meanwhile achieves an6 LGORITHM 1: The Robustness Against Ignorance ( Rai ) algorithm Input: Instance I , Time horizon T Parameters: γ ∈ B , { (cid:15) t > t ∈ [ T ] } Output: a t ∈ A for each t ∈ [ T ] for t = 0 to T − do Choose any σ [ h t ] ∈ arg max σ { V ( γ t , σ ) : σ ∈ Pers ( B t ) } ;Recommend a t = a ∈ A with probability σ ( ω t , a ; h t );Update γ t +1 ( ω ) ← t +1 (cid:80) tτ =0 I { ω τ = ω } for each ω ∈ Ω;Set B t +1 ← B ( γ t +1 , (cid:15) t +1 ); end average regret O ( (cid:113) log TT ). Rai ) Algorithm Before describing the algorithm, we need some notation. First, for any set B ⊆ ∆(Ω), let Pers ( B )denote the set of signaling mechanisms that are simultaneously persuasive under all priors µ ∈ B : Pers ( B ) = ∩ µ ∈B Pers ( µ ). We remark that for any non-empty set B ⊆ ∆(Ω), the set Pers ( B ) isconvex since it is an intersection of convex sets Pers ( µ ), and is non-empty since it contains thefull-information signaling mechanism. Second, let B ( µ, (cid:15) ) (cid:44) { µ (cid:48) ∈ ∆(Ω) : (cid:107) µ (cid:48) − µ (cid:107) ≤ (cid:15) } denotethe (closed) (cid:96) -ball of radius (cid:15) > µ ∈ ∆(Ω).The Rai algorithm is formally described in Algorithm 1. At a high level, at each time t ≥ 0, thealgorithm maintains a set B t of candidates for the (unknown) prior µ ∗ and a particular estimate γ t .It then selects a robustly persuasive signaling scheme that maximizes the sender utility w.r.t. to thecurrently estimated prior γ t . Concretely, among signaling mechanisms that are persuasive for allbeliefs µ ∈ B t , Rai selects the one that maximizes the sender’s expected utility under γ t . Finally, itmakes an action recommendation a t using this signaling mechanism, given the state realization ω t .From the intuitive description, it follows that to overcome the sender’s ignorance of the prior,the algorithm seeks to be persuasive robustly against a conservative set of priors to maintainpersuasiveness of the algorithm. In this goal, the parameters { (cid:15) t : t ∈ [ T ] } determine howconservative the algorithm is in its persuasion: larger values of (cid:15) t imply that the algorithm ismore likely to be persuasive. (In particular, for (cid:15) t > 2, the algorithm chooses the full-informationmechanism Full , and hence is persuasive with certainty.) Unsurprisingly, larger values of (cid:15) t lead tolarger regret, and hence the sender, in the choice of (cid:15) t , must optimally trade-off the certainty ofpersuasiveness of the algorithm against its regret.Our first result characterizes the optimal choice of the parameters, and shows that the algorithm Rai is efficient, and persuasive with high probability. Theorem 1. For each t ∈ [ T ] , let (cid:15) t = min { (cid:113) | Ω | t (cid:0) √ Φ log T (cid:1) , } with Φ > . Then, the Rai algorithm runs efficiently in polynomial time, and is β -persuasive with β = sup µ ∗ ∈B P µ ∗ (cid:0) ∩ t ∈ [ T ] B t (cid:54)(cid:51) µ ∗ (cid:1) ≤ T − √ Ω56 . In particular, for Φ > , we have β ≤ T − . . To see the efficiency of the Rai algorithm, note that at each time t the algorithm has to solve theoptimization problem max σ { V ( γ t , σ ) : σ ∈ Pers ( B t ) } . Since B t = B ( γ t , (cid:15) t ) is an (cid:96) -ball of radius (cid:15) t ,7t is a convex polyhedron with at most | Ω | · ( | Ω | − 1) vertices. By the linearity of the obedienceconstraints and the convexity of B t , it follows that Pers ( B t ) is obtained by imposing the obedienceconstraints at priors corresponding to each of these vertices. Since there are O ( | Ω | + | A | ) obedienceconstraints for each prior, we obtain that the optimization problem is a polynomially-sized linearprogram, and hence can be solved efficiently.The proof of the persuasiveness of Rai follows by showing that the empirical distribution γ t concentrates around the unknown prior µ ∗ with high probability. Since, after any history h t , thesignaling mechanism σ [ h t ] chosen by the algorithm is persuasive for all priors in an (cid:96) -ball around γ t , we deduce that it is persuasive under µ ∗ as well. To show the concentration result, we use aconcentration inequality for independent random vectors in a Banach space [Foucart and Rauhut,2013]; the full proof is provided in Appendix A.2.We observe that to get strong persuasiveness guarantees, the choice of (cid:15) t in the precedingtheorem requires the knowledge of the time horizon T . However, applying the standard doublingtricks [Besson and Kaufmann, 2018], one can convert our algorithm to an anytime version that hasthe same regret upper bound guarantee, at the cost of a weakened persuasiveness guarantee, wherethe persuasiveness β is weakened to a constant arbitrarily close to 0. Given the persuasiveness of the algorithm, we now devote the rest of this section to proving theregret upper bound, under minor regularity conditions on the problem instance.To state the regularity conditions under which we obtain our regret bound, we need a definition.For each action a ∈ A , let P a denote the set of beliefs for which action a is optimal for the receiver: P a (cid:44) (cid:8) µ ∈ ∆(Ω) : E µ [ u ( ω, a )] ≥ E µ (cid:2) u ( ω, a (cid:48) ) (cid:3) , for all a (cid:48) ∈ A (cid:9) . It is without loss of generality to assume that for each a ∈ A , the set P a is non-empty. Ourprimary regularity condition further requires that each such set has a non-empty (relative) interior. Regularity Conditions. The instance I satisfies the following conditions:1. There exists d > such that for each a ∈ A , the set P a contains an (cid:96) -ball of size d . Let D > denote the largest value of d for which the preceding is true, and let η a ∈ P a be suchthat B ( η a , D ) ⊆ P a .2. There exists a p > such that for all µ ∈ B we have min ω µ ( ω ) ≥ p > . These regularity conditions are essential to ensure the possibility of successful learning; it isnot hard to see that if some P a has zero measure, then the sender cannot hope to persuasivelyrecommend action a without complete certainty of the prior. The first condition ensures that suchdegeneracies do not arise. The second condition is technical and is made primarily to ensure thepotency of the first condition: without it, the sets P a may satisfy the first condition in ∆(Ω), whilefailing to satisfy it relative to the subset ∆( { ω : µ ∗ ( ω ) > } ). Taken together, these regularityconditions serve to avoid pathologies, and henceforth we restrict our attention only to those instancessatisfying these regularity conditions.With these regularity conditions in place, our main positive result establishes a regret upperbound for the Rai algorithm. We observe that while p appears in our regret bound, it is notrequired by the Rai algorithm for its operation. These vertices are all of the form γ t + (cid:15) t ( e ω − e ω (cid:48) ), where e ω is the belief that puts all its weight on ω . This is because the receiver can never be persuaded to play an action a ∈ A for which P a is empty, and hencesuch an action can be dropped from A . heorem 2. For t ∈ [ T ] , let (cid:15) t = min { (cid:113) | Ω | t (1 + √ Φ log T ) , } with Φ > . Then, for all µ ∗ ∈ B ,with probability at least − T − √ Ω56 − T − | Ω | , the Rai algorithm satisfies Reg I ( Rai , µ ∗ , T ) ≤ (cid:18) p D + 1 (cid:19) (cid:16) (cid:112) | Ω | T (1 + 2 (cid:112) Φ log T ) (cid:17) . In particular, the regret is of order O (cid:16) √ Ω p D √ T log T (cid:17) with high probability. To obtain the preceding regret bound, we start by bounding a quantity that measures the lossin the sender’s expected utility for being robustly persuasive for a subset of priors close to eachother. Formally, for B ⊆ B and µ ∈ B , define Gap ( µ, B ) (cid:44) sup σ ∈ Pers ( µ ) V ( µ, σ ) − sup σ ∈ Pers ( B ) V ( µ, σ ) . (5)Note that Gap ( µ, B ) captures the difference in the sender’s expected utility between using theoptimal persuasive signaling mechanism for prior µ ∈ B and using the optimal signaling mechanismthat is persuasive for all priors µ (cid:48) ∈ B .The following proposition provides a bound on Gap ( µ, B ( µ, (cid:15) )), and is the central result thatunderpins our proof of Theorem 2. Proposition 1. For any µ ∈ B and for all (cid:15) ≥ , we have Gap ( µ, B ( µ, (cid:15) )) ≤ (cid:16) p D (cid:17) (cid:15) . The proof of the proposition is obtained through an explicit construction of a signaling mechanism (cid:98) σ that is persuasive for all priors in the set B ( µ, (cid:15) ). To do this, we first use the geometry of theinstance to split the prior µ into a convex combination of beliefs that either fully reveal thestate, or are well-situated in the interior of the sets P a . (It is here that we make use of the tworegularity assumptions.) We then construct a signaling mechanism (cid:98) σ that induces, under prior µ ,the aforementioned beliefs as posteriors, and show that the induced posteriors under a prior µ (cid:48) closeto µ are themselves close to the posteriors under µ (hence within the sets P a ) or are fully revealing.This proves the persuasiveness of (cid:98) σ for all priors µ (cid:48) close to µ . The bound on Gap ( µ, B ( µ, (cid:15) )) thenfollows by showing that the sender’s payoff under (cid:98) σ for prior µ is close to the payoff under theoptimal signaling mechanism that is persuasive under µ .Before presenting the proof of Proposition 1, we briefly sketch the proof of Theorem 2. Usingthe definition (4), in Lemma 2 we show the following bound on the regret: Reg I ( Rai , µ ∗ , T ) ≤ (cid:88) t ∈ [ T ] Gap ( µ ∗ , B ( µ ∗ , (cid:107) µ ∗ − γ t (cid:107) )) + (cid:88) t ∈ [ T ] Gap ( γ t , B ( γ t , (cid:15) t ))+ (cid:88) t ∈ [ T ] (cid:107) µ ∗ − γ t (cid:107) + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) . On the event { µ ∗ ∈ ∩ t ∈ [ T ] B t } , we have (cid:107) µ ∗ − γ t (cid:107) ≤ (cid:15) t . Together with Proposition 1, we obtain thatthe first three terms are of order (cid:80) t ∈ [ T ] (cid:15) t = O ( √ T log T ). The final term is also of the same orderdue to a simple application of the Azuma-Hoeffding inequality. The complete proof is provided inAppendix A.1.We end this section with the proof of Proposition 1. Proof of Proposition 1. Observe that for (cid:15) > p D , we have (cid:15)p D > 1, and hence the specified boundis trivial. Hence, hereafter, we assume (cid:15) ≤ p D . 9o begin, let σ ∈ arg max σ (cid:48) ∈ Pers ( µ ) V ( µ, σ (cid:48) ) denote the optimal signaling mechanism under theprior µ . Let A + = { a ∈ A : (cid:80) ω ∈ Ω σ ( ω, a ) > } denote the set of all actions that are recommendedwith positive probability under σ . For each a ∈ A + , let µ a denote the receiver’s posterior belief(under signaling mechanism σ ) upon receiving the action recommendation a . Note that since σ is persuasive under µ , we must have µ a ∈ P a . By the splitting lemma [Aumann et al., 1995], itthen follows that µ can be written as a convex combination (cid:80) a ∈ A + w a µ a of { µ a : a ∈ A + } , where w a ∈ [0 , 1] is given by w a = (cid:80) ω ∈ Ω µ ( ω ) σ ( ω, a ).We next explicitly construct a signaling mechanism (cid:98) σ . To simplify the proof argument, thesignaling mechanism (cid:98) σ we construct is not a straightforward mechanism, in the sense that it revealsmore than just action recommendations for signals in S . Using revelation principle, one can constructan equivalent straightforward mechanism ¯ σ by coalescing [Anunrojwong et al., 2020] signals withthe same best response for the signal. We omit the details of this reduction. We start with somedefinitions that are needed to construct the signaling mechanism (cid:98) σ .Let η a ∈ P a be such that B ( η a , D ) ⊆ P a . For δ = (cid:15)p D ∈ [0 , ξ a = (1 − δ ) µ a + δη a ∈ P a for each a ∈ A + and let ξ = (cid:80) a ∈ A + w a ξ a . Furthermore, since µ a ∈ P a and B ( η a , D ) ⊆ P a , theconvexity of the set P a implies that B ( ξ a , δD ) ⊆ P a .Since µ ∈ B ⊆ relint (∆(Ω)), we have − ρ ( µ − ρξ ) ∈ ∆(Ω) for all small enough ρ > 0. Let¯ ρ (cid:44) sup (cid:110) ρ ∈ [0 , 1] : − ρ ( µ − ρξ ) ∈ ∆(Ω) (cid:111) be the largest such value in [0 , χ as χ (cid:44) (cid:40) − ¯ ρ ( µ − ¯ ρξ ) , if ¯ ρ < µ, if ¯ ρ = 1.Then, we obtain µ = ¯ ρξ + (1 − ¯ ρ ) χ . Furthermore, if ¯ ρ < 1, we have¯ ρ = (cid:107) χ − µ (cid:107) (cid:107) χ − µ (cid:107) + (cid:107) µ − ξ (cid:107) ≥ p p + δ , where the inequality follows from (cid:107) µ − ξ (cid:107) ≤ (cid:80) a ∈ A + w a (cid:107) µ a − ξ a (cid:107) = δ (cid:80) a ∈ A + w a (cid:107) η a − ξ a (cid:107) ≤ δ and from the fact that χ lies in the boundary of ∆(Ω), which implies (cid:107) χ − µ (cid:107) ≥ ω µ ( ω ) ≥ p .With the preceding definitions in place, we are now ready to construct the mechanism (cid:98) σ . Let a ω be a best response for the receiver at state ω ∈ Ω, and let S = { ( ω, a ω ) ∈ Ω × A : χ ( ω ) > } .Consider the signaling mechanism (cid:98) σ , with the set of signals A + ∪ S , defined as follows: for each ω ∈ Ω, let (cid:98) σ ( ω, s ) (cid:44) ¯ ρ w a ξ a ( ω ) µ ( ω ) , for s ∈ A + ;(1 − ¯ ρ ) χ ( ω ) µ ( ω ) , for s = ( ω, a ω ) ∈ S ;0 , otherwise. (6)We now show that the signaling mechanism (cid:98) σ is persuasive for all priors in B ( µ, (cid:15) ), in the sensethat for all signals s ∈ A + it is optimal for the receiver to play s , and for all signals s = ( ω, a ω ) ∈ S , itis optimal for the receiver to play a ω . To see this, for any γ ∈ B ( µ, (cid:15) ), let γ ( ·| s ) denote the receiver’sposterior under signaling mechanism (cid:98) σ upon receiving the signal s ∈ A + ∪ S . For s = ( ω, a ω ) ∈ S ,we have γ ( ·| s ) = e ω , where e ω is the belief that puts all its weight on ω ∈ Ω. Thus, upon receivingthe signal s = ( ω, a ω ) it is optimal for the receiver with prior γ to take action a ω . Thus, it onlyremains to show that signals s = a ∈ A + are persuasive.10or a ∈ A + , we have for ω ∈ Ω, µ ( ω | a ) = µ ( ω ) (cid:98) σ ( ω, a ) (cid:80) ω (cid:48) ∈ Ω µ ( ω (cid:48) ) (cid:98) σ ( ω (cid:48) , a ) = ξ a ( ω ) γ ( ω | a ) = γ ( ω ) (cid:98) σ ( ω, a ) (cid:80) ω (cid:48) ∈ Ω γ ( ω (cid:48) ) (cid:98) σ ( ω (cid:48) , a ) = γ ( ω ) µ ( ω ) · ξ a ( ω ) (cid:80) ω (cid:48) ∈ Ω γ ( ω (cid:48) ) ξ a ( ω (cid:48) ) µ ( ω (cid:48) ) . Then, using triangle inequality and some algebra, we obtain (cid:107) γ ( ·| a ) − µ ( ·| a ) (cid:107) = (cid:88) ω ∈ Ω | γ ( ω | a ) − ξ a ( ω ) |≤ (cid:88) ω ∈ Ω (cid:12)(cid:12)(cid:12)(cid:12) γ ( ω | a ) − γ ( ω ) µ ( ω ) · ξ a ( ω ) (cid:12)(cid:12)(cid:12)(cid:12) + (cid:88) ω ∈ Ω (cid:12)(cid:12)(cid:12)(cid:12) γ ( ω ) µ ( ω ) · ξ a ( ω ) − ξ a ( ω ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ · sup ω ∈ Ω ξ a ( ω ) µ ( ω ) · (cid:107) γ − µ (cid:107) ≤ (cid:15)p , where in the final inequality, we have used min ω µ ( ω ) ≥ p to get sup ω ∈ Ω ξ a ( ω ) µ ( ω ) ≤ p . Since µ ( ·| a ) = ξ a , this implies that γ ( ·| a ) ∈ B (cid:16) ξ a , (cid:15)p (cid:17) = B ( ξ a , δD ) ⊆ P a . Thus, the signal a ∈ A + ispersuasive for the prior γ ∈ B ( µ, (cid:15) ). Taken together, we obtain that the signaling mechanism (cid:98) σ ispersuasive for all γ ∈ B ( µ, (cid:15) ).The persuasiveness of (cid:98) σ for all γ ∈ B ( µ, (cid:15) ) implies thatsup σ (cid:48) ∈ Pers ( B ( µ,(cid:15) )) V ( µ, σ (cid:48) ) ≥ V ( µ, (cid:98) σ )= (cid:88) ω ∈ Ω (cid:88) a ∈ A + µ ( ω ) (cid:98) σ ( ω, a ) v ( ω, a ) + (cid:88) ω ∈ Ω (cid:88) s ∈ S µ ( ω ) (cid:98) σ ( ω, s ) v ( ω, a ω ) ≥ (cid:88) ω ∈ Ω (cid:88) a ∈ A + ¯ ρw a ξ a ( ω ) v ( ω, a )= ¯ ρ (cid:88) ω ∈ Ω (cid:88) a ∈ A + w a ((1 − δ ) µ a ( ω ) + δη a ( ω )) v ( ω, a ) ≥ ¯ ρ (1 − δ ) (cid:88) ω ∈ Ω (cid:88) a ∈ A + w a µ a ( ω ) v ( ω, a )= ¯ ρ (1 − δ ) OPT ( µ ) . Thus, we obtain Gap ( µ, B ( µ, (cid:15) )) = OPT ( µ ) − sup σ (cid:48) ∈ Pers ( B ( µ,(cid:15) )) V ( µ, σ (cid:48) ) ≤ (1 − ¯ ρ (1 − δ )) OPT ( µ ) ≤ (cid:18) p D (cid:19) (cid:15), where the final inequality follows from ¯ ρ ≥ p p + δ , δ = (cid:15)p D and OPT ( µ ) ≤ a) Receiver’s preferences (b) Prior µ ∗ Figure 1: The persuasion instance I . In this section, we show that the our regret upper bound in Theorem 2 are essentially tight withrespect to the parameter D, T (up to a lower order √ log T factor). We also show that the inversepolynomial dependence on p , the smallest probability of states, is necessary though the exact orderof the dependence on p is left as an interesting open question. Theorem 3. There exists an instance I , a prior µ ∗ ∈ B , and T > such that for any T ≥ T and any β T -persuasive algorithm a the following holds with probability at least − β T : Reg I ( a , T, µ ∗ ) = T · OPT ( µ ∗ ) − (cid:88) t ∈ [ T ] v ( ω t , a t ) ≥ √ T Dp . The persuasion instance I in the preceding theorem is carefully crafted to result in a substantialloss to the sender for being robustly persuasive. We begin by providing a geometric overview, andthe underlying intuition, behind the crafting of this persuasion instance.In the persuasion instance I , there are three states Ω = { ω , ω , ω } and five actions A = { a , a , a , a , a } for the receiver. At a high level, the receiver’s preference can be illustrated asin Fig. 1a, which depicts the receiver’s optimal action for any belief in the simplex. The regions P i in the figure correspond to the set of beliefs that induce action a i ∈ A as the receiver’s bestresponse. The instance is crafted in a way such that the sets P and P that induce actions a and a respectively are symmetric and extremely narrow with the width controlled by an (cid:96) -ball of radius D contained, as depicted in Fig. 1b. (For completeness, the receiver’s utility is listedexplicitly in Table 1.) The sender seeks to persuade the receiver into choosing one of actions a and a (regardless of the state); all other actions are strictly worse for the sender. (Formally, weset v ( ω, a ) = 1 if a ∈ { a , a } and 0 otherwise, for all ω .) The sender’s initial knowledge regardingthe prior is captured by the set B = { µ ∈ ∆(Ω) : min ω µ ≥ p } , while the prior of interest is µ ∗ = ( p , − p , − p ), corresponding to the midpoint of the tips of the sets P i , as shown in Fig. 1b.We focus on the setting where the instance parameters D and p satisfy Dp < / I , it is costly to require the signaling mechanism to be robustly persuasive for a set ofpriors around µ ∗ . It also implies that the bound on Gap ( · ) obtained in Proposition 1 is almost tight, Since | Ω | = 3, the (cid:96) -ball here is an hexagon. I , with u ( ω, a ) normalized to 0 for all ω ∈ Ω. a a a a ω D D − D (1 − p − D ) − D (1 − p − D ) ω (1 − D )(1 − D ) − p ( D + 1)(2 D − 1) + p − p − D )(1 − D ) − − p − D )( D + 1) ω ( D + 1)(2 D − 1) + p (1 − D )(1 − D ) − p − − p − D )( D + 1) 2(1 − p − D )(1 − D ) up to a factor of 1 /p . We remark that this result also serves as a worst-case (lower bound) examplefor robust persuasion, which may be of independent interest. Proposition 2 (The Cost of Robustness) . For the instance I , we have OPT ( µ ∗ ) = 1 . Furthermore,for all (cid:15) ∈ (0 , D ) , we have Gap ( µ ∗ , Pers ( µ ∗ , ¯ µ , ¯ µ )) ≥ (cid:15) Dp , where ¯ µ = µ ∗ + (cid:15) ( e − e ) , ¯ µ = µ ∗ + (cid:15) ( e − e ) , where the belief e i puts all its weight on ω i . We defer the rigorous algebraic proof of the proposition to Appendix C.1 and present a briefsketch using a geometric argument here. Since the prior µ can be written as a convex combination µ = ( µ + µ ) / 2, where µ and µ are the tips of region P and P respectively (see Fig. 1b), bythe splitting lemma [Aumann et al., 1995], it follows that the optimal signaling mechanism sendssignals induces posterior beliefs µ and µ leading to receiver’s choice of a and a respectively.Since the sender can always persuade the receiver to choose one of her preferred actions, we obtain OPT ( µ ∗ ) = 1.On the other hand, for a signaling mechanism to be robustly persuasive for all priors (cid:15) -close tothe prior µ ∗ for sufficiently small (cid:15) , the posteriors for the sender’s preferred actions a , a inducedby the signaling mechanism have to be shifted up significantly in the narrow region. Such a largediscrepancy ultimately forces the sender to suffer a substantial loss in the expected utility.Armed with this proposition, we are now ready to present the proof of Theorem 3. Proof of Theorem 3. For a prior µ ∈ B , define the event E T ( µ ) as E T ( µ ) = { h T : σ a [ h t ] ∈ Pers ( µ ) , for each t ∈ [ T ] } . In words, under the event E T ( µ ), the signaling mechanisms σ a [ h t ] chosen by the algorithm a afterany history h t ∈ E T ( µ ) is persuasive for the prior µ . Since the algorithm a is β T -persuasive, weobtain P µ ( E T ( µ )) ≥ − β T , for all µ ∈ B .Fix an (cid:15) ∈ (0 , − p ) to be chosen later, and consider the priors ¯ µ = µ ∗ = ( p , − p , − p ) and¯ µ = µ ∗ + (cid:15) ( e − e ) and ¯ µ = µ ∗ + (cid:15) ( e − e ), where e j is the belief that puts all its weight onstate ω j for j ∈ { , } . Observe that for each i ∈ { , , } and for all (cid:15) ∈ (0 , − p ), we have ¯ µ i ∈ B and hence P ¯ µ i ( E T (¯ µ i )) ≥ − β T .Now, on the event E T (¯ µ ) ∩ E T (¯ µ ) ∩ E T (¯ µ ), the signaling mechanisms σ a [ h t ] chosen by thealgorithm after any history h t is persuasive for all the priors ¯ µ i , i = 0 , , 2. This implies that on thisevent, we have T · OPT (¯ µ ) − (cid:88) t ∈ [ T ] V (¯ µ , σ a [ h t ]) ≥ T · Gap (¯ µ , { ¯ µ , ¯ µ , ¯ µ } ) ≥ (cid:15)T Dp , Gap ( · ), and the second inequality followsfrom Proposition 2.Now, we have2 | P ¯ µ ( E T (¯ µ )) − P ¯ µ ( E T (¯ µ )) | ≤ (cid:88) t ∈ [ T ] KL (¯ µ || ¯ µ )= 1 − p (cid:18) (1 − p ) (1 − p ) − (cid:15) (cid:19) T = 1 − p (cid:18) (cid:15) (1 − p ) − (cid:15) (cid:19) T ≤ − p (cid:18) (cid:15) (1 − p ) − (cid:15) (cid:19) T, where the first inequality is the Pinsker’s inequality, and the first equality is from the definitionof the Kullback-Leibler divergence, and the final inequality follows from log(1 + x ) ≤ x for x ≥ (cid:15) < − p , we obtain2 | P ¯ µ ( E T (¯ µ )) − P ¯ µ ( E T (¯ µ )) | ≤ (cid:15) T − p ) ≤ (cid:15) T, where we have used p ≤ | Ω | = in the final inequality. Thus, we obtain that P ¯ µ ( E T (¯ µ )) ≥ P ¯ µ ( E T (¯ µ )) − | P ¯ µ ( E T (¯ µ )) − P ¯ µ ( E T (¯ µ )) |≥ − β T − (cid:15) (cid:114) T . By the same argument, we obtain P ¯ µ ( E T (¯ µ )) ≥ − β T − (cid:15) (cid:113) T .By the linearity of the obedience constraints, we obtain that if σ ∈ Pers (¯ µ ) ∩ Pers (¯ µ ), then σ ∈ Pers (¯ µ ). Thus, we have E T (¯ µ ) ∩ E T (¯ µ ) ⊆ E T (¯ µ ), and hence P ¯ µ ( E T ( µ ) ∩ E T (¯ µ ) ∩ E T (¯ µ )) = P ¯ µ ( E T (¯ µ ) ∩ E T (¯ µ )) ≥ P ¯ µ ( E T (¯ µ )) + P ¯ µ ( E T (¯ µ )) − ≥ − β T − (cid:15) √ T . Finally, by the Azuma-Hoeffding inequality, we obtain P ¯ µ (cid:88) t ∈ [ T ] V (¯ µ , σ a [ h t ]) − (cid:88) t ∈ [ T ] v ( ω t , a t ) < −√ T < e − / . Taken together, we obtain that with probability at least 2 − β T − (cid:15) √ T − e − / , we have Reg I ( a , T, ¯ µ ) = T · OPT (¯ µ ) − (cid:88) t ∈ [ T ] v ( ω t , a t ) ≥ (cid:15)T Dp − √ T . For T ≥ T = − p ) , choosing (cid:15) = √ T ≤ − p , we obtain, with probability at least − β T , Reg I ( a , T, ¯ µ ) = T · OPT (¯ µ ) − (cid:88) t ∈ [ T ] v ( ω t , a t ) ≥ √ T (cid:18) Dp − (cid:19) ≥ √ T Dp , for Dp < / 32. 14e end this section by observing that the lower bound in Theorem 3 extends also to signalingalgorithms that satisfy a much weaker persuasiveness requirement. Specifically, we say an algorithm a is weakly β -persuasive if during T rounds of persuasion, the expected number of rounds thatthe algorithm is not persuasive is at most βT . Clearly, any β -persuasive algorithm is also weakly β -persuasive. To see how this definition is weaker, imagine an algorithm that is never persuasiveat time 1 but always persuasive at all other times. Then it is a 1-persuasive algorithm (meaningit is never persuasive across all rounds) but is weakly 1 /T -persuasive. Despite the much weakerrestriction, the following theorem shows that weakly β -persuasive algorithms still cannot guaranteea significantly better regret (i.e., by more than logarithmic terms) than the Rai algorithm with astronger persuasiveness guarantee. Theorem 4. There exists an instance I , a prior µ ∗ ∈ B , and T > such that for any T ≥ T and for any algorithm a that is weakly β T -persuasive, the following holds with probability at least − β T : (cid:88) t ∈ [ T ] ( OPT ( µ ∗ ) − v ( ω t , a t )) I { σ a [ h t ] ∈ Pers ( µ ∗ ) } ≥ √ T Dp , We remark that in the preceding theorem, the regret is measured only over those time periodswhere the signaling mechanism chosen by the algorithm is persuasive under µ ∗ . An immediateimplication of the result is that for any algorithm, the maximum of the expected number ofunpersuasive recommendations and the regret for the persuasive recommendations is Ω( √ T ). Theproof, which uses the same instance I but considers different events, is provided in Appendix C.2. We studied a repeated Bayesian persuasion problem where the prior distribution of payoff-relevantstates is unknown to the sender. The sender learns this distribution from observing state realizationswhile making recommendations to the receiver. We propose the Rai algorithm which persuadesrobustly and achieves O ( √ T log T ) regret against the optimal signaling mechanism under theknowledge of the prior. To match this upper-bound, we construct a persuasion instance for whichno persuasive algorithm achieves regret better than Ω( √ T ). Taken together, our work preciselycharacterizes the value of knowing the prior distribution in repeated persuasion.While in our analysis we have assumed that the receiver’s utility is fixed across time periods,our model and the analysis can be easily extended to accommodate heterogeneous receivers, aslong as the sender observes the receiver’s type prior to making the recommendation, and the costof robustness Gap can be uniformly bounded across different receiver types. More interesting isthe setting where the sender must persuade a receiver with an unknown type. In such a setting,assuming the sender cannot elicit the receiver’s type prior to making the recommendation, thesender makes a menu of action recommendations (one for each receiver type). It can be shownthe complete information problem in this setting corresponds to public persuasion of a group ofreceivers with no externality, which is known to be a computationally hard linear program withexponentially many constraints [Dughmi and Xu, 2017]. Consequently, our algorithm ceases to becomputationally efficient. Nevertheless, our results imply that the algorithm continues to maintainthe O ( √ T log T ) regret bound. 15 eferences Jerry Anunrojwong, Krishnamurthy Iyer, and David Lingenbrink. Persuading risk-conscious agents:A geometric approach. Available at SSRN 3386273 , 2020.Robert J Aumann, Michael Maschler, and Richard E Stearns. Repeated games with incompleteinformation . MIT press, 1995.Maria-Florina Balcan, Avrim Blum, Nika Haghtalab, and Ariel D Procaccia. Commitment withoutregrets: Online learning in stackelberg security games. In Proceedings of the sixteenth ACMconference on economics and computation , pages 61–78, 2015.Dirk Bergemann and Stephen Morris. Bayes correlated equilibrium and the comparison of informationstructures in games. Theoretical Economics , 11(2):487–522, 2016.Dirk Bergemann and Stephen Morris. Information design: A unified perspective. Journal ofEconomic Literature , 57(1):44–95, 2019.Lilian Besson and Emilie Kaufmann. What doubling tricks can and can’t do for multi-armed bandits. arXiv preprint arXiv:1803.06971 , 2018.St´ephane Boucheron, G´abor Lugosi, and Pascal Massart. Concentration inequalities: A nonasymp-totic theory of independence . Oxford university press, 2013.Modibo Camara, Jason Hartline, and Aleck Johnsen. Mechanisms for a no-regret agent: Beyondthe common prior. arXiv preprint arXiv:2009.05518 , 2020.Matteo Castiglioni, Andrea Celli, Alberto Marchesi, and Nicola Gatti. Online bayesian persuasion. Advances in Neural Information Processing Systems , 33, 2020.Shuchi Chawla, Jason D Hartline, David Malec, and Balasubramanian Sivan. Prior-independentmechanisms for scheduling. In Proceedings of the forty-fifth annual ACM symposium on Theoryof computing , pages 51–60, 2013.Yiling Chen, Yang Liu, and Chara Podimata. Learning strategy-aware linear classifiers. arXivpreprint arXiv:1911.04004 , 2019.Peerapong Dhangwatnotai, Tim Roughgarden, and Qiqi Yan. Revenue maximization with a singlesample. Games and Economic Behavior , 91:318–333, 2015.Jinshuo Dong, Aaron Roth, Zachary Schutzman, Bo Waggoner, and Zhiwei Steven Wu. Strategicclassification from revealed preferences. In Proceedings of the 2018 ACM Conference on Economicsand Computation , pages 55–70, 2018.Shaddin Dughmi. Algorithmic information structure design: a survey. ACM SIGecom Exchanges ,15(2):2–24, 2017.Shaddin Dughmi and Haifeng Xu. Algorithmic persuasion with no externalities. In Proceedings ofthe 2017 ACM Conference on Economics and Computation , pages 351–368, 2017.Shaddin Dughmi and Haifeng Xu. Algorithmic bayesian persuasion. SIAM Journal on Computing ,(0):STOC16–68, 2019. 16iotr Dworczak and Alessandro Pavan. Preparing for the worst but hoping for the best: Robust(bayesian) persuasion. 2020.Simon Foucart and Holger Rauhut. A Mathematical Introduction to Compressive Sensing . 2013.doi: 10.1007/978-0-8176-4948-7.Niklas Hahn, Martin Hoefer, and Rann Smorodinsky. The secretary recommendation problem. arXiv preprint arXiv:1907.04252 , 2019.Niklas Hahn, Martin Hoefer, and Rann Smorodinsky. Prophet inequalities for bayesian persuasion.In Proc. 29th Int. Joint Conf. Artif. Intell.(IJCAI) , pages 175–181, 2020.Ju Hu and Xi Weng. Robust persuasion of a privately informed receiver. Economic Theory , pages1–45, 2020.Emir Kamenica and Matthew Gentzkow. Bayesian persuasion. American Economic Review , 101(6):2590–2615, 2011.Robert Kleinberg and Tom Leighton. The value of knowing a demand curve: Bounds on regret foronline posted-price auctions. In Proceedings of the 44th Annual IEEE Symposium on Foundationsof Computer Science , FOCS ’03, page 594, USA, 2003. IEEE Computer Society. ISBN 0769520405.Svetlana Kosterina. Persuasion with unknown beliefs. Work. Pap., Princeton Univ., Princeton, NJ ,2018.Ilan Kremer, Yishay Mansour, and Motty Perry. Implementing the “wisdom of the crowd”. Journalof Political Economy , 122(5):988–1012, 2014.Mehrdad Mahdavi, Rong Jin, and Tianbao Yang. Trading regret for efficiency: Online convexoptimization with long term constraints. CoRR , abs/1111.6082, 2011. URL http://arxiv.org/abs/1111.6082 .Yishay Mansour, Aleksandrs Slivkins, and Vasilis Syrgkanis. Bayesian incentive-compatible banditexploration. In Proceedings of the Sixteenth ACM Conference on Economics and Computation ,pages 565–582, 2015.Yishay Mansour, Aleksandrs Slivkins, Vasilis Syrgkanis, and Zhiwei Steven Wu. Bayesian exploration:Incentivizing exploration in bayesian games. In Proceedings of the 2016 ACM Conference onEconomics and Computation , pages 661–661, 2016.Luis Rayo and Ilya Segal. Optimal information disclosure. Journal of political Economy , 118(5):949–987, 2010.Hao Yu and Michael J. Neely. A low complexity algorithm with o(ˆaˆˇst) regret and o(1) constraintviolations for online convex optimization with long term constraints. Journal of Machine LearningResearch , 21(1):1–24, 2020. URL http://jmlr.org/papers/v21/16-494.html .Hao Yu, Michael J. Neely, and Xiaohan Wei. Online convex optimization with stochastic constraints,2017.Jianjun Yuan and Andrew G. Lamperski. Online convex optimization for cumulative constraints. CoRR , abs/1802.06472, 2018. URL http://arxiv.org/abs/1802.06472 .17 Proofs from Section 3 This section provides the proof of our main theorems in Section 3. Throughout, we use the samenotation as in the main text. A.1 Proof of Theorem 2 In this section, we provide the proof of Theorem 2. In the process, we also state and prove severalhelper lemmas used in the proof. Proof of Theorem 2. From Lemma 2, on the event { µ ∗ ∈ ∩ t ∈ [ T ] B t } , we have Reg I ( Rai , µ ∗ , T ) ≤ (cid:18) p D + 1 (cid:19) (cid:88) t ∈ [ T ] (cid:15) t + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) ≤ (cid:18) p D + 1 (cid:19) (cid:32) T − (cid:88) t =1 (cid:114) | Ω | t (1 + (cid:112) Φ log T ) (cid:33) + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) ≤ (cid:18) p D + 1 (cid:19) (cid:16) (cid:112) | Ω | T (1 + (cid:112) Φ log T ) (cid:17) + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) , where in the final inequality, we have used the fact that (cid:80) T − t =1 / √ t ≤ √ T .From Theorem 1, we have P µ (cid:0) ∩ t ∈ [ T ] B t (cid:54)(cid:51) µ (cid:1) ≤ T − √ Ω56 . Furthermore, from Lemma 7, we havefor α > P µ ∗ (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) ≥ (cid:112) αT log T < T α . After choosing α = 4Φ | Ω | and taking the union bound, we obtain with probability at least1 − T − √ Ω56 − T − | Ω | , we have Reg I ( Rai , µ ∗ , T ) ≤ (cid:18) p D + 1 (cid:19) (cid:16) (cid:112) | Ω | T (1 + 2 (cid:112) Φ log T ) (cid:17) . Lemma 1. The Rai algorithm satisfies Reg I ( Rai , µ ∗ , T ) = (cid:88) t ∈ [ T ] ( OPT ( µ ∗ ) − OPT ( γ t )) + (cid:88) t ∈ [ T ] Gap ( γ t , B t )+ (cid:88) t ∈ [ T ] ( V ( γ t , σ [ h t ]) − V ( µ ∗ , σ [ h t ]))+ (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) . roof. We have Reg I ( Rai , µ ∗ , T ) = OPT ( µ ∗ ) · T − (cid:88) t ∈ [ T ] v ( ω t , a t )= OPT ( µ ∗ ) · T − (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) . Now, note that for t ∈ T , E µ ∗ [ v ( ω t , a t ) | h t ] = V ( µ ∗ , σ [ h t ])= V ( γ t , σ [ h t ]) + ( V ( µ ∗ , σ [ h t ]) − V ( γ t , σ [ h t ]))= sup σ ∈ Pers ( B t ) V ( γ t , σ ) + ( V ( µ ∗ , σ [ h t ]) − V ( γ t , σ [ h t ])) . Thus, we have Reg I ( Rai , µ ∗ , T ) = OPT ( µ ∗ ) · T − (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) = OPT ( µ ∗ ) · T − (cid:88) t ∈ [ T ] V ( µ ∗ , σ [ h t ]) + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) = (cid:88) t ∈ [ T ] ( OPT ( µ ∗ ) − V ( µ ∗ , σ [ h t ])) + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) . (7)Finally, note that OPT ( µ ∗ ) − V ( µ ∗ , σ [ h t ]) = OPT ( µ ∗ ) − V ( γ t , σ [ h t ]) + V ( γ t , σ [ h t ]) − V ( µ ∗ , σ [ h t ])= ( OPT ( µ ∗ ) − OPT ( γ t )) + ( OPT ( γ t ) − V ( γ t , σ [ h t ]))+ ( V ( γ t , σ [ h t ]) − V ( µ ∗ , σ [ h t ]))= ( OPT ( µ ∗ ) − OPT ( γ t )) + Gap ( γ t , B t )+ ( V ( γ t , σ [ h t ]) − V ( µ ∗ , σ [ h t ])) , where in the final equality, we have used the fact that OPT ( γ t ) − V ( γ t , σ [ h t ]) = Gap ( γ t , B t ).Substituting the preceding expression into (7) yields the lemma statement. Lemma 2. On the event { µ ∗ ∈ ∩ t ∈ [ T ] B t } , we have Reg I ( Rai , µ ∗ , T ) ≤ (cid:18) p D + 1 (cid:19) (cid:88) t ∈ [ T ] (cid:15) t + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) . Proof. In Lemma 1, we obtain the following expression for the regret: Reg I ( Rai , µ ∗ , T ) = (cid:88) t ∈ [ T ] ( OPT ( µ ∗ ) − OPT ( γ t )) + (cid:88) t ∈ [ T ] Gap ( γ t , B t )+ (cid:88) t ∈ [ T ] ( V ( γ t , σ [ h t ]) − V ( µ ∗ , σ [ h t ])) + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) . OPT ( µ ∗ ) − OPT ( γ t ) ≤ Gap ( µ ∗ , B ( µ ∗ , (cid:107) µ ∗ − γ t (cid:107) ) + 12 · (cid:107) µ ∗ − γ t (cid:107) ≤ (cid:18) p D (cid:19) · (cid:107) µ ∗ − γ t (cid:107) + 12 · (cid:107) µ ∗ − γ t (cid:107) , where the second inequality follows from Proposition 1. Furthermore, from Lemma 5, we have V ( γ t , σ [ h t ]) − V ( µ ∗ , σ [ h t ]) ≤ (cid:107) µ ∗ − γ t (cid:107) Taken together, we have Reg I ( Rai , µ ∗ , T ) ≤ (cid:88) t ∈ [ T ] Gap ( γ t , B t ) + (cid:18) p D + 1 (cid:19) (cid:88) t ∈ [ T ] (cid:107) µ ∗ − γ t (cid:107) + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) . Finally, in Lemma 3, we show that on the event { µ ∗ ∈ B t } , we have Gap ( γ t , B t ) ≤ (cid:16) p D (cid:17) (cid:15) t . Thus,on the event { µ ∗ ∈ ∩ t ∈ [ T ] B t } , we obtain Reg I ( Rai , µ ∗ , T ) ≤ (cid:18) p D + 1 (cid:19) (cid:88) t ∈ [ T ] (cid:15) t + (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) . Lemma 3. For t ∈ [ T ] , on the event { µ ∗ ∈ B t } , we have Gap ( γ t , B t ) ≤ (cid:18) p D (cid:19) (cid:15) t . Proof. On the event { µ ∗ ∈ B t } , we have γ t ( ω ) ≥ µ ( ω ) − (cid:107) γ t − µ ∗ (cid:107) ≥ p − (cid:15) t . Thus, for (cid:15) t < p , we have min ω γ t ( ω ) ≥ p . Using the same argument as in Proposition 1, we thenobtain Gap ( γ t , B t ) = Gap ( γ t , B ( γ t , (cid:15) t )) ≤ (cid:18) D min ω γ t ( ω ) (cid:19) (cid:15) t ≤ (cid:18) p D (cid:19) (cid:15) t . For (cid:15) t > p / 2, the bound holds trivially since 16 (cid:15) t /p D > Lemma 4. For any µ , µ ∈ ∆(Ω) , we have OPT ( µ ) − OPT ( µ ) ≤ Gap ( µ , B ( µ , (cid:107) µ − µ (cid:107) )) + 12 · (cid:107) µ − µ (cid:107) . Proof. Fix µ , µ ∈ ∆(Ω). For i ∈ { , } , let σ i ∈ arg max σ (cid:48) ∈ Pers ( µ i ) V ( µ i , σ (cid:48) ). By definition, wehave OPT ( µ i ) = V ( µ i , σ i ). 20ext, among all signaling mechanisms that are persuasive for all µ ∈ B ( µ , (cid:107) µ − µ (cid:107) ), let σ maximize V ( µ , σ ). Since σ is persuasive for µ , we have OPT ( µ ) = V ( µ , σ ) ≥ V ( µ , σ ). Thus,we have OPT ( µ ) − OPT ( µ ) = V ( µ , σ ) − V ( µ , σ ) ≤ V ( µ , σ ) − V ( µ , σ )= V ( µ , σ ) − V ( µ , σ ) + V ( µ , σ ) − V ( µ , σ ) ≤ Gap ( µ , B ( µ , (cid:107) µ − µ (cid:107) )) + 12 · (cid:107) µ − µ (cid:107) . Here, the inequality follows from the definition of Gap ( · ), and from Lemma 5. Lemma 5. For any µ , µ ∈ ∆(Ω) and any signaling mechanism σ , we have | V ( µ , σ ) − V ( µ , σ ) | ≤ · (cid:107) µ − µ (cid:107) . Proof. Fix µ , µ ∈ ∆(Ω). For any signaling mechanism σ that is persuasive under µ , we have forany x ∈ R , | V ( µ , σ ) − V ( µ , σ ) | = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) ω ∈ Ω ( µ ( ω ) − µ ( ω )) (cid:32)(cid:88) a ∈ A σ ( ω, a ) v ( ω, a ) − x (cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:107) µ − µ (cid:107) · sup ω ∈ Ω (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:88) a ∈ A σ ( ω, a ) v ( ω, a ) − x (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , where we have used the H¨older’s inequality in the last line. Optimizing over x and using the factthat the sender’s valuations lie in [0 , 1] yields the result. A.2 Proof of Theorem 1 Proof of Theorem 1. If µ ∗ ∈ B t for each t ∈ [ T ], then since σ [ h t ] is persuasive under all priors in B t , we deduce that σ [ h t ] is persuasive under prior µ ∗ for all t ∈ [ T ]. Thus, we obtain that the Rai -algorithm is β -persuasive for β = sup µ ∗ ∈B P µ ∗ (cid:0) ∩ t ∈ [ T ] B t (cid:54)(cid:51) µ ∗ (cid:1) . Now, for any µ ∈ B , using the union bound we get P µ (cid:0) ∩ t ∈ [ T ] B t (cid:54)(cid:51) µ (cid:1) = P µ (cid:0) ∪ t ∈ [ T ] B ct (cid:51) µ (cid:1) ≤ (cid:88) t ∈ [ T ] P µ ( B ct (cid:51) µ )= (cid:88) t ∈ [ T ] P µ ( (cid:107) γ t − µ (cid:107) > (cid:15) t )= (cid:88) t ∈ [ T ] P µ (cid:32) (cid:107) γ t − µ (cid:107) > (cid:114) | Ω | t (cid:16) (cid:112) Φ log T (cid:17)(cid:33) . t < Φ log T , we have (cid:114) | Ω | t (cid:16) (cid:112) Φ log T (cid:17) > (cid:112) | Ω | (cid:18) √ Φ log T (cid:19) ≥ . Hence, P µ (cid:18) (cid:107) γ t − µ (cid:107) > (cid:113) | Ω | t (cid:0) √ Φ log T (cid:1)(cid:19) = 0. On the other hand, for t ≥ Φ log T , we have √ Φ log T ≤ √ t , and hence from Lemma 6, we obtain (cid:88) t ≥ Φ log T P µ (cid:32) (cid:107) γ t − µ (cid:107) > (cid:114) | Ω | t (cid:16) (cid:112) Φ log T (cid:17)(cid:33) ≤ (cid:88) t ≥ Φ log T exp (cid:32) − 3Φ log T √ Ω56 (cid:33) ≤ T − √ Ω56 (cid:18) T − Φ log T (cid:19) ≤ T − √ Ω56 . Setting Φ > 20 implies the final term is at most T − . . B Concentration Inequalities In this section, we provide some concentration inequalities used in the proofs of our main results.We use the same notation as in the main text. The following lemma provides a bound on the (cid:96) -norm of the deviation of the empirical distribution from its mean. Lemma 6. For each t ∈ [ T ] , and for any µ ∈ ∆(Ω) , we have for all < Φ t ≤ √ t , P µ (cid:32) (cid:107) γ t − µ (cid:107) ≥ (cid:114) | Ω | t (1 + Φ t ) (cid:33) ≤ exp (cid:32) − t √ Ω56 (cid:33) I (cid:40)(cid:114) | Ω | t (1 + Φ t ) ≤ (cid:41) . Proof. Let X t ∈ { , } | Ω | denote the random variable with X t ( ω ) = I { ω t = ω } , and define Y t = X t − E µ [ X t ]. Let Z t = (cid:107) (cid:80) τ ∈ [ t ] Y τ (cid:107) . Since (cid:107) Y t (cid:107) ≤ (cid:107) X t − E µ [ X t ] (cid:107) ≤ t ∈ [ T ], by Foucartand Rauhut [2013, Corollary 8.46], we obtain for each t ∈ [ T ], P µ ( Z t ≥ E µ [ Z t ] + s ) ≤ exp (cid:18) − s t + 6 E µ [ Z t ] + s ) (cid:19) . Next, letting Z t,ω = | (cid:80) τ ∈ [ t ] Y τ ( ω ) | for ω ∈ Ω, we obtain E µ [ Z t ] = (cid:88) ω ∈ Ω E µ [ Z t,ω ]= (cid:88) ω ∈ Ω E µ [ (cid:113) Z t,ω ] ≤ (cid:88) ω ∈ Ω (cid:113) E µ [ Z t,ω ]= (cid:88) ω ∈ Ω (cid:115) (cid:88) τ ∈ [ t ] Var µ [ Y τ ( ω )]= √ t · (cid:88) ω ∈ Ω (cid:112) µ ( ω )(1 − µ ( ω )) ≤ (cid:112) | Ω | t, E µ [ Y t ( ω )] = 0, we have E [ Z t,ω ] = (cid:80) τ ∈ [ t ] Var µ [ Y τ ( ω )]. The final step follows from astraightforward optimization. Thus, we obtain P µ (cid:16) Z t ≥ (cid:112) | Ω | t + s (cid:17) ≤ exp − s (cid:16) t + 6 (cid:112) | Ω | t + s (cid:17) . Choosing s = Φ t (cid:112) | Ω | t for 0 < Φ t ≤ √ t , and noting that Z t = t (cid:107) γ t − µ (cid:107) , we obtain P µ (cid:32) (cid:107) γ t − µ (cid:107) ≥ (cid:114) | Ω | t (1 + Φ t ) (cid:33) ≤ exp − t | Ω | t (cid:16) t + 6 (cid:112) | Ω | t + Φ t (cid:112) | Ω | t (cid:17) ≤ exp (cid:32) − t (cid:112) | Ω | (cid:0) 12 + Φ t / √ t (cid:1) (cid:33) ≤ exp (cid:32) − t √ Ω56 (cid:33) . The lemma statement then follows after noticing that for all t ∈ [ T ], we have (cid:107) γ t − µ (cid:107) ≤ (cid:107) γ t (cid:107) + (cid:107) µ (cid:107) ≤ Lemma 7. For all µ ∗ ∈ B and α > , we have P µ ∗ (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) ≥ (cid:112) αT log T < T α . Proof. For t ∈ [ T ], let X t (cid:44) E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ). Observe that E µ ∗ [ X t | h t ] = 0 and | X t | ≤ { X t : t ∈ [ T ] } is a bounded martingale difference sequence. Hence, fromAzuma-Hoeffding Boucheron et al. [2013], we obtain for z ≥ P µ ∗ (cid:88) t ∈ [ T ] E µ ∗ [ v ( ω t , a t ) | h t ] − v ( ω t , a t ) ≥ z < exp (cid:18) − z T (cid:19) . Choosing z = √ αT log T for α ≥ C Proofs from Section 4 This section provides the proof of our main results in Section 4. C.1 Proof of Proposition 2 In this section, we provide the proof of Proposition 2.23 roof of Proposition 2. It is straightforward to verify that the following signaling mechanism σ ∗ ∈ Pers ( µ ∗ ) optimizes the sender’s expected utility among all mechanisms in Pers ( µ ∗ ): σ ∗ ( ω , a ) = σ ∗ ( ω , a ) = 12 ,σ ∗ ( ω , a ) = σ ∗ ( ω , a ) = 12 + D − p ) ,σ ∗ ( ω , a ) = σ ∗ ( ω , a ) = 12 − D − p ) ,σ ∗ ( ω, a ) = 0 , otherwise . Since the action recommendations are always in { a , a } , we obtain OPT ( µ ∗ ) = 1.By the linearity of obedience constraints and µ ∗ = (¯ µ + ¯ µ ) / 2, it follows that Pers ( { µ ∗ , ¯ µ , ¯ µ } )can be obtained by imposing the obedience constraints at priors ¯ µ and ¯ µ . The optimizationproblem max σ { V ( µ ∗ , σ ) : σ ∈ Pers ( { ¯ µ , ¯ µ } ) } can be solved to obtain the following optimal signalingmechanism: ˆ σ ( ω , a ) = ˆ σ ( ω , a ) = 12ˆ σ ( ω , a ) = ˆ σ ( ω , a ) = XZ ˆ σ ( ω , a ) = ˆ σ ( ω , a ) = YZ ˆ σ ( ω , a ) = ˆ σ ( ω , a ) = 1 − ˆ σ ( ω , a ) − ˆ σ ( ω , a )ˆ σ ( ω, a ) = 0 , otherwise , where X = 2 p (1 − p − (cid:15) )(1 − p + D ) D + p (1 − p + (cid:15) )(1 − p − D − D ) Y = p (1 − p − (cid:15) )(1 − p − D + 2 D ) + 2 p (1 − p + (cid:15) )(1 − p − D )(1 − D ) D Z = (1 − p + (cid:15) ) (1 − p − D )(1 − D )(1 − p − D − D ) − (1 − p − (cid:15) ) (1 − p + D )(1 − p − D + 2 D )The difference in the sender’s expected utility between using the optimal persuasive signalingmechanism for prior µ ∗ ∈ B and using the optimal signaling mechanism that is persuasive for allpriors in { µ ∗ , ¯ µ , ¯ µ } is given by Gap ( µ ∗ , Pers ( µ ∗ , ¯ µ , ¯ µ )) = V ( µ ∗ , σ ∗ ) − V ( µ ∗ , ˆ σ ) ≥ (cid:15) / Dp (1 + (cid:15)/ − Dp − D ) Dp + (cid:15) ≥ (cid:15) Dp . C.2 Proof of Theorem 4 Proof of Theorem 4. The proof uses the same instance I as in the proof of Theorem 3, but usesa different construction for the random events. Specifically, for a prior µ ∈ B , let Y t ( µ ) indicate24hether the signaling mechanism σ a [ h t ] chosen by the algorithm a after history h t is persuasive forprior µ : Y t ( µ ) (cid:44) I { σ a [ h t ] ∈ Pers ( µ ) } . Since a is weakly β T -persuasive, we have E µ [ (cid:80) t ∈ [ T ] Y t ( µ )] ≥ T (1 − β T ) for all µ ∈ B . Define theevent F T ( µ ) (cid:44) h T : (cid:88) t ∈ [ T ] Y t ( µ ) ≥ T . Thus, on the event F T ( µ ) the signaling mechanism chosen by the algorithm a is persuasive for atleast 3 T / E µ [ (cid:80) t ∈ [ T ] Y T ( µ )] ≤ T P µ ( F T ( µ )) + T (1 − P µ ( F T ( µ ))), we have P µ ( F T ( µ )) ≥ − β T for all µ ∈ B .For (cid:15) ∈ (0 , − p ) to be chosen later, let ¯ µ , ¯ µ , ¯ µ be defined as in the proof of Theorem 3. Foreach i ∈ { , , } and for all (cid:15) ∈ (0 , − p ), we have ¯ µ i ∈ B and hence P µ ( F T (¯ µ i )) ≥ − β T .On the event F T (¯ µ ) ∩ F T (¯ µ ), we have (cid:88) t ∈ [ T ] Y t (¯ µ ) Y t (¯ µ ) ≥ (cid:88) t ∈ [ T ] Y t (¯ µ ) + (cid:88) t ∈ [ T ] Y t (¯ µ ) − T ≥ T , where the first inequality follows from ab ≥ a + b − a, b ∈ [0 , T / σ [ h t ] is persuasive for both priors ¯ µ and ¯ µ , and hence alsofor prior ¯ µ because of the linearity of the obedience constraints. On each of these time periods,we have OPT (¯ µ ) − V (¯ µ , σ a [ h t ]) ≥ Gap (¯ µ , { ¯ µ , ¯ µ , ¯ µ } ) ≥ (cid:15) Dp . Thus, we obtain, on the event F T (¯ µ ) ∩ F T (¯ µ ), (cid:88) t ∈ [ T ] ( OPT (¯ µ ) − V (¯ µ , σ a [ h t ])) Y t (¯ µ ) Y t (¯ µ ) ≥ (cid:15)T Dp . Now, by the same argument as in the proof of Theorem 3, using the Pinsker’s inequality, weobtain P ¯ µ ( F T (¯ µ ) ∩ F T (¯ µ )) ≥ − β T − (cid:15) √ T . Moreover, since Y t (¯ µ ) ∈ { , } is h t -measurable,by the Azuma-Hoeffding inequality Boucheron et al. [2013], we have P ¯ µ (cid:88) t ∈ [ T ] ( V (¯ µ i , σ a [ h t ]) − v ( ω t , a t )) Y t (¯ µ ) < −√ T < e − / . Thus, with probability at least 2 − β T − (cid:15) √ T − e − / , we have (cid:88) t ∈ [ T ] ( OPT (¯ µ ) − v ( ω t , a t ))) Y t (¯ µ ) ≥ (cid:88) t ∈ [ T ] ( OPT (¯ µ ) − V (¯ µ , σ a [ h t ])) Y t (¯ µ ) − √ T ≥ (cid:88) t ∈ [ T ] ( OPT (¯ µ ) − V (¯ µ , σ a [ h t ])) Y t (¯ µ ) Y t (¯ µ ) − √ T ≥ (cid:15)T Dp − √ T . For T ≥ T = − p ) , choosing (cid:15) = √ T ≤ − p , we obtain with probability at least − β T , (cid:88) t ∈ [ T ] ( OPT (¯ µ ) − v ( ω t , a t ))) I { σ a [ h t ] ∈ Pers (¯ µ ) } ≥ √ T (cid:18) Dp − (cid:19) ≥ √ T Dp , for Dp < //