[PDF] Optimal Transport of Information

Abstract

We study the general problem of Bayesian persuasion (optimal information design) with continuous actions and continuous state space in arbitrary dimensions. First, we show that with a finite signal space, the optimal information design is always given by a partition. Second, we take the limit of an infinite signal space and characterize the solution in terms of a Monge-Kantorovich optimal transport problem with an endogenous information transport cost. We use our novel approach to: 1. Derive necessary and sufficient conditions for optimality based on Bregman divergences for non-convex functions. 2. Compute exact bounds for the Hausdorff dimension of the support of an optimal policy. 3. Derive a non-linear, second-order partial differential equation whose solutions correspond to regular optimal policies. We illustrate the power of our approach by providing explicit solutions to several non-linear, multidimensional Bayesian persuasion problems.

Full PDF

aa r X i v : . [ ec on . GN ] F e b Optimal Transport of Information ∗ Semyon Malamud † , Anna Cieslak ‡ , and Andreas Schrimpf § This version: March 2, 2021

Abstract

We study the general problem of Bayesian persuasion (optimal information design)with continuous actions and continuous state space in arbitrary dimensions. First, weshow that with a ﬁnite signal space, the optimal information design is always given bya partition. Second, we take the limit of an inﬁnite signal space and characterize thesolution in terms of a Monge-Kantorovich optimal transport problem with an endoge-nous information transport cost. We use our novel approach to: 1. Derive necessaryand suﬃcient conditions for optimality based on Bregman divergences for non-convexfunctions. 2. Compute exact bounds for the Hausdorﬀ dimension of the support ofan optimal policy. 3. Derive a non-linear, second-order partial diﬀerential equationwhose solutions correspond to regular optimal policies. We illustrate the power ofour approach by providing explicit solutions to several non-linear, multidimensionalBayesian persuasion problems.

Keywords : Bayesian Persuasion, Information Design, Signalling, Optimal Transport

JEL : D82, D83, E52, E58, E61 ∗ We thank Darrell Duﬃe, Piotr Dworczak, Jean-Charles Rochet, and Stephen Morris (AEA discussant)as well as seminar participants at Caltech, UBC, SUFE, SFI and conference participants at the 2020 AEAmeeting in San Diego for helpful comments. Parts of this paper were written when Malamud visited theBank for International Settlements (BIS) as a research fellow. The views in this article are those of theauthors and do not necessarily represent those of BIS. † Swiss Finance Institute, EPF Lausanne, and CEPR; E-mail: [email protected] ‡ Duke University, Fuqua School of Business, CEPR and NBER, E-mail: [email protected] § Bank of International Settlements (BIS) and CEPR; E-mail: [email protected] Introduction

We study the general problem of Bayesian persuasion (optimal information design) intro-duced in the seminal work of Kamenica and Gentzkow (2011). We show that solutionsto optimal information design problem exhibit a remarkable mathematical structure whenthe private information of the sender (the state) is a random vector in R L with a priordistribution absolutely continuous with respect to the Lebesgue measure.We start by solving a restricted problem in which the sender is constrained to a ﬁniteset of signals. We show that it is always optimal for the sender to ﬁrst partition the statespace into a ﬁnite number of ”clusters” and then communicate to the public to which clusterthe state belongs. When public actions are functions of the expected state, we show thatthese clusters are given by convex polygons, formalizing the intuition that similar states are”bunched” together. Our explicit characterization of optimal partitions allows us to take the continuouslimit and show that these partitions converge to a solution to the unconstrained problem.We establish a surprising connection between optimal information design and the Monge-Kantorovich theory of optimal transport whereby the sender eﬀectively ﬁnds an optimal wayof ”transporting information” to the receiver, with an endogenous information transport cost .In the case when public actions are a function of expectations about (multiple, arbitrary) See also Aumann and Maschler (1995), Calzolari and Pavan (2006), Ostrovsky and Schwarz (2010)and Rayo and Segal (2010) for important prior contributions to the literature on communication withcommitment. The term “information design” was introduced in Taneva (2015) and Bergemann and Morris(2016). See Bergemann and Morris (2019) and Kamenica (2019) for excellent reviews. Kleinberg and Mullainathan (2019) argue that clustering (partitioning the state space into discrete cells)is the most natural way to simplify information processing in complex environments. Our results providea theoretical foundation for such clustering. Note that, formally, in a Bayesian persuasion framework,economic agents (signal receivers) would need to use (potentially complex) calculations underlying the Bayesrule to compute the conditional probabilities. However, formally, the sender could just communicate theseconditional probabilities directly to the public, and it is enough to have some specialists verify that theseprobabilities are consistent with the Bayes rule. This way, the sender fully takes over the ”simpliﬁcation”task. One important real-world problem arises when market participants, including the sender himself, donot know the ”true” probability distribution, in which case methods from robust optimization need to beused. See Dworczak and Pavan (2020).

The Model

There are four time periods, t = 0 − , , , . The information designer (the sender)believes that the state ω (the private information of the sender) is a random vector takingvalues in Ω ⊂ R L , an open subset of R L equipped with the Borel sigma-algebra B , distributedwith a density µ ( ω ) that is strictly positive on Ω . Following Kamenica and Gentzkow (2011), we assume that the sender is able to commit to an information design at time t = 0 − , before the state ω is realized. Full commitmentmeans that the sender is completely transparent about the exact structure of the map fromits private information to policy announcements. In particular, we abstract from issuesrelated to imperfect commitment and reputation building by the sender. The sender learnsthe realization of the state ω at time t = 0 , while the public only learns it at time t = 1 . Thevector ω represents the full set of private information of the sender. The sender’s objectiveis then to decide how much, and what kind of, information about ω to reveal to the publicthrough policy announcements.We ﬁrst deﬁne the information design to which the sender commits, as well as the basicstructure of the sender’s announcements. Deﬁnition 1 (Finite Information Design)

An information design is a probability space K (hereinafter signal space) and a probability measure P on K × Ω . An information designis K -ﬁnite if the signal space K has exactly K elements: |K| = K. An information designis ﬁnite if it is K -ﬁnite for some K ∈ N . In this case, without loss of generality we assumethat K = { , · · · , K } . Once the public observes a sender announcement k ∈ { , · · · , K } , it updates the beliefsabout the probability distribution of ω using the Bayes rule. To do this, the public justneeds to know π k ( ω ) – that is, the probability of the state of the economy being ω given the5bserved announcement k : π k ( ω ) ≡ P ( k | ω ) . As such, a K -ﬁnite information design can be equivalently characterized by a set of measur-able functions π k ( ω ) , k = { , · · · , K } satisfying conditions π k ( ω ) ∈ [0 ,

1] and P k π k ( ω ) = 1with probability one.Intuitively, an information design is a map from the space Ω of possible states to a“dictionary” of K messages, whereby the sender commits to a precise rule of selectingan announcement from the dictionary for every realization of ω. In principle, it is possiblethat this rule involves randomization, whereby, for a given ω, the sender randomly picks anannouncement from a non-singleton subset of messages in the dictionary. An informationdesign does not involve randomization if and only if it is a partition of the state space Ω . Deﬁnition 2 (Randomization)

We say that information design involves randomizationif P ( π k ( ω )

6∈ { , } ) > for some k . We say that information design is a partition if P ( π k ( ω ) ∈ { , } ) = 1 for all k = 1 , · · · , K. In this case, ∪ Kk =1 { ω : π k ( ω ) = 1 } (1) is a Lebesgue-almost sure partition of Ω in the sense that Ω \ ∪ Kk =1 { ω : π k ( ω ) = 1 } has Lebesgue measure zero, and the subsets of the partition (1) are Lebesgue-almost surelydisjoint. We use ¯ π = ( π k ( ω )) Kk =1 ∈ [0 , K to denote the random K -dimensional vector represent- Note that, in practice, many of these messages are related to future actions of the sender. We do notassume that a message always implies a full commitment of the sender to implement the promised action.The only assumption we make is that the public uses the Bayes rule to update its probabilistic beliefs aboutthe likelihood of the promised action. { ( π k ( ω )) Kk =1 : π k ( ω ) ≥ , X k π k ( ω ) = 1 } to denote the set of all possible information designs, equipped with the metric: k ¯ π − ¯ π k = max k Z Ω | π k ( ω ) − π k ( ω ) | dω . As we show below, a key implication of this setting is that, with a continuous state space andunder appropriate regularity conditions, randomization is never optimal, and hence optimalinformation design is always given by a partition . While this result might seem intuitive,its proof is highly non-trivial and is based on novel techniques that, to the best of ourknowledge, have never been used in the literature before. It is this result that is key to allof our subsequent analysis of the unconstrained problem.

We assume that the economy is populated by N classes of agents, indexed by n = 1 , · · · , N. Each class n may consist of a continuum of agents or a single, large agent. In the formercase, we assume that all agents within each class are identical and take identical actions.Furthermore, we allow for the possibility that agents in each class have private informationor simply diﬀer in their prior beliefs. Namely, each class n has a class-speciﬁc prior with aRadon-Nykodym density µ n ( ω ) , n = 1 , · · · , N with respect to the central bank prior µ ( ω ) . At time t = 0 , upon observing a signal k, each agent of class n selects an action a n from theaction space R ⊂ R m to maximize the expected utility function E n [ U n ( a n , a, ω ) | k ] , a = { a n } Nn =1 ∈ R M to denote the vector of actions of all agents’ classes andwhere we have deﬁned M = N m.

Thus, we allow each agent’s utility to depend on the actionsof other agents in the economy. These actions could represent, for instance, consumption,investment, production, or price setting choices by market participants, consumers, or ﬁrms.Upon observing the sender’s signal k, agents of class n update their prior µ n ( ω ) µ ( ω ) usingthe Bayes rule to the class-speciﬁc posterior distribution π k ( ω ) µ n ( ω ) µ ( ω ) R π k ( ω ) µ n ( ω ) µ ( ω ) dω . As a result, their expected utility conditional on observing k is given by E n [ U n ( a n , a, ω ) | k ] = R π k ( ω ) µ n ( ω ) µ ( ω ) U n ( a n , a, ω ) dω R π k ( ω ) µ n ( ω ) µ ( ω ) dω . Clearly, the optimal action a n that maximizes this expected utility coincides with thatof an agent with the prior µ ( ω ) and a state-dependent utility function ˜ U n ( a n , a, ω ) ≡ µ n ( ω ) U n ( a n , a, ω ). Hence, instead of assuming heterogeneous priors, we can replace U n with˜ U n and assume, without loss of generality, that all agents have a common prior µ ( ω ) , coinciding with that of the sender. From now on, we will always use E [ · ] to denote theexpectation under this measure, and this is the only type of expectation we will use.Expected utility depends on agents’ own actions as well as on the vector a of actions ofother agents and the state ω. Thus, an equilibrium a ( k ) = { a ( n, k ) } Nn =1 is a solution to theﬁxed point system a n ( k ) = arg max a n E [ U n ( a n , a ( k ) , ω ) | k ] . We use C (Ω) to denote the set of functions that are twice continuously diﬀerentiablein Ω. We will also use D a and D aa to denote the respectively the gradient and the Hessian8ith respect to the variable a. Let˜ G n ( a n , a, ω ) ≡ D a U n ( a n , a ( k ) , ω ) , and let G n ( a, ω ) ≡ ˜ G n ( a ( n ) , a, ω )be ˜ G n evaluated at the equilibrium action. Assumption 1

We assume that there exists an integrable majorante Y n ( ω ) ≥ such that Y n ( ω ) ≥ U n ( a n , a, ω ) for all ( a, a n ) ∈ R N +1 , ω ∈ Ω . The action space R is a convex, opensubset of R m . The function U n ( a n , a, ω ) ∈ C ( R × R N × Ω) is strictly concave in a n , wherewe have deﬁned R N = R × · · · × R | {z }

N times , and is such that lim a n → ∂ R U n = −∞ . Furthermore, the map G = ( G n ) Nn =1 : R M × Ω → R M is strictly monotone in a for each ω and k G ( a, ω ) k + k D a G ( a, ω ) k has an integrable majorante ˜ Y X ( ω ) for any compact subset X of R N +1 . Here, M = N m.

Assumption 1 implies that the following is true:

Lemma 3

There always exists a unique equilibrium a = a ∗ ( π ) . It is the unique solutionto the ﬁxed point system: E [ G ( a ( k ) , ω ) | k ] = 0 . This is a form of Inada condition ensuring that the optimum is always in the interior of R . An integrablemajorante is needed to apply Fatou lemma and conclude that lim a n → ∂ R E [ U n ( a n , a, ω ) | k ] = −∞ always. A map G is strictly monotone if ( a − a ) ⊤ ( G ( a ) − G ( a )) < a = a . .2 Optimal Information Design Without loss of generality, we may assume that at the optimum we always have a ( k ) = a (˜ k )for any k = ˜ k. That is, diﬀerent signals always induce diﬀerent actions. We assume thatthe sender chooses the information design to maximize the expected public welfare function W ( a, ω ) over all possible action proﬁles satisfying the participation (optimality) constraintsof the public:¯ π ∗ = arg max ¯ π E [ W ( a ∗ (¯ π ) , ω )] = arg max ¯ π { E [ W ( a, ω )] : a = maximizes agents ′ utilities } = max ¯ π, a { E [ W ( a ( k ) , ω )] : E [ G ( a ( k ) , ω ) | k ] = 0 ∀ k } . (2)By direct calculation, we can rewrite the expected social welfare function as E [ W ( a ∗ (¯ π ) , ω )] = K X k =1 Z W ( a ∗ ( k, ¯ π ) , ω ) π k ( ω ) µ ( ω ) dω . (3) Example 4 (Moment Persuasion)

The most important example throughout this paperwill be a setup where G n ( a, ω ) = g n ( ω ) − a n for some functions g n ( ω ) , n = 1 , · · · , M. Dworczak and Kolotilin (2019) refer to this setupas “moment persuasion.” It is known that any continuous function W ( a, ω ) can be uniformlyapproximated by a separable function, W ( a, ω ) ≈ κ X k =1 f k ( a ) ϕ k ( ω ) for some smooth functions ϕ k , f k (e.g., polynomials). As a result, deﬁning G n + l = ϕ l ( ω ) − n + l , l = 1 , · · · , κ, and ˜ W ( a ) = κ X k =1 f k ( a ) a n + k , we get E [ W ( a, ω )] ≈ E [ ˜ W ( a )] . More generally, if we approximate G ( a, ω ) ≈ X i ψ i ( a ) φ i ( ω ) , we get that E [ G ( a, ω ) | k ] = 0 is equivalent to the optimal action, a, being a function ofconditional expectations of φ i . Thus, any optimal information design problem considered inthis paper can be approximated by a moment persuasion problem.

We also need a technical condition motivated by Example 4.

Assumption 2

There exist functions ψ, g : Ω → R + and a convex, increasing function f such that the unique solution to E [ G ( a ( k ) , ω ) | k ] = 0 satisﬁes k a ( k ) k ≤ E [ g ( ω ) | k ] and | W ( a, ω ) | + k D a W ( a, ω ) k ≤ ψ ( ω ) + g ( ω ) f ( k a k ) and E [ ψ ( ω ) + g ( ω ) f ( g ( ω ))] < ∞ . Let Prob( k ) = E [ π k ( ω )] be the ex-ante probability of signal k. Then, ( xf ( x )) ′′ = 2 f ′ ( x ) + xf ′′ ( x ) ≥ E [ W ( a, ω )] = X k Prob( k ) E [ W ( a ( k ) , ω ) | k ] ≤ X k Prob( k ) E [ ψ ( ω ) + g ( ω ) f ( k a ( k ) k ) | k ] ≤ E [ ψ ( ω )] + X k Prob( k ) E [ g ( ω ) | k ]) f ( E [ g ( ω ) | k ]) ≤ E [ ψ ( ω )] + X k Prob( k ) E [ g ( ω ) f ( g ( ω )) | k ]= E [ ψ ( ω ) + g ( ω ) f ( g ( ω ))] , (4)11nd, hence, social welfare is bounded. A similar argument implies that in depends smoothlyon the information design. To state the main result of this section —the optimality ofpartitions—we need also the following deﬁnition. Deﬁnition 5

We say that functions { f ( ω ) , · · · , f L ( ω ) } , ω ∈ Ω , are linearly independentmodulo { g ( ω ) , · · · , g L ( ω ) } if there exist no real vectors h ∈ R L , k ∈ R L with k h k 6 = 0 , such that X i h i f i ( ω ) = X j k j g j ( ω ) f or all ω ∈ ω . In particular, if L = 1 , then f ( ω ) is linearly independent modulo { g ( ω ) , · · · , g L ( ω ) } if f ( ω ) cannot be expressed as a linear combination of { g ( ω ) , · · · , g L ( ω ) } . We also need the following technical condition.

Deﬁnition 6

We say that

W, G are in a generic position if for any ﬁxed a, ˜ a ∈ R N , a = ˜ a ,the function W ( a, ω ) − W (˜ a, ω ) is linearly independent modulo (cid:8) { G n ( a, ω ) } Nn =1 , { G n (˜ a, ω ) } Nn =1 (cid:9) ; W, G are in generic position for generic functions W and G . We will also need a key prop-erty of real analytic functions that we use in our analysis (see, e.g., Hugonnier, Malamud and Trubowitz(2012)). Proposition 7

If a real analytic function f ( ω ) is zero on a set of positive Lebesgue measure,then f is identically zero. Hence, if real analytic functions { f ( ω ) , · · · , f L ( ω ) } are linearlydependent modulo { g ( ω ) , · · · , g L ( ω ) } on some subset I ⊂ Ω of positive Lebesgue measure,then this linear dependence also holds on the whole Ω except, possibly, a set of Lebesguemeasure zero. Using Proposition 7, it is possible to prove the main result of this section: The set of

W, G that are not in generic position is nowhere dense in the space of continuous functions. A function is real analytic if it can be represented by a convergent power series in the neighborhood ofany point in its domain. heorem 8 (Optimal ﬁnite information design) There always exists an optimal K -ﬁnite information design ¯ π ∗ which is a partition. Furthermore, if W, G are real analyticin ω for each a and are in generic position, then any K -ﬁnite optimal information design isa partition. While Theorem 8 may seem intuitive, its proof is non-trivial and is based on noveltechniques (see the appendix). First, we prove the second part. The existence of an optimum¯ π follows by standard compactness arguments. Suppose now on the contrary that ¯ π is nota partition. Then, for some k, we have π k ( ω ) ∈ (0 ,

1) on some positive measure subset I ⊂ Ω . At the global maximum, under arbitrary small perturbations, social welfare shoulddecrease. We show that this can only be true if conditions of Deﬁnition 6 are violated for ω ∈ I. However, since I has a positive Lebesgue measure, Proposition 7 implies that it hasto be violated on the whole of Ω. Finally, the ﬁrst part follows by a simple approximationargument because any function can be approximated by an analytic function satisfying thegeneric conditions of Deﬁnition 6. The goal of this section is to provide a general characterization of the “optimal clusters” inTheorem 8.We use D a G ( a, ω ) ∈ R M × M to denote the Jacobian of the map G , and, similarly, D a W ( a, ω ) ∈ R × M the gradient of the welfare function W ( a, ω ) with respect to a. Forany vectors x k ∈ R M , k = 1 , · · · , K and actions { a ( k ) } Kk =1 , let us deﬁne the partitionΩ ∗ k ( { x ℓ } Kℓ =1 , { a ℓ } Kℓ =1 ) = ( ω ∈ Ω : W ( a ( k ) , ω ) − x ⊤ k G ( a ( k ) , ω )= max ≤ l ≤ K (cid:0) W ( a ( l ) , ω ) − x ⊤ l G ( a ( l ) , ω ) (cid:1) ) (5)13quation (5) is basically the ﬁrst-order condition for the optimization problem (2), whereby x k are the Lagrange multipliers of agents’ participation constrains. Theorem 9

Any optimal partition in Theorem 8 satisﬁes the following conditions: • local optimality holds: Ω k = Ω ∗ k ( { x ℓ } Kℓ =1 , { a ℓ } Kℓ =1 ) with x ⊤ k = ¯ D a W ( k )( ¯ D a G ( k )) − , where we have deﬁned for each k = 1 , · · · , K ¯ D a W ( k ) = Z Ω k D a W ( a ( k ) , ω ) µ ( ω ) dω , ¯ D a G ( k ) = Z Ω k D a G ( a ( k ) , ω ) µ ( ω ) dω • the actions { a ( k ) } Kk =1 satisfy the ﬁxed point system Z Ω k G ( a ( k ) , ω ) µ ( ω ) dω = 0 , k = 1 , · · · , K . (6) • the boundaries of Ω k have a Lebesgue measure of zero and are a subset of the variety ∪ k = l (cid:8) ω ∈ R m : W ( a ( k ) , ω ) − x ⊤ k G ( a ( k ) , ω ) = W ( a ( l ) , ω ) − x ⊤ l G ( a ( l ) , ω ) (cid:9) . (7)A key insight of Theorem 9 comes from the characterization of the diﬀerent clusters ofan optimal partition. The sender has to solve the problem of maximizing social welfare (3)by inducing the desired actions, ( a n ) , of economic agents for every realization of ω. Ideally,the sender would like to induce a ∗ = arg max a W ( a, ω ) . However, the ability of the senderto elicit the desired actions is limited by the participation constraints of the public – that is,the map from the posterior beliefs induced by communication to the actions of the public.Indeed, while the sender can induce any Bayes-rational beliefs (i.e., any posteriors consistent This variety is real analytic when so are W and G. A real analytic variety in R m is a subset of R m deﬁned by a set of identities f i ( ω ) = 0 , i = 1 , · · · , I where all functions f i are real analytic. If at least oneof functions f i ( ω ) is non-zero, then a real analytic variety is always a union of smooth manifolds and hencehas a Lebesgue measure of zero. When W, G are real analytic and are in generic position, the variety (cid:8) ω ∈ R m : W ( a ( k ) , ω ) − x ⊤ k G ( a ( k ) , ω ) = W ( a ( l ) , ω ) − x ⊤ l G ( a ( l ) , ω ) (cid:9) has a Lebesgue measure ofzero for each k = l. x k , so that the sender is maximizing the Lagrangianmax a ( W a, ( ω ) − x ⊤ k G ( a, ω )) . Formula (5) shows that, inside the cluster number k , the optimalaction proﬁle maximizes the respective Lagrangian. The boundaries of the clusters are thendetermined by the indiﬀerence conditions (7), ensuring that at the boundary between regions k and l the sender is indiﬀerent between the respective action proﬁles a k and a l . Several papers study the one-dimensional case (i.e., when L = 1 so that ω ∈ R )and derive conditions under which the optimal signal structure is a monotone partitioninto intervals. Such a monotonicity result is intuitive, as one would expect the optimalinformation design to only pool nearby states. The most general results currently availableare due to Hopenhayn and Saeedi (2019) and Dworczak and Martini (2019), but they coverthe case when sender’s utility (social welfare function in our setting) only depends on E [ ω ] ∈ R . Under this assumption, Dworczak and Martini (2019) derive necessary andsuﬃcient conditions guaranteeing that the optimal signal structure is a monotone partitionof Ω into a union of disjoint intervals. Arieli et al. (2020) (see, also, Kleiner et al. (2020))provide a full solution to the information design problem when a ( k ) = E [ ω | k ] and, inparticular, show that the partition result does not hold in general when the signal spaceis continuous. Theorem 9 proves that a K -ﬁnite optimal information design is in fact alwaysa partition when the state space is continuous and the signal space is discrete. However, nogeneral results about the monotonicity of this partition can be established without imposingmore structure on the problem. See also Mensch (2018). This is equivalent to G ( a, ω ) = a − ω in our setting. In this case, formula (6) implies that the optimalaction is given by a ( k ) = E [ ω | k ] . Of course, as Dworczak and Martini (2019) and Arieli et al. (2020) explain, even in the one-dimensionalcase the monotonicity cannot be ensured without additional technical conditions. No such conditions areknown in the multi-dimensional case. Dworczak and Martini (2019) present an example with four possibleactions ( K = 4) and a two-dimensional state space ( L = 2) for which they are able to show that the optimalinformation design is a partition into four convex polygons. See, also, Dworczak and Kolotilin (2019). a ( ω ) = X k a ( k ) ω ∈ Ω k Since welfare can be written as E [ W ( a ( ω ) , ω )] , this function encodes all the properties of the optimal information design. It turns out thatin the case of moment persuasion (Example 4), optimality conditions of Theorem 9 can beused to derive important, universal monotonicity properties of the optimal policy function a ( ω ) . Proposition 10

Suppose that we are in a moment persuasion setup: G = g ( ω ) − a with g ( ω ) : Ω → R M and W ( a, ω ) = W ( a ) . Suppose also that M ≤ L. Let X ⊂ Ω be an open setsuppose that g is injective on X where g ( X ) is convex. Then, the set g (Ω k ∩ X ) is convexfor each k. In particular, Ω k ∩ X is connected. Furthermore, the map D a W ( a ( g − ( x ))) ismonotone increasing on g (Ω k ∩ X ) . Injectivity of the g map and convexity of its image are crucial for the connectedness ofthe Ω k regions. Without injectivity, even bounding the number of connected componentsof Ω k is non-trivial. These eﬀects become particularly strong in the limit when K → ∞ , where lack of injectivity in the map g may lead to a breakdown of even minimal regularityproperties of the optimal map. 16 The Unconstrained Problem and the Cost of Infor-mation Transport

The full, unconstrained optimal information design problem can be formulated as follows(see, e.g., Kamenica and Gentzkow (2011), Dworczak and Kolotilin (2019)):

Deﬁnition 11

Let ∆(Ω) be the set of probability measures on Ω and deﬁne ∆(∆(Ω)) simi-larly. Let also a ( µ ) be the unique solution to Z G ( a, ω ) dµ ( ω ) = 0 and let ¯ W ( µ ) = Z W ( a ( µ ) , ω ) dµ ( ω ) . The optimal Bayesian persuasion (optimal information design) problem is to maximize Z ¯ W ( µ ) dτ ( µ ) over all τ ∈ ∆(∆(Ω)) satisfying Z µdτ ( µ ) = µ . Such a τ is called an information design. We say that τ does not involve randomizationif there exists a map a : Ω → R M such that τ coincides with the set of distributions of ω conditional on a. In this case, a ( ω ) will be referred to as an optimal policy. We start with the following simple lemma.17 emma 12

When K → ∞ , maximal social welfare attained with the ﬁnite K converges tothe maximal welfare attained in the full, unconstrained problem of Deﬁnition 11. Proof of Lemma 12 . The proof follows by standard compactness arguments. Indeed, picka τ ∈ ∆(∆(Ω)) . Because ∆(∆(Ω)) is compact in the weak* topology, we can approximate τ by a measure of ﬁnite support { µ , · · · , µ K } . Q.E.D.The next result is the key step in deriving properties of an optimal information design.We will use Supp( a ) to denote the support of any map a :Supp( a ) = { x ∈ R M : µ ( { ω : k a ( ω ) − x k < ε ) > ∀ ε > } . Deﬁnition 13

Let a ∗ ( ω ) be the unique solution to G ( a ∗ ( ω ) , ω ) = 0 . For any map x : R M → R M , we deﬁne c ( a, ω ; x ) ≡ W ( a ∗ ( ω ) , ω ) − W ( a, ω ) + x ( a ) ⊤ G ( a, ω ) . For any set Ξ ⊂ R M , we deﬁne φ Ξ ( ω ; x ) ≡ inf a ∈ Ξ c ( a, ω ; x ) . Everywhere in the sequel, we refer to c as the cost of information transport. To gain some intuition behind the cost c, we note that W ( a ∗ ( ω ) , ω ) is the welfare attainedfrom revealing that the true state is ω. Thus, W ( a, ω ) − W ( a ∗ ( ω ) , ω ) is the welfare gain frominducing a diﬀerent (preferred) action a and x ( a ) ⊤ G ( a, ω ) is the corresponding shadow costof agents’ participation constraints. The total cost of information transport is the sum ofthe true and the shadow costs of “transporting” information from a ∗ ( ω ) to a. Since E [ W ( a ∗ ( ω ) , ω )] is independent of the information design, maximizing expectedwelfare E [ W ( a ( ω ) , ω )] is equivalent to minimizing E [ W ( a ∗ ( ω ) , ω ) − W ( a ( ω ) , ω )] . From now18n, we will be considering this equivalent formulation of the problem. Note that, for anypolicy a satisfying E [ G ( a ( ω ) , ω ) | a ( ω ) = a ] = 0 and any bounded x we always have E [ W ( a ∗ ( ω ) , ω ) − W ( a ( ω ) , ω )] = E [ c ( a ( ω ) , ω ; x )] . (8)Thus, the problem of maximizing E [ W ( a ( ω ) , ω )] over all admissible policies a ( ω ) is equivalentto the problem of minimizing the expected cost of information transport , E [ c ( a ( ω ) , ω ; x )] . By passing to the limit in Theorem 9, it is possible to prove the following result.

Theorem 14

Suppose that inf a,ω,z, k z k =1 z ⊤ D a G ( a, ω ) z > and sup a,ω k D a G ( a, ω ) k < ∞ . Then, there exists a Borel-measurable optimal Bayesian persuasion solving the problem ofDeﬁnition 11 that does not involve randomization. The corresponding optimal map a ( ω ) satisﬁes E [ G ( a ( ω ) , ω ) | a ( ω ) = a ] = 0 for all a ∈ R M . Furthermore, if we deﬁne

Ξ =Supp( a ) and x ( a ) ⊤ = E [ D a W ( a, ω ) | a ] E [ D a G ( a, ω ) | a ] − , (9) then • we have c ( a ( ω ) , ω ; x ) ≤ • we have a ( ω ) = arg min b ∈ Ξ c ( b, ω ; x ) (11) and the function c ( a ( ω ) , ω ; x ) is Lipschitz continuous in ω. • we have E [ W ( a ∗ ( ω ) , ω ) − W ( a ( ω ) , ω )] = E [ φ Ξ ( ω ; x )]19 urthermore, any optimal information design satisﬁes (10) and the following weaker form of (11) : a ( ω ) = arg min b ∈ Ξ E [ c ( b, ω ; x ) | a ( ω )] . (12)Theorem 14 characterizes some important properties that are necessary for an optimalinformation design. We conjecture that, in fact, any optimal policy has to satisfy (11) (whichclearly implies (12)). We will refer to an optimal policy satisfying (11) as a strong policy. Theorem 14 implies that a strong policy always exists. It turns out that strong policiespossess certain remarkable properties that can be established using results from optimaltransport theory. We ﬁrst recall the classical optimal transport problem of Monge andKantorovich (see, e.g., McCann and Guillen (2011)).

Deﬁnition 15

Consider two probability measures, µ ( ω ) dω on Ω and ν on Ξ . The optimalmap problem (the Monge problem) is to ﬁnd a map X : Ω → Ξ that minimizes Z c ( X ( ω ) , ω ) µ ( ω ) dω under the constraint that the random variable χ = X ( ω ) is distributed according to ν. TheKantorovich problem is to ﬁnd a probability measure γ on Ξ × Ω that minimizes Z c ( χ, ω ) dγ ( χ, ω ) over all γ whose marginals coincide with µ ( ω ) dω and ν , respectively. It is known that, under very general conditions, the Monge problem and its Kantorovichrelaxation have identical values, and an optimal map exists. It turns out that, remarkably,a strong optimal policy always solves the Monge problem. This result is a direct analog of Corollary 2.5 in Kramkov and Xu (2019) that was established there in heorem 16 Any strong optimal policy a ( ω ) solves the Monge problem with ν being thedistribution of the random vector a ( ω ) . The result of Theorem 16 puts us in a perfect position to apply all the powerful machineryof optimal transport theory and derive properties of a strong optimal policy.

Deﬁnition 17

We say that a map a : Ω → R M is c -cyclically monotone if and only if all k ∈ N and ω , · · · , ω k ∈ Ω satisfy k X i =1 c ( a ( ω i ) , ω i ; x ) ≤ k X i =1 c ( a ( ω i ) , ω σ ( i ) ; x ) for any permutation σ of k letters. We start with the following result which is a direct consequence of a theorem of Smith and Knott(1992).

Corollary 18

Any strong optimal policy is c -cyclically monotone. Cyclical monotonicity plays an important role in the theory of optimal demand. See, forexample, Rochet (1987). In our setting, this result has a similar ﬂavour: In order to inducean optimal action, the sender optimally aligns actions a with the state ω to minimize thecost of information transport, c. We complete this section with a direct application of a theorem of Gangbo (1995) andLevin (1999). Recall that a function u ( ω ) is called c -convex if and only if u = ( u ˜ c ) c where u ˜ c ( a ) = sup ω ∈ Ω ( − c ( a, ω ) − u ( ω )) , v c ( ω ) = sup a ∈ R M ( − c ( a, ω ) − v ( a )) the special case of W ( a ) = a a , g ( ω ) = (cid:0) ω ω (cid:1) and L = M = 2 . Its proof is also completely analogous to thatin Kramkov and Xu (2019). This result is, in fact, a direct consequence of (11) because each ω is already coupled with the “best” a ( ω ) . c -convex functions has a lot of nice properties (see, e.g., McCann and Guillen(2011)). In particular, a c -convex function u is twice diﬀerentiable Lebesgue-almost every-where and satisﬁes | Du ( ω ) | ≤ sup a | D ω c ( a, ω ) | and D aa u ( ω ) ≥ inf a − D ωω c ( , ω ) . Thefollowing is true (see, e.g., McCann and Guillen (2011)).

Corollary 19

Suppose that c is jointly continous in ( a, ω ) and that the map a → D ω c ( a, ω ) is injective for Lebesgue-almost every ω ∈ Ω . Let a = Y ( ω, p ) be the unique solution to D ω c ( a, ω ) = − p for any p. Then, there exists a locally Lipschitz c -convex function u : Ω → R such that a ( ω ) = Y ( ω, Du ( ω )) . Corollaries 18 and 19 provide strong necessary conditions for an optimal policy. Unfor-tunately, in general we do not know if the conditions of Theorem 14 are also suﬃcient foroptimality. As we show in the next section, such suﬃciency can be established in the settingof moment persuasion which, as we explain above (see Example 4), can in fact be used toapproximate any optimal information design problem.

In the case of moment persuasion, G ( a, ω ) = a − g ( ω ) , a ∗ ( ω ) = g ( ω ) , W ( a, ω ) = W ( a ) and x ( a ) ⊤ = D a W ( a ) . The key simpliﬁcation comes from the fact that x ( a ) is independent of thechoice of information design. We will slightly abuse the notation and introduce a modiﬁeddeﬁnition of the function c. Namely, we deﬁne c ( a, b ) = W ( b ) − W ( a ) + D a W ( a ) ( a − b ) (13)As one can see from (13), the cost of information transport, c, coincides with the classicBregman divergence that plays an important role in convex analysis (see, e.g., Rockafellar(1970)). A key innovation in this paper is the introduction of Bregman divergence with a22on-convex W. In this case, none of the classic results about Bregman divergence hold true,and new techniques need to be developed.Our key objective here is to understand the structure of the support set Ξ of an optimalpolicy. Here, the nature of the map g : Ω → R M will play an important role, as can alreadybe guessed from Proposition 10. Deﬁnition 20

Let conv ( X ) be the closed convex hull of a set X ⊂ R M . A set Ξ ⊂ R M is X -maximal if inf a ∈ Ξ c ( a, b ) ≤ for all b ∈ X. A set Ξ is W -monotone if c ( a , a ) ≥ forall a , a ∈ Ξ . A set Ξ is W -convex if W ( ta + (1 − t ) a ) ≤ tW ( a ) + (1 − t ) W ( a ) for all a , a ∈ Ξ , t ∈ [0 , . We also deﬁne φ Ξ ( x ) = inf a ∈ Ξ c ( a, x ) . Note that we are again slightly abusing the notation so that the function φ Ξ from the previoussection corresponds to φ Ξ ( g ( ω )) in this section. Lemma 21

Every W -convex set is W -monotone. Indeed, the function q ( t ) = tW ( a ) + (1 − t ) W ( a ) − W ( ta + (1 − t ) a ) satisﬁes q ( t ) ≥ q (0) = 0 and hence 0 ≤ q ′ (0) = c ( a , a ) . For the readers’ convenience, we now state ananalog of Theorem 14 for the case of moment persuasion. Recall also that (by (8)) we aresolving the equivalent problem of minimizing E [ c ( a ( ω ) , g ( ω ))] . Corollary 22

Suppose that W = W ( a ) , G = a − g ( ω ) are such that | W ( a ) | + k D a W k ≤ f ( a ) for some convex function f satisfying E [ f ( g ( ω ))] < ∞ . Then, there exists a strong optimalpolicy a ( ω ) with Ξ = Supp( a ) such that: • a ( ω ) = E [ g ( ω ) | a ( ω )] , a ( ω ) = arg min b ∈ Ξ c ( b, g ( ω )) , the function c ( a ( ω ) , g ( ω )) isLipschitz continuous in ω. if g is injective on a set X ⊂ Ω and g ( X ) is convex, then D a W ( a ( g − ( x ))) is mono-tone increasing on X , while c ( a ( g − ( x )) , x ) is convex on X and D a W ( a ( g − ( x ))) is asubgradient of c ( a ( g − ( x )) , x ) . Furthermore, and any optimal policy satisﬁes c ( a ( ω ) , g ( ω )) ≤ . The main result of this section is the following theorem which provides explicit andveriﬁable necessary and suﬃcient conditions for optimality of a given policy.

Theorem 23

For any optimal policy a ( ω ) , the set Ξ = Supp( a ) is g (Ω) -maximal and W -convex. Furthermore, any policy a ( ω ) satisfying a ( ω ) = E [ g ( ω ) | a ( ω )] , and a ( ω ) =arg min b ∈ Ξ c ( b, g ( ω )) , and such that Ξ = Supp( a ) is conv ( g (Ω)) -maximal is optimal. Intuitively, it is optimal to reveal information along the “domains of convexity” of W .Hence the W -convexity of the support Ξ of an optimal policy. W -convexity of Ξ also implies W -monotonicity. It means that transporting information along Ξ is costly, as informationon Ξ is already in its optimal “location”. Similarly, maximality of Ξ means that any pointoutside of Ξ can be transported to some location on Ξ at a negative cost, improving theoverall welfare. The arguments in the proof of Theorem 23 can also be used to shed somelight on the uniqueness of optimal policies. Proposition 24

Let a ( ω ) be a strong optimal policy with support Ξ . Let also Q Ξ be the set { b ∈ R M : φ Ξ ( b ) = 0 } . Then, if Ξ is X -maximal for some set X and Ξ ⊆ X, we have Ξ ⊆ Q Ξ . Furthermore, if ˜ a is another optimal policy with support ˜Ξ , then ˜Ξ ⊆ Q Ξ .If Ξ = Q Ξ and arg min a ∈ Ξ c ( a, b ) is a singleton for all b ∈ conv ( g (Ω)) , then the optimalpolicy is unique. We conjecture that the conditions in Proposition 24 hold generically and hence optimalpolicy is unique for generic W. This is indeed the case in the explicitly solvable examplesdiscussed in Section 5. Surprisingly, its W -convexity (and, hence, W -monotonicity) then follows automatically.

24e now discuss other, more subtle properties of optimal policies. We start with anapplication of Corollary 19.

Corollary 25

Suppose that M ≤ L and D ω g has rank M for Lebesgue-almost every ω andthat a → D a W ( a ) is injective. Then, the map a → D ω c ( a, g ( ω )) is injective. Let Y ( ω, p ) be the unique solution to D ω c ( a, g ( ω )) = − p . Then there exists a locally Lipschitz c -convexfunction u ( ω ) such that a ( ω ) = Y ( ω, Du ( ω )) . In particular, a ( ω ) is diﬀerentiable Lebesque-almost everywhere. Our next objective to get an idea about the “amount” of information revealed by anoptimal policy. When W ( a ) is convex (so that D aa W ( a ) is positive semi-deﬁnite for any a ),then full revelation is an optimal policy, and it is the only optimal policy if W is strictlyconvex. Thus, in this case, Ξ = g (Ω) may have dimension up to M . By contrast, if W isstrictly concave (so that D aa W ( a ) is negative semi-deﬁnite), then revealing no informationis optimal and Ξ = { E [ g ( ω )] } has a dimension of zero. But what happens when W is neitherconcave nor convex? In this case, it is natural to expect that Ξ will be “smaller” than R M and its “smallness” depends on a “degree of concavity” of W . One natural candidatefor such a degree is the number of negative eigenvalues of D aa W. And indeed, as Tamura(2018) shows, when (1) W ( a ) = a ⊤ Ha + b ⊤ a for some matrix H ∈ R L × L and some vector b ∈ R L ; (2) g ( ω ) = ω and (3) µ ( ω ) is a multi-variate Gaussian distribution, there existsan optimal policy a ( ω ) with Ξ being a subspace of R M whose dimension equals the numberof nonnegative eigenvalues of H. A key simplifying property of the linear optimal policy inTamura (2018) is its regularity: A linear subspace is a smooth manifold and, hence, has anatural notion of dimension. However, even in the simple setup of Tamura (2018), nothingis known about the behaviour of other policies. Are there non-linear policies? If yes, howmuch information to they reveal? Here, we establish a surprising result linking the numberof positive eigenvalues of the Hessian D aa W with the Hausdorﬀ dimension of the support Ξof any (strong or weak) optimal policy. Recall that the d-dimensional Hausdorﬀ measure H d

25f a subset S ⊂ R M is deﬁned as H d ( S ) = lim r → inf { X i r di : there is a cover of S by balls with radii 0 < r i < r } , and the Hausdorﬀ dimension dim H ( S ) is deﬁned asdim H ( S ) = inf { d ≥ H d ( S ) = 0 } . It is known that Hausdorﬀ dimension coincides with the “natural” deﬁnition of dimensionfor suﬃciently regular sets. E.g., dim H ( S ) = d for a smooth, d -dimensional manifold S .However, in general, we do not have any strong regularity results for the behaviour of thesupport Ξ of an optimal policy a ( ω ). Hence, proving that Ξ is a smooth manifold seems ingeneral out of reach and Ξ may potentially be highly irregular. For irregular sets, Hausdorﬀdimension may behave in a very complex fashion and may even take fractional values forfractals. The following is true. Theorem 26

Let X be a Borel set, X ⊂ R M . Suppose that either D aa W is a constantmatrix or that D aa W ( a ) is continuous in a and is non-degenerate for all a ∈ X except fora countable set of points. Let ν ( a ) be the number of nonnegative eigenvalues of D aa W ( a ) . Then, for any optimal policy a ( ω ) , we have dim H (Supp( a ) ∩ X ) ≤ sup a ∈ X ν ( a ) . Alternatively, one can impose bounds on the Hausdorﬀ dimension of the set { a : det( D aa W ( a )) = 0 } . Examples

In this section, we investigate several concrete examples illustrating applications of Theorem23. All our examples will be based on the following technical result which is a directconsequence of Theorem 23.

Proposition 27

Let F be a bijective, bi-Lipshitz map, F : X → Ω for some open set X ⊂ R M . Let also M ≤ M and x = ( x , x ) with x ∈ X , the projection of X onto R M and x ∈ X , the projection of X onto R L − M . Deﬁne f ( x ) = f ( x ; F ) ≡ R X | det( D x F ( x , x )) | µ ( F ( x , x )) g ( F ( x , x )) dx R X | det( D x F ( x , x )) | µ ( F ( x , x )) dx . Suppose that f is an injective map, f : X → R M and deﬁne φ ( b ) = min y ∈ X { W ( b ) − W ( f ( y )) + D a W ( f ( y )) ⊤ ( f ( y ) − b ) } . (14) Suppose also that the min in (14) for b = g ( F ( x , x )) is attained at y = x and that φ ( b ) ≤ ∀ b ∈ conv ( g (Ω)) . Then, a ( ω ) = f (( F − ( ω )) ) is a strong optimal policy. If x = arg min in (14) with b = F ( x , x ) for all x , x and φ ( b ) < for all b ∈ conv ( g (Ω)) \ Ξ with Ξ = f ( X ) , then the optimal policy is unique.If f is Lipshitz-continuous and the minimum in (14) is attained at an interior point, weget a system of second order partial integro-diﬀerential equations for the F map: D x f ( x ) ⊤ D aa W ( f ( x ))( f ( x ) − g ( F ( x , x ))) = 0 . We start with the simplest setting: A quadratic problem with W ( a ) = a ⊤ Ha.

In thiscase, Theorem 23 implies that, for any optimal policy, Ξ has to be monotonic, meaning that( a − a ) ⊤ H ( a − a ) ≥ a , a ∈ Ξ . The question we ask is: Under what conditions A map F is bi-Lipschitz if both F and F − are Lipschitz continuous. a ( ω ) = Aω with some matrix A of rank M ≤ M is optimal with g ( ω ) = ω. Clearly, it isnecessary that µ have linear conditional expectations, E [ ω | Aω ] = Aω . But then, since E [ Aω | Aω ] = Aω, we must have A = A, so that A is necessarily a projection. Maximalmonotonicity implies that Q = A ⊤ HA is positive semi-deﬁnite, and φ ( b ) = min ω ( b ⊤ Hb + ω ⊤ A ⊤ H ( Aω − b )) = b ⊤ HAQ − A ⊤ A ( AQ − A ⊤ H − Id ) b with the minimizer Q − A ⊤ Hb.

Thus, A satisﬁes the ﬁxed point equation A = Q − A ⊤ H and hence A ⊤ = HAQ − . Furthermore, maximality of Ξ implies that H + HAQ − A ⊤ A ( AQ − A ⊤ H − Id ) = H − A ⊤ A is negative semi-deﬁnite. As a result, ( Id − A ⊤ )( H − A ⊤ A )( Id − A ) = ( Id − A ⊤ ) H ( Id − A )is also negative semi-deﬁnite, implying that A and Id − A “perfectly split” positive andnegative eigenvalues of H . Here, it is instructive to make two observations: First, optimalityrequires that a ( ω ) “lives” on positive eigenvalues of H . Second, maximality (the fact that φ ( b ) ≤ b ) requires that A absorbs all positive eigenvalues, justifying the term“maximal”. By direct calculation, we obtain the following extension of the result of Tamura(2018) (uniqueness follows from Proposition 24). Corollary 28

Suppose that W ( a ) = a ⊤ Ha . Suppose that ω has an elliptical distribution witha density µ ( ω ) = µ ∗ ( ω ⊤ Σ − ω ) for some µ ∗ . Let V = Σ / H Σ / . Deﬁne Q + = [ q , · · · , q r ] as a k × r matrix consisting of the eigenvectors q , · · · , q r associated with all the positiveeigenvalues of V . Then, a ( ω ) = Σ / Q + ( Q ′ + Q + ) − Q ′ + Σ − / ω This is, e.g., the case for all elliptical distributions, but also for many other distributions. See Wei et al.(1999). Here, Q − is the Moore-Penrose inverse. s a strong optimal policy. Furthermore, if det( H ) = 0 , then the optimal policy is unique. Inparticular, there are no non-linear optimal policies. Consider now a non-linear version of this problem. Suppose that W ( a ) = ϕ ( k a k )for some smooth function ϕ . Let F correspond to multi-dimensional spherical coordinates: ω = F ( r, θ ) = rx ( θ ) where θ are the angular coordinates on the unit sphere and r = k ω k . Then, with x = θ, x = r we have R X | det( D x F ( x , x )) | µ ( F ( x , x )) g ( F ( x , x )) dx R X | det( D x F ( x , x )) | µ ( F ( x , x )) dx = R ∞ r L − µ ( rx ( θ )) g ( rx ( θ )) dr R ∞ r L − µ ( rx ( θ )) dr . Suppose that g and µ are such that R ∞ r L − µ ( rx ( θ )) g ( rx ( θ )) dr R ∞ r L − µ ( rx ( θ )) dr = αx ( θ ) . For example, this is the case when g ( ω ) = ω ψ ( k ω k ) for some function ψ ≥ µ ( ω ) = µ ∗ ( k ω k ) is spherically symmetric (a special case of an elliptical distribution). Thus, wemust have a ( ω ) = αω/ k ω k . In this case, Ξ is the sphere k a k = α and we have φ ( b ) = min a ∈ Ξ ( ϕ ( k b k ) − ϕ ( α )+2 ϕ ′ ( α ) a ⊤ ( a − b )) = ϕ ( k b k ) − ϕ ( α )+2 ϕ ′ ( α ) α ( α −k b k ) , and the minimizer is a = αb/ k b k . Thus, we get the ﬁxed point equation αω/ k ω k = a ( ω ) = αg ( ω ) / k g ( ω ) | = αω/ k ω k . Maximality is achieved when φ ( b ) is always non-positive, for all k b k ≤ max x ≤ R ( xψ ( x )) . Thus, we arrive at the following result.

Corollary 29

Suppose that g ( ω ) = ω ψ ( k ω k ) for some function ψ ≥ and µ ( ω ) = µ ∗ ( k ω k ) and W ( a ) = ϕ ( k a k ) and let α ≡ R ∞ r L µ ∗ ( r ) ψ ( r ) dr R ∞ r L − µ ∗ ( r ) dr . f max k b k≤ max x ≥ ( xψ ( x )) ( ϕ ( k b k ) − ϕ ( α ) + 2 ϕ ′ ( α ) α ( α − k b k )) ≤ , (15) then a ( ω ) = αω/ k ω k is a strong optimal policy and the optimal policy is unique if themaximum in (15) is attained only when α = k b k . Consider now a more complex example where R L = R L ⊕ R L and W ( a ) = ϕ ( k a k , a )where ϕ ( y , y ) is monotone increasing in y and satisﬁes ϕ ( y , ≥ ϕ ( y , y ) for all y . LetΞ ⊂ (cid:0) R L (cid:1) be the sphere k ω k = α whose Hausdorﬀ dimension is L − φ ( b ) = min a ∈ Ξ ( ϕ ( k b k , b ) − ϕ ( α ,

0) + 2 ϕ y ( α , a ⊤ ( a − b ))= ϕ ( k b k , b ) − ϕ ( α ,

0) + 2 ϕ y ( α , α ( α − k b k ) . Furthermore, b ∈ ( conv ( g (Ω))) implies k b k ≤ max ω ∈ Ω k ω k ψ ( k ω k , ω ) . Thus, we arriveat the following result.

Corollary 30

Suppose that g ( ω ) = ω ψ ( k ω k , ω ) for some function ψ ≥ and µ ( ω ) = µ ∗ ( k ω k , ω ) be such that ψ ( k ω k , ω ) µ ∗ ( k ω k , ω ) is even in each coordinate of ω . Letalso W ( a ) = ϕ ( k a k , a ) with ϕ y ( α , > and ϕ ( y , y ) ≤ ϕ ( y , for all y , y . Deﬁne α ≡ R ∞ r L R R L µ ∗ ( r , ω ) ψ ( r , ω ) dω dr R ∞ r L − R R L µ ∗ ( r , ω ) dω dr . If max k b k≤ max ω ∈ Ω k ω k ψ ( k ω k ,ω ) ( ϕ ( k b k , − ϕ ( α ,

0) + 2 ϕ y ( α , α ( α − k b k )) ≤ , (16)30 hen a ( ω ) = (cid:0) αω / k ω k (cid:1) is a strong optimal policy. The optimal policy is unique if ϕ ( y , y ) <ϕ ( y , for all y = 0 and the maximum in (16) is attained only when k b k = α . It is straightforward to extend this analysis to the more general setup of Corollary 28with W ( a ) = ϕ ( a ⊤ Ha ) with an increasing ϕ and µ ( ω ) = µ ∗ ( ω ⊤ Σ − ω ), in which case Ξwill be an ellipsoid whose Hausdorﬀ dimension equals the number of positive eigenvaluesof H minus one (see Theorem 26). It is also interesting to link these results to those ofDworczak and Martini (2019). Condition (16) means that the graph of the function ϕ ( x )lies below its tangent at x = α. When this condition is violated, one can consider the aﬃneclosure of ϕ ( x ) as in Dworczak and Martini (2019). In this case, the tangent will touch thegraph of ϕ ( x ) in several points r i and the optimal policy will be to project ω onto one of thespheres k ω k = r i . It is then possible to extend the beautiful results of Dworczak and Martini(2019) to this nonlinear setting.We complete this section with an example where ω takes values on a real analytic manifoldin R M . Namely, suppose that the sender observes the realization of ω = ( ω , · · · , ω M ) of theprobabilities of some states of the world, with P Mi =1 ω i = 1 . Note that in this case ω lives onthe unit simplex which is a real analytic manifold in R M , but all our results directly applyin this setting as long as the prior is absolutely continuous with respect to the Lebesguemeasure restricted to the unit simplex. We assume that µ ( ω ) is given by the Dirichletdistribution on the unit simplex, µ ( ω ; α ) = 1 B ( α ) M Y i =1 ω α i − i , B ( α ) = Q Mi =1 Γ( α i )Γ( P Mi =1 α i ) . We will also be assuming a speciﬁc function form of the social welfare function that dependsonly the total probabilities of certain groups of states as well as on the relative entropy of thecorresponding distributions. It is easy to micro-found such a welfare function in a settingwith limited attention. See, e.g., Gabaix (2019). Uniqueness follows from Proposition 24. orollary 31 Suppose that µ is the Dirichlet distribution on the unit simplex ∆ M withparameters α = (cid:0) ¯ α ¯ α (cid:1) and ω = (cid:0) ¯ ω ¯ ω (cid:1) with ¯ α , ¯ ω ∈ R M + , ¯ α , ¯ ω ∈ R M + . Let g ( ω ) = (cid:18) ψ ( ⊤ ¯ ω )¯ ω ψ ( ⊤ ¯ ω )¯ ω (cid:19) for some functions ψ i ≥ , i = 1 , , and W ( a ) = X i =1 ( q i E i (¯ a i ) + ϕ i ( ⊤ ¯ a i )) where a = (cid:0) ¯ a ¯ a (cid:1) , with E i ( a ) = P j a ( j ) log( a ( j ) /y i ( j )) being the negative or the relativeentropy, y i ∈ R M i + , i = 1 , , are arbitrary vectors, q i > , and ϕ i are arbitrary smoothfunctions. Deﬁne γ i = E µ [ ψ i ( ⊤ ¯ ω i ) ⊤ ¯ ω i ] , i = 1 , , and a ( ω ) = (cid:18) γ ¯ ω / ( ⊤ ¯ ω ) γ ¯ ω / ( ⊤ ¯ ω ) (cid:19) . Suppose that max ≤ ¯ b i ≤ max x ∈ [0 , ψ i ( x ) x (cid:0) ϕ i (¯ b i ) − ϕ i ( γ i ) + ϕ ′ i ( γ i )( γ − ¯ b i ) − q i (¯ b i log( γ i / ¯ b i ) − γ i + ¯ b i ) (cid:1) ≤ , i = 1 , . (17) Then, a ( ω ) is an optimal policy. If for each i = 1 , the maximum in (17) is attained onlywhen ¯ b i = γ i , then the optimal policy is unique. As in our discussion following Corollary 30, it is possible to show that when (17) is32iolated, the optimal policy will be to project ¯ ω i onto one of the multiple ℓ -spheres deﬁnedby P j ¯ a ( j ) = γ for several values of γ corresponding to points where the aﬃne closuretouches the graph of the function in (17). Theorem 26 implies that D aa W, the Hessian of W , is a key determinant of the structure ofoptimal policies. Is there an analog of D aa W for the more general setting of Theorem 14?What determines the natural convex and concave components of the problem? We do nothave a complete answer to these questions. However, as we show in this section, somethingcan be said in the case when the uncertainty is small.The structure of the optimal partition (Theorem 9) can be complex and non-linear. One may ask whether it is possible to “linearize” these partitions, just as one can linearizeequilibria in complex, non-linear economic models, assuming the deviations from the steadystate are small. As we show below, this is indeed possible.Everywhere in this section, we make the following assumption.

Assumption 3

There exists a small parameter ε such that the functions deﬁning the equi-librium conditions, G, and the welfare function, W, are given by G ( a, εω ) and W ( a, εω ) . Parameter ε has two interpretations. First, it could mean small deviations from a steadystate (as is common in the literature on log-linear approximations). Second, ε could beinterpreted as capturing the sensitivity of economic quantities to changes in ω. In the contextof policy communication, one could think about the sender trying to stabilize the economyaround the steady state by steering public expectations towards its desired equilibrium.In the limit when ε = 0 , equilibrium does not depend on shocks to ω. We use a = a ∗ (0) In general, the boundaries of the sets Ω k might be represented by complicated hyper-surfaces, and someof Ω k might even feature multiple disconnected components.

33o denote this “steady state” equilibrium. By deﬁnition, it is given by the unique solutionto the system G ( a ,

0) = 0 , and the corresponding social welfare is W ( a , . Assumption 4 (The information relevance matrix)

We assume the matrix D (0) with D ( ω ) ≡ D ωω ( W ( a ∗ ( ω ) , ω )) − W ωω ( a ∗ ( ω ) , ω ) is non-degenerate. We refer to D as the information relevance matrix. We are now ready to state the main result of this section, showing how the optimallinearized partition can be characterized explicitly in terms of the information relevancematrix D . Theorem 32 (Linearized partition)

Under the hypothesis of Theorem 8 and Assump-tions 3 and 4, let { Ω k ( ε ) } Kk =1 be the corresponding optimal partition. Then, for any sequence ε l → , l > , there exists a sub-sequence ε l j , j > , such that the optimal partition { Ω k ( ε l j ) } Kk =1 converges to an almost sure partition { ˜Ω ∗ k } Kk =1 satisfying ˜Ω ∗ k = { ω ∈ Ω : ( M ( k ) − M ( l )) ⊤ D (0) ω > . M ( k ) ⊤ D (0) M ( k ) − M ( l ) ⊤ D (0) M ( l )) ∀ l = k } , where we have deﬁned M ( k ) ≡ E [ ω | ˜Ω ∗ k ] . In particular, for this limiting partition, each set ˜Ω ∗ k is convex. If the matrix D from Assumption 4 is negative semi-deﬁnite, then all sets ˜Ω ∗ k are empty except for one; that is, it is optimal to reveal no information. Theorem 32 implies that the general problem of optimal information design convergesto a quadratic moment persuasion when ε is small. The matrix D (0) of Assumption 4incorporates information both about the hessian of H and about other partial derivativesof G . Moment persuasion setting corresponds to the case when D ωa G = 0 . One interestingeﬀect we observe is that, in general, non-zero partial derivatives D ωa G may have a a majorimpact on the structure of the D matrix. In particular, when ε is suﬃciently small and K

34s suﬃciently large, we are in a position to apply Theorem 26 and Corollary 28, linking thenumber of positive eigenvalues of D to the dimension of the support of optimal policies.35 eferences Arieli, Itai, Yakov Babichenko, Rann Smorodinsky, and Takuro Yamashita ,“Optimal Persuasion via Bi-Pooling,” Working Paper 2020.

Aumann, Robert J and Michael Maschler , Repeated games with incomplete informa-tion , MIT Press, 1995.

Beiglb¨ock, Mathias, Nicolas Juillet et al. , “On a problem of optimal transport undermarginal martingale constraints,”

Annals of Probability , 2016, (1), 42–106. Bergemann, Dirk and Stephen Morris , “Information design, Bayesian persuasion, andBayes correlated equilibrium,”

American Economic Review , 2016, (5), 586–91. and , “Information design: A uniﬁed perspective,”

Journal of Economic Literature ,2019, (1), 44–95. Calzolari, Giacomo and Alessandro Pavan , “On the optimality of privacy in sequentialcontracting,”

Journal of Economic Theory , 2006, (1).

Das, Sanmay, Emir Kamenica, and Renee Mirka , “Reducing congestion throughinformation design,” in “2017 55th annual allerton conference on communication,control, and computing (allerton)” IEEE 2017, pp. 1279–1284.

Dworczak, Piotr and Alessandro Pavan , “Preparing for the Worst But Hoping for theBest: Robust (Bayesian) Persuasion,” Working Paper 2020. and Anton Kolotilin , “The Persuasion Duality,” Working Paper 2019. and Giorgio Martini , “The Simple Economics of Optimal Persuasion,”

Journal ofPolytical Economy , 2019.

Gabaix, Xavier , “Behavioral inattention,” in “Handbook of Behavioral Economics: Appli-cations and Foundations 1,” Vol. 2, Elsevier, 2019, pp. 261–343.

Gangbo, W , “Habilitation thesis,”

Universite de Metz, available at http://people. math.gatech. edu/gangbo/publications/habilitation. pdf , 1995.36 entzkow, Matthew and Emir Kamenica , “A Rothschild-Stiglitz approach to Bayesianpersuasion,”

American Economic Review , 2016, (5), 597–601.

Ghoussoub, Nassif, Young-Heon Kim, Tongseok Lim et al. , “Structure of optimalmartingale transport plans in general dimensions,”

Annals of Probability , 2019, (1),109–164. Hopenhayn, Hugo and Maryam Saeedi , “Optimal Ratings and Market Outcomes,”Technical Report, UCLA 2019.

Hugonnier, Julien, Semyon Malamud, and Eugene Trubowitz , “Endogenous Com-pleteness of Diﬀusion Driven Equilibrium Markets,”

Econometrica , 2012, , 1249–1270. Kamenica, Emir , “Bayesian persuasion and information design,”

Annual Review ofEconomics , 2019, , 249–272. and Matthew Gentzkow , “Bayesian Persuasion,” American Economic Review , 2011, , 2590–2615.

Kleinberg, John and Sendhil Mullainathan , “Simplicity Creates Inequity: Implicationsfor Fairness, Stereotypes and Interpretability,” Working Paper 2019.

Kleiner, Andreas, Benny Moldovanu, and Philipp Strack , “Extreme points andmajorization: Economic applications,”

Available at SSRN , 2020.

Kolotilin, Anton , “Optimal information disclosure: a linear programming approach,”

Theoretical Economics , 2018, (2), 607–636. and Alexander Wolitzky , “Assortative Information Disclosure,” 2020. Kramkov, Dmitry and Yan Xu , “An optimal transport problem with backward martin-gale constraints motivated by insider trading,” arXiv preprint arXiv:1906.03309 , 2019.

Levin, Vladimir , “Abstract cyclical monotonicity and Monge solutions for the generalMonge–Kantorovich problem,”

Set-Valued Analysis , 1999, (1), 7–32. Mattila, Pertti , Geometry of sets and measures in Euclidean spaces: fractals and rectiﬁa-bility number 44, Cambridge university press, 1999.37 cCann, Robert J and Nestor Guillen , “Five lectures on optimal transportation:geometry, regularity and applications,”

Analysis and geometry of metric measure spaces:lecture notes of the s´eminaire de Math´ematiques Sup´erieure (SMS) Montr´eal , 2011,pp. 145–180.

Mensch, Jeﬀrey , “Monotone Persuasion,”

Manuscript , 2018.

Ostrovsky, Michael and Michael Schwarz , “Information disclosure and unraveling inmatching markets,”

American Economic Journal: Microeconomics , 2010, (2). Rayo, Luis and Ilya Segal , “Optimal Information Disclosure,”

Journal of PoliticalEconomy , 2010, , 949–987.

Rochet, Jean-Charles , “A necessary and suﬃcient condition for rationalizability in aquasi-linear context,”

Journal of mathematical Economics , 1987, (2), 191–200. and Jean-Luc Vila , “Insider trading without normality,” The review of economicstudies , 1994, (1), 131–152. Rockafellar, R Tyrrell , Convex analysis , Vol. 36, Princeton university press, 1970.

Smith, Cyril and Martin Knott , “On Hoeﬀding-Fr´echet bounds and cyclic monotonerelations,”

Journal of multivariate analysis , 1992, (2), 328–334. Tamura, Wataru , “Bayesian persuasion with quadratic preferences,”

Available at SSRN1987877 , 2018.

Taneva, Ina , “Information Design,” Edinburgh School of Economics Discussion PaperSeries, Edinburgh School of Economics, University of Edinburgh 2015.

Wei, KC John, Cheng F Lee, and Alice C Lee , “Linear conditional expectation, returndistributions, and capital asset pricing theories,”

Journal of Financial Research , 1999,22

Journal of Financial Research , 1999,22 (4), 471–487. 38 nternet Appendix A Finite Partitions: Proofs

Proof of Theorem 8 . Existence of an optimal information design follows trivially fromcompactness. Indeed, since π k ( ω ) ∈ [0 , P k π k = 1 is trivially preserved in the limit. Continuity ofsocial welfare in π k follows directly from the assumed integrability and regularity, hence theexistence of an optimal design.The equilibrium conditions can be rewritten as E µ s [ G ( a ( s ) , ω ) | s ] = 0 . Here, µ s ( ω ) = π ( s | ω ) µ ( ω ) R π ( s | ω ) µ ( ω ) dω and hence E µ k [ G ( a ( s ) , ω )] = R π ( k | ω ) µ ( ω ) G ( a ( k ) , ω ) dω R π ( k | ω ) µ ( ω ) dω . By assumption, equilibrium a depends continuously on { π k } . Since the map( { π k } , { a k } ) → (cid:26)Z π k ( ω ) µ ( ω ) G ( a ( k, ε ) , ω ) dω (cid:27) is real analytic, and has a non-degenerate Jacobian with respect to a, the assumed continuityof a and the implicit function theorem imply that a is in fact real analytic in { π k } . To computethe Frechet diﬀerentials of a ( s ) , we take a small perturbation η ( ω ) of π k ( ω ). By the regularity39ssumption and the Implicit Function Theorem, a ( k, ε ) = a ( k ) + εa (1) ( k ) + 0 . ε a (2) ( k ) + o ( ε )for some a (1) ( k ) , a (2) ( k ) . Let us rewrite0 = Z ( π k ( ω ) + εη ( ω )) µ ( ω ) G ( a ( k, ε ) , ω ) dω = Z ( π k ( ω ) + εη ( ω )) µ ( ω ) G ( a ( k ) + εa (1) ( k ) + 0 . ε a (2) ( k ) , ω ) dω ≈ Z π k ( ω ) µ ( ω ) G ( a ( k ) , ω ) + G a ( εa (1) ( k ) + 0 . ε a (2) ( k ))+ 0 . G aa ( εa (1) ( k ) , εa (1) ( k )) ! dω + ε Z η ( ω ) µ ( ω ) G ( a ( k )) + G a εa (1) ( k ) ! dω ! = ε Z π k ( ω ) µ ( ω ) G a a (1) ( k ) dω + Z η ( ω ) µ ( ω ) G ( a ( k )) dω ! + 0 . ε Z π k ( ω ) µ ( ω )[ G a a (2) ( k ) + G aa ( a ( k ) , ω )( a (1) ( k ) , a (1) ( k ))] dω + 2 Z η ( ω ) µ ( ω ) G a ( a ( k ) , ω ) a (1) ( k ) dω !! As a result, we get a (1) ( k ) = − ¯ G a ( k ) − Z η ( ω ) µ ( ω ) G ( a ( k ) , ω ) dω, ¯ G a ( k ) = Z π k ( ω ) µ ( ω ) G a dω , a (2) ( k ) = − ¯ G a ( k ) − Z π k ( ω ) µ ( ω ) G aa ( a ( k ) , ω )( a (1) ( k ) , a (1) ( k )) dω + 2 Z η ( ω ) µ ( ω ) G a ( a ( k ) , ω ) a (1) ( k ) dω ! . Consider the social welfare function¯ W ( π ) = E [ W ( a ( s ) , ω )] = X k Z Ω W ( a ( k ) , ω ) π k ( ω ) µ ( ω ) dω . Suppose that the optimal information structure is not a partition. Then, there exists asubset I ⊂ Ω of positive µ -measure and an index k such that π k ( ω ) ∈ (0 ,

1) for µ -almostall ω ∈ I. Since P i π i ( ω ) = 1 and π i ( ω ) ∈ [0 , , there must be an index k = k and a subset I ⊂ I such that π k ( ω ) ∈ (0 ,

1) for µ -almost all ω ∈ I . Consider a small perturbation { ˜ π ( ε ) } i of the information design, keeping π i , i = k, k ﬁxed and changing π k ( ω ) → π k ( ω ) + εη ( ω ) , π k ( ω ) → π k ( ω ) − ε ( ω ) where η ( ω ) in an arbitrary bounded function with η ( ω ) = 0for all ω I . Deﬁne η k ( ω ) = η ( ω ) , η k ( ω ) = − η ( ω ) , and η i ( ω ) = 0 for all i = k, k . Asecond-order Taylor expansion in ε gives X i Z Ω W ( a ( i, ε ) , ω )( π i ( ω ) + εη i ( ω )) µ ( ω ) dω ≈ Z Ω W ( a ( i ) , ω ) + W a ( a ( i ) , ω )( εa (1) ( i ) + 0 . ε a (2) ( i ))+ 0 . W aa ( a ( i ) , ω ) ε ( a (1) ( i ) , a (1) ( i )) ! ( π i ( ω ) + εη i ( ω )) µ ( ω ) dω = ¯ W ( π ) + ε X i Z Ω ( W ( a ( i ) , ω ) η i ( ω ) + W a ( a ( i ) , ω ) a (1) ( i ) π i ( ω )) µ ( ω ) dω ! + 0 . ε X i Z Ω (cid:16) W aa ( a ( i ) , ω )( a (1) ( i ) , a (1) ( i )) π i ( ω )+ W a ( a ( i ) , ω ) a (2) ( i ) π i ( ω ) + W a ( a ( i ) , ω ) a (1) ( i ) η i ( ω ) (cid:17) µ ( ω ) dω (18)41ince, by assumption, { π i } is an optimal information design, it has to be that the ﬁrst orderterm in (18) is zero, while the second-order term is always non-positive. We can rewrite theﬁrst order term as X i Z Ω ( W ( a ( i ) , ω ) η i ( ω ) + W a ( a ( i ) , ω ) a (1) ( i ) π i ( ω )) µ ( ω ) dω ! = X i Z Ω W ( a ( i ) , ω ) − (cid:16) Z W a ( a ( i ) , ω ) π i ( ω ) µ ( ω ) dω (cid:17) ¯ G a ( i ) − G ( a ( i ) , ω ) ! η i ( ω ) µ ( ω ) dω and hence it is zero for all considered perturbations if and only if W ( a ( k ) , ω ) − (cid:16) Z W a ( a ( k ) , ω ) π k ( ω ) µ ( ω ) dω (cid:17) ¯ G a ( k ) − G ( a ( k ) , ω )= W ( a ( k ) , ω ) − (cid:16) Z W a ( a ( k ) , ω ) π k ( ω ) µ ( ω ) dω (cid:17) ¯ G a ( k ) − G ( a ( k ) , ω ) (19)Lebesgue-almost surely for ω ∈ I . By Proposition 7, (19) also holds for all ω ∈ Ω . Hence,by Assumption 6, a ( k ) = a ( k ) , which contradicts our assumption that all a ( k ) are diﬀerent.Q.E.D. Proof of Theorem 9 . Suppose a partition ω = ∪ k Ω k is optimal. By regularity, equilib-rium actions satisfy the ﬁrst order conditions Z Ω k G ( a ( k ) , ω ) µ ( ω ) dω = 0 . Consider a small perturbation, whereby we move a small mass on a set

I ⊂ Ω k to Ω l . Then,42he marginal change in a n ( k ) can be determined from0 = Z Ω k G ( a ( k ) , ω ) µ ( ω ) dω − Z Ω k \I G ( a ( k, I ) , ω ) µ ( ω ) dω ≈ − Z Ω k D a G ( a ( k ) , ω )∆ a ( k ) µ ( ω ) dω + Z I G ( a ( k ) , ω ) µ ( ω ) dω , implying that the ﬁrst order change in a is given by∆ a ( k ) ≈ ( ¯ D a G ( k )) − Z I G ( a ( k ) , ω ) µ ( ω ) dω . Thus, the change in welfare is ∆ W = Z Ω k W ( a ( k ) , ω ) µ ( ω ) dω − Z Ω k \I W ( a ( k, I ) , ω ) µ ( ω ) dω + Z Ω l W ( a ( l ) , ω ) µ ( ω ) dω − Z Ω l ∪I W ( a ( l, I ) , ω ) µ ( ω ) dω ≈ − Z Ω k D a W ( a ( k ) , ω )∆ a ( k ) µ ( ω ) dω + Z I W ( a ( k ) , ω ) µ ( ω ) dω − Z Ω l D a W ( a ( l ) , ω )∆ a ( l ) µ ( ω ) dω − Z I W ( a ( l ) , ω ) µ ( ω ) dω = − ¯ D a W ( k )( ¯ D a G ( k )) − Z I G ( a ( k ) , ω ) µ ( ω ) dω + Z I W ( a ( k ) , ω ) µ ( ω ) dω + ¯ D a W ( l )( ¯ D a G ( l )) − Z I G ( a ( l ) , ω ) µ ( ω ) dω − Z I W ( a ( l ) , ω ) µ ( ω ) dω . This expression has to be non-negative for any I of positive Lebesgue measure. Thus, − ¯ D a W ( k )( ¯ D a G ( k )) − G ( a ( k ) , ω ) + W ( a ( k ) , ω )+ ¯ D a W ( l )( ¯ D a G ( l )) − G ( a ( l ) , ω ) − W ( a ( l ) , ω ) ≥ ω ∈ Ω k . Q.E.D.

Proof of Proposition 10 . First, we note that y = (cid:0) y y (cid:1) ∈ ˆ g (Ω k ∩ X ) where y ∈ R L if and Note that D a W is a horizontal (row) vector. W ( a ( k )) − x ⊤ k ( a ( k ) − y ) = max ≤ l ≤ K ( W ( a ( l )) − x ⊤ l ( a ( l ) − y ))and y ∈ ˆ g ( X ) . Both sets are convex and hence so is their intersection. To show monotonicityof D a W ( a (ˆ g − ( y ))), pick a y, z such that y, y + z ∈ g (Ω k ∩ X ) . By convexity, y + tz ∈ ˆ g (Ω k ∩ X )for all t ∈ [0 , . Our goal is to show that( D a W ( a ( g − ( y + z ))) − D a W ( a ( g − ( y ))) z ≥ . Since a is constant inside each Ω k , it suﬃces to show this inequality when y and y + z areinﬁnitesimally close to the boundary between two regions, Ω k and Ω k . Let y belong to thatboundary and y + εz ∈ Ω k . Then, W ( a ( k )) − D a W ( a ( k ))( a ( k ) − ( y + εz )) ≥ W ( a ( k )) − D a W ( a ( k ))( a ( k ) − ( y + εz ))and W ( a ( k )) − D a W ( a ( k ))( a ( k ) − y ) = W ( a ( k )) − D a W ( a ( k ))( a ( k ) − y )Subtracting, we get the required monotonicity. Q.E.D. B Finite Partitions: The Small Uncertainty Limit

When the policy-maker sends signal k , the public learns that ω ∈ Ω k . As a result, the publicposterior estimate of the conditional mean of ω is then given by M (Ω k ) ≡ E [ ω | ω ∈ Ω k ] = R Ω k ωµ ( ω ) dωP k ∈ R m , P k = P ( k ) = Z Ω k µ ( ω ) dω . Deﬁne

G ≡ ( D a G ( a , − D ω G ( a , ∈ R M × L . The following lemma follows by direct calculation.

Lemma 33

For any sequence ε ν → , ν ∈ Z + , there exists a sub-sequence ε ν j , j > , such that the optimal partitions { Ω k ( ε ν j ) } Kk =1 converge to a limiting partition { Ω k (0) } Kk =1 as j → ∞ . In this limit, a k ( ε ν j ) = a k − ε ν j G M (Ω k (0)) + o ( ε ν j ) . Lemma 33 provides an intuitive explanation for the role of the matrix G . Namely, in thelinear approximation, the public action is given by a linear transformation of E [ ω | k ] , theﬁrst moment of ω given the policy announcement: a k ≈ a k − G E [ ω | k ] . Thus, the matrix −G captures how strongly public actions respond to changes in beliefs. Proof of Lemma 33 . Trivially, the set of partitions is compact and hence we can ﬁnd asubsequence { Ω k ( ε j ) } converging to some partition { Ω k (0) } in the sence that their indicatorfunctions converge in L . We have0 = Z Ω k ( ε ) G ( a ( k, ε ) , εω ) µ ( ω ) dω = Z ˜Ω k ( ε ) G ( a ( k, ε ) , ω ) µ ( ω ) dω Z Ω k ( ε ) G ( a ( k, ε ) , εω ) µ ( ω ) dω = G ( a ( k, ε ) , M (Ω k ( ε )) + ε D ω G ( a ( k, ε ) , M (Ω k ( ε )) + O ( ε ) . (20)Let us show that a ( k, ε ) − a ( k,

0) = O ( ε ) . Suppose the contrary. Then there exists asequence ε m → k a ( k, ε ) − a ( k, k ε − → ∞ . We have G ( a ( k, ε ) , − G ( a ( k, ,

0) = Z D a G ( a ( k,

0) + t ( a ( k, ε ) − a ( k, a ( k, ε ) − a ( k, dt ≥ c k a ( k, ε ) − a ( k, k (21)for some c > D a G (0) = D a G ( a ( k, . Dividing(20) by ε, we get a contradiction.Deﬁne a (1) ( k ) ≡ − D a G (0) − D ω G ( a ( k ) , M (Ω k (0)) = −G M (Ω k (0)) . Let us now show that a ( k, ε ) − a ( k,

0) = εa (1) ( k ) + o ( ε ) . Suppose the contrary. Then, k ε − ( a ( k, ε ) − a ( k, − a (1) ( k ) k > c for some c > ε → . By (21),0 = Z Ω k ( ε ) G ( a ( k, ε ) , εω ) µ ( ω ) dω = G ( a ( k, ε ) , M (Ω k ( ε )) + ε D ω G ( a ( k, ε ) , M (Ω k ( ε )) + O ( ε )= εD a G (0) ε − ( a ( k, ε ) − a ( k, M (Ω k ( ε )) + ε D ω G ( a ( k ) , M (Ω k ( ε )) + O ( ε ) , and we get a contradiction taking the limit as ε → . Q.E.D.46 roof of Theorem 32 . We haveΩ k ( ε ) = { ω ∈ Ω : − ¯ D a W ( k, ε )( ¯ D a G ( k, ε )) − G ( a ( k, ε ) , εω ) + W ( a ( k, ε ) , εω ) > − ¯ D a W ( l, ε )( ¯ D a G ( l, ε )) − G ( a ( l, ε ) , εω ) + W ( a ( l, ε ) , εω ) ∀ l = k. } (22)The proof of the theorem is based on the following technical lemma. Lemma 34

We have − ¯ D a W ( k, ε )( ¯ D a G ( k, ε )) − G ( a ( k, ε ) , εω ) + W ( a ( k, ε ) , εω )= W (0) − . M ( k ) ⊤ D ε ω + 0 . ε M ( k ) ⊤ D M ( k ) + εW ω (0) ω + 0 . ε ω ⊤ W ωω (0) ω − D a W (0) D a G (0) − ( D ω G (0) ω + 0 . ω ⊤ G ωω (0) ω ) + o ( ε ) . Proof . We have¯ D a W ( k, ε ) = Z Ω k ( ε ) D a W ( a ( k, ε ) , εω ) µ ( ω ) dω = Z Ω k ( ε ) ( D a W (0) + εω ⊤ D ω W (0) ⊤ + εa (1) ( k ) ⊤ D aa W (0) + o ( ε )) µ ( ω ) dω = ( D a W (0) + ε ( a (1) ( k )) ⊤ D aa W (0) + ε M (Ω k (0)) ⊤ ( D ω W (0)) ⊤ ) M (Ω k ( ε )) + o ( ε ) ∈ R × M . At the same time, an analogous calculation implies that¯ D a G ( k, ε ) = ( D a G (0) + ε ( a (1) ( k )) ⊤ D aa G (0) + ε D ω G (0) M (Ω k (0))) M (Ω k ( ε )) + o ( ε )Here, D a G (0) = ( ∂G i /∂a j ) and( D ω G (0) M (Ω k (0))) i,j = X k ∂ G i ∂a j ∂ω k M ,k , a (1) ( k )) ⊤ D aa G (0)) i,j = X l ( a (1) ( k )) l ∂ G l ∂a i ∂a j ∈ R M × M . Thus, M (Ω k ( ε )) ¯ D a G ( k, ε ) − = D a G (0) − − D a G (0) − ε (cid:16) ( a (1) ( k )) ⊤ D aa G (0) + ε D ω G (0) M (Ω k (0)) (cid:17) D a G (0) − + o ( ε ) , and therefore¯ D a W ( k, ε )( ¯ D a G ( k, ε )) − = D a W (0) D a G (0) − + ε ( M ⊤ D ω W (0) ⊤ D a G (0) − + ( a (1) ( k )) ⊤ D aa W (0) D a G (0) − ) − εD a W (0) D a G (0) − (cid:16) ( a (1) ( k )) ⊤ D aa G (0) + D ω G (0) M (Ω k (0)) (cid:17) D a G (0) − + o ( ε )= D a W (0) D a G (0) − + ε ( M ⊤ D ω W (0) ⊤ D a G (0) − − M ⊤ G ⊤ D aa W (0) D a G (0) − ) − εD a W (0) D a G (0) − (cid:16) − M ⊤ G ⊤ D aa G (0) + D ω G (0) M (cid:17) D a G (0) − + o ( ε )= D a W (0) D a G (0) − + ε Γ + o ( ε ) , whereΓ = M ⊤ D ω W (0) ⊤ D a G (0) − − M ⊤ G ⊤ D aa W (0) D a G (0) − − D a W (0) D a G (0) − (cid:16) − M ⊤ G ⊤ D aa G (0) + D ω G (0) M (cid:17) D a G (0) − . a (1) ( k, ε ) ≡ ε − ( a ( k, ε ) − a ( k, a (1) ( k ) + o (1) . Let also G (2) ( k ) ≡ . ε ( a (1) ( k ) ⊤ D aa G (0) a (1) ( k ) + 2 ω ⊤ D ω G (0) a (1) ( k ) + ω ⊤ G ωω (0) ω ) , so that G ( a ( k, ε ) , εω ) − ( εD a G (0)˜ a (1) ( k, ε ) + εD ω G (0) ω ) = ε G (2) ( k ) + o ( ε ) , where we have used that G (0) = 0 . While we cannot prove that ε (˜ a (1) ( k ) − a (1) ( k )) = o ( ε ) , we show that this term cancels out. We have − ¯ D a W ( k, ε )( ¯ D a G ( k, ε )) − G ( a ( k, ε ) , εω ) + W ( a ( k, ε ) , εω ) ≈ − ¯ D a W ( k, ε )( ¯ D a G ( k, ε )) − (cid:16) εD a G (0)˜ a (1) ( k, ε ) + εD ω G (0) ω + ε G (2) ( k ) + o ( ε ) (cid:17) + W (0) + εD a W (0)˜ a (1) ( k, ε ) + εW ω (0) ω + 0 . ε (cid:16) ( a (1) ( k )) ⊤ D aa W (0) a (1) ( k ) + ω ⊤ W ω,ω (0) ω + 2( a (1) ( k )) ⊤ D ω W (0) ω (cid:17) + o ( ε ) ! = − (cid:16) D a W (0) D a G (0) − + ε Γ + o ( ε ) (cid:17) × (cid:16) εD a G (0)˜ a (1) ( k, ε ) + εD ω G (0) ω + ε G (2) ( k ) + o ( ε ) (cid:17) + W (0) + εD a W (0)˜ a (1) ( k ) + εW ω (0) ω + 0 . ε (cid:16) ( a (1) ( k )) ⊤ D aa W (0) a (1) ( k ) + ω ⊤ W ω,ω (0) ω + 2( a (1) ( k )) ⊤ D ω W (0) ω (cid:17) + o ( ε ) ! W (0)+ ε − D a W (0) D a G (0) − (cid:16) D a G (0)˜ a (1) ( k, ε ) + D ω G (0) ω (cid:17) + D a W (0)˜ a (1) ( k, ε ) + W ω (0) ω ! + ε − D a W (0) D a G (0) − G (2) ( k ) − Γ (cid:16) D a G (0) a (1) ( k ) + D ω G (0) ω (cid:17) + 0 . (cid:16) ( a (1) ( k )) ⊤ D aa W (0) a (1) ( k ) + ω ⊤ W ω,ω (0) ω + 2 a (1) ( k ) ⊤ D ω W (0) ω (cid:17)! + o ( ε ) . Thus, the terms with ˜ a (1) ( k, ε ) have cancelled out. We haveΓ (cid:16) D a G (0) a (1) ( k ) + D ω G (0) ω (cid:17) = (cid:16) M ⊤ D ω W (0) ⊤ D a G (0) − − M ⊤ G ⊤ D aa W (0) D a G (0) − − D a W (0) D a G (0) − (cid:16) − M ⊤ G ⊤ D aa G (0) + D ω G (0) M (cid:17) D a G (0) − (cid:17) × D ω G (0)( ω − M )= (cid:16) M ⊤ D ω W (0) ⊤ − M ⊤ G ⊤ D aa W (0) − D a W (0) D a G (0) − (cid:16) − M ⊤ G ⊤ D aa G (0) + D ω G (0) M (cid:17)(cid:17) × G ( ω − M ) = M ⊤ D G ( ω − M ) , where D = D ω W (0) ⊤ − D a W (0) D a G (0) − D ω G (0) − ( G ⊤ D aa W (0) −G ⊤ D a W (0) D a G (0) − D aa G (0)) ∈ R L × M and where the three-dimensional tensor multiplication is understood as follows: M ⊤ D a W (0) D a G (0) − D ω G (0) = X k M ,k D a W (0) D a G (0) − D a G ω k (0) M ⊤ G ⊤ D a W (0) D a G (0) − D ω G (0) = X k ( G M ) k D a W (0) D a G (0) − D a G a k (0) . W (0) + ε − D a W (0) D a G (0) − D ω G (0) ω + W ω (0) ω ! + ε − D a W (0) D a G (0) − ε G (2) ( k ) − M ⊤ D G ( ω − M )+ 0 . (cid:16) ( a (1) ( k )) ⊤ D aa W (0) a (1) ( k ) + ω ⊤ W ω,ω (0) ω + 2 a (1) ( k ) ⊤ D ω W (0) ω (cid:17)! + o ( ε )= W (0) + ε − D a W (0) G ω + W ω (0) ω ! + ε − D a W (0) D a G (0) − ε G (2) ( k ) − M ⊤ D G ( ω − M )+ 0 . (cid:16) M ⊤ G ⊤ D aa W (0) G M + ω ⊤ W ω,ω (0) ω − G M ) ⊤ D ω W (0) ω (cid:17)! + o ( ε ) . Now, ε G (2) ( k ) = 0 . M ⊤ G ⊤ D aa G (0) G M − G M ) ⊤ D ω G (0) ω + ω ⊤ G ωω (0) ω ) . Thus, the desired expression is given by W (0) + ε − D a W (0) G ω + W ω (0) ω ! + ε (0 . M ⊤ A M + M ⊤ B ω + ω ⊤ C ω )51here we have deﬁned A ≡ − D a W (0) D a G (0) − G ⊤ D aa G (0) G + 2 D G + G ⊤ D aa W (0) G = − D a W (0) D a G (0) − G ⊤ D aa G (0) G + 2 (cid:16) D ω W (0) ⊤ − D a W (0) D a G (0) − D ω G (0) − ( G ⊤ D aa W (0) − D a W (0) D a G (0) − G ⊤ D aa G (0)) (cid:17) G + G ⊤ D aa W (0) G = G ⊤ D a W (0) D a G (0) − D aa G (0) G − G ⊤ D aa W (0) G + 2( D ω W (0) ⊤ G − G ⊤ D a W (0) D a G (0) − D ω G (0)) ∈ R L × L B ≡ G ⊤ D a W (0) D a G (0) − D ω G (0) − D G − G ⊤ D ω W (0)= G ⊤ D a W (0) D a G (0) − D a G ⊤ ω (0) − (cid:16) G ⊤ D ω W (0) − G ⊤ D a W (0) D a G (0) − D ω G (0) − ( G ⊤ D aa W (0) G − D a W (0) D a G (0) − G ⊤ D aa G (0)) G (cid:17) − D ω W (0) ⊤ G Here, the ﬁrst term is given by( D a W (0) D a G (0) − D a G ⊤ ω (0)) i,j = X k ( D a W (0) D a G (0) − ) k ∂ G k ∂a i ∂ω j , and the claim follows by a direct (but tedious) calculation. Q.E.D.The desired convergence is then a direct consequence of Lemma 34. Indeed, by com-pactness, we can pick a converging sub-sequence and Lemma 34 implies that, in the limit, apoint ω satisﬁes inequalities (22) if and only if ω ∈ ˜Ω ∗ k . Q.E.D. C The Unconstrained Problem

Proof of Theorem 14 . For each ﬁnite K, the optimal solution ( a K ( ω ) , x K ( a ( ω ))) stayuniformly bounded and hence there exists a subsequence converging in L (Ω; µ ) and inprobability to a limit ( a ( ω ) , x ( a ( ω ))) . By continuity and Lemma 12, E [ W ( a K ( ω ) , ω )] con-52erges to the maximum in the problem of Deﬁnition 11 and, hence, by the same continuityargument, a ( ω ) is an optimal policy without randomization. Since (11) holds true for a K foreach ﬁnite K , convergence in probability implies that (9) also holds in the limit. Indeed, c ( a K ( q ) , ω ; x K ) ≥ c ( a K ( ω ) , ω ; x K )holds for almost all ω and all q with probability one, and hence it also holds in the limitwith probability one (due to convergence in probability). Clearly, for each ﬁnite K thefunction c ( a K ( ω ) , ω ; x K ) is smooth in each region Ω k and is continuous at the boundaries.Since a, x stay bounded and W, G are smooth and G is compact, the functions are uniformlyLipschitz continuous and the Arzela-Ascoli theorem implies that so is the limit (passing toa subsequence if necessary). Finally, to prove that E [ G ( a ( ω ) , ω ) | a ( ω )] = 0it suﬃces to prove that E [ G ( a ( ω ) , ω ) f ( a ( ω ))] = 0for a countable dense set of test functions, which follows by passing to a subsequence.To verify all the required integrability to apply Lebesgue dominated convergence, As-sumption 2 implies that we just need to check that E [ D a G ( a, ω ) | a ] − is uniformly bounded.Since, by assumption, k D a G ( a, ω ) k is uniformly bounded, we just need to check that eigen-values of E [ D a G ( a, ω ) | a ( ω )] are uniformly bounded away from zero.Indeed, let ε = inf a,ω,z, k z k =1 z ⊤ D a G ( a, ω ) z > . If λ is an eigenvalue of E [ D a G ( a, ω ) | a ]with a normed eigenvector z , then λ = z ⊤ E [ D a G ( a, ω ) | a ] z = E [ z ⊤ D a G ( a, ω ) z | a ] ≥ ε .

53o prove that (9) holds, we note that it suﬃces to show that E [( x ( a ( ω )) ⊤ E [ D a G ( a, ω ) | a ( ω )] − E [ D a W ( a, ω ) | a ( ω )]) f ( a ( ω ))] = 0for a countable, dense set of smooth test functions f. The latter is equivalent to E [( x ( a ( ω )) ⊤ D a G ( a ( ω ) , ω ) − D a W ( a ( ω ) , ω )) f ( a ( ω ))] = 0and the claim follows by continuity by passing to a subsequence. Finally, the last identityfollows from (8). Finally, the fact that a is Borel-measurable follows from the known factthat for any Lebesgue-measurable a ( ω ) there exists a Borel measurable modiﬁcation of a coinciding with a for Lebesgue-almost every ω. Q.E.D.

Proof of Corollary 22 . Integrability condition (by the same argument as in (4)) impliesthat all the convergence arguments are justiﬁed. The convexity claim is then a directconsequence of Proposition 10. Q.E.D.

Proposition 35

Let γ be the joint distribution of ( a, ω ) for an optimal information design.Then, Z ( x ( a ) ⊤ G ( a, ω ) − W ( a, ω )) dη + Z W (˜ a ∗ , ω ) dη ( R , ω ) ≤ In the case of moment persuasion, Z ( D a W ( a )( a − g ( ω )) − W ( a )) dη + W ( Z g ( ω ) dη ) ≤ for every η ∈ P ( R × Ω) such that Supp( η ) ⊂ Supp( γ ) . Proof of Proposition 35 . We closely follow the arguments and notation in Kramkov and Xu542019). Let γ be the joint distribution of the random variables ω and a ( ω ) . We ﬁrst establish(23) for a Borel probability measure η on P ( R ×

Ω) that has a bounded density with respectto γ : V ( x, y ) = dηdγ ∈ L ∞ ( P ( R ×

Ω)) . We choose a non-atom q ∈ R of µ ( da ) = γ ( da, R L ) and deﬁne the probability measure ζ ( da, dω ) = δ q ( da ) η ( R , dω ) , where δ q is the Dirac measure concentrated at q. For suﬃciently small ε > γ = γ + ε ( ζ − η )is well-deﬁned and has the same ω -marginal µ ( ω ) as γ . Let ˜ a be the optimal action satisfying˜ γ ( G (˜ a, ω ) | ˜ a ) = 0 . The optimality of γ implies that Z W (˜ a, ω ) d ˜ γ ≤ Z W ( a, ω ) dγ . (24)By direct calculation,0 = ˜ γ ( G (˜ a, ω ) | a )= a = q R G (˜ a, ω ) d (( γ | a ) − ε ( η | a )) R d ( γ − εη ) + a = q Z G (˜ a, ω ) dη ( R , ω )= a = q R G (˜ a, ω ) d ( γ | a ) − ε R G (˜ a, ω ) d ( η | a )1 − εU ( a ) + a = q Z G (˜ a, ω ) dη ( R , ω )55here U ( a ) = γ ( V ( a, ω ) | a ) . Now, we know that Z G ( a, ω ) d ( γ | a ) = 0 , and the assumed regularity of G together with the implicit function theorem imply that˜ a ( a ) = a + εQ ( a ) + O ( ε )if a = q and˜ a = ˜ a ∗ , where ˜ a ∗ is the unique solution to Z G (˜ a ∗ , ω ) dη ( R , ω ) = 0for a = q. Here,0 = O ( ε ) + Z G ( a + εQ ( a ) , ω ) d ( γ | a ) − ε Z G ( a, ω ) V ( a, ω ) d ( γ | a )= O ( ε ) + ε Z D a G ( a, ω ) d ( γ | a ) Q ( a ) − ε Z G ( a, ω ) V ( a, ω ) d ( γ | a )so that Q ( a ) = (cid:18)Z D a G ( a, ω ) d ( γ | a ) (cid:19) − Z G ( a, ω ) V ( a, ω ) d ( γ | a ) . Z W (˜ a ( a ) , ω ) d ˜ γ = Z W (˜ a ( a ) , ω )(1 − εV ( a, ω )) dγ + ε Z W (˜ a ∗ , ω ) dη ( R , ω )= O ( ε ) + Z W ( a, ω ) dγ + ε Z ( D a W ( a, ω ) Q ( a ) − V ( a, ω )) dγ + Z W (˜ a ∗ , ω ) dη ( R , ω ) ! In view of (24), the ﬁrst-order term is non-positive: Z ( D a W ( a, ω ) Q ( a ) − W ( a, ω ) V ( a, ω )) dγ + Z W (˜ a ∗ , ω ) dη ( R , ω ) ≤ . Substituting, we get Z ( x ( a ) ⊤ Z G ( a, ω ) V ( a, ω ) d ( γ | a ) − W ( a, ω ) V ( a, ω )) dγ + Z W (˜ a ∗ , ω ) dη ( R , ω ) ≤ , which is equivalent to Z ( x ( a ) ⊤ G ( a, ω ) − W ( a, ω )) dη + Z W (˜ a ∗ , ω ) dη ( R , ω ) ≤ Q ( a ) = aU ( a ) − R ( a ) , where we have deﬁned U ( a ) = γ ( V ( a, ω ) | a ) , R ( a ) = γ ( g ( ω ) V ( a, ω ) | a ) , a ∗ = Z g ( ω ) dη . Thus, we get0 ≥ Z ( D a W ( a ) Q ( a ) − W ( a ) V ( a, ω )) dγ + W (˜ a ∗ )= Z ( D a W ( a )( aU ( a ) − R ( a )) − W ( a ) V ( a, ω )) dγ + W (˜ a ∗ )= Z ( D a W ( a )( a − g ( ω )) − W ( a )) dη + W ( Z g ( ω ) dη ) . The general case (without the assumption of bounded density for η ) follows by the samearguments as in (24) . Q.E.D. Lemma 36

Let a ∗ ( ω ) be the unique solution to G ( a ∗ ( ω ) , ω ) = 0 . Then, at any optimal ( a ( ω ) x ( a ( ω )) with x ( a ( ω )) = ¯ D a W ( a ( ω )) ¯ D a G ( a ( ω )) − we have x ( a ( ω )) ⊤ G ( a ( ω ) , ω ) − W ( a ( ω ) , ω ) + W ( a ∗ ( ω ) , ω ) ≤ Furthermore, Z ( W ( a , ω ) − W ( a , ω ) − x ( a ) ⊤ G ( a , ω )) d ( γ | a ) ≤ for all a , a in the support of γ Proof . The ﬁrst claim follows by selecting η = δ ( a ( ω ) ,ω ) . The second one follows by selecting η = tδ a ( γ | a ) + (1 − t ) δ a ( γ | a ) where Ω( a ) is the respective ﬁber, the level set of a. In this58ase, we get t Z ( x ( a ) ⊤ G ( a , ω ) − W ( a , ω )) dγ ( ω | a ) + (1 − t ) Z ( x ( a ) ⊤ G ( a , ω ) − W ( a , ω )) dγ ( ω | a )+ Z W (˜ a ∗ , ω )( tdγ | a + (1 − t ) dγ | a ) ≤ . (25)where ˜ a ∗ ( t ) is uniquely determined by t Z G (˜ a ∗ ( t ) , ω ) d ( γ | a ) + (1 − t ) Z G (˜ a ∗ ( t ) , ω ) d ( γ | a ) = 0 . Clearly, (25) is equivalent to t Z ( W (˜ a ∗ ( t ) , ω ) − W ( a , ω )) d ( γ | a ) + (1 − t ) Z ( W (˜ a ∗ ( t ) , ω ) − W ( a , ω )) d ( γ | a ) ≤ . Assuming that t is small, we get˜ a ∗ ( t ) = a + t ˆ a + o ( t ) , ˆ a = − ¯ D a G ( a ) − Z G ( a , ω ) d ( γ | a )and hence0 ≥ t Z ( W (˜ a ∗ ( t ) , ω ) − W ( a , ω )) d ( γ | a ) + (1 − t ) Z ( W (˜ a ∗ ( t ) , ω ) − W ( a , ω )) d ( γ | a )= t Z ( W ( a , ω ) − W ( a , ω )) d ( γ | a ) + t ¯ D a W ( a )ˆ a + o ( t )= t Z ( W ( a , ω ) − W ( a , ω )) d ( γ | a ) − tx ( a ) ⊤ Z G ( a , ω ) d ( γ | a ) Q.E.D. Corollary 37

In a moment persuasion setup, let a ( ω ) be an optimal information design. hen, for any two points ω , ω , we have W ( tg ( ω ) + (1 − t ) g ( ω )) + t ( D a W ( a ( ω ))( a ( ω ) − g ( ω )) − W ( a ( ω )))+ (1 − t )( D a W ( a ( ω ))( a ( ω ) − g ( ω )) − W ( a ( ω ))) ≤ . In the case of ω = ω , we just get D a W ( a ( ω ))( a ( ω ) − g ( ω )) ≤ W ( a ( ω )) − W ( g ( ω )) . (26) Furthermore, W ( ta + (1 − t ) a ) ≤ tW ( a ) + (1 − t ) W ( a ) , a , a ∈ Supp( a ) . (27) In particular,

Supp( a ) is a W -convex set. Proof of Corollary 37 . The ﬁrst claim follows from the choice η = tδ ( a ( ω ) ,ω ) + (1 − t ) δ ( a ( ω ) ,ω ) in Proposition 35. The second one follows from Lemma 36. Monotonicity ofthe set follows by evaluating the inequality at t → . Maximal monotonicity follows from(26). Q.E.D.

Proof of Theorem 16 . Let ( a ( ω ) , x ( a )) be a strong optimal policy and let φ c ( a ) = inf ω ( c ( a, ω ; x ) − φ Ξ ( ω ; x ))Pick an a ∈ Ξ . Since a ∈ Ξ , there exists a ˜ ω such that a = a (˜ ω ) and hence φ c ( a ) = inf ω ( c ( a, ω ; x ) − φ Ξ ( ω ; x )) ≤ c ( a, ˜ ω ; x ) − φ Ξ ( ω ; x ) = 0 . Z φ c ( a ( ω )) µ ( ω ) dω = 0 . At the same time, c ( a, ω ; x ) − φ Ξ ( ω ; x ) = c ( a, ω ; x ) − inf b ∈ Ξ c ( b, ω ; x ) ≥ . Thus, φ c ( a ) = 0 for all a ∈ Ξ . Now, by the deﬁnition of φ c , we always have φ c ( a ) + φ Ξ ( ω ; x ) ≤ c ( a, ω )for a strong optimal policy. Let γ be the measure on Ξ × Ω describing the joint distributionof χ = a ( ω ) and ω. Then, Z c ( a, ω ) γ ( a, ω ) = Z c ( a ( ω ) , ω ) µ ( ω ) dω = Z φ Ξ ( ω ; x ) µ ( ω ) dω = Z φ Ξ ( a ) dν ( a ) , Pick any measure π from the Kantorovich problem. Then, Z c ( a, ω ) dγ ( a, ω ) = Z ( φ Ξ ( ω ; x )) dγ ( a, ω )= Z φ Ξ ( ω ; x ) µ ( ω ) dω + Z φ c ( a ( ω )) µ ( ω ) dω = Z ( φ Ξ ( ω ; x ) + φ c ( a ( ω ))) µ ( ω ) dω = Z ( φ Ξ ( ω ; x ) + φ c ( a )) dπ ( a, ω ) ≤ Z c ( a, ω ) dπ ( a, ω )Thus, γ minimizes the cost in the Kantorovich problem. Q.E.D. Proof of Theorem 23 . The ﬁrst claim follows from Corollary 37 in the Appendix. Theproof of suﬃciency closely follows ideas from Kramkov and Xu (2019).61et a ( ω ) be a policy satisfying the conditions Theorem 23. Note that, in terms of thefunction c, our objective is to show (see (8)) thatmin all feasible policies b ( ω ) E [ c ( b ( ω ) , g ( ω ))] = E [ c ( a ( ω ) , g ( ω ))] . Next, we note that the assumed maximality implies that c ( a ( ω ) , g ( ω )) = φ Ξ ( g ( ω )) ≤ ω. Now, for any feasible policy b ( ω ) , we have E [ g ( ω ) | b ( ω )] = b ( ω ) ∈ conv ( g (Ω)) andtherefore, for any ﬁxed a ∈ R M , we have E [ c ( a, g ( ω )) − c ( b ( ω ) , g ( ω )) | b ( ω )]= E [ W ( g ( ω )) − W ( a ) + D a W ( a ) ( a − g ( ω )) − ( W ( g ( ω )) − W ( b ( ω )) + D a W ( b ( ω )) ( b ( ω ) − g ( ω )))]= W ( b ( ω )) − W ( a ) + D a W ( a )( a − b ( ω )) = c ( a, b ( ω )) . Taking the inﬁnum over a dense, countable set of a, we getinf a E [ c ( a, g ( ω )) − c ( b ( ω ) , g ( ω )) | b ( ω )] = φ Ξ ( b ( ω )) ≤ E [ c ( a ( ω ) , g ( ω )) − c ( b ( ω ) , g ( ω )) | b ( ω )] = E [inf a ∈ Ξ c ( a, g ( ω )) − c ( b ( ω ) , g ( ω )) | b ( ω )] ≤ inf a ∈ Ξ E [ c ( a, g ( ω )) − c ( b ( ω ) , g ( ω )) | b ( ω )] = inf a ∈ Ξ c ( a, b ( ω )) ≤ . (28)and therefore, integrating over b, we get E [ c ( a ( ω ) , g ( ω ))] ≤ E [ c ( b ( ω ) , g ( ω ))] . The proof is complete. Q.E.D.62 roof of Proposition 24 . Since Ξ is W -monotone, we have c ( a, b ) ≥ a, b ∈ Ξ andhence φ Ξ ( b ) ≥ b ∈ Ξ . Thus, if Ξ ⊂ X, we get φ Ξ ( b ) = 0 on Ξ . Let now a be astrong optimal policy. First we note that (28) implies that φ Ξ ( b ( ω )) = 0 almost surely forany optimal policy b ( ω ) . If Ξ = Q Ξ , we get that ˜Ξ ⊂ Ξ and hence φ Ξ ( b ) ≤ φ ˜Ξ ( b ) for all b. Thus, Z c ( b ( ω ) , g ( ω )) µ ( ω ) dω ≥ Z φ ˜Ξ ( g ( ω )) µ ( ω ) dω ≥ Z φ Ξ ( g ( ω )) µ ( ω ) dω . Since both policies are optimal, we must have φ Ξ = φ ˜Ξ , and the singleton assumption impliesthat a ( ω ) = ˜ a ( ω ) . Q.E.D.

Proof of Theorem 26 . Our proof is based on an application of the famous Frostman’slemma (see, e.g., Mattila (1999)).

Lemma 38 (Frostman’s lemma)

We only consider the case of a non-degenerate D aa W. The case of a degenerate, constant D aa W is proved analogously.Deﬁne the s -capacity of a Borel set A as follows: C s ( A ) = sup ((cid:18)Z A × A dµ ( x ) dµ ( y ) k x − y k s (cid:19) − : µ is a Borel measure and µ ( A ) = 1 ) . (Here, we take inf ∅ = ∞ and / ∞ = 0 . ) Then, dim H ( A ) = sup { s ≥ C s ( A ) > } . Now, by Corollary 37 (formula (27)), we have that the function q ( t ) = W ( a t + a (1 − t )) , t ∈ (0 ,

1) is either identically constant or attains a global minimum at a point t ∗ ∈ (0 , . At thispoint, we have( a − a ) ⊤ D aa W ( t ∗ a + (1 − t ∗ ) a )( a − a ) ≥ . X is a suﬃciently small ball in R M . Pickany point a ∗ ∈ X for which D aa W is non-degenerate and change the coordinates so that D aa W ( a ∗ ) = diag( λ , · · · , λ M ) is diagonal. Furthermore, rescaling the coordinates, we mayassume that all λ i have absolute values equal to 1, so that λ i = 1 for i ≤ ν ( a ∗ ) and λ i = − i > ν ( a ∗ ). Furthermore, making the ball X suﬃciently small, we may assume that k D aa W ( a ) − D aa W ( a ∗ ) k ≤ ε for all a ∈ X. Let a i = ( x i , y i ) be the orthogonal decompositioninto two components corresponding to positive and negative eigenvalues. Then,0 ≤ ( a − a ) ⊤ D aa W ( t ∗ a + (1 − t ∗ ) a )( a − a ) ≤ ε k a − a k + k x − x k − k y − y k = (1 + ε ) k x − x k − (1 − ε ) k y − y k , and therefore k a − a k = k x − x k + k y − y k ≤ c k x − x k with c = 1 + (1 + ε ) / (1 − ε ) . Z Ξ ∩ X × Ξ ∩ X dµ ( a ) dµ ( a ) k a − a k s ≥ c − Z Ξ ∩ X × Ξ ∩ X dµ ( a ) dµ ( a ) k x − x k s = c − Z ˜Ξ ∩ X × ˜Ξ ∩ X d ˜ µ ( x ) d ˜ µ ( x ) k x − x k s where ˜ µ is the x -marginal of the measure µ and ˜Ξ is the projection of Ξ onto R ν ( a ∗ ) . Thus, C s (Ξ ∩ X ) ≤ c C s (˜Ξ ∩ X )and thereforedim H (Ξ ∩ X ) ≤ dim(˜Ξ ∩ X ) . ⊂ R ν ( a ∗ ) , the claim follows. Q.E.D. Proof of Corollary 31 . We will use an important property of the Dirichlet distribution:Deﬁning ¯ ω = ( ω , · · · , ω j ) and ¯ ω = ( ω j +1 , · · · , ω M ) , we have that the random vectors T ¯ ω , ¯ ω / ( T ¯ ω ) and ¯ ω / (1 − ⊤ ¯ ω ) are jointly independent. In this case, by direct calcula-tion, a ( ω ) = E [ (cid:18) ψ ¯ ω ψ ¯ ω (cid:19) | a ( ω )] . Then, φ Ξ ( b ) = min ω W ( b ) − W ( a ( ω )) + D a W ( a ( ω ))( a ( ω ) − b ) ! = min ω W ( b ) − X i ( q i E i ( a i ( ω )) + ϕ i ( ⊤ a i ( ω )))+ X i ( q i (log( a i /y i ) + 1) + ϕ ′ i ( ⊤ a i )) ⊤ ( a i ( ω ) − b i ) ! Since ⊤ a i ( ω ) = γ i , the ﬁrst order conditions take the form q a − ( a − b ) = λ , q a − ( a − b ) = λ where λ i are Lagrange multipliers for the constraints ⊤ a i = γ i . Thus, denoting ¯ b i = ⊤ b i , we get a i = γ i b i / ¯ b i , and using the identity E ( γb i ¯ b ) = γ ¯ b ( E ( b ) + log( γ ¯ b )¯ b )we get φ Ξ ( b ) = X i =1 (cid:0) ϕ i (¯ b i ) − ϕ i ( γ i ) + ϕ ′ i ( γ i )( γ − ¯ b i ) − q i (¯ b i log( γ i / ¯ b i ) − γ i + ¯ b i ) (cid:1) eferences Arieli, Itai, Yakov Babichenko, Rann Smorodinsky, and Takuro Yamashita ,“Optimal Persuasion via Bi-Pooling,” Working Paper 2020.

Aumann, Robert J and Michael Maschler , Repeated games with incomplete informa-tion , MIT Press, 1995.

Beiglb¨ock, Mathias, Nicolas Juillet et al. , “On a problem of optimal transport undermarginal martingale constraints,”

Annals of Probability , 2016, (1), 42–106. Bergemann, Dirk and Stephen Morris , “Information design, Bayesian persuasion, andBayes correlated equilibrium,”

American Economic Review , 2016, (5), 586–91. and , “Information design: A uniﬁed perspective,”

Journal of Economic Literature ,2019, (1), 44–95. Calzolari, Giacomo and Alessandro Pavan , “On the optimality of privacy in sequentialcontracting,”

Journal of Economic Theory , 2006, (1).

Journal ofPolytical Economy , 2019.

Gabaix, Xavier , “Behavioral inattention,” in “Handbook of Behavioral Economics: Appli-cations and Foundations 1,” Vol. 2, Elsevier, 2019, pp. 261–343.

Gangbo, W , “Habilitation thesis,”

Universite de Metz, available at http://people. math.gatech. edu/gangbo/publications/habilitation. pdf , 1995.67 entzkow, Matthew and Emir Kamenica , “A Rothschild-Stiglitz approach to Bayesianpersuasion,”

American Economic Review , 2016, (5), 597–601.

Ghoussoub, Nassif, Young-Heon Kim, Tongseok Lim et al. , “Structure of optimalmartingale transport plans in general dimensions,”

Annals of Probability , 2019, (1),109–164. Hopenhayn, Hugo and Maryam Saeedi , “Optimal Ratings and Market Outcomes,”Technical Report, UCLA 2019.

Hugonnier, Julien, Semyon Malamud, and Eugene Trubowitz , “Endogenous Com-pleteness of Diﬀusion Driven Equilibrium Markets,”

Econometrica , 2012, , 1249–1270. Kamenica, Emir , “Bayesian persuasion and information design,”

Annual Review ofEconomics , 2019, , 249–272. and Matthew Gentzkow , “Bayesian Persuasion,” American Economic Review , 2011, , 2590–2615.

Kleinberg, John and Sendhil Mullainathan , “Simplicity Creates Inequity: Implicationsfor Fairness, Stereotypes and Interpretability,” Working Paper 2019.

Kleiner, Andreas, Benny Moldovanu, and Philipp Strack , “Extreme points andmajorization: Economic applications,”

Available at SSRN , 2020.

Kolotilin, Anton , “Optimal information disclosure: a linear programming approach,”

Levin, Vladimir , “Abstract cyclical monotonicity and Monge solutions for the generalMonge–Kantorovich problem,”

Set-Valued Analysis , 1999, (1), 7–32. Mattila, Pertti , Geometry of sets and measures in Euclidean spaces: fractals and rectiﬁa-bility number 44, Cambridge university press, 1999.68 cCann, Robert J and Nestor Guillen , “Five lectures on optimal transportation:geometry, regularity and applications,”

Analysis and geometry of metric measure spaces:lecture notes of the s´eminaire de Math´ematiques Sup´erieure (SMS) Montr´eal , 2011,pp. 145–180.

Mensch, Jeﬀrey , “Monotone Persuasion,”

Manuscript , 2018.

Ostrovsky, Michael and Michael Schwarz , “Information disclosure and unraveling inmatching markets,”

American Economic Journal: Microeconomics , 2010, (2). Rayo, Luis and Ilya Segal , “Optimal Information Disclosure,”

Journal of PoliticalEconomy , 2010, , 949–987.

Rochet, Jean-Charles , “A necessary and suﬃcient condition for rationalizability in aquasi-linear context,”

Smith, Cyril and Martin Knott , “On Hoeﬀding-Fr´echet bounds and cyclic monotonerelations,”

Journal of multivariate analysis , 1992, (2), 328–334. Tamura, Wataru , “Bayesian persuasion with quadratic preferences,”

Available at SSRN1987877 , 2018.

Taneva, Ina , “Information Design,” Edinburgh School of Economics Discussion PaperSeries, Edinburgh School of Economics, University of Edinburgh 2015.

Wei, KC John, Cheng F Lee, and Alice C Lee , “Linear conditional expectation, returndistributions, and capital asset pricing theories,”

Journal of Financial Research , 1999,22