[PDF] Identification and Estimation of A Rational Inattention Discrete Choice Model with Bayesian Persuasion

Abstract

This paper studies the semi-parametric identification and estimation of a rational inattention model with Bayesian persuasion. The identification requires the observation of a cross-section of market-level outcomes. The empirical content of the model can be characterized by three moment conditions. A two-step estimation procedure is proposed to avoid computation complexity in the structural model. In the empirical application, I study the persuasion effect of Fox News in the 2000 presidential election. Welfare analysis shows that persuasion will not influence voters with high school education but will generate higher dispersion in the welfare of voters with a partial college education and decrease the dispersion in the welfare of voters with a bachelors degree.

Full PDF

IIdentiﬁcation and Estimation of A Rational Inattention

Discrete Choice Model with Bayesian Persuasion

Moyu Liao ∗ Pennsylvania State UniversitySeptember 18, 2020

Abstract

This paper studies the semi-parametric identiﬁcation and estimation of a rationalinattention model with Bayesian persuasion. The identiﬁcation requires the observationof a cross-section of market-level outcomes. The empirical content of the model can becharacterized by three moment conditions. A two-step estimation procedure is proposedto avoid computation complexity in the structural model. In the empirical application,I study the persuasion eﬀect of Fox News in the 2000 presidential election. Welfareanalysis shows that persuasion will not inﬂuence voters with high school education butwill generate higher dispersion in the welfare of voters with a partial college educationand decrease the dispersion in the welfare of voters with a bachelors degree. ∗ First draft: 03/22/2019. This version: 09/16/2020. I would like to thank Marc Henry, Sun Jae Jun,Peter Newberry, Karl Schurter, Jia Xiang, Zhiyuan Chen and conference attenders at the 2019 EconometricSociety Asian Meeting for useful comments. a r X i v : . [ ec on . E M ] S e p Introduction

In many applications of discrete choice models, econometricians usually assume the decisionmaker has the following random utility from choosing item j among a choice set J = 1 , ...J : U j = u j + (cid:15) j , where u j is the mean utility observed by the econometricians and (cid:15) j is theutility shock known to the decision maker but not the econometrician. Decision makers inthe model choose the item with the highest utility. When the unobserved shock follows theType I extreme value distribution, we can solve the probability of choosing j analytically.Aggregating the choice outcomes of the decision makers in the market we can get the marketshare of an item. This approach to studying market structure was initiated by McFadden(1973), and then adopted by Berry, Levinsohn, and Pakes (1995) (henceforth BLP) to studyautomobile markets, and became widely applied to other industries.This model, however, is not easily adaptable to accommodate persuasion in a structuralway. Take advertising as a form of persuasion. In the classical analysis of the eﬀect ofadvertising, three approaches are adopted. The ﬁrst is to model advertising as a feature of theitem that enters mean utility u j = u j ( A ) , where the level of advertising A aﬀects the choiceutility. The argument is that advertising is ‘persuasive’ and the individual will buy moreof the advertised goods because their utility is distorted (Dorfman and Steiner, 1954). Thisreduced form approach does not oﬀer us much explanation of how advertisement inﬂuencesdecision making and market structure. The second approach is to model advertisementas the trigger of the consideration set change (Goeree, 2008). The consideration set isthe priori set of items that the decision maker chooses from. Advertisement thus servesas the trigger that puts a previously non-considered item into the consideration set. Thisapproach views advertisement as the information revealing device that reveals the true (cid:15) j to the decision maker which was previously −∞ to the decision maker. If a good j isalready in the consideration set for all customers, the model of consideration set predictsthat advertising has no eﬀect on the market share. If a good is well known, the modelof consideration set cannot explain why sellers advertise. The third approach is to viewadvertising as a signaling device to separate the high-quality product from the low-qualityproduct (Nelson, 1974; Bagwell and Ramey, 1988). The degree of advertisement serves as2he signal that induces the separating equilibrium where only high-quality ﬁrms advertise. Inparticular, they assume the unobserved quality is common for all decision makers. However,this approach requires the decision maker in the model to have imperfect knowledge of (cid:15) j ,which contradicts the assumption that (cid:15) j is known by the decision maker.Compared to the classical approaches to model persuasion, this paper develops an empir-ical model of persuasion using the Bayesian persuasion theory in Kamenica and Gentzkow(2011). The Bayesian persuasion approach to model advertising diﬀers from the previouslymentioned informative view in two ways: 1. Decision makers in the model can have a diﬀer-ent realization of the product quality; 2. The advertiser, who acts as the Bayesian persuader,does not always want to reveal their quality honestly. However, similar to the informativeview, the Bayesian persuasion model assumes that the decision maker in the model only hasa prior belief on the { (cid:15) j } Jj =1 and the exact realization of { (cid:15) j } Jj =1 is unknown. The decisionmaker’s prior distribution of { (cid:15) j } Jj =1 comes from the reputation of the goods. The prior beliefis likely to be common across decision makers. However, the standard Bayesian persuasionmodel assumes the decision makers only have access to the signal sent by the persuaderto update their belief and no other sources of information are available. In the real world,decision makers will also search actively for information on the goods’ quality by themselves.For example, if a person wants to buy a car, he or she will have a test drive before makinga decision. An extensive search of information can reduce the randomness of { (cid:15) j } Jj =1 but atthe time is costly. Matejka and McKay (2015) considers a model where the decision makersearches information of { (cid:15) j } Jj =1 to maximize the expected utility after deducting the searchcost. Their rational inattention discrete choice model can incorporate Bayesian persuasionby assuming persuaders send signals after the decision makers get their own information.The analysis of the structural persuasion and information search model has largely beendiscussed under the assumption that the decision makers’ prior belief of { (cid:15) j } Jj =1 , denotedby G , is known to the economist. In empirical researches, the prior belief G is unknownand should be estimated from data. A recent empirical study by Xiang (2020) assumesthe decision makers’ prior distribution G is normally distributed and analyzes the decisionmakers’ welfare change when a policy change induces the persuader to change the persuasionstrategy. However, the empirical content of a parametric assumption on G is unclear.3his paper follows Matejka and McKay (2015) to consider a rational inattention dis-crete model with Bayesian persuasion. I discuss the non-parametric identiﬁcation of theprior distribution G and parametric identiﬁcation of persuader’s persuasion strategy whenan econometrician observes the choice ratio at the market level across many independentmarkets. The independent markets are divided into two groups: the ﬁrst group is not in-ﬂuenced by the persuader and the second group is inﬂuenced by the persuader. The priordistribution G is identiﬁed from the choice ratio in the ﬁrst group of markets. Given theidentiﬁcation of G , a parametric persuasion strategy is identiﬁed from the second group ofmarkets. I characterize a set of moment conditions implied by the model, and the standardestimation method such as GMM can be applied easily.For econometricians who already observe the market shares with and without the in-ﬂuence of persuasion, identifying the persuasion strategy is the ﬁrst step to understandingthe behavior of the persuader. If we assume the persuader use a persuasion strategy is tomaximize some utility function, the identiﬁed persuasion strategy can help us understandthe persuader’s objective function. Analysis in this paper leaves the persuader’s objectivefunction as unknown and analyzes the behavior from the buyers’ side. A complete two-sidedanalysis will incorporate the persuader’s utility as a function of persuasion strategy andanalyze the problem as a sequential game played between the persuader and the buyers.For policymakers, given the knowledge of the prior belief G , they will be able to evaluatethe eﬀect of regulating the persuasion strategy. In the advertisement market, the policymak-ers for example can ban one seller from directly revealing information about his competitors’products. Moreover, policymakers can also evaluate the eﬀect of providing less costly in-formation to the decision makers. In other words, policymakers can compete with existingpersuaders in the markets to increase the decision makers’ welfare.In the empirical application, I look at the 2000 presidential election in the United States.I treat the presidential candidates as voters’ choices and view voting statistical areas asseparated markets. In 1996, Fox News was developed and then entered into approximately30% of the towns in the United States by 2000. DellaVigna and Kaplan (2007) shows thatFox News motivated voters to vote for Republicans compared to voters in towns withoutFox News. I take the data and analyze how Fox News persuaded voters in diﬀerent towns.4he estimated results from the markets without Fox News show that the prior belief ofthe quality of the presidential candidates varies a lot with voters’ education level. Bothvoters with bachelor’s degrees and with only high school degrees prefer the Democratic partythan the Republican Party. The estimated results also show that Fox News provided verylittle information to voters, but managed to manipulate the voting outcome by a signiﬁcantmargin. I also compare the welfare of voters with diﬀerent education levels. Voters’ welfare isdeﬁned as the probability of choosing their ﬁrst best choice, and their ﬁrst best choice is thepresidential candidate that will generate the highest utility to voters when the voters knowthe realization of { (cid:15) j } Jj =1 . The result shows that persuasion will not inﬂuence the welfare ofvoters with high school education but will generate higher dispersion in the welfare of voterswith a partial college education and decrease the dispersion in the welfare of voters with abachelors degree.Another way to study the eﬀect of persuasion is to model the presence of a persuaderas a treatment status (Jun and Lee, 2018). In their model, the presence of a Bayesianpersuader is taken as treatment assignment and sharp bounds on the persuasion eﬀect aregiven under various data generating processes. The treatment eﬀect model does not specifythe decision makers’ utility and thus analysis of the decision makers’ welfare before and afterpersuasion is not possible. The treatment eﬀect model also makes it hard to consider policycounterfactual such as regulations on persuasion strategy or when the policymaker providesextra information in the market.The rest of the paper is organized as follows. Section 2 introduces the rational inattentiondiscrete choice model with persuasion. Section 3 discusses the data generating process andthe identiﬁcation strategy. Section 4 discusses the estimation strategy. Section 5 studies the2000 presidential election and the eﬀect of Fox News. Section 6 concludes. I consider the standard random utility speciﬁcation: a decision maker (DM) derives utilitylevel U j from good j from the choice set J = { , ..., J } : U j = u j + (cid:15) j . u j is the mean utility of choosing good j and (cid:15) j is the individual speciﬁc random drawof utility shock. Throughout this section, I assume that the decision maker knows only ( u , ..., u j ) but not ( (cid:15) , ...(cid:15) J ) . The decision maker has a prior belief on the distribution G onthe utility shock: ( (cid:15) , ...(cid:15) J ) ≡ (cid:15) ∼ G . If there is no further information about the true utilityshock (cid:15) , the decision maker will choose the one with highest expected utility: j ∈ a ( G ) ≡ arg max j ∈J E G [ u j + (cid:15) j ] . (2.1)If arg max j ∈J E G [ u j + (cid:15) j ] is not a singleton, we let a ( G ) to be an arbitrary selection ofmaximizers. The maximized utility derived from the belief G is given by V ( G ) ≡ max j ∈J E G [ u ij + (cid:15) ij ] . (2.2)I will ﬁrst introduce a rational inattention discrete choice model and then discuss howpersuasion can be incorporated. The rational inattention discrete choice model in Matejka and McKay (2015) assumes thatthe decision maker can choose an information strategy to get a signal s DM . The signal s DM updates the decision makers’ belief on the true utility shock (cid:15) . The decision maker thenchoose the item with highest posterior mean according to (2.1). Following the notation inMatejka and McKay (2015), denote u j + (cid:15) j ≡ v j . Formally, the decision maker’s informationstrategy is a joint distribution of the true utility vector v ∈ R J and the signal s DM ∈ R J ,denoted by F ( s DM , v ) . The marginal distribution of the information strategy has to beconsistent with the prior belief G . Once the decision maker is committed to the informationstrategy, the random shocks to utility are realized, and then the decision maker get a realizedsignal s DM from F ( s DM | v ) . The decision maker updates his belief as F ( (cid:15) | s DM ) , and choosesthe item in a ( F ( (cid:15) | s DM )) according to (2.1).Since the real utility shocks are not observed by the decision maker, the decision makersolves the following optimization problem to maximize his expected utility: max F ∈ ∆( R J ) (cid:90) v (cid:90) s DM V ( F ( ·| s DM )) F ( d s DM | v ) G ( d v ) − c ( F ) (2.3)6 .t. (cid:90) s DM F ( d s DM , v ) = G ( v ) (2.4)where V ( F ( ·| s DM )) is determined by (2.2). The constraint (2.4) requires that the DM’s priordistribution G is consistent with the real state of the world. The cost of information c ( F ) isthe mutual information between the shocks (cid:15) and the signal s DM : c ( F ) = λ { H ( G ) − E s [ H ( F ( ·| s DM ))] } , (2.5)where the parameter λ is the unit cost of information, and E s denote the expectation overthe marginal distribution of F ( s DM , v ) . The entropy function H of a discrete distribution G is deﬁned as: H ( G ) = − (cid:80) k P k log( P k ) , where P k is the probability of the state k . When G iscontinuously distributed, the diﬀerential entropy is deﬁned as H ( G ) = − (cid:82) s g ( s ) log( g ( s )) ds .The use of entropy reduction as a measure of information cost is standard in the rationalinattention literature. See De Oliveira et al. (2017) for the discussion of entropy cost. More-over, the entropy number is related to the complexity of a random variable, and can be givena data compression interpretation. The mutual information in (2.5) can be interpreted asthe number of binary questions asked by acquiring signal s . Appendix A gives an exampleof data compression interpretation.Let S DMj ≡ { s DM ∈ R J : a ( F ( ·| s DM ) = j } be the set of signals that lead the DM tochoose j . Also denote P j ( v ) ≡ (cid:90) S DMj F ( d s DM | v ) (2.6)as the conditional choice probability of choosing item j when the realized utility vector is v . Also deﬁne the unconditional choice probability of choosing j as P j = (cid:90) v P j ( v ) dG ( v ) . (2.7)This is the ex-ante probability of choosing j before the utility vector is realized.A set of optimality condition to the problem (2.3)-(2.5) from Matejka and McKay (2015)is summarized in the following lemma. Note that the DM does not know the realization of v . The conditional choice probability should beunderstood to be the choice probability when the actual utility vector is v emma 2.1. If λ >0 and F is an optimal information strategy that solves (2.3)-(2.5), thenthe conditional and choice probability in (2.6) satisﬁes P j ( v ) = P j e v j /λ (cid:80) k ∈J P k e v k /λ a.s., (2.8) E G [ P j ( v )] = P j . (2.9) The unconditional choice probability in (2.7) solves the following convex optimization prob-lem: max {P j } Jj =1 (cid:90) v λ log( J (cid:88) j =1 P j e v j /λ ) G ( d v ) s.t. ∀ j : P j ≥ , J (cid:88) k =1 P k = 1 . (2.10) Conversely, if {P j } Jj =1 is the solution to (2.10), and P j ( v ) deﬁned in (2.8) satisﬁes (2.9),then we can construct an information strategy F such that: • The signal s DM is supported on J points: { s , ...s J } ; • The conditional distribution of s DM satisﬁes P r F ( s DM = s j ) = P j ( v ) .This information strategy F solves the optimization problem (2.3)-(2.5).Proof. See Theorem 1 and Lemma 2 in Matejka and McKay (2015).Lemma 2.1 shows that solving the optimization problem (2.3)-(2.5) is equivalent to solvethe optimization problem (2.10). We do not observe the DM’s optimal information strategy.Instead, we observe their choice outcome. When we aggregate the choice outcome to themarket level, it becomes the conditional and unconditional choice probability.We should note that the conditional choice probability (2.8) takes a Logit-like choiceprobability form. However, the rational inattention discrete choice model does not imply theusual I.I.A constraints on the choice probability. Matejka and McKay (2015) discusses thetwo equivalent conditions on the conditional choice probability (2.8).8 .2 A Sequential Persuasion Game

Consider a persuader that tries to inﬂuence the choice probability by choosing a persuasionstrategy and sending a realized signal. The persuader is also called he information designer(ID) in the Bayesian persuasion literature.

Deﬁnition 1.

A persuasion strategy is a joint distribution ˜ F ( s ID , v ) of the signal s ID ∈ R J sent by the ID and the utility vector such that (cid:90) ˜ F ( s ID , v ) ds ID = G ( v ) . I consider a sequential persuasion game between the decision makers and the informationdesigner in the following order1. The information designer chooses an persuasion strategy and then sends the realizedsignal s ID to the decision maker;2. The decision maker updates his belief to the intermediate distribution: ˜ G s ID ≡ ˜ G ( v | S ID = s ID ) = G ( v ) × ˜ F ( s ID | v ) (cid:82) v ˜ F ( v , s ) d v ; (2.11)3. The decision maker solves optimization (2.3)-(2.5) with the intermediate belief ˜ G s ID ;4. The decision maker gets a realized signal s DM from his optimal information strategy F . He then makes the choice based on the updated belief F ( v | s DM ) .For the persuasion strategy to work, it is assumed that the DM who receives the signalknows the joint distribution ˜ F ( s ID , v ) . Assumption 2.1.

The persuasion strategy ˜ F ( s ID , v ) is common knowledge. The assumption 2.1 on ˜ F is satisﬁed when there is an underlying equilibrium determin-ing how the information designer chooses the persuasion strategy. For example, informationdesigner can have an objective function M : ∆( R J ) → R so that ˜ F = arg max F ∈ C M ( F ) where C ⊂ ∆( R J ) is some constrained persuasion strategy set. When the objective function M and the constrained set C is known by the DM, the decision maker can solve the informa-tion designer’s optimization problem to get ˜ F . This paper does not tackle the information9esigner’s objective function. The objective function for the information designer is not easyto formulate. In the marketing context, the trade-oﬀ is between higher marketing cost ofpersuasion and higher sales. In the context of political persuasion, the goal of persuasion isnot to maximize voting share but to increase voting share until it exceeds 50%. Also, themedia that conducts persuasion may also care about other aspects of persuasion since theirpersuasion strategy can inﬂuence their audience ratings.The setting of the persuasion game is diﬀerent from the setting in Bloedel and Segal(2018). In their setting, the decision maker chooses an information strategy to understandthe signal send by the sender. In other words, the decision maker in their model paysattention cost to understand the signal from the sender and cannot acquire a signal aboutthe true utility by himself. In my formulate, there is no cost to understand the signal s ID from the sender and there is a cost incurred by acquiring information about the true utilityvector. Remark 2.1.

The persuasion strategy ˜ F and the information strategy F lie in the samespace. The eﬀect of persuasion is limited because decision makers can acquire their owninformation. While the ID can distort the prior distribution of the utility vector v through ˜ F , the decision maker’s information strategy F ( s DM , v ) can still provide information to thedecision maker. In this section, I discuss a data generating process that allows us to non-parametricallyidentify the prior belief G and parametrically identify the persuasion strategy ˜ F . To allowfor the heterogeneity of decision makers’ preferences, I assume the utility of an individual i in market m in the demographic group k takes the following additively separable form: U ikjm = u ( x mj , β ) + u ( x mj , ν ik , α ) + (cid:15) j,m , (3.1)where x mj is the characteristics of product j in market m ; k is the index for people ofdemographic group k with demographic characteristics ν ik ; m is the index for market. Theutility function u , u is of known parametric form, and α , β are two vectors to be estimated,10ut the distribution of G is left as non-parametric. The utility (3.1) assumes that the decisionmakers’ demographic and product characteristics only inﬂuence their mean utility but notthe utility shocks. Here I assume that all DMs in the same market m realize the same (cid:15) m = ( (cid:15) m , ..., (cid:15) jm ) since the random shock vector (cid:15) m in equation (3.1) does not depend onthe individual index i . In particular, if individual i and i are in the same market, and theyhave the same demographic characteristics, they should have the same realized utility vector.This speciﬁcation is reasonable when the shock is market-speciﬁc. For example, when wewant to study the voting decision, the market realization of (cid:15) m can be the real payoﬀ ofcandidate j ’s policy on town m ’s local industry. In the automobile industry, this marketlevel states of the world may come from the local road condition, climate, or geographictopology. Notation

Throughout this section, I use ˜ · to denote probability quantities related to markets with thepresence of a persuader. Also, I drop the super-scrip on s ID , and use s to denote signals sentby the information designer whenever there is no confusion. I use m to denote the index formarkets, j to denote the index of products, k to denote the index of diﬀerent demographicgroups. In many data sets, we do not observe individual choices. Instead, we observe the marketshare, which is the aggregated individual choices. Across diﬀerent markets, I assume thatthe prior distribution on (cid:15) j,m is the same G . Assumption 3.1. (Data) (i) We observe a binary variable χ m such that χ m = 1 if andonly if the persuader is present in market m ; (ii). The demographic heterogeneity v k isdiscrete and supported on K points. For each market m , the distribution of demographicheterogeneity D m = ( d m , ..., d mK ) is observed, where d mk is the proportion of DMs in group k in market m ; (iii). We observe the market characteristics X m in each market m and themarket share vector ms m = ( ms m , ..., ms mJ ) , where ms mj is the market share of product j . Assumption 3.2. (DGP without Persuasion) For markets with χ m = 0 , the data generatingprocess satisﬁes:1. Common prior: (cid:15) m ∼ G ;2. Independent random utility shock: (cid:15) m ⊥ X m

3. Independent demographic distribution: D m ⊥ ( (cid:15) m , X m )

4. The choice set J and information cost λ are the same across diﬀerent markets. Assumption 3.2 imposes that the mean of (cid:15) m is independent of the product characteristicsand is normalized to be zero. If there is any unobserved characteristics that is correlatedwith X mj , the unobserved eﬀects are captured by the observed X mj .For markets with a persuader, I assume that the persuader is the same across thesemarkets and the persuader use the same persuasion strategy. Moreover, I assume that thepersuasion strategy is a joint distribution of (cid:15) m and s ID . This speciﬁcation is diﬀerentfrom Deﬁnition 1 and the persuader uses the same persuasion strategy even if the productcharacteristics X mj may vary across markets. Assumption 3.3. (DGP with Persuasion) For markets with χ m = 1 , the data generatingprocess satisﬁes:1. ( (cid:15) , X mj , D m ) and ( λ, J ) satisfy the conditions in Assumption 3.2;2. There is a uniform persuader across markets with χ m = 1 and the persuasion strategy ˜ F k ( s ID , (cid:15) ) can depend on the demographic groups k but not the market;3. The persuasion signal s IDkm ∼ ˜ F k ( s ID | (cid:15) m ) and the signal s IDkm is independent of eachother across demographic groups and markets;4. Signal Independence ( s IDk , (cid:15) m ) ⊥ D m . ormalization Since the permutation of the item index does not matter, I call the last item J the outsideoption. Note that in the discrete choice model, only the relative diﬀerence of utility mattersfor the DM. Therefore, we can normalize the utility of outside option J to be zero U kJ = 0 .Also, when u , u is homogeneous of degree one with respect to α, β , the vector ( α, β, λ, G ) is not identiﬁed. Indeed, we can consider a model with ( cα, cβ, cλ, cG ) , where cG is thedistribution of c (cid:15) m . The model ( cα, cβ, cλ, cG ) will generate the same choice probability(2.6 and (2.7). Since linear speciﬁcation of utility is frequently used in the applied literature,I assume u , u is homogeneous of degree one with respect to α, β . Assumption 3.4. (Normalization) The utility functions u , u are homogeneous of degree 1with respect to ( α, β ) , and λ = 1 . The parameters of interests include the mean utility parameters ( α, β ) , the prior belief G ,and the persuasion strategy ˜ F . For markets without persuasion, we are also interestedin P ,kj ( X ) , which is demographic group k ’s unconditional choice probability of choosing j when the product characteristics are X . If we want to evaluate the overall eﬀect of persuasionacross diﬀerent markets, we want to compare the post-persuasion market share with P ,kj ( X ) .I ﬁrst deﬁne the identiﬁed set of ( α, β, G, P ,kj ( X )) from the rational inattention discretechoice model. Deﬁnition 2.

Let F χ =0 denote the conditional distribution of ( D m , X m , ms m ) conditionedon χ m = 0 . The identiﬁed set of ( α, β, G, P ,kj ( X )) under the rational inattention discretechoice model, denoted by Γ I , is the collection of ( α, β, {P ,kj ( X ) } j,k , G ) that satisﬁes thefollowing constraints:1. Given ( α, β, G ) , {P ,kj ( X ) } j,k solves the individuals optimization problem (2.10) with v mj = u ( x mj , β ) + u ( x mj , ν ik , α ) + (cid:15) mj ; (3.2)13 . The unconditional mean of the conditional choice probability is the unconditional choiceprobability: E G (cid:34) P ,kj ( X ) e v mj /λ (cid:80) k ∈J P ,kj ( X ) e v mk /λ (cid:35) = P ,kj ( X ); (3.3)

3. Consider the mapping: P mj ( α, β, (cid:15) , D m , X m , {P ,kj ( X ) } j,k ) = (cid:88) k d mk P ,kj ( X ) e v mj /λ (cid:80) k ∈J P ,kj ( X ) e v mk /λ (3.4) where v mj is deﬁned in (3.2). Then ( D m , X mj , P mj ( α, β, (cid:15) , D m , X m , {P ,kj ( X ) } j,k )) hasthe same distribution as F χ =0 . The ﬁrst two conditions in Deﬁnition 2 corresponds to the optimization condition (2.10)and the condition (2.9) in Lemma 2.1. Equation (3.4) calculates the market share of product j as the weighted average of diﬀerent demographic groups’ choice probability. The thirdcondition in Deﬁnition 2 requires the model predicted market share is consistent with theobserved data distribution.I then deﬁne the identiﬁed set of the persuasion strategy ˜ F . Deﬁnition 3.

Let F χ =1 denote the conditional distribution of ( D m , X m , ms m ) conditionedon χ m = 1 . Given the value of ( α, β, G ) , and a persuasion strategy ˜ F , we consider the map: ˜ P kj,s ( (cid:15) ; X m ) = ˜ P ,kj,s ( X m ) e v mj (cid:80) Jl =1 ˜ P ,kl,s ( X m ) e v mj (3.5) where ˜ P ,kj,s ( X m ) solves the individual optimization problem (2.10) when his belief is ˜ G ( (cid:15) ) =˜ F ( (cid:15) | s ID = s ) . The identiﬁed set of the persuasion strategy is the set of ˜ F ( s ID , (cid:15) ) such that (cid:32) D m , X m , (cid:88) k d mk ˜ P kj,s ( (cid:15) ; X m ) (cid:33) has the same distribution as F χ =1 . The identiﬁed set in Deﬁnition 3 is conditioned on the vector ( α, β, G ) . This is because inmarkets with the persuader, there are two types of the unobserved heterogeneity: the utilityshock (cid:15) m and the realization of the signal s ID . In contrast, in markets without the persuader,14nly the utility shock (cid:15) m exists. Therefore, knowing the prior distribution G reduces therandomness and makes the problem of identifying the persuasion strategy tractable.The identiﬁed set of ( α, β, G ) in Deﬁnition 2 is deﬁned through the rational inattentiondiscrete choice model only, and it ignores the empirical content of the subsequent persuasionstage. There are two reasons to deﬁne the identiﬁed set in this way. First, if we only havedata on markets without any persuader, i.e. χ m = 0 for all markets, the identiﬁed setdeﬁned in Deﬁnition 2 can still be used. Second, the unobserved persuasion signal s ID in thepersuasion stage makes it hard to characterize the empirical content of the whole persuasiongame. I will stick with these two deﬁnitions and characterize the corresponding momentconditions. Recall that given ( α, β ) and the unconditional choice probability {P ,kj ( X ) } j,k , the modelpredicted market share is given by (3.4). Following BLP, I denote δ mj = u ( x mj , β ) + (cid:15) mj andlet δ m = ( δ m , ..., δ mJ ) . Then the predicted market share in (3.4) can be written as: P mj ( α, β, (cid:15) , D m , X m , {P ,kj ( X ) } j,k ) = (cid:88) k P ,kj ( X m ) e δ mj + u ( x mj ,ν k ,α ) P ,kJ ( X m ) + (cid:80) J − l =1 P ,kl ( X m ) e δ ml + u ( x ml ,ν k ,α ) d mk ≡ ms ∗ j ( X m , δ m , α, D m , {P ,kj ( X ) } j,k ) , where the utility of the outside option is normalized to zero, so δ mJ = u ( x mJ , ν k , α ) = 0 .Consider a mapping T : R J − → R J − such that (cid:104) T [ X m , ms m , α, D m , {P ,kj ( X ) } j,k ]( δ m ) (cid:105) j = δ j + log( ms mj ) − log( ms ∗ j ( X m , δ j , α, D m , {P ,kj ( X ) } j,k )) , (3.6) where [ T ] j is the j -th entry in the output vector. The input ms m will be the observed marketshare. When T ( δ m ) = δ m , the observed market share ms m equals the model predictedmarket share. This map is a contraction mapping whenever the outside option has a nonzerounconditional choice probability. As a result, there exists a unique market level δ m thatmatches the observed market share with the model predicted market share. Lemma 3.1.

Suppose in a market we have ms J > , then the mapping deﬁned by (3.6) is acontraction mapping. Let δ ∗ m denote the ﬁxed point of the contraction mapping (3.6). As a esult, the unobserved heterogeneity δ ∗ m is a function of observables x m , D k , ms m and theparameters α and P ,kj ( X ) . Now I state the ﬁrst identiﬁcation result of the prior distribution G . Proposition 1.

For each ( α, β, {P ,kj ( X ) } j,k ) in the identiﬁed set Γ I deﬁned in Deﬁnition2, there exists a unique G ∗ such that ( α, β, G ∗ , {P ,kj ( X ) } j,k ) ∈ Γ I . In particular, for anymeasurable set B , deﬁne the set M S ( B ; x m , D k , ms m , α, P ,kj ( X ) , β ) ≡ { ms m : δ ∗ m ( x m , D m , ms m , α, P ,kj ( X )) − [ u ( x mj , β )] Jj =1 ∈ B } , where δ ∗ m is deﬁned in Lemma 3.1 and [ u ( x mj , β )] Jj =1 = [ u ( x m , β ) , ..., u ( x mJ , β )] (cid:48) . The G ∗ satisﬁes P r G ∗ ( (cid:15) m ∈ B ) = P r F χ =0 ( ms m ∈ M S ( B ; x m , D k , ms m , α, P ,kj ( X ) , β )) . (3.7) Proof.

I prove this statement by contradiction. Suppose there exists a G (cid:48) (cid:54) = G ∗ such that ( α, β, G (cid:48) , {P ,kj ( X ) } j,k ) is also in the identiﬁed set. Suppose there exists a positively measuredset B (cid:48) such that P r G ∗ ( (cid:15) m ∈ B (cid:48) ) (cid:54) = P r G (cid:48) ( (cid:15) m ∈ B (cid:48) ) . I claim the distribution of ms m implied by G (cid:48) , denoted by F (cid:48) χ =0 is diﬀerent from F χ =0 . Byequation (3.4), P r F (cid:48) χ =0 ( ms m ∈ M S ( B (cid:48) ; x m , D k , ms m , α, P ,kj ( X ) , β ))= P r G (cid:48) ( (cid:15) m ∈ B (cid:48) ) (cid:54) = P r G ∗ ( (cid:15) m ∈ B (cid:48) )= P r F χ =0 ( ms m ∈ M S ( B (cid:48) ; x m , D k , ms m , α, P ,kj ( X ) , β )) . Therefore, G (cid:48) cannot generate the same data distribution F χ =0 , so G (cid:48) is not in the identiﬁedset by Deﬁnition 2.Proposition 1 states that once we know ( α, β, {P ,kj ( X ) } j,k ) , the distribution G is pointidentiﬁed. This is similar to the identiﬁcation strategy in the ﬁrst price auction models(Guerre et al., 2000). The δ ∗ m − [ u ( x mj , β )] Jj =1 is the pseudo value of (cid:15) m , similar to thepseudo value that is constructed from bids in the auction model.16 roposition 2. Suppose assumptions 3.2, 3.4. Suppose the unconditional choice probability {P ,kj ( X ) } j,k are uniformly bounded away from zero and one. Each ( α, β, {P ,kj ( X ) } j,k , G ) inthe identiﬁed set Γ I deﬁned in Deﬁnition 2 satisﬁes:1. Constraint on unconditional choice probability: E  ms m −  P , ( X m ) ... P ,K ( X m ) ... ... ... P , J ( X m ) ... P ,KJ ( X ) m   d m ...d mK  (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( D m ) , X m  = 0; (3.8)

2. Instrument constraint: E [ δ ∗ j ( ms m , X m , D m , α, {P ,kj ( X ) } j,k ) − u ( X mj , β ) | X m , D m ] = 0 , ∀ j = 1 , ..., J − (3.9)

3. Optimality constraint on {P ,kj ( X ) } j,k ∀ j = 1 , , ..J − k = 1 , ..., K : E (cid:20) e δ mj + u ( x mj ,ν k ,α ) (cid:80) l ∈J P ,kl ( X ) e δ ml + u ( x ml ,ν k ,α ) − (cid:12)(cid:12)(cid:12)(cid:12) X m (cid:21) = 0; (3.10) G satisﬁes equation (3.7). The ﬁrst moment equality (3.8) is equivalent to condition (3.3) in Deﬁnition 2, since ms m is the conditional choice probability while P ,kj is the unconditional choice probability.The second moment inequality (3.9) is the consequence of conditions 2 and 3 in Assumption3.2. The third moment inequality is the ﬁrst order condition of (2.10). Remark 3.1.

The identiﬁcation results are diﬀerent from the results in BLP in severalways. First, we need the number of markets to be large to identify the unconditional choiceprobability for diﬀerent demographic groups from (3.8). From the identiﬁed unconditionalchoice probability, we can proceed to identify coeﬃcients on the product and demographicheterogeneous characteristics α and β . Second, in BLP we assume there is a vector ofunobserved product heterogeneity ξ = ( ξ , ...ξ J ) that can be recovered by matching marketshares and model prediction. In the rational inattention discrete choice model, we recover avector of market-speciﬁc utility shock (cid:15) . Third, the prior distribution of (cid:15) is the structural bject that we are interested in, but the distribution of ξ in BLP is not of fundamentalinterest. Remark 3.2.

If the price of item j , denoted by q j , enters in the product heterogeneity X j ,then the price is likely to be correlated with the unobserved market realized utility shock. Forexample, when sellers know the realization of (cid:15) , they may set a price accordingly. In thiscase, the assumption E [ (cid:15) m | X m ] = 0 fails. In this case, we may want to ﬁnd an instrumentfor q j . The choice of instruments for the price is discussed in BLP. Deﬁnition 3 of the identiﬁed set of persuasion strategy is conditioned on the value of ( β, α, G ) . If ( β, α, G ) is point identiﬁed from Proposition 2, we can assume that ( β, α, G ) isknown by the econometrician and plug the identiﬁed ( β, α, G ) into Deﬁnition 3. If ( β, α, G ) is not point identiﬁed, we can do analysis by considering that each point in the identiﬁed set Γ I as the true value separately.For a point ( α, β, G ) in the identiﬁed set Γ I , equation (3.5) deﬁnes the conditional choiceprobability of demographic group k choosing item j when they receive a persuasion signal s from the ID. The ˜ P ,kj,s is the unconditional choice probability solved from (2.3)-(2.5) whenthe intermediate belief is ˜ F ( (cid:15) | s ) . The choice probability ˜ P ,kj,s is conditioned on the signal s ID , but unconditional on the utility shock.The observed market share ˜ ms m is a linear combination of diﬀerent demographic groups’conditional choice probability : ˜ ms mj = ( ˜ P j,s ( (cid:15) , X m ) , ... ˜ P Kj,s ( (cid:15) , X m ))( d m , ..., d mK ) (cid:48) . (3.11)Conditioned on ( d m , ..., d mK ) , we can take expectation on both sides of (3.11) to get: E [ ˜ ms j − ( ˜ P j,s ( (cid:15) , X m ) , ... ˜ P Kj,s ( (cid:15) , X m ))( d m , ..., d mK ) (cid:48) | D m , X m ] = 0 , ∀ j = 1 , ...J. (3.12)Since we do not observe the realization of the persuasion signal and the realization of theutility shock in each market, we can integrate it out. Let h kj ( X m ; ˜ F k ) := (cid:90) ( s, (cid:15) ) ˜ P kj,s ( (cid:15) , X m ) d ˜ F ( s, (cid:15) )= (cid:90) ( s, (cid:15) ) ˜ P kj,s ( (cid:15) , X m ) d ˜ F ( (cid:15) | s ) d ˜ F k ( s )= (cid:90) s ˜ P ,kj,s ( X m ) d ˜ F k ( s ) (3.13)18e the unconditional choice probability for demographic group k under persuasion strategy ˜ F ( s, (cid:15) ; θ ) . The third equality holds because G ( (cid:15) | s ; θ ) = ˜ F ( (cid:15) | s ; θ ) by Bayes’ rule. Proposition 3.

Under assumption 3.2 - 3.3, for each ( α, β, G ) , the true persuasion strategyparameter θ must satisfy the moment condition E [ ˜ ms j − K (cid:88) k =1 h kj ( X m ; ˜ F ) d mk | D m , X m ] = 0 ∀ j = 1 ...J − . (3.14) Proof.

By assumption 3.2, the independence of demographic distribution D m and ( (cid:15) m , X m ) : E [ ˜ P kj,s ( (cid:15) m , X m ) | D m , X m ] = ˜ P ,kj,s ( X m ) . Then by (3.12), we have E [ ˜ ms j − ( ˜ P , j,s ( X m ) , ... ˜ P ,Kj,s ( X m ))( d m , ..., d mK ) (cid:48) | D m , X m ] = 0 . (3.15)Since the signal s ID ⊥ D m , X m by assumption 3.3, we have E [ ˜ P ,kj,s ( X m ) | D m , X m ] = h kj ( X m ; ˜ F ) .The result follows.The eﬀective number of conditional moment equality is J − since I have the constraintthat (cid:80) ˜ ms j = 1 . We should be careful with the persuasion strategy in Bayesian persuasion.The value of a signal in persuasion strategy itself has no meaning beyond the context of acommunication game. For example, if ˜ F is the distribution of ( (cid:15) , s ID ) and is the persuasionstrategy used by the persuader, then let ˜ F be the distribution of ( (cid:15) , s ID + ∆) , where ∆ isan arbitrary vector that lies in the same space as s ID . ˜ F as a persuasion strategy is notdiﬀerent from ˜ F since the value of the signal does not matter.In practice, we can consider the case where the persuasion strategy is indexed by a ﬁnite-dimensional parameter θ : ˜ F k ( s ID , (cid:15) ; θ ) , and the support of s ID is ﬁnite. The persuasionstrategy can depend on the demographic group k . There are several justiﬁcations for theuse of a parametric persuasion strategy. First, when there are only two choices, the optimalpersuasion strategy is to use a cut-oﬀ rule, see Kamenica and Gentzkow (2016). In thiscase, the parameter θ is the cutoﬀ points, and signals only take two values. Second, inmany empirical contexts, it is costly to design complex persuasion strategies. For example,an online advertisement can only send a simple signal within a few seconds. If the cost of19ignal increase with the number of parameters and support points of signals, it is natural torestrict the persuasion strategy to parametric form. Third, a parametric persuasion strategywith discrete signal support facilitates a clear interpretation of the meaning of the signals.In Kamenica and Gentzkow (2011), signals are interpreted as action recommendations. Discussion of Moment Condition (3.14)

One issue with the moment condition (3.14) is that it does not guarantee the identiﬁcation ofpersuasion parameters θ . For example, consider the case where there is only one demographicgroup K = 1 and no product characteristics heterogeneity across markets X m = X ∀ m . Inthis case, moment condition (3.14) implies h j ( ˜ F ) = E [ ˜ ms j ] . If ˜ F is indexed by a parameter θ and h j ( ˜ F ( θ )) is not monotone in θ , then θ is not necessarily point identiﬁed.There are several restrictions that help to tighten the identiﬁed set of ˜ F . The ﬁrst is toimpose the persuasion strategy is the same for certain demographic groups, i.e. ˜ F k ( s | (cid:15) ) =˜ F k (cid:48) ( s | (cid:15) ) for some k (cid:54) = k (cid:48) . Then demographic variation will tighten the bounds on thepersuasion strategy. This is because diﬀerent demographic groups’ choice probability canhave diﬀerent sensitivity to the same persuasion strategy. The second is to impose theparameter θ to be of lower dimension smaller than J . The variation of the choice probabilityacross diﬀerent products can tighten the bounds of the parameter that indexes the persuasionstrategy. Third, the variation of product characteristics across markets can also tighten thebounds on ˜ F . This is because if in a market m the j -th product characteristics x mj generateslarge utility to decision makers, persuasion strategy is unlikely to change the market sharea lot. Point Identiﬁcation Assumption

It is worthwhile to discuss the assumptions under which parameters ( α, β, P ,kj , G ) and θ arepoint identiﬁed. Note that the moment conditions constructed in (3.8)- (3.10) are similarto the moment conditions appeared in BLP, except that I have extra parameters P ,kj ( X ) to identify. Note that the moment condition for P ,kj ( X ) is similar to the moment conditionfor linear regression, so if E [ F k F (cid:48) k | X ] is invertible X − a.s. , then P ,kj ( X ) is identiﬁed. Theglobal suﬃcient primitive conditions for identiﬁcation of moment conditions (3.9)- (3.10) are20ot easy to interpret, because the ﬁxed point δ ∗ in Lemma 3.1 is highly non-linear in itsarguments. In a similar situation in BLP, they assume the moment conditions are suﬃcientto identify the utility parameter. Assumption 3.5. (Identiﬁcation Assumption)1. E [ F k F (cid:48) k | X ] is invertible, X − a.s .2. At the true parameter {P ,kj ( X ) } j,k , there is a unique ( α, β ) such that moment condi-tions (3.9) and (3.10) hold. The second requirement in Assumption 3.5 is not as restrictive as it seems. In particular,if there is only one demographic group, the ﬁxed point in Lemma 3.1 is given by δ ∗ j = log ms mj ms mJ − log P j ( X m ) P J ( X m ) , (3.16)and moment condition (3.9) becomes E (cid:20) log ms mj ms mJ − log P j ( X m ) P J ( X m ) − u ( X mj , β ) (cid:12)(cid:12)(cid:12)(cid:12) X m (cid:21) = 0 . If {P ,kj ( X ) } j,k is identiﬁed from moment condition 3.8 and u is a linear function, then β ispoint identiﬁed.Now suppose the persuasion is parametric and indexed by θ . The assumptions to guar-antee that θ is identiﬁed up to ( α, β, G ) is easier to write down. The discussion of (3.14)shows that h k, j ( x m ) ≡ h kj ( x m ; θ ) is identiﬁed if E [ D m D m (cid:48) | X ] is invertible, X − a.s , where θ is the true value of θ . Then the identiﬁed set of the persuasion strategy is then the set ofthe θ ∗ such that h kj ( x ; θ ∗ ) = h k, j ( x ) for all j, k and x ∈ supp ( X ) . Assumption 3.6.

The matrix E [ D m D m (cid:48) | X ] is invertible for X − a.s. . Under Assumption 3.5, ( α, β, G, {P j,k ( X ) } j,k ) is point identiﬁed from moment conditions(3.8), (3.9) and (3.10). When the product characteristics X are continuously distributed, We say θ is identiﬁed up to ( α, β, G ) if the data generating process allow us to point identify θ for eachgiven parameter ( α, β, G ) . ,kj ( X ) in moment condition (3.8) needs to be estimated non-parametrically. However, insome empirical settings, the product characteristics are discrete and standard estimators ofmoment equality such as GMM estimator can be implemented directly. In this section, Idiscuss the estimation of ( α, β, G, {P j,k ( X ) } j,k ) when the characteristics X are discrete. Assumption 4.1.

The product characteristics X m are discretely distributed and supportedon L points: { x (1) , ... x ( L ) } , and the probability inf l =1 ,...,L P r ( X m = x ( l )) > /C for someconstant C > . Under Assumption 4.1, the analysis of moment conditions (3.8), (3.9) and (3.10) can bedone conditioned on the value of X m separately. Since the demographic characteristics v k are also discrete, the most general utility function of (3.1) under discrete v k and X can bere-written as u ijkm = α kj ( l ) if ( x mj ) Jj =1 = x ( l ) , where α kj ( l ) is the mean utility of product j for a demographic group k individual in amarket with characteristics x ( l ) . Any parametric assumption on the utility u and u in(3.1) can be imposed as constraints on the value of α kj ( l ) .Even if ν k is distributed on K discrete points, the random vector D m is continuouslydistributed. Moment conditions (3.8), (3.9) are still conditioned on D m and we need totransform them into unconditional moment conditions. Moment condition (3.8) is linearin the elements of D m and the optimal instrument will be d m , ..., d mk , and we can deﬁne P ,kj ( x ( l )) as P ,kj ( l ) . For moment condition (3.9), we can use D m and its second orderpower terms { ( d mj ) t : j = 1 , ...J, t = 1 , } as instruments to form unconditional momentconditions.Let α denote the vector of { α kj ( l ) } j,k,l and P denote the vector of {P ,kj ( l ) } j,k,l . Let γ ( ms m , D m , X m , α , P ) denote the moment unconditional conditions. The standard GMMestimator of ( α , P ) is given by ( ˆ α , ˆ P ) = arg min[ 1 M M (cid:88) m =1 γ ( ms m , D m , X m , α , P )] (cid:48) ˆ W [ 1 M M (cid:88) m =1 γ ( ms m , D m , X m , α , P )] , (4.1) The utility parameter β cannot be separated from α kj ( l ) , so I normalize u ≡ for all j . M is the number of markets without the persuader, and ˆ W is any positive semi-deﬁniteweighting matrix. Standard asymptotic normality results on the GMM estimator can beapplied if the moment condition satisﬁes some regularity conditions. Assumption 4.2.

Suppose the following conditions hold: (i).The true parameter value ( α , P ) lies in the interior of the parameter space; (ii). γ ( ms m , D m , X m , · , · ) is con-tinuously diﬀerentiable on the interior of the parameter space for ( ms m , D m , X m ) ; (iii). γ ( ms m , D m , X m , α , P ) has ﬁnite second moment; (iv) E [ |∇ ( α , P ) γ ( ms m , D m , X m , α , P ) ] has rank dim (( α , P )) ; (v). There exists a integrable function b such that (cid:12)(cid:12) ∇ ( α , P ) γ ( ms m , D m , X m , α , P ) (cid:12)(cid:12) < b ( ms m , D m , X m ) . Conditions (i), (iii) and (iv) are assumptions on the true value of the parameter of interest ( α , P ) , which are not veriﬁable without observing the data distribution. Conditions (ii)and (v) are assumptions on the derivatives of the moment conditions. It is diﬃcult to verify(ii) and (v) because δ m ∗ as a function of α and P is deﬁned through the contraction mapping(3.6). General primitive conditions on the rational inattention model to guarantee that δ m ∗ iscontinuously diﬀerentiable in α , P are hard to ﬁnd. However, when there is no demographicheterogeneity, the δ m ∗ in Lemma 3.1 has a closed from solution (3.16). In this case, themoment conditions (3.9) and (3.10) can be rewritten as E (cid:20)(cid:18) log ms mj ms mJ − log P j ( x ( l )) P J ( x ( l )) − α j ( l ) (cid:19) ( X m = x ( l )) (cid:21) = 0 ,E (cid:20)  ms mj ms mJ / P j ( x ( l )) P J ( x ( l )) (cid:80) l ∈J P ,kl ( X ) (cid:104) ms mj ms mJ / P j ( x ( l )) P J ( x ( l )) (cid:105) −  ( X m = x ( l )) (cid:21) = 0 . If there exists a constant

C > such that P j ( x ( l )) > /C holds for all j, l , then conditions(ii) and (iv) holds. Lemma 4.1.

Suppose assumption 4.2 holds. Denote B = E [ ∇ α , P γ ( ms m , D m , X m , α , P )] .Then √ M [( ˆ α , ˆ P ) − ( α , P ) → d N (0 , Σ) , For example, see Theorem 3.4 of Newey and McFadden (1994). here Σ = ( B (cid:48) W B ) − B (cid:48) W Λ W B (cid:48) ( B (cid:48) W B ) − , and Λ = E [ γ ( ms m , D m , X m , α , P ) γ ( ms m , D m , X m , α , P ) (cid:48) ] . Recall that the moment condition for persuasion strategy in (3.14) is derived for eachidentiﬁed value of ( α, β, G ) . Now I give an estimator of the persuasion strategy when theestimated ( ˆ α , ˆ P ) in Lemma 4.1 are directly plugged into (3.14). This is a two-step estimationprocedure and will not be eﬃcient. I will discuss the complexity of the joint estimation ofmoment conditions 3.8-3.10 and (3.14) after the plug-in estimator of the persuasion strategyis introduced.Given the estimated ( ˆ α , ˆ P ) , we can construct a sample of estimated realized utility ˆ v mj,k ( x ( l )) = L (cid:88) l =1 (cid:104) δ j ( ms m , X m , D m , ˆ α , ˆ P ) + ˆ α kj ( X m ) (cid:105) ( X m = x ( l )) (4.2)corresponding to (3.2) and a sample of utility shock ˆ (cid:15) mj = δ j ( ms m , X m , D m , ˆ α , ˆ P ) . (4.3)Fixing the demographic group k and the characteristics x ( l ) , the distribution of ˆ v mj,k ( x ( l )) conditioned on k and x ( l ) is an estimated distribution of realized utility.To form moment condition (3.14), we ﬁrst need the unconditional choice probability ˜ P ,kj,s ( X m ) in (3.13) for each demographic group k and for each product characteristics. To getan estimator of ˜ P ,kj,s ( X m ) , denoted by ˆ P ,kj,s ( X m ) , we need to solve optimization problem (2.10)with an estimated prior belief. I look at the empirical counterparts of optimization problem(2.10) under persuasion strategy ˜ F ( s ID , (cid:15) ; θ ) conditioned on markets with X m = x ( l ) : (cid:80) M ( x ( l )) m (cid:48) =1 ˜ F k ( s ID = s | ˆ (cid:15) m (cid:48) j ; θ ) max { ˜ P ,kj,s } Jj =1 M ( x ( l )) (cid:88) m =1 log( J (cid:88) j =1 ˜ P ,kj,s ( x ( l )) e ˆ v mj ) × ˜ F k ( s ID = s | ˆ (cid:15) mj ; θ ) s.t. ∀ j : ˜ P ,kj,s ( x ( l )) ≥ , J (cid:88) j =1 ˜ P ,kj,s ( x ( l )) = 1 , (4.4)where M ( x ( l )) is the number of markets such that X m = x ( l ) . I implicitly imposed thatthe marginal distribution of ˜ F k ( (cid:15) mj ) is the empirical distribution of ˆ (cid:15) m , and by Bayes’ rule24 F k ( s ID | ˆ (cid:15) mj ; θ ) (cid:80) Mm (cid:48) =1 ˜ F k ( s ID | ˆ (cid:15) m (cid:48) j ; θ ) is the posterior belief when the DM receive a signal s . Let ˆ P ,kj,s ( x ( l )) bethe solution to (4.4), and denoted the vector ( ˆ P ,kj,s ( x ( l ))) j,k,s,l as ˆ P s .After solving ˆ P ,kj,s ( x ( l )) , we can now write the empirical version of moment condition(3.14). Let N be the number of markets with persuasion. Denote ∀ l = 0 , ...L and ∀ j =1 , ...J − : g l,j,k ( θ, ˜ ms m , D m , X m , ˆ P s ) = [ ˜ ms mj − K (cid:88) d =1 h dj ( θ, ˆ P s , x ( l )) d mk ] d mk ( X m = x ( l )) , (4.5) h kj ( θ, ˆ P s , x ( l )) = (cid:88) s (cid:20) ˆ P ,kj,s ( x ( l ) , θ ) N ( x ( l )) (cid:88) m =1 ˜ F ( s | (cid:15) m ; θ ) N ( x ( l )) (cid:21) (4.6)where ˜ ms m is a vector of share observation in market m , and N ( x ( l )) is the number ofmarkets with persuasion such that X m = x ( l ) . Then we can estimate θ by the usual GMMestimator: ˆ θ = arg min( 1 N N (cid:88) m =1 g m ( θ )) (cid:48) W ( 1 N N (cid:88) m =1 g m ( θ )) , (4.7)where g m ( θ )) is the vector of moment functions ( g l,j ) l,j in (4.5).In what follows, I derive the consistency of ˆ θ when the persuasion strategy has a smoothparametric form ˜ F ( s ID, (cid:15) ; θ ) and the signal s ID is discrete. Assumption 4.3.

The persuasion strategy satisﬁes that ∃ C > for all s value:1. ˜ F ( s | (cid:15) ; θ ) is diﬀerentiable with respect to (cid:15) , and the gradient is uniformly bounded in θ : sup θ ∈ Θ ,s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ ˜ F ( s | (cid:15) ; θ ) ∂(cid:15) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < C ;

2. The δ ∗ m ( ms m , D m , X m ; α , ( P ,kj ( X m )) j,k ) deﬁned in Lemma 3.1 satisﬁes | ∂δ ∗ mj ∂κ | < C ∀ κ ∈ { α kj ( l ) , ( P ,kj ( x ( l )) : j, k, l } for all values of ms m , D m , X m .3. The partial derivatives with respect to the elements of θ satisfy sup (cid:15) ,s,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ ˜ F ( s | (cid:15) ; θ ) ∂θ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < C. Z ( D m ) point identify the parameter θ . The point identiﬁcation conditions are discussed in section 3. Assumption 4.4.

Let g ( θ ) = (cid:16) g l,j ( θ, ˜ ms m , D m , X m , ˆ P s ) (cid:17) l =1 ,...,Lj =1 ,...,J − , and deﬁne L ( θ ) = g ( θ ) (cid:48) W g ( θ ) .The following identiﬁcation condition hold for all ζ > d ( θ,θ ) >ζ L ( θ ) − L ( θ ) > . Proposition 4.

Under assumptions 4.2 -4.4, and technical assumption C.1, ˆ θ is a consistentestimator of θ . Remark 4.1.

The asymptotic distribution of ˆ θ is not derived in this paper. There are twodiﬃculties in deriving the asymptotic distribution of θ . The unconditional choice probabilityvector under persuasion ˆ P s is estimated using the sample of markets without persuasion. Thesampling error of ˆ P s comes from two aspects: (i) ˆ P s is estimated from the empirical version(4.4) of the optimization problem (2.10); (ii) the utility shocks in (4.4) are constructed fromthe estimator ˆ α . Another diﬃculty comes from the fact that P ,sj,k ( x ( l )) can be local to theboundary to the parameter space under the true persuasion strategy, i.e. P ,sj,k ( x ( l )) ≈ √ n forsome ( j, k, l ) . In this case, the sampling distribution of ˆ P ,sj,k ( x ( l )) is hard to derive and theinﬂuence of the sampling error on ˆ θ is hard to derive. Joint Estimation and Two Step Estimation

In this section, I brieﬂy discuss how to estimate the persuasion strategy parameter θ andpreference parameters ( α , {P j,k ( x ( l )) } , G ) jointly using moment conditions (3.8)-(3.10) and(3.14). The objective function of joint GMM estimation is just the simple stack of γ m in (4.1)and g m in (4.7). For each ( α , {P j,k ( x ( l )) } , θ ) in the parameter space, we need to ﬁnd the δ ∗ m for each market, and construct the pseudo sample of { (cid:15) m } Mm =1 . Given the pseudo sample of { (cid:15) m } Mm =1 , we then solve the optimization problem (4.4) to get h kj in (4.6). Given (cid:15) m and h kj ,we can evaluate the value of the joint GMM objective function at this ( α , {P j,k ( x ( l )) } , θ ) .The joint GMM estimation procedure introduces two extra computational burden com-pared with the two step estimation procedure. First, the ﬁxed point δ ∗ m needs to calculatedat each ( α , {P j,k ( x ( l )) } j,k,l θ ) parameter evaluation in the joint estimation. In contrast, in26he two-step estimation, we ﬁnd the ﬁxed point for each ( α , {P j,k ( x ( l )) } j,k,l ) . If the di-mension of θ is large, the extra parameter θ can introduce signiﬁcant computational bur-den to the joint estimation. Second, the optimization problem (4.4) needs to be solvedat each ( α , {P j,k ( x ( l )) } j,k,l , θ ) in the joint estimation. In contrast, we plug the estimator ( ˆ α , { ˆ P j,k ( x ( l )) } j,k,l ) into (3.14), and the optimization problem (4.4) only needs to be solvedfor each θ . Plugging in the estimator ( ˆ α , { ˆ P j,k ( x ( l )) } j,k,l ) reduces the dimension of theparameter space for the second step GMM estimation.Joint estimation of ( α , {P j,k ( x ( l )) } j,k,l , θ ) also makes the inference of ( α , {P j,k ( x ( l )) } j,k,l ) diﬃcult. The discussion under Proposition 4 reveals the diﬃculty of deriving the asymptoticdistribution of ˆ θ . The diﬃculty comes from the unknown limit distribution of ˆ P ,sj,k ( x ( l )) when P ,sj,k ( x ( l )) is local to zero. The same issue will happen to ( ˆ α , { ˆ P j,k ( x ( l )) } j,k,l ) if weestimate all moment conditions jointly. In this section, I apply the rational inattention, discrete choice model, with persuasion tothe eﬀect of Fox News on the 2000 presidential election (DellaVigna and Kaplan, 2007).Fox News started the distribution of its channel in 1996 and its twenty-four-hour cableprogram penetrated about 20% of the towns in the United States by Nov, 2000. Fox Newschannel is perceived to provide political views that are right to the mainstream news channelsuch as ABC and CNN. In the empirical application, I treat the entry of Fox News intothe local cable markets as the presence of the persuader. The DMs’ prior distribution G is understood as the prior belief on the presidential candidates under mainstream newschannels. The goal is to estimate the preference parameters of each demographic groupand the persuasion strategy used by Fox News in these markets. The estimated persuasionstrategy can reveal the degree of bias in Fox News program.27 .1 Data The election outcome data are taken from DellaVigna and Kaplan (2007) and the demo-graphic data are a mixture of the original demographic data in DellaVigna and Kaplan(2007) and the 2000 U.S. census data . Each observation consists of a vector of presidentialelection vote results and a vector of demographic statistics that correspond to a town and anindicator for the presence of Fox News. The presidential election vote result includes the to-tal votes cast, the number of votes for the Democratic Party, and the number of votes for theRepublican Party. The demographic statistics include the number of people that are above18 years old, the gender ratio, the ethnic group decomposition (African American, Hispanic,Asian, etc), and the decomposition by education level. The education level statistics are foreligible voters (18+ years old), but ethnic group statistics incorporate both the adults andchildren.The original demographic data in Vigna and Kaplan is ﬂawed. In about 15% of thetowns, the number of votes cast is more than the number of residents above 18 years old.The issue happens when the town name corresponds to multiple administrative levels. Forexample, there are some names used for two diﬀerent townships and cities but in diﬀerentcounties, their match tends to get wrong. I re-match the voting data with the 2000 U.S.census data to deal with this issue but the problem is not solved completely. There are stillabout 5% of the towns that have the inconsistency of votes and adults. As mentioned inDellaVigna and Kaplan (2007), this may be due to ﬂaws in the process of collecting theelection data.I follow the data selection procedure in DellaVigna and Kaplan (2007) to discard towns:1. without CNN news channel; 2. the number of precincts in 2000 diﬀers from that in 1996by 20%; 3. the total number of votes in 2000 diﬀers from that in 1996 by 100%; 4. withmultiple cabal systems; 5. the number of people with high school and above is more thanthe number of adults; 6. the number of votes is greater than the number of adults.Throughout the application, I assume the choice set includes J = 3 options: { Rep, Dem, Out } . The education level variable in their data set is not correct for some towns. For example, the proportionof residents with no more than high school education and the proportion of residents with more than highschool education sum to greater than 1.

Out option.

First I separate markets into two groups: with persuasion and without persuasion. If FoxNews is available in the town, I assume the town is under the inﬂuence of the persuader.This assumes that the presence of Fox News inﬂuences the whole town. Since I only useobservation of towns with one unique cable company, if the cable company includes Fox News,everyone in the town should have access to the channel. While some residents may not watchthe channel, the contents of the news program can be spread through workplaces and placesof entertainment. This also assumes that towns without Fox News cannot be inﬂuenced bypersuasion. This assumption suits the historical context in 2000 where the ﬁxed broadbandsubscription in the United States accounts for around 2.5% of the population, so streamingof Fox News is not accessible to major voters in the towns without Fox News.The key assumption on market without persuasion is the assumption 3.2. The i.i.d assumption on (cid:15) m assumes that there is no spatial correlation conditioned on the observedcharacteristics in the town. This variation in (cid:15) m may come from the geographic locationdiﬀerences of towns and the composition of industries in towns. For example, a policy ofcleaner fuel may generate diﬀerent perceptions in the coal mining towns and forest zone.The independence assumption (cid:15) m ⊥ D m assumes the composition of demographics does notinﬂuence the prior belief.For markets with persuasion, assumption 3.3 requires that Fox News use the same per-suasion strategy for all towns, regardless of the demographic composition. This assumptionis justiﬁed because Fox News is a national program, so the perception of persuasion strategyshould be similar for all towns . Last, the assumption that the persuader draw persuasion Note that this is not a restriction on the entry decision. In fact, Fox News can endogenously choose thetown they wanted to provide channels but this is out of the scope of this paper. The model aims to estimatethe persuasion strategy used by Fox News but does not model its utility to justify the persuasion and theentry. As long as the persuasion strategy is the same for all towns, the identiﬁcation argument goes throughwhether the entry was chosen optimally or exogenous. s ID ∼ i.i.d ˜ F k says that the signals should be independent for all towns. This assump-tion is hard to justify since Fox News is a national program. However, Fox News reportson diﬀerent aspects of the candidates (e.g. foreign policy, economic policy), and each townmay only focus on one aspect of a candidate, which may result in an i.i.d persuasion signalacross towns. I assume there are no product characteristics across towns. The utility is u mkj = α j,k + (cid:15) mj ,where the parameters α j,k are the mean utility of candidate j that diﬀer across demographicgroup k . The utility for the outside option is normalized to be zero. I partition the decisionmakers in each town based on their education level at the time of the election: { High Schooland Lower, College Partial, College Complete}. The segment of education level can reﬂectthe diﬀerences in income levels and the political spectrum. The estimators and the 95%conﬁdence intervals are reported in table 1.Table 1: Estimated Mean Preference ParametersChoice j α

High School College Partial College CompleteRep -0.1318[-0.1540,-0.1050] 0.1369[0.0816,0.1848] 0.0306[0.0079,0.0538]Dem -0.0859[-0.0983,-0.0707] 0.1260[0.0693,0.1725] 0.0702[0.0529,0.0857]The estimation result shows several interesting observations. First, the group with partialcollege degree has a slightly lower preference for the Democratic Party than the RepublicanParty. The partial college group includes eligible voters who earn degrees from communitycollege or technical colleges. So we see that both highly educated group and the least A ﬁner partition of the demographics is desired, but the U.S. census data do not provide the jointdistribution of education with other demographic characteristics. , but the middle class seems to be indiﬀerentbetween these two parties. Second, the College Partial group has a higher willingness to vote.However, this does not imply the College Partial group vote more to the Democratic Partythan those who complete college education. Table 2 reports the estimated unconditionalchoice probability for each demographic group.Table 2: Unconditional Choice Probability: With and Without Fox NewsHigh School College Partial College CompleteNo Fox With Fox No Fox With Fox No Fox With FoxRep 0.1998 0.1610 0.5082 0.5488 0.3031 0.3415Dem 0.1891 0.2086 0.2925 0.2498 0.3974 0.3634The result in table 2 cannot be generated by a random utility model with Logit shock.By random utility model with Logit shock, we would predict that the College Partialgroup vote more for the Democratic Party than College Complete group would do, because α Dem,College P artial > α

Dem,College Complete . The estimated density of the prior distribution G isgiven in Figure ?? . Note that the conﬁdence interval of α Dem,k does not intersect with α Rep,k for k ∈{ High school, College Complete } (cid:15) , and it means ‘1 is better than 2’ if it comparesthe (cid:15) with (cid:15) . A ‘-’ signal means the contrary. The persuasion strategy of the high school education group is given by

P r ˜ F HS ( S ID = −| (cid:15) ) =  if (cid:15) rep < (cid:15) dem θ ( (cid:15) rep − (cid:15) dem ) hs if (cid:15) rep ≥ (cid:15) dem , and the persuasion strategy of the college partial and college complete group is given by P r ˜ F College ( S ID = −| (cid:15) ) =  if (cid:15) rep > (cid:15) dem − θ ( (cid:15) rep − (cid:15) dem ) hs if (cid:15) rep ≤ (cid:15) dem . Two-signal persuasion strategy is also justiﬁed by Gitmez and Molavi (2018), where the politician intheir model has full control of the news media and voters are heterogeneous in their belief.

32 use the same parametric family for demographic group with education higher than highschool but treat the least educated group separately. This is because table 2 shows that onlythe least educated group has decreased unconditional choice probability for the RepublicanParty and increased unconditional choice probability for the Democratic Party after FoxNews entered into their towns.The ‘-’ signal in the persuasion strategy for the high school group can either mean whenthe Republican party is indeed worse than the Democratic party, or it can mean with a smallprobability that the Republican party is better.The persuasion strategy for the eligible voters with at least a partial college educationhas a better interpretation. The positive signal S ID = + can be read as ‘the Republican isbetter than the Democratic’. A positive signal is always sent when the Republican is indeedbetter, i.e. (cid:15) rep > (cid:15) dem , and a fake positive signal can also be sent when (cid:15) rep < (cid:15) dem , but theprobability decays as the diﬀerence becomes larger in absolute value.The estimated persuasion strategy parameters are reported in table 3, and I plot theprobability of the "+" signal for the two persuasion strategies in ﬁgure 2. We should notethat the persuasion strategy parameter θ is very close to 1 and the entropy of the marginaldistribution of the signal is close to zero. The close-to-zero entropy indicates that the signalsent by Fox News does not carry much information. However, the relative scale of entropyis still signiﬁcantly large compared with the utility parameter α jk for all three groups.Table 3: Persuasion StrategyHigh School College Partial and CompleteEstimator ˆ θ Note: The entropy numbers are calculated based on the marginal distribution of signal.

The overall ﬁt of the persuasion model can be seen from the diﬀerence between thedata unconditional choice probability and the unconditional choice probability predicted bythe persuasion strategy. Table 4 shows that the model predicts the unconditional choiceprobability quite well except for the high school group’s unconditional choice probability of33igure 2: Estimated Probability of Sending "+" Signal and the histogram of (cid:15) rep − (cid:15) dem choosing the Republican party.Table 4: Unconditional Choice Probability in Towns with Fox News: Model vs DataHigh School College Partial College CompleteModel Data Model Data Model DataRep 0.1853 0.1610 0.5427 0.5488 0.3335 0.3415Dem 0.2090 0.2086 0.2614 0.2498 0.3708 0.3634 Costly information acquisition can lead the decision maker to choose the second-best choicewith some probability. If information is free (i.e. λ = 0 , or decision maker can perfectlyobserve ( (cid:15) rep , (cid:15) dem ) ), the decision maker should be able to choose the one that maximizes34is utility. This is deﬁned as the ﬁrst-best outcome. Persuasion signal has two inﬂuenceson decision makers: persuasion signal provides extra information that reduce the entropyof belief, but it also intentionally leads some decision makers to make wrong decisions. Inthis section, I analyze the welfare by asking what is the percentage of voters that cast votesconsistent with their ﬁrst best choice before and after Fox News enters into their town.Formally, the ﬁrst best choice j m,fbk in a town m is deﬁned as j m,fbk = arg max j ∈J α j,k + (cid:15) mj and P kj = j fb ( α + (cid:15) m ) is the proportion of voters that make the correct choice in the rationalinattention model without persuasion in town m , and (cid:80) s ˜ F ( s | (cid:15) )) P kj = j fb ,s ( α + (cid:15) m ) is theproportion of voters that make the correct choice with Fox News Persuasion. Since we havethe estimated prior distribution G ( (cid:15) rep , (cid:15) dem ) , we get the distribution of P kj = j fb ( α + (cid:15) m ) and (cid:80) s ˜ F ( s | (cid:15) )) P kj = j fb ,s ( α + (cid:15) m ) . The estimated distribution (across towns) can be seen in ﬁgure 3.The patterns are quite diﬀerent for the three groups. For voters with high school education,persuasion does not really help them to make better decisions overall. For voters with apartial college education, persuasion generates higher dispersion in the distribution of votersthat vote for their ﬁrst best choice. It should be noted that even if the persuasion strategy isthe same for voters with a partial and full college education, the persuasion strategy tightensthe distribution of the ﬁrst best choice for voters who complete a college education.35igure 3: Distribution of percentage of voters that achieve their ﬁrst best choice In this paper, I study the identiﬁcation of the rational inattention discrete choice model withBayesian Persuasion. I derive the conditional moment conditions that identify the meanutility of each product and prior distribution. I also show the identiﬁcation of a parametric36ersuasion strategy when the persuader plays a sequential game with decision makers in themodel. In the empirical application, I studied the eﬀect of Fox News in persuading voters tovote for the Republican Party. I also analyze the welfare change for voters before and afterthe inﬂuence of Fox News.For future research, we should derive a method to unify the supply-side model with theidentiﬁed persuasion strategy. If the supply side, which is Fox News in the context, is rationalwhen it chooses the persuasion strategy, the optimal strategy should reveal constraints on itsutility parameters. Such parameters are crucial when we conduct a counterfactual analysison the supply side. For instance, in the IO context, the preference for persuasion strategywould allow us to model the non-price competition.

References

Bagwell, Kyle and Garey Ramey (1988), “Advertising and limit pricing.”

The Rand journalof economics , 1, 59–71.Berry, Steven, James Levinsohn, and Ariel Pakes (1995), “Automobile prices in marketequilibrium.”

Econometrica: Journal of the Econometric Society , 841–890.Berry, Steven T (1994), “Estimating discrete-choice models of product diﬀerentiation.”

TheRAND Journal of Economics , 242–262.Bloedel, Alexander W and Ilya R Segal (2018), “Persuasion with rational inattention.”

Work-ing Paper .De Oliveira, Henrique, Tommaso Denti, Maximilian Mihm, and Kemal Ozbek (2017), “Ra-tionally inattentive preferences and hidden information costs.”

Theoretical Economics , 12,621–654.DellaVigna, Stefano and Ethan Kaplan (2007), “The fox news eﬀect: Media bias and voting.”

The quarterly journal of economics , 3, 1187–1234.Dorfman, Robert and Peter O Steiner (1954), “Optimal advertising and optimal quality.”

The American Economic Review , 44, 826–836.37itmez, Arda and Pooya Molavi (2018), “Media capture: A bayesian persuasion approach.”

Working Paper .Goeree, Michelle Sovinsky (2008), “Limited information and advertising in the us personalcomputer industry.”

Econometrica , 76, 1017–1074.Guerre, Emmanuel, Isabelle Perrigne, and Quang Vuong (2000), “Optimal nonparametricestimation of ﬁrst-price auctions.”

Econometrica , 68, 525–574.Jun, Sung Jae and Sokbae Lee (2018), “Identifying the eﬀect of persuasion.” Wo .Kamenica, E. and M. Gentzkow (2016), “A rothschild-stiglitz approach to bayesian persua-sion.” American Economic Review: Papers & Proceedings .Kamenica, Emir and Matthew Gentzkow (2011), “Bayesian persuasion.”

American EconomicReview , 101, 2590–2615.Matejka, Filip and Alisdair McKay (2015), “Rational inattention to discrete choices.”

Amer-ican economic review , 1, 272–298.McFadden, Daniel (1973), “Conditional logit analysis of qualitative choice behavior.”Nelson, Philip (1974), “Advertising as information.”

The journal of political economy , 4,729–754. Mit bibliogr. Angaben.Newey, KW and Daniel McFadden (1994), “Large sample estimation and hypothesis.”

Hand-book of Econometrics, IV, Edited by RF Engle and DL McFadden , 2112–2245.Van der Vaart, Aad W (2000),

Asymptotic statistics , volume 3. Cambridge university press.Wellner, Jon and Aad W. van der Vaart (2013),

Weak convergence and empirical processes:with applications to statistics . Springer Science & Business Media.Xiang, Jia (2020), “Physicians as persuaders: Evidence from hospitals in china.”

WorkingPaper . 38

Appendix 1: Data Compression Interpretation of En-tropy Cost

The entropy of a discrete random variable is closely related to the expected number of binaryquestions needed to be asked to determine the realization. Consider the following example: • X is supported on 4 points: X = ( H, H ) , X = ( H, L ) , X = ( L, H ) and X = ( L, L ) . • The probability of each realization is P = P = 1 / and P = P = 1 / . • Consider two ways of asking questions:1. Q1: The state is: (A) First component is H; (B) the First component is L. Q2:(A) Second component is H; (B) the Second component is L.2. Q1: The state is: (A) Both high; (B) Both low; (C) Neither. Q2: The state is:(A) (H,L); (B) (L,L).Using the ﬁrst approach, we need to ask two binary questions for sure to pin down therealization. Using the second approach we are expected to ask one 3-adic question for sure,and with / probability we need another binary question. If we consider that a 3-adicquestion is equivalent to log binary questions , the expected number of binary questionwe need to ask is log / − / × log (1 /

3) + 1 / × log (1 / which is the entropynumber. In many examples, the entropy number cannot be coded with an integer numberof binary questions, but nonetheless the entropy number is a good approximation for thecomplexity of the random variable.Now, consider the entropy cost function we deﬁned in (2.5). The entropy H ( G ) is inter-preted as the number of binary questions of the prior distribution. Now, given the signal s the DM acquire from the world, the number of binary questions is reduced to H ( F ( ·| s )) . One way to understand this conversion is the following. Suppose we have N binary questions that cancover all possible states of the world, the cardinality of states of the world is approximately N . In thesame case, suppose we need M N ≈ M , weﬁnd the N = M ∗ log . The more rigorous conversion argument can be established using large scale datacompression theory. See Cover (2006), Chapter 5. s , the expected number of questionsremained is E s [ H ( F ( ·| s ))] . Therefore, the diﬀerence of entropy H ( G ) − E s [ H ( F ( ·| s ))] isinterpreted as the expected number of binary questions that is answered by the signal s ,and the unit cost of information λ = 0 is interpreted as the market price for asking a binaryquestion.The interpretation still works when we consider s being discrete but the state v is con-tinuous. Let’s consider an example where X ∼ U [ − , and Y = 1 when X ≥ and Y = 0 when X < . Given X is negative with probability . , Y answers one binary questionwhether X is negative or not. Direct calculation shows that H ( X ) = 1 and H ( X | Y ) = 0 , sothe mutual information I ( X ; Y ) = H ( X ) − H ( X | Y ) = 1 .When the pair ( s , v ) is continuously distributed, the data compression argument needsto be modiﬁed slightly. The approach is to take a quantization of the random variable. Thequantization of v is to slice the support of v with cubes of side length ∆ . As the quantizationlength ∆ → , the entropy of the discrete random vector, denoted as V ( δ ) , will converge tothe diﬀerential entropy of v in the following sense: H ( V ( δ )) + log ∆ J → H ( G ( v )) as ∆ → where J is the dimension of v . We can perform the same quantization for the signal variable s . When we calculate the entropy diﬀerence H ( G ) − E s [ H ( F ( ·| s ))] , which is called themutual information, the eﬀect of quantization will disappear. See Cover and Thomas(2006),Chapter 8 for discussion of quantization. Then we can use the interpretation for the datacompression on the quantized version of ( v , s ) . B Appendix 2: Proofs of Section 3

B.1 The Contraction Mapping Lemma 3.1

Proof.

The proof is a minor adaption of Berry et al. (1995). To show the operator T is acontraction mapping, it suﬃce to show that the conditions of theorem 1 in BLP holds. Let T j : R J − → R denote the j − th component of the mapping T : R J − → R J − deﬁned in403.6). I use the following notation for the proof: P kj ( δ , X , P ,k , ν k , α ) = P ,kj ( X ) e δ j + u ( X m ,ν k ,α ) (cid:80) l ∈J P ,kl ( X ) e δ l + u ( X m ,ν k ,α ) . (B.1)First note that: ∂T j ∂δ j = 1 − ms ∗ j × (cid:88) k P kj ( δ , X , P ,k , ν k , α ) × (1 − P kj ( δ , X , P ,k , ν k , α )) d k ≥ − ms ∗ j × (cid:88) k P kj ( δ , X , P ,k , ν k , α ) d k ≥ ∂T j ∂δ l = 1 ms ∗ j × (cid:88) k P kj ( δ , X , P ,k , ν k , α ) P kl ( δ l , X , P ,k , ν k , α ) d k ≥ and for any j = 1 , ..., J − : (cid:88) l . The Logit form (B.1) implies the conditional choice probability P kJ ( δ J , X , P ,k , ν k , α ) > must hold, which further implies: J − (cid:88) l =1 ∂T j ∂δ l < . This veriﬁes the condition 1 of the contraction mapping theorem in Appendix 1 of BLP.To verify condition 2 of the contraction mapping theorem in BLP, I rewrite equation(3.6) by plug in the expression of ms ∗ j into the mapping T : [ T ( δ )] j = log( ms mj ) − log (cid:32)(cid:88) k P ,kj e u ( X mj ,ν k ,α ) P ,kJ + (cid:80) J − l =1 P ,kl e δ l + u ( x l ,ν k ,α ) d k (cid:33) The function T is bounded from below by log( ms mj ) − log (cid:18)(cid:80) k P ,kj e u Xmj ,νk,α ) P ,kJ d k (cid:19) when δ j →−∞ . 41or condition 3 of the contraction mapping theorem in BLP, for any j , I set ¯ δ j : ¯ δ j = arg min δ j (cid:20) ms mJ − (cid:88) k P ,kJ ( X ) P ,kJ ( X ) + P ,kj ( X ) e δ j + u ( X mj ,ν k ,α ) d k (cid:21) which is the solution of δ j to match the market share of the outside option when δ k = −∞ for all k (cid:54) = j . Remark B.1.

The extra condition that the outside option is chosen with positive uncondi-tional choice probability is not required in the proof of Berry et al. (1995), because when theshock is supported on unbounded space, the outside option will always have a positive choiceprobability. The last step is also slightly diﬀerent from Berry (1994) where the unconditionalchoice probability P ,kj appears in the denominator. B.2 Proof of Proposition 2

Proof.

Since all three moment conditions are conditioned on X m , and by assumption 3.2, theproduct characteristics X m is independent of the random utility shocks (cid:15) m and demographicdistribution vector D m , I prove the proposition conditioned on the value of X m and drop X m moment condition expressions whenever there is no confusion. Constraint on P ,kj For each market m , we observe only the market share vector ms m = ( ms m , ...ms mJ ) (cid:48) and the demographic distribution D m = ( d m , ...d mK ) where d mk is the share of people in demographic group k in market m . Then in market m ,the observation ms m satisﬁes: ms mj = K (cid:88) k =1 P kj ( (cid:15) m ) d mk ∀ j = 1 , ..., J.

42f we take expectation with respect to the G distribution and the demographic distributionon both sides of the above equation, we have E G [ ms mj − ( P j ( (cid:15) m ) , ... P Kj ( (cid:15) m ))( d m , ...d mK ) (cid:48) | D m ] = 0 By assumption 3.2, ( d m , ...d mK ) ⊥ ( (cid:15) m , X m ) , we have E G [ P kj ( (cid:15) m ) d mk | ( d m , ...d mK )] = d mk E G [ P kj ( (cid:15) m )] = d mk E G [ P ,kj ] . Use the linearity of expectation we can rewrite the above equation as: E [ ms mj − ( P , j , ... P ,Kj )( d m , ...d mK ) (cid:48) | D m ] = 0 . This is the moment condition (3.8).

Independent (cid:15) constraint

Lemma (3.6) establishes δ m as a function of ( α, β, P ,kj ) . So we can write the (cid:15) as thediﬀerence of δ and u . The moment condition (3.9) then comes directly from the assumptionthat (cid:15) m ⊥ D m in assumption (3.2). Optimality constraint

Lastly, I derive the condition that is implied by the fact that P ,kj solves the optimizationproblem (2.10). Since P ,kj uniformly bounded away from zero and one, so the ﬁrst ordercondition of (2.10) is (cid:90) (cid:15) e δ mj + u ( x mj ,ν k ,α ) (cid:80) Jl =1 P ,kl e δ ml + u ( x ml ,ν k ,α ) dG ( (cid:15) ) = 1 . Note that the optimization (2.10) is a convex optimization so the ﬁrst order condition issuﬃcient to characterize the solution. So the above ﬁrst order condition can be transformedinto the condition: E (cid:20) e δ j + u ( x mj ,ν k ,α ) (cid:80) l ∈J P ,kl e δ l + u ( x ml ,ν k ,α ) − (cid:21) = 0 , which is the moment condition (3.10). 43 Proofs of Proposition 4

Some Notations

Fix a θ and a persuasion strategy ˜ F ( s ID , (cid:15) ; θ ) . Recall that I use ˆ P ,kj,s ( x ( l ); θ ) to denotethe estimated unconditional choice probability under persuasion signal s solved from (4.4)and use ˆ P s ( θ ) to denote the vector of all j, k, l, s . I use ˜ P ,kj,s ( x ( l ); θ ) to denote the trueunconditional choice probability under persuasion solved from (2.10), and use ˜ P s ( θ ) to denotethe vector of all j, k, l, s . I use P to denote the true unconditioned choice probabilitieswithout persuasion that corresponds to the moment condition (3.8), and use ˆ P to denoteits estimator. I use G to denote the empirical distribution of ˆ (cid:15) and use G to denote the truedistribution of (cid:15) . I use B r ( · ) to denote a neighborhood of radius r near ( · ). C.1 Some Lemmas

Assumption C.1.

Fixing the index k, l, s , let M ( { P j } Jj =1 , θ ) = (cid:90) (cid:15) J (cid:88) j =1 P j e α kj ( x ( l ))+ (cid:15) j ˜ F k ( s | (cid:15) ; θ ) G ( (cid:15) ) . The following condition hold: ∀ θ ∈ Θ , ∀ κ > , there exists some ζ > such that sup d ( ( P j ) Jj =1 , ( ˜ P ,kj,s ( x ( l ); θ )) Jj =1 ) >κ M ( { ˜ P ,kj,s ( x ( l ); θ ) } Jj =1 , θ ) − M ( { P j } Jj =1 , θ ) > ζ. Lemma C.1.

Fixing the index k, l, s . Let M n ( { P j } Jj =1 , θ ) = 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P j e α kj, ( x ( l ))+ (cid:15) mj ˜ F k ( s | (cid:15) m ; θ ) ( X m = x ( l )) , where M ( x ( l )) = (cid:80) Mm =1 ( X m = x ( l )) . Suppose assumptions in Proposition 4 hold, then inf θ ∈ Θ (cid:20) M n ( { ˆ P ,kj,s ( x ( l ); θ ) } Jj =1 , θ ) − sup ( P j ) Jj =1 ∈ ∆ J M n ( { P j } Jj =1 , θ ) (cid:21) = − o p (1) , where ∆ J − is the J dimensional probability simplex. Remark C.1.

The M n diﬀers from the objective function of (4.4) because the α kj, ( x ( l )) isthe true value of α , while we use ˆ α in (4.4). This Lemma shows that ˆ P s ( θ ) is also the o p (1) -maximizer of M n . roof. Deﬁne ˆ M n ( { P j } Jj =1 , θ ) = 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P j e ˆ α kj, ( x ( l ))+ˆ (cid:15) mj ˜ F k ( s | ˆ (cid:15) m ; θ ) ( X m = x ( l )) , which is the objective function in (4.4), and { ˆ P ,kj,s ( x ( l )) } Jj =1 is the maximizer of the aboveobjective function in the simplex ∆ J − . Let { P ∗ j ( θ ) } Jj =1 be the maximizer of M n ( { P j } Jj =1 , θ ) ,then we have ˆ M n ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) ≥ ˆ M n ( { P ∗ j ( θ ) } Jj =1 , θ )= 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) e ˆ α kj, ( x ( l ))+ˆ (cid:15) mj ˜ F k ( s | ˆ (cid:15) m ; θ ) ( X m = x ( l ))= 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (1) ( X m = x ( l )) (C.1)where the function f mj is deﬁned in the following: f mj ( t ) = e α kj, ( x ( l ))+ t (ˆ α kj, ( x ( l )) − α kj, ( x ( l )))+ (cid:15) mj + t (ˆ (cid:15) mj − (cid:15) mj ) ˜ F k ( s | (cid:15) m + t (ˆ (cid:15) m − (cid:15) m ); θ ) . By mean value theorem, we can ﬁnd a t mj ∈ [0 , such that f mj (1) = f mj (0) + ( f mj ) (cid:48) ( t mj ) . Thederivatives with respect to t is ( f mj ) (cid:48) ( t ) = f mj ( t ) (cid:2) ˆ α kj, ( x ( l )) − α kj, ( x ( l )) + ˆ (cid:15) mj − (cid:15) mj (cid:3) + e α kj, ( x ( l ))+ t (ˆ α kj, ( x ( l )) − α kj, ( x ( l )))+ (cid:15) mj + t (ˆ (cid:15) mj − (cid:15) mj ) J (cid:88) i =1 ∂ ˜ F k ∂(cid:15) i (ˆ (cid:15) mi − (cid:15) mi ) . (C.2)Now I bound the term ˆ (cid:15) mj − (cid:15) mj : | (ˆ (cid:15) mj − (cid:15) mj ) | = (cid:12)(cid:12)(cid:12) [ δ ∗ j ( ms m , D m , X m , ˆ α , ˆ P ) − δ ∗ j ( ms m , D m , X m , α , P ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j,k ∂δ ∗ mj ∂α kj ( l ) ( ˆ α kj ( l ) − α kj ( l )) + (cid:88) j,k ∂δ ∗ mj ∂ P ,kj ( l ) ( ˆ P ,kj ( l ) − P ,kj ( l ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ J × K × C max j,k (cid:110) max { ˆ α kj ( l ) − α kj ( l ) , ˆ P ,kj ( l ) − P ,kj ( l ) } (cid:111) (C.3)where the inequality holds by Assumption 4.3. Moreover, by Assumption 4.3, ∂ ˜ F k ∂(cid:15) i < C also holds. Now, denote the term max j,k (cid:110) max { ˆ α kj ( l ) − α kj ( l ) , ˆ P ,kj ( l ) − P ,kj ( l ) } (cid:111) by o ∗ α,P ,combining (C.2) and (C.3), we have | ( f mj ) (cid:48) ( t ) | ≤ J KC e α kj, ( x ( l ))+ t (ˆ α kj, ( x ( l )) − α kj, ( x ( l )))+ (cid:15) mj + t (ˆ (cid:15) mj − (cid:15) mj ) × o ∗ α,P . f mj (1) back to (C.1) to get M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (1) ( X m = x ( l ))= 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (0) ( X m = x ( l ))+ 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ )( f mj ) (cid:48) ( t mj ) ( X m = x ( l )) ≥ M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (0) ( X m = x ( l )) − J KC | o ∗ α,P | M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) e α kj, ( x ( l ))+ t (ˆ α kj, ( x ( l )) − α kj, ( x ( l )))+ (cid:15) mj + t (ˆ (cid:15) mj − (cid:15) mj ) ( X m = x ( l )) By Lemma 4.1, | o ∗ α,P | = o p (1) and M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) e α kj, ( x ( l ))+ t (ˆ α kj, ( x ( l )) − α kj, ( x ( l )))+ (cid:15) mj + t (ˆ (cid:15) mj − (cid:15) mj ) ( X m = x ( l )) → p E (cid:34) J (cid:88) j =1 P ∗ j ( θ ) e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) (cid:35) ≤ E (cid:34) J (cid:88) j =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) (cid:35) , where the last inequality holds because P ∗ j ( θ ) ≤ . The observation is that E (cid:34) J (cid:88) j =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) (cid:35) is independent of the parameter θ . M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (1) ( X m = x ( l )) ≥ M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (0) ( X m = x ( l )) − o p (1)= sup ( P j ) Jj =1 ∈ ∆ J M n ( { P j } Jj =1 , θ ) − o p (1) , where the last equality holds by deﬁnition of M n ( { P j } Jj =1 , θ ) and { P ∗ j ( θ ) } Jj =1 is the maximizerof M n ( { P j } Jj =1 , θ ) . In particular, the o p (1) term J KC | o ∗ α,P | E (cid:104)(cid:80) Jj =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) (cid:105) is independent of θ , so the result in the Lemma follows.46 emma C.2. sup θ ∈ Θ , ( P j ) Jj =1 ∈ ∆ J | M n ( { P j } Jj =1 , θ ) − M ( { P j } Jj =1 , θ ) | = o p (1) Proof.

Let (( P j ) Jj =1 , θ ) and (( ¯ P j ) Jj =1 , ¯ θ ) be two values in the set Θ × ∆ J . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J (cid:88) j =1 P j e α kj, ( x ( l ))+ (cid:15) mj ˜ F k ( s | (cid:15) m ; θ ) ( X m = x ( l )) − J (cid:88) j =1 ¯ P j e α kj, ( x ( l ))+ (cid:15) mj ˜ F k ( s | (cid:15) m ; ¯ θ ) ( X m = x ( l )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (1)  dim ( θ ) (cid:88) i =1 (¯ θ i − θ i ) ∂ ˜ F k ∂θ i + sup j | P j − ¯ P j |  J (cid:88) j =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) ≤ (2) C × dim ( θ ) || ( ¯ P j ) Jj =1 , ¯ θ ) − ( P j ) Jj =1 , θ ) || ∞ J (cid:88) j =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) ≤ C × C × dim ( θ ) || ( ¯ P j ) Jj =1 , ¯ θ ) − ( P j ) Jj =1 , θ ) || ∞ J (cid:88) j =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) where || · || ∞ is the sup norm on a vector, and C is a constant such that || · || ∞ ≤ || · || . Inequality (1) follows by mean value theorem and inequality (2) follows by Assumption4.3. Then by Theorem 2.7.11 in Wellner and van der Vaart (2013), we have the uniformconvergence. Lemma C.3.

If Condition C.1 holds, then sup θ ∈ Θ | ˆ P s ( θ ) − ˜ P s ( θ ) | = o p (1) Proof.

Lemma C.1 and C.2 implies that sup θ ∈ Θ | M n ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) − M ( { ˜ P ,kj,s ( x ( l )) } Jj =1 , θ ) | = o p (1) . So we have sup θ M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) − M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) ≤ sup θ M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) − M n ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) + o p (1) = o p (1) , where the last equality hold by Lemma C.2. By assumption C.1, the event d (cid:16) ( P j ) Jj =1 , ( ˜ P ,kj,s ( x ( l ); θ )) Jj =1 (cid:17) > κ is contained in the event sup θ M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) − M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) > κ , therefore P r (cid:16) d (cid:16) ( P j ) Jj =1 , ( ˜ P ,kj,s ( x ( l ); θ )) Jj =1 (cid:17) > κ (cid:17) < P r (sup θ M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) − M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) > κ ) → . The result follows by taking the union over ﬁnite index k = 1 , ..., K and l = 1 , ..., L . Such C can always be found because all norms of a ﬁnite dimensional vector space are equivalent. emma C.4. Let F k ( s | θ ) ≡ M (cid:80) Mm =1 ˜ F k ( s | ˆ (cid:15) m ; θ ) and let ˜ F k ( s | θ ) ≡ (cid:82) ˜ F k ( s | (cid:15) ; θ ) dG ( (cid:15) ) . Thefollowing hold under assumption 4.3 sup θ ∈ Θ | ˜ F k ( s | θ ) − ˜ F ( s | θ ) | = o p (1) .Proof. We look at the following expansion (cid:12)(cid:12)(cid:12) ˜ F k ( s | θ ) − ˜ F ( s | θ ) (cid:12)(cid:12)(cid:12) = 1 M (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) m =1 ˜ F ( s | ˆ (cid:15) m ; θ ) − ˜ F ( s | (cid:15) m ; θ ) + ˜ F ( s | (cid:15) m ; θ ) − E (cid:15) ( ˜ F ( s | (cid:15) m ; θ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M M (cid:88) m =1 J (cid:88) j =1 ∂ ˜ F∂(cid:15) j (ˆ (cid:15) mj − (cid:15) mj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M M (cid:88) m =1 [ ˜ F ( s | (cid:15) m ; θ ) − E (cid:15) ( ˜ F ( s | (cid:15) m ; θ ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ CM (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) m =1 J (cid:88) j =1 (ˆ (cid:15) mj − (cid:15) mj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M M (cid:88) m =1 [ ˜ F ( s | (cid:15) m ; θ ) − E (cid:15) ( ˜ F ( s | (cid:15) m ; θ ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (C.4)where the last inequality holds by assumption 4.3. Now we use the expansion of (cid:15) mj in (C.3)to get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) CM M (cid:88) m =1 J (cid:88) j =1 (ˆ (cid:15) mj − (cid:15) mj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ J KC (cid:12)(cid:12)(cid:12)(cid:12) max j,k (cid:110) max { ˆ α kj ( l ) − α kj ( l ) , ˆ P ,kj ( l ) − P ,kj ( l ) } (cid:111)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) Note that F ( s | (cid:15) m ; θ ) is a Donsker class indexed by θ by assumption 4.3, which implies sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M M (cid:88) m =1 [ ˜ F ( s | (cid:15) m ; θ ) − E (cid:15) ( ˜ F ( s | (cid:15) m ; θ ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . Combined the two terms in (C.4) we can get sup θ | ˜ F k ( s | θ ) − ˜ F k ( s | θ )) = o p (1) . Lemma C.5. sup θ ∈ Θ ,j =1 ,...J,k =1 ,...K,l =1 ,...L | ˆ P ,kj,s ( x ( l ); θ )˜ F k ( s | θ ) − ˜ P ,kj,s ( x ( l ); θ ) ˜ F k ( s | θ ) | = o p (1) .Proof. This follows directly from Lemma C.3 and C.4.

Lemma C.6.

Consider g ∗ l,j,k ( θ, ˜ ms m , D m , X m , ˜ P s ) = [ ˜ ms mj − K (cid:88) d =1 h ∗ kj ( θ, ˜ P s , x ( l )) d mk ] d mk ( X m = x ( l )) , (C.5) h ∗ kj ( θ, ˜ P s , x ( l )) = (cid:88) s (cid:20) ˜ P ,kj,s ( x ( l ) , θ ) ˜ F k ( s | θ ) (cid:21) . (C.6) The equation (C.5) and (C.6) diﬀer from (4.5) and (4.6) because (C.5) and (C.6) use thetrue unconditional choice probability instead of the estimator. Deﬁne L n ( θ ) = (cid:32) N N (cid:88) m =1 g ∗ ( θ ) (cid:33) (cid:48) W (cid:32) N N (cid:88) m =1 g ∗ ( θ ) (cid:33) , here g ∗ ( θ ) collects g ∗ l,j for all l, j, k indices. Then ˆ θ is an o p (1) minimizer of L n ( θ ) , i.e. L n (ˆ θ ) ≤ min θ L n ( θ ) + o p (1) .Proof. Note that ˆ θ = arg min ˆ L n ( θ ) , where ˆ L n ( θ ) is the objective function of (4.7).I ﬁrst denote ∆ l,j,k ( θ ) = g ∗ l,j,k ( θ, ˜ ms m , D m , X m , ˜ P s ) − g l,j,k ( θ, ˜ ms m , D m , X m , ˆ P s ) , where g l,j is deﬁned in (4.5). Using the expression of g ∗ l,j,k and g l,j,k , we have sup θ ∈ Θ | ∆ l,j,k ( θ ) | ≤ sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) d =1 (cid:88) s (cid:16) ˆ P ,kj,s ( x ( l ); θ )˜ F k ( s | θ ) − ˜ P ,kj,s ( x ( l ); θ ) ˜ F k ( s | θ ) (cid:17) ( X m = x ( l )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . (C.7)The diﬀerence L n ( θ ) − ˆ L n ( θ ) = ∆ ( θ ) (cid:48) W ∆ ( θ ) , where ∆ ( θ ) = (∆ l,j ( θ )) l,j . Then by (C.7), sup θ | L n ( θ ) − ˆ L n ( θ ) | ≤ || ∆ ( θ ) || max eig ( W ) = o p (1) .Now I look at L n (ˆ θ ) . Suppose we can ﬁnd θ ∗ such that L n ( θ ∗ ) ≤ inf θ ∈ Θ L n ( θ ) + o p (1) L n (ˆ θ ) = ˆ L n (ˆ θ ) + L n (ˆ θ ) − ˆ L n (ˆ θ ) ≤ (1) ˆ L n ( θ ∗ ) + L n (ˆ θ ) − ˆ L n (ˆ θ )= L n ( θ ∗ ) + L n (ˆ θ ) − ˆ L n (ˆ θ ) (cid:124) (cid:123)(cid:122) (cid:125) o p (1) − [ L n ( θ ∗ ) − ˆ L n ( θ ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) o p (1) ]= (2) L n ( θ ∗ ) + o p (1) ≤ inf θ ∈ Θ L n ( θ ) + o p (1) where inequality (1) holds by the deﬁnition of ˆ θ , and (2) equality holds because we haveshown sup θ | L n ( θ ) − ˆ L n ( θ ) | ≤ || ∆ ( θ ) || max eig ( W ) = o p (1) . Lemma C.7.

Let L ( θ ) = E [ g ∗ ( θ )] (cid:48) W E [ g ∗ ( θ )] . Then sup θ ∈ Θ | L n ( θ ) − L ( θ ) | = o p (1) Proof.

Deﬁne the diﬀerence ∆ ∗ l,j,k ( θ ) = 1 N N (cid:88) m =1 (cid:0) ˜ ms mj ( X m = x ( l )) d mk − E [ ˜ ms mj ( X m = x ( l )) d mk ] (cid:1) ++ K (cid:88) k (cid:48) =1 (cid:32) N N (cid:88) m =1 d mk (cid:48) ( X m = x ( l )) d mk − E [ d mk (cid:48) ( X m = x ( l )) d mk ] (cid:33) P ,k ,s ( θ ) ˜ F k ( s | θ ) Observe that P ,k ,s ( θ ) ˜ F k ( s | θ ) ∈ [0 , because it is the product of two probability quantities.Moreover, the d mk ∈ [0 , . Therefore, we can bound ∆ ∗ ( θ ) , which is the vector of ∆ ∗ l,j,k forall l, j, k indices by || ∆ ∗ ( θ ) || ≤ J KL (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) m =1 ˜ ms mj − E [ ˜ ms mj ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + J K L max k,k (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) m =1 d mk d mk (cid:48) − E [ d mk d mk (cid:48) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (C.8)49he right hand side of (C.8) does not depend on θ . By apply weak law of large numbers tothe sample means of ˜ ms mj and d mk d mk (cid:48) , we have sup θ ∈ Θ || ∆ ∗ ( θ ) || = o p (1) . Then notice that L ( θ ) − L n ( θ ) = ∆ ∗ ( θ ) (cid:48) W ∆ ∗ ( θ ) , so we have sup θ ∈ Θ | L ( θ ) − L n ( θ ) | ≤ max eig ( W ) || ∆ ∗ ( θ ) || = o p (1) . C.2 Main Proof of Proposition 4

Proof.

The consistency of ˆ θ follows by the identiﬁcation assumption sup d ( θ,θ ) >ζ L ( θ ) − L ( θ ) > where ˆ θ is an o p (1) minimizer of L n by Lemma C.6. Moreover we have the uniform conver-gence of sup θ ∈ Θ | L n ( θ ) − L ( θ ) | = o p (1) by Lemma C.7. Conditions of Theorem 5.7 in Van derVaart (2000) are satisﬁed, so ˆ θ → p θ . D Discussion of Computation

The estimators in the main text are constructed in two steps. While the joint estimationof ( α, β, P , θ ) is possible, the computational burden is heavy. Markets with persuasion alsoprovide identiﬁcation power to the ﬁrst stage parameter ( α, β, P ) , but this requires me touse contraction mapping each time I search over a higher dimensional parameter space whenincluding θ . Also, the estimation of θ requires solving the empirical optimization problem(4.4) for given ﬁrst stage parameters. For the two-step estimation, I just plug in the ﬁrststage estimator and solve the (4.4) for diﬀerent values of θ , while for joint estimation theoptimization problem needs to be repeated for each guessed value of ( α, β, P ) .The computational burden also comes from the contraction mapping because I need toiterate over M markets. So here I use the following trick to convert the M contractionmappings to one single contraction mapping. Proposition 5.

Let T m ( δ ) : R d → R d be a contraction mapping for each m = 1 ...M . Then T ≡ ( T , ..., T M ) : R dM → R dM is a contraction mapping acting on ( δ , ..., δ M ) . roof. Let C m < be the contraction constant that | T m ( δ ) − T m ( δ ) | < C m | δ − δ | . Then C ( M ) = max C m < is the contraction constant for T .The computational burden for this combined T can be potentially high because: 1. eventhough iteration on matrix is faster than iteration over M markets, the iteration on matrixis still slow when M is large; 2. the uniform contraction constant C ( M ) can be close toone and number of iteration to achieve certain tolerance level may be large. The followingalgorithm is helpful to reduce the running time: • Set up a tolerance level tol and a threshold integer K thr . Run the iteration on T , andcount the number of δ m such that | T ( δ m ) − δ m | < tol , denote this number as K con . • When K con > K thr , collect the remaining markets index and construct the new con-traction T (cid:48) = { T m } remain . Iterate until convergence. • Multiple threshold to decide remaining markets can be set up to further boost thespeed.This algorithm exploits the fact that the contraction mapping is simply the stacking ofindividuals. The intuition is that if C m ∈ { C small , C large } , and when the number of marketsfalls into C large group is relatively small, the algorithm avoids the slow iteration on C large andalso avoid excessive iteration on markets with C smallsmall