Identification and Estimation of A Rational Inattention Discrete Choice Model with Bayesian Persuasion
IIdentification and Estimation of A Rational Inattention
Discrete Choice Model with Bayesian Persuasion
Moyu Liao ∗ Pennsylvania State UniversitySeptember 18, 2020
Abstract
This paper studies the semi-parametric identification and estimation of a rationalinattention model with Bayesian persuasion. The identification requires the observationof a cross-section of market-level outcomes. The empirical content of the model can becharacterized by three moment conditions. A two-step estimation procedure is proposedto avoid computation complexity in the structural model. In the empirical application,I study the persuasion effect of Fox News in the 2000 presidential election. Welfareanalysis shows that persuasion will not influence voters with high school education butwill generate higher dispersion in the welfare of voters with a partial college educationand decrease the dispersion in the welfare of voters with a bachelors degree. ∗ First draft: 03/22/2019. This version: 09/16/2020. I would like to thank Marc Henry, Sun Jae Jun,Peter Newberry, Karl Schurter, Jia Xiang, Zhiyuan Chen and conference attenders at the 2019 EconometricSociety Asian Meeting for useful comments. a r X i v : . [ ec on . E M ] S e p Introduction
In many applications of discrete choice models, econometricians usually assume the decisionmaker has the following random utility from choosing item j among a choice set J = 1 , ...J : U j = u j + (cid:15) j , where u j is the mean utility observed by the econometricians and (cid:15) j is theutility shock known to the decision maker but not the econometrician. Decision makers inthe model choose the item with the highest utility. When the unobserved shock follows theType I extreme value distribution, we can solve the probability of choosing j analytically.Aggregating the choice outcomes of the decision makers in the market we can get the marketshare of an item. This approach to studying market structure was initiated by McFadden(1973), and then adopted by Berry, Levinsohn, and Pakes (1995) (henceforth BLP) to studyautomobile markets, and became widely applied to other industries.This model, however, is not easily adaptable to accommodate persuasion in a structuralway. Take advertising as a form of persuasion. In the classical analysis of the effect ofadvertising, three approaches are adopted. The first is to model advertising as a feature of theitem that enters mean utility u j = u j ( A ) , where the level of advertising A affects the choiceutility. The argument is that advertising is ‘persuasive’ and the individual will buy moreof the advertised goods because their utility is distorted (Dorfman and Steiner, 1954). Thisreduced form approach does not offer us much explanation of how advertisement influencesdecision making and market structure. The second approach is to model advertisementas the trigger of the consideration set change (Goeree, 2008). The consideration set isthe priori set of items that the decision maker chooses from. Advertisement thus servesas the trigger that puts a previously non-considered item into the consideration set. Thisapproach views advertisement as the information revealing device that reveals the true (cid:15) j to the decision maker which was previously −∞ to the decision maker. If a good j isalready in the consideration set for all customers, the model of consideration set predictsthat advertising has no effect on the market share. If a good is well known, the modelof consideration set cannot explain why sellers advertise. The third approach is to viewadvertising as a signaling device to separate the high-quality product from the low-qualityproduct (Nelson, 1974; Bagwell and Ramey, 1988). The degree of advertisement serves as2he signal that induces the separating equilibrium where only high-quality firms advertise. Inparticular, they assume the unobserved quality is common for all decision makers. However,this approach requires the decision maker in the model to have imperfect knowledge of (cid:15) j ,which contradicts the assumption that (cid:15) j is known by the decision maker.Compared to the classical approaches to model persuasion, this paper develops an empir-ical model of persuasion using the Bayesian persuasion theory in Kamenica and Gentzkow(2011). The Bayesian persuasion approach to model advertising differs from the previouslymentioned informative view in two ways: 1. Decision makers in the model can have a differ-ent realization of the product quality; 2. The advertiser, who acts as the Bayesian persuader,does not always want to reveal their quality honestly. However, similar to the informativeview, the Bayesian persuasion model assumes that the decision maker in the model only hasa prior belief on the { (cid:15) j } Jj =1 and the exact realization of { (cid:15) j } Jj =1 is unknown. The decisionmaker’s prior distribution of { (cid:15) j } Jj =1 comes from the reputation of the goods. The prior beliefis likely to be common across decision makers. However, the standard Bayesian persuasionmodel assumes the decision makers only have access to the signal sent by the persuaderto update their belief and no other sources of information are available. In the real world,decision makers will also search actively for information on the goods’ quality by themselves.For example, if a person wants to buy a car, he or she will have a test drive before makinga decision. An extensive search of information can reduce the randomness of { (cid:15) j } Jj =1 but atthe time is costly. Matejka and McKay (2015) considers a model where the decision makersearches information of { (cid:15) j } Jj =1 to maximize the expected utility after deducting the searchcost. Their rational inattention discrete choice model can incorporate Bayesian persuasionby assuming persuaders send signals after the decision makers get their own information.The analysis of the structural persuasion and information search model has largely beendiscussed under the assumption that the decision makers’ prior belief of { (cid:15) j } Jj =1 , denotedby G , is known to the economist. In empirical researches, the prior belief G is unknownand should be estimated from data. A recent empirical study by Xiang (2020) assumesthe decision makers’ prior distribution G is normally distributed and analyzes the decisionmakers’ welfare change when a policy change induces the persuader to change the persuasionstrategy. However, the empirical content of a parametric assumption on G is unclear.3his paper follows Matejka and McKay (2015) to consider a rational inattention dis-crete model with Bayesian persuasion. I discuss the non-parametric identification of theprior distribution G and parametric identification of persuader’s persuasion strategy whenan econometrician observes the choice ratio at the market level across many independentmarkets. The independent markets are divided into two groups: the first group is not in-fluenced by the persuader and the second group is influenced by the persuader. The priordistribution G is identified from the choice ratio in the first group of markets. Given theidentification of G , a parametric persuasion strategy is identified from the second group ofmarkets. I characterize a set of moment conditions implied by the model, and the standardestimation method such as GMM can be applied easily.For econometricians who already observe the market shares with and without the in-fluence of persuasion, identifying the persuasion strategy is the first step to understandingthe behavior of the persuader. If we assume the persuader use a persuasion strategy is tomaximize some utility function, the identified persuasion strategy can help us understandthe persuader’s objective function. Analysis in this paper leaves the persuader’s objectivefunction as unknown and analyzes the behavior from the buyers’ side. A complete two-sidedanalysis will incorporate the persuader’s utility as a function of persuasion strategy andanalyze the problem as a sequential game played between the persuader and the buyers.For policymakers, given the knowledge of the prior belief G , they will be able to evaluatethe effect of regulating the persuasion strategy. In the advertisement market, the policymak-ers for example can ban one seller from directly revealing information about his competitors’products. Moreover, policymakers can also evaluate the effect of providing less costly in-formation to the decision makers. In other words, policymakers can compete with existingpersuaders in the markets to increase the decision makers’ welfare.In the empirical application, I look at the 2000 presidential election in the United States.I treat the presidential candidates as voters’ choices and view voting statistical areas asseparated markets. In 1996, Fox News was developed and then entered into approximately30% of the towns in the United States by 2000. DellaVigna and Kaplan (2007) shows thatFox News motivated voters to vote for Republicans compared to voters in towns withoutFox News. I take the data and analyze how Fox News persuaded voters in different towns.4he estimated results from the markets without Fox News show that the prior belief ofthe quality of the presidential candidates varies a lot with voters’ education level. Bothvoters with bachelor’s degrees and with only high school degrees prefer the Democratic partythan the Republican Party. The estimated results also show that Fox News provided verylittle information to voters, but managed to manipulate the voting outcome by a significantmargin. I also compare the welfare of voters with different education levels. Voters’ welfare isdefined as the probability of choosing their first best choice, and their first best choice is thepresidential candidate that will generate the highest utility to voters when the voters knowthe realization of { (cid:15) j } Jj =1 . The result shows that persuasion will not influence the welfare ofvoters with high school education but will generate higher dispersion in the welfare of voterswith a partial college education and decrease the dispersion in the welfare of voters with abachelors degree.Another way to study the effect of persuasion is to model the presence of a persuaderas a treatment status (Jun and Lee, 2018). In their model, the presence of a Bayesianpersuader is taken as treatment assignment and sharp bounds on the persuasion effect aregiven under various data generating processes. The treatment effect model does not specifythe decision makers’ utility and thus analysis of the decision makers’ welfare before and afterpersuasion is not possible. The treatment effect model also makes it hard to consider policycounterfactual such as regulations on persuasion strategy or when the policymaker providesextra information in the market.The rest of the paper is organized as follows. Section 2 introduces the rational inattentiondiscrete choice model with persuasion. Section 3 discusses the data generating process andthe identification strategy. Section 4 discusses the estimation strategy. Section 5 studies the2000 presidential election and the effect of Fox News. Section 6 concludes. I consider the standard random utility specification: a decision maker (DM) derives utilitylevel U j from good j from the choice set J = { , ..., J } : U j = u j + (cid:15) j . u j is the mean utility of choosing good j and (cid:15) j is the individual specific random drawof utility shock. Throughout this section, I assume that the decision maker knows only ( u , ..., u j ) but not ( (cid:15) , ...(cid:15) J ) . The decision maker has a prior belief on the distribution G onthe utility shock: ( (cid:15) , ...(cid:15) J ) ≡ (cid:15) ∼ G . If there is no further information about the true utilityshock (cid:15) , the decision maker will choose the one with highest expected utility: j ∈ a ( G ) ≡ arg max j ∈J E G [ u j + (cid:15) j ] . (2.1)If arg max j ∈J E G [ u j + (cid:15) j ] is not a singleton, we let a ( G ) to be an arbitrary selection ofmaximizers. The maximized utility derived from the belief G is given by V ( G ) ≡ max j ∈J E G [ u ij + (cid:15) ij ] . (2.2)I will first introduce a rational inattention discrete choice model and then discuss howpersuasion can be incorporated. The rational inattention discrete choice model in Matejka and McKay (2015) assumes thatthe decision maker can choose an information strategy to get a signal s DM . The signal s DM updates the decision makers’ belief on the true utility shock (cid:15) . The decision maker thenchoose the item with highest posterior mean according to (2.1). Following the notation inMatejka and McKay (2015), denote u j + (cid:15) j ≡ v j . Formally, the decision maker’s informationstrategy is a joint distribution of the true utility vector v ∈ R J and the signal s DM ∈ R J ,denoted by F ( s DM , v ) . The marginal distribution of the information strategy has to beconsistent with the prior belief G . Once the decision maker is committed to the informationstrategy, the random shocks to utility are realized, and then the decision maker get a realizedsignal s DM from F ( s DM | v ) . The decision maker updates his belief as F ( (cid:15) | s DM ) , and choosesthe item in a ( F ( (cid:15) | s DM )) according to (2.1).Since the real utility shocks are not observed by the decision maker, the decision makersolves the following optimization problem to maximize his expected utility: max F ∈ ∆( R J ) (cid:90) v (cid:90) s DM V ( F ( ·| s DM )) F ( d s DM | v ) G ( d v ) − c ( F ) (2.3)6 .t. (cid:90) s DM F ( d s DM , v ) = G ( v ) (2.4)where V ( F ( ·| s DM )) is determined by (2.2). The constraint (2.4) requires that the DM’s priordistribution G is consistent with the real state of the world. The cost of information c ( F ) isthe mutual information between the shocks (cid:15) and the signal s DM : c ( F ) = λ { H ( G ) − E s [ H ( F ( ·| s DM ))] } , (2.5)where the parameter λ is the unit cost of information, and E s denote the expectation overthe marginal distribution of F ( s DM , v ) . The entropy function H of a discrete distribution G is defined as: H ( G ) = − (cid:80) k P k log( P k ) , where P k is the probability of the state k . When G iscontinuously distributed, the differential entropy is defined as H ( G ) = − (cid:82) s g ( s ) log( g ( s )) ds .The use of entropy reduction as a measure of information cost is standard in the rationalinattention literature. See De Oliveira et al. (2017) for the discussion of entropy cost. More-over, the entropy number is related to the complexity of a random variable, and can be givena data compression interpretation. The mutual information in (2.5) can be interpreted asthe number of binary questions asked by acquiring signal s . Appendix A gives an exampleof data compression interpretation.Let S DMj ≡ { s DM ∈ R J : a ( F ( ·| s DM ) = j } be the set of signals that lead the DM tochoose j . Also denote P j ( v ) ≡ (cid:90) S DMj F ( d s DM | v ) (2.6)as the conditional choice probability of choosing item j when the realized utility vector is v . Also define the unconditional choice probability of choosing j as P j = (cid:90) v P j ( v ) dG ( v ) . (2.7)This is the ex-ante probability of choosing j before the utility vector is realized.A set of optimality condition to the problem (2.3)-(2.5) from Matejka and McKay (2015)is summarized in the following lemma. Note that the DM does not know the realization of v . The conditional choice probability should beunderstood to be the choice probability when the actual utility vector is v emma 2.1. If λ >0 and F is an optimal information strategy that solves (2.3)-(2.5), thenthe conditional and choice probability in (2.6) satisfies P j ( v ) = P j e v j /λ (cid:80) k ∈J P k e v k /λ a.s., (2.8) E G [ P j ( v )] = P j . (2.9) The unconditional choice probability in (2.7) solves the following convex optimization prob-lem: max {P j } Jj =1 (cid:90) v λ log( J (cid:88) j =1 P j e v j /λ ) G ( d v ) s.t. ∀ j : P j ≥ , J (cid:88) k =1 P k = 1 . (2.10) Conversely, if {P j } Jj =1 is the solution to (2.10), and P j ( v ) defined in (2.8) satisfies (2.9),then we can construct an information strategy F such that: • The signal s DM is supported on J points: { s , ...s J } ; • The conditional distribution of s DM satisfies P r F ( s DM = s j ) = P j ( v ) .This information strategy F solves the optimization problem (2.3)-(2.5).Proof. See Theorem 1 and Lemma 2 in Matejka and McKay (2015).Lemma 2.1 shows that solving the optimization problem (2.3)-(2.5) is equivalent to solvethe optimization problem (2.10). We do not observe the DM’s optimal information strategy.Instead, we observe their choice outcome. When we aggregate the choice outcome to themarket level, it becomes the conditional and unconditional choice probability.We should note that the conditional choice probability (2.8) takes a Logit-like choiceprobability form. However, the rational inattention discrete choice model does not imply theusual I.I.A constraints on the choice probability. Matejka and McKay (2015) discusses thetwo equivalent conditions on the conditional choice probability (2.8).8 .2 A Sequential Persuasion Game
Consider a persuader that tries to influence the choice probability by choosing a persuasionstrategy and sending a realized signal. The persuader is also called he information designer(ID) in the Bayesian persuasion literature.
Definition 1.
A persuasion strategy is a joint distribution ˜ F ( s ID , v ) of the signal s ID ∈ R J sent by the ID and the utility vector such that (cid:90) ˜ F ( s ID , v ) ds ID = G ( v ) . I consider a sequential persuasion game between the decision makers and the informationdesigner in the following order1. The information designer chooses an persuasion strategy and then sends the realizedsignal s ID to the decision maker;2. The decision maker updates his belief to the intermediate distribution: ˜ G s ID ≡ ˜ G ( v | S ID = s ID ) = G ( v ) × ˜ F ( s ID | v ) (cid:82) v ˜ F ( v , s ) d v ; (2.11)3. The decision maker solves optimization (2.3)-(2.5) with the intermediate belief ˜ G s ID ;4. The decision maker gets a realized signal s DM from his optimal information strategy F . He then makes the choice based on the updated belief F ( v | s DM ) .For the persuasion strategy to work, it is assumed that the DM who receives the signalknows the joint distribution ˜ F ( s ID , v ) . Assumption 2.1.
The persuasion strategy ˜ F ( s ID , v ) is common knowledge. The assumption 2.1 on ˜ F is satisfied when there is an underlying equilibrium determin-ing how the information designer chooses the persuasion strategy. For example, informationdesigner can have an objective function M : ∆( R J ) → R so that ˜ F = arg max F ∈ C M ( F ) where C ⊂ ∆( R J ) is some constrained persuasion strategy set. When the objective function M and the constrained set C is known by the DM, the decision maker can solve the informa-tion designer’s optimization problem to get ˜ F . This paper does not tackle the information9esigner’s objective function. The objective function for the information designer is not easyto formulate. In the marketing context, the trade-off is between higher marketing cost ofpersuasion and higher sales. In the context of political persuasion, the goal of persuasion isnot to maximize voting share but to increase voting share until it exceeds 50%. Also, themedia that conducts persuasion may also care about other aspects of persuasion since theirpersuasion strategy can influence their audience ratings.The setting of the persuasion game is different from the setting in Bloedel and Segal(2018). In their setting, the decision maker chooses an information strategy to understandthe signal send by the sender. In other words, the decision maker in their model paysattention cost to understand the signal from the sender and cannot acquire a signal aboutthe true utility by himself. In my formulate, there is no cost to understand the signal s ID from the sender and there is a cost incurred by acquiring information about the true utilityvector. Remark 2.1.
The persuasion strategy ˜ F and the information strategy F lie in the samespace. The effect of persuasion is limited because decision makers can acquire their owninformation. While the ID can distort the prior distribution of the utility vector v through ˜ F , the decision maker’s information strategy F ( s DM , v ) can still provide information to thedecision maker. In this section, I discuss a data generating process that allows us to non-parametricallyidentify the prior belief G and parametrically identify the persuasion strategy ˜ F . To allowfor the heterogeneity of decision makers’ preferences, I assume the utility of an individual i in market m in the demographic group k takes the following additively separable form: U ikjm = u ( x mj , β ) + u ( x mj , ν ik , α ) + (cid:15) j,m , (3.1)where x mj is the characteristics of product j in market m ; k is the index for people ofdemographic group k with demographic characteristics ν ik ; m is the index for market. Theutility function u , u is of known parametric form, and α , β are two vectors to be estimated,10ut the distribution of G is left as non-parametric. The utility (3.1) assumes that the decisionmakers’ demographic and product characteristics only influence their mean utility but notthe utility shocks. Here I assume that all DMs in the same market m realize the same (cid:15) m = ( (cid:15) m , ..., (cid:15) jm ) since the random shock vector (cid:15) m in equation (3.1) does not depend onthe individual index i . In particular, if individual i and i are in the same market, and theyhave the same demographic characteristics, they should have the same realized utility vector.This specification is reasonable when the shock is market-specific. For example, when wewant to study the voting decision, the market realization of (cid:15) m can be the real payoff ofcandidate j ’s policy on town m ’s local industry. In the automobile industry, this marketlevel states of the world may come from the local road condition, climate, or geographictopology. Notation
Throughout this section, I use ˜ · to denote probability quantities related to markets with thepresence of a persuader. Also, I drop the super-scrip on s ID , and use s to denote signals sentby the information designer whenever there is no confusion. I use m to denote the index formarkets, j to denote the index of products, k to denote the index of different demographicgroups. In many data sets, we do not observe individual choices. Instead, we observe the marketshare, which is the aggregated individual choices. Across different markets, I assume thatthe prior distribution on (cid:15) j,m is the same G . Assumption 3.1. (Data) (i) We observe a binary variable χ m such that χ m = 1 if andonly if the persuader is present in market m ; (ii). The demographic heterogeneity v k isdiscrete and supported on K points. For each market m , the distribution of demographicheterogeneity D m = ( d m , ..., d mK ) is observed, where d mk is the proportion of DMs in group k in market m ; (iii). We observe the market characteristics X m in each market m and themarket share vector ms m = ( ms m , ..., ms mJ ) , where ms mj is the market share of product j . Assumption 3.2. (DGP without Persuasion) For markets with χ m = 0 , the data generatingprocess satisfies:1. Common prior: (cid:15) m ∼ G ;2. Independent random utility shock: (cid:15) m ⊥ X m
3. Independent demographic distribution: D m ⊥ ( (cid:15) m , X m )
4. The choice set J and information cost λ are the same across different markets. Assumption 3.2 imposes that the mean of (cid:15) m is independent of the product characteristicsand is normalized to be zero. If there is any unobserved characteristics that is correlatedwith X mj , the unobserved effects are captured by the observed X mj .For markets with a persuader, I assume that the persuader is the same across thesemarkets and the persuader use the same persuasion strategy. Moreover, I assume that thepersuasion strategy is a joint distribution of (cid:15) m and s ID . This specification is differentfrom Definition 1 and the persuader uses the same persuasion strategy even if the productcharacteristics X mj may vary across markets. Assumption 3.3. (DGP with Persuasion) For markets with χ m = 1 , the data generatingprocess satisfies:1. ( (cid:15) , X mj , D m ) and ( λ, J ) satisfy the conditions in Assumption 3.2;2. There is a uniform persuader across markets with χ m = 1 and the persuasion strategy ˜ F k ( s ID , (cid:15) ) can depend on the demographic groups k but not the market;3. The persuasion signal s IDkm ∼ ˜ F k ( s ID | (cid:15) m ) and the signal s IDkm is independent of eachother across demographic groups and markets;4. Signal Independence ( s IDk , (cid:15) m ) ⊥ D m . ormalization Since the permutation of the item index does not matter, I call the last item J the outsideoption. Note that in the discrete choice model, only the relative difference of utility mattersfor the DM. Therefore, we can normalize the utility of outside option J to be zero U kJ = 0 .Also, when u , u is homogeneous of degree one with respect to α, β , the vector ( α, β, λ, G ) is not identified. Indeed, we can consider a model with ( cα, cβ, cλ, cG ) , where cG is thedistribution of c (cid:15) m . The model ( cα, cβ, cλ, cG ) will generate the same choice probability(2.6 and (2.7). Since linear specification of utility is frequently used in the applied literature,I assume u , u is homogeneous of degree one with respect to α, β . Assumption 3.4. (Normalization) The utility functions u , u are homogeneous of degree 1with respect to ( α, β ) , and λ = 1 . The parameters of interests include the mean utility parameters ( α, β ) , the prior belief G ,and the persuasion strategy ˜ F . For markets without persuasion, we are also interestedin P ,kj ( X ) , which is demographic group k ’s unconditional choice probability of choosing j when the product characteristics are X . If we want to evaluate the overall effect of persuasionacross different markets, we want to compare the post-persuasion market share with P ,kj ( X ) .I first define the identified set of ( α, β, G, P ,kj ( X )) from the rational inattention discretechoice model. Definition 2.
Let F χ =0 denote the conditional distribution of ( D m , X m , ms m ) conditionedon χ m = 0 . The identified set of ( α, β, G, P ,kj ( X )) under the rational inattention discretechoice model, denoted by Γ I , is the collection of ( α, β, {P ,kj ( X ) } j,k , G ) that satisfies thefollowing constraints:1. Given ( α, β, G ) , {P ,kj ( X ) } j,k solves the individuals optimization problem (2.10) with v mj = u ( x mj , β ) + u ( x mj , ν ik , α ) + (cid:15) mj ; (3.2)13 . The unconditional mean of the conditional choice probability is the unconditional choiceprobability: E G (cid:34) P ,kj ( X ) e v mj /λ (cid:80) k ∈J P ,kj ( X ) e v mk /λ (cid:35) = P ,kj ( X ); (3.3)
3. Consider the mapping: P mj ( α, β, (cid:15) , D m , X m , {P ,kj ( X ) } j,k ) = (cid:88) k d mk P ,kj ( X ) e v mj /λ (cid:80) k ∈J P ,kj ( X ) e v mk /λ (3.4) where v mj is defined in (3.2). Then ( D m , X mj , P mj ( α, β, (cid:15) , D m , X m , {P ,kj ( X ) } j,k )) hasthe same distribution as F χ =0 . The first two conditions in Definition 2 corresponds to the optimization condition (2.10)and the condition (2.9) in Lemma 2.1. Equation (3.4) calculates the market share of product j as the weighted average of different demographic groups’ choice probability. The thirdcondition in Definition 2 requires the model predicted market share is consistent with theobserved data distribution.I then define the identified set of the persuasion strategy ˜ F . Definition 3.
Let F χ =1 denote the conditional distribution of ( D m , X m , ms m ) conditionedon χ m = 1 . Given the value of ( α, β, G ) , and a persuasion strategy ˜ F , we consider the map: ˜ P kj,s ( (cid:15) ; X m ) = ˜ P ,kj,s ( X m ) e v mj (cid:80) Jl =1 ˜ P ,kl,s ( X m ) e v mj (3.5) where ˜ P ,kj,s ( X m ) solves the individual optimization problem (2.10) when his belief is ˜ G ( (cid:15) ) =˜ F ( (cid:15) | s ID = s ) . The identified set of the persuasion strategy is the set of ˜ F ( s ID , (cid:15) ) such that (cid:32) D m , X m , (cid:88) k d mk ˜ P kj,s ( (cid:15) ; X m ) (cid:33) has the same distribution as F χ =1 . The identified set in Definition 3 is conditioned on the vector ( α, β, G ) . This is because inmarkets with the persuader, there are two types of the unobserved heterogeneity: the utilityshock (cid:15) m and the realization of the signal s ID . In contrast, in markets without the persuader,14nly the utility shock (cid:15) m exists. Therefore, knowing the prior distribution G reduces therandomness and makes the problem of identifying the persuasion strategy tractable.The identified set of ( α, β, G ) in Definition 2 is defined through the rational inattentiondiscrete choice model only, and it ignores the empirical content of the subsequent persuasionstage. There are two reasons to define the identified set in this way. First, if we only havedata on markets without any persuader, i.e. χ m = 0 for all markets, the identified setdefined in Definition 2 can still be used. Second, the unobserved persuasion signal s ID in thepersuasion stage makes it hard to characterize the empirical content of the whole persuasiongame. I will stick with these two definitions and characterize the corresponding momentconditions. Recall that given ( α, β ) and the unconditional choice probability {P ,kj ( X ) } j,k , the modelpredicted market share is given by (3.4). Following BLP, I denote δ mj = u ( x mj , β ) + (cid:15) mj andlet δ m = ( δ m , ..., δ mJ ) . Then the predicted market share in (3.4) can be written as: P mj ( α, β, (cid:15) , D m , X m , {P ,kj ( X ) } j,k ) = (cid:88) k P ,kj ( X m ) e δ mj + u ( x mj ,ν k ,α ) P ,kJ ( X m ) + (cid:80) J − l =1 P ,kl ( X m ) e δ ml + u ( x ml ,ν k ,α ) d mk ≡ ms ∗ j ( X m , δ m , α, D m , {P ,kj ( X ) } j,k ) , where the utility of the outside option is normalized to zero, so δ mJ = u ( x mJ , ν k , α ) = 0 .Consider a mapping T : R J − → R J − such that (cid:104) T [ X m , ms m , α, D m , {P ,kj ( X ) } j,k ]( δ m ) (cid:105) j = δ j + log( ms mj ) − log( ms ∗ j ( X m , δ j , α, D m , {P ,kj ( X ) } j,k )) , (3.6) where [ T ] j is the j -th entry in the output vector. The input ms m will be the observed marketshare. When T ( δ m ) = δ m , the observed market share ms m equals the model predictedmarket share. This map is a contraction mapping whenever the outside option has a nonzerounconditional choice probability. As a result, there exists a unique market level δ m thatmatches the observed market share with the model predicted market share. Lemma 3.1.
Suppose in a market we have ms J > , then the mapping defined by (3.6) is acontraction mapping. Let δ ∗ m denote the fixed point of the contraction mapping (3.6). As a esult, the unobserved heterogeneity δ ∗ m is a function of observables x m , D k , ms m and theparameters α and P ,kj ( X ) . Now I state the first identification result of the prior distribution G . Proposition 1.
For each ( α, β, {P ,kj ( X ) } j,k ) in the identified set Γ I defined in Definition2, there exists a unique G ∗ such that ( α, β, G ∗ , {P ,kj ( X ) } j,k ) ∈ Γ I . In particular, for anymeasurable set B , define the set M S ( B ; x m , D k , ms m , α, P ,kj ( X ) , β ) ≡ { ms m : δ ∗ m ( x m , D m , ms m , α, P ,kj ( X )) − [ u ( x mj , β )] Jj =1 ∈ B } , where δ ∗ m is defined in Lemma 3.1 and [ u ( x mj , β )] Jj =1 = [ u ( x m , β ) , ..., u ( x mJ , β )] (cid:48) . The G ∗ satisfies P r G ∗ ( (cid:15) m ∈ B ) = P r F χ =0 ( ms m ∈ M S ( B ; x m , D k , ms m , α, P ,kj ( X ) , β )) . (3.7) Proof.
I prove this statement by contradiction. Suppose there exists a G (cid:48) (cid:54) = G ∗ such that ( α, β, G (cid:48) , {P ,kj ( X ) } j,k ) is also in the identified set. Suppose there exists a positively measuredset B (cid:48) such that P r G ∗ ( (cid:15) m ∈ B (cid:48) ) (cid:54) = P r G (cid:48) ( (cid:15) m ∈ B (cid:48) ) . I claim the distribution of ms m implied by G (cid:48) , denoted by F (cid:48) χ =0 is different from F χ =0 . Byequation (3.4), P r F (cid:48) χ =0 ( ms m ∈ M S ( B (cid:48) ; x m , D k , ms m , α, P ,kj ( X ) , β ))= P r G (cid:48) ( (cid:15) m ∈ B (cid:48) ) (cid:54) = P r G ∗ ( (cid:15) m ∈ B (cid:48) )= P r F χ =0 ( ms m ∈ M S ( B (cid:48) ; x m , D k , ms m , α, P ,kj ( X ) , β )) . Therefore, G (cid:48) cannot generate the same data distribution F χ =0 , so G (cid:48) is not in the identifiedset by Definition 2.Proposition 1 states that once we know ( α, β, {P ,kj ( X ) } j,k ) , the distribution G is pointidentified. This is similar to the identification strategy in the first price auction models(Guerre et al., 2000). The δ ∗ m − [ u ( x mj , β )] Jj =1 is the pseudo value of (cid:15) m , similar to thepseudo value that is constructed from bids in the auction model.16 roposition 2. Suppose assumptions 3.2, 3.4. Suppose the unconditional choice probability {P ,kj ( X ) } j,k are uniformly bounded away from zero and one. Each ( α, β, {P ,kj ( X ) } j,k , G ) inthe identified set Γ I defined in Definition 2 satisfies:1. Constraint on unconditional choice probability: E ms m − P , ( X m ) ... P ,K ( X m ) ... ... ... P , J ( X m ) ... P ,KJ ( X ) m d m ...d mK (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ( D m ) , X m = 0; (3.8)
2. Instrument constraint: E [ δ ∗ j ( ms m , X m , D m , α, {P ,kj ( X ) } j,k ) − u ( X mj , β ) | X m , D m ] = 0 , ∀ j = 1 , ..., J − (3.9)
3. Optimality constraint on {P ,kj ( X ) } j,k ∀ j = 1 , , ..J − k = 1 , ..., K : E (cid:20) e δ mj + u ( x mj ,ν k ,α ) (cid:80) l ∈J P ,kl ( X ) e δ ml + u ( x ml ,ν k ,α ) − (cid:12)(cid:12)(cid:12)(cid:12) X m (cid:21) = 0; (3.10) G satisfies equation (3.7). The first moment equality (3.8) is equivalent to condition (3.3) in Definition 2, since ms m is the conditional choice probability while P ,kj is the unconditional choice probability.The second moment inequality (3.9) is the consequence of conditions 2 and 3 in Assumption3.2. The third moment inequality is the first order condition of (2.10). Remark 3.1.
The identification results are different from the results in BLP in severalways. First, we need the number of markets to be large to identify the unconditional choiceprobability for different demographic groups from (3.8). From the identified unconditionalchoice probability, we can proceed to identify coefficients on the product and demographicheterogeneous characteristics α and β . Second, in BLP we assume there is a vector ofunobserved product heterogeneity ξ = ( ξ , ...ξ J ) that can be recovered by matching marketshares and model prediction. In the rational inattention discrete choice model, we recover avector of market-specific utility shock (cid:15) . Third, the prior distribution of (cid:15) is the structural bject that we are interested in, but the distribution of ξ in BLP is not of fundamentalinterest. Remark 3.2.
If the price of item j , denoted by q j , enters in the product heterogeneity X j ,then the price is likely to be correlated with the unobserved market realized utility shock. Forexample, when sellers know the realization of (cid:15) , they may set a price accordingly. In thiscase, the assumption E [ (cid:15) m | X m ] = 0 fails. In this case, we may want to find an instrumentfor q j . The choice of instruments for the price is discussed in BLP. Definition 3 of the identified set of persuasion strategy is conditioned on the value of ( β, α, G ) . If ( β, α, G ) is point identified from Proposition 2, we can assume that ( β, α, G ) isknown by the econometrician and plug the identified ( β, α, G ) into Definition 3. If ( β, α, G ) is not point identified, we can do analysis by considering that each point in the identified set Γ I as the true value separately.For a point ( α, β, G ) in the identified set Γ I , equation (3.5) defines the conditional choiceprobability of demographic group k choosing item j when they receive a persuasion signal s from the ID. The ˜ P ,kj,s is the unconditional choice probability solved from (2.3)-(2.5) whenthe intermediate belief is ˜ F ( (cid:15) | s ) . The choice probability ˜ P ,kj,s is conditioned on the signal s ID , but unconditional on the utility shock.The observed market share ˜ ms m is a linear combination of different demographic groups’conditional choice probability : ˜ ms mj = ( ˜ P j,s ( (cid:15) , X m ) , ... ˜ P Kj,s ( (cid:15) , X m ))( d m , ..., d mK ) (cid:48) . (3.11)Conditioned on ( d m , ..., d mK ) , we can take expectation on both sides of (3.11) to get: E [ ˜ ms j − ( ˜ P j,s ( (cid:15) , X m ) , ... ˜ P Kj,s ( (cid:15) , X m ))( d m , ..., d mK ) (cid:48) | D m , X m ] = 0 , ∀ j = 1 , ...J. (3.12)Since we do not observe the realization of the persuasion signal and the realization of theutility shock in each market, we can integrate it out. Let h kj ( X m ; ˜ F k ) := (cid:90) ( s, (cid:15) ) ˜ P kj,s ( (cid:15) , X m ) d ˜ F ( s, (cid:15) )= (cid:90) ( s, (cid:15) ) ˜ P kj,s ( (cid:15) , X m ) d ˜ F ( (cid:15) | s ) d ˜ F k ( s )= (cid:90) s ˜ P ,kj,s ( X m ) d ˜ F k ( s ) (3.13)18e the unconditional choice probability for demographic group k under persuasion strategy ˜ F ( s, (cid:15) ; θ ) . The third equality holds because G ( (cid:15) | s ; θ ) = ˜ F ( (cid:15) | s ; θ ) by Bayes’ rule. Proposition 3.
Under assumption 3.2 - 3.3, for each ( α, β, G ) , the true persuasion strategyparameter θ must satisfy the moment condition E [ ˜ ms j − K (cid:88) k =1 h kj ( X m ; ˜ F ) d mk | D m , X m ] = 0 ∀ j = 1 ...J − . (3.14) Proof.
By assumption 3.2, the independence of demographic distribution D m and ( (cid:15) m , X m ) : E [ ˜ P kj,s ( (cid:15) m , X m ) | D m , X m ] = ˜ P ,kj,s ( X m ) . Then by (3.12), we have E [ ˜ ms j − ( ˜ P , j,s ( X m ) , ... ˜ P ,Kj,s ( X m ))( d m , ..., d mK ) (cid:48) | D m , X m ] = 0 . (3.15)Since the signal s ID ⊥ D m , X m by assumption 3.3, we have E [ ˜ P ,kj,s ( X m ) | D m , X m ] = h kj ( X m ; ˜ F ) .The result follows.The effective number of conditional moment equality is J − since I have the constraintthat (cid:80) ˜ ms j = 1 . We should be careful with the persuasion strategy in Bayesian persuasion.The value of a signal in persuasion strategy itself has no meaning beyond the context of acommunication game. For example, if ˜ F is the distribution of ( (cid:15) , s ID ) and is the persuasionstrategy used by the persuader, then let ˜ F be the distribution of ( (cid:15) , s ID + ∆) , where ∆ isan arbitrary vector that lies in the same space as s ID . ˜ F as a persuasion strategy is notdifferent from ˜ F since the value of the signal does not matter.In practice, we can consider the case where the persuasion strategy is indexed by a finite-dimensional parameter θ : ˜ F k ( s ID , (cid:15) ; θ ) , and the support of s ID is finite. The persuasionstrategy can depend on the demographic group k . There are several justifications for theuse of a parametric persuasion strategy. First, when there are only two choices, the optimalpersuasion strategy is to use a cut-off rule, see Kamenica and Gentzkow (2016). In thiscase, the parameter θ is the cutoff points, and signals only take two values. Second, inmany empirical contexts, it is costly to design complex persuasion strategies. For example,an online advertisement can only send a simple signal within a few seconds. If the cost of19ignal increase with the number of parameters and support points of signals, it is natural torestrict the persuasion strategy to parametric form. Third, a parametric persuasion strategywith discrete signal support facilitates a clear interpretation of the meaning of the signals.In Kamenica and Gentzkow (2011), signals are interpreted as action recommendations. Discussion of Moment Condition (3.14)
One issue with the moment condition (3.14) is that it does not guarantee the identification ofpersuasion parameters θ . For example, consider the case where there is only one demographicgroup K = 1 and no product characteristics heterogeneity across markets X m = X ∀ m . Inthis case, moment condition (3.14) implies h j ( ˜ F ) = E [ ˜ ms j ] . If ˜ F is indexed by a parameter θ and h j ( ˜ F ( θ )) is not monotone in θ , then θ is not necessarily point identified.There are several restrictions that help to tighten the identified set of ˜ F . The first is toimpose the persuasion strategy is the same for certain demographic groups, i.e. ˜ F k ( s | (cid:15) ) =˜ F k (cid:48) ( s | (cid:15) ) for some k (cid:54) = k (cid:48) . Then demographic variation will tighten the bounds on thepersuasion strategy. This is because different demographic groups’ choice probability canhave different sensitivity to the same persuasion strategy. The second is to impose theparameter θ to be of lower dimension smaller than J . The variation of the choice probabilityacross different products can tighten the bounds of the parameter that indexes the persuasionstrategy. Third, the variation of product characteristics across markets can also tighten thebounds on ˜ F . This is because if in a market m the j -th product characteristics x mj generateslarge utility to decision makers, persuasion strategy is unlikely to change the market sharea lot. Point Identification Assumption
It is worthwhile to discuss the assumptions under which parameters ( α, β, P ,kj , G ) and θ arepoint identified. Note that the moment conditions constructed in (3.8)- (3.10) are similarto the moment conditions appeared in BLP, except that I have extra parameters P ,kj ( X ) to identify. Note that the moment condition for P ,kj ( X ) is similar to the moment conditionfor linear regression, so if E [ F k F (cid:48) k | X ] is invertible X − a.s. , then P ,kj ( X ) is identified. Theglobal sufficient primitive conditions for identification of moment conditions (3.9)- (3.10) are20ot easy to interpret, because the fixed point δ ∗ in Lemma 3.1 is highly non-linear in itsarguments. In a similar situation in BLP, they assume the moment conditions are sufficientto identify the utility parameter. Assumption 3.5. (Identification Assumption)1. E [ F k F (cid:48) k | X ] is invertible, X − a.s .2. At the true parameter {P ,kj ( X ) } j,k , there is a unique ( α, β ) such that moment condi-tions (3.9) and (3.10) hold. The second requirement in Assumption 3.5 is not as restrictive as it seems. In particular,if there is only one demographic group, the fixed point in Lemma 3.1 is given by δ ∗ j = log ms mj ms mJ − log P j ( X m ) P J ( X m ) , (3.16)and moment condition (3.9) becomes E (cid:20) log ms mj ms mJ − log P j ( X m ) P J ( X m ) − u ( X mj , β ) (cid:12)(cid:12)(cid:12)(cid:12) X m (cid:21) = 0 . If {P ,kj ( X ) } j,k is identified from moment condition 3.8 and u is a linear function, then β ispoint identified.Now suppose the persuasion is parametric and indexed by θ . The assumptions to guar-antee that θ is identified up to ( α, β, G ) is easier to write down. The discussion of (3.14)shows that h k, j ( x m ) ≡ h kj ( x m ; θ ) is identified if E [ D m D m (cid:48) | X ] is invertible, X − a.s , where θ is the true value of θ . Then the identified set of the persuasion strategy is then the set ofthe θ ∗ such that h kj ( x ; θ ∗ ) = h k, j ( x ) for all j, k and x ∈ supp ( X ) . Assumption 3.6.
The matrix E [ D m D m (cid:48) | X ] is invertible for X − a.s. . Under Assumption 3.5, ( α, β, G, {P j,k ( X ) } j,k ) is point identified from moment conditions(3.8), (3.9) and (3.10). When the product characteristics X are continuously distributed, We say θ is identified up to ( α, β, G ) if the data generating process allow us to point identify θ for eachgiven parameter ( α, β, G ) . ,kj ( X ) in moment condition (3.8) needs to be estimated non-parametrically. However, insome empirical settings, the product characteristics are discrete and standard estimators ofmoment equality such as GMM estimator can be implemented directly. In this section, Idiscuss the estimation of ( α, β, G, {P j,k ( X ) } j,k ) when the characteristics X are discrete. Assumption 4.1.
The product characteristics X m are discretely distributed and supportedon L points: { x (1) , ... x ( L ) } , and the probability inf l =1 ,...,L P r ( X m = x ( l )) > /C for someconstant C > . Under Assumption 4.1, the analysis of moment conditions (3.8), (3.9) and (3.10) can bedone conditioned on the value of X m separately. Since the demographic characteristics v k are also discrete, the most general utility function of (3.1) under discrete v k and X can bere-written as u ijkm = α kj ( l ) if ( x mj ) Jj =1 = x ( l ) , where α kj ( l ) is the mean utility of product j for a demographic group k individual in amarket with characteristics x ( l ) . Any parametric assumption on the utility u and u in(3.1) can be imposed as constraints on the value of α kj ( l ) .Even if ν k is distributed on K discrete points, the random vector D m is continuouslydistributed. Moment conditions (3.8), (3.9) are still conditioned on D m and we need totransform them into unconditional moment conditions. Moment condition (3.8) is linearin the elements of D m and the optimal instrument will be d m , ..., d mk , and we can define P ,kj ( x ( l )) as P ,kj ( l ) . For moment condition (3.9), we can use D m and its second orderpower terms { ( d mj ) t : j = 1 , ...J, t = 1 , } as instruments to form unconditional momentconditions.Let α denote the vector of { α kj ( l ) } j,k,l and P denote the vector of {P ,kj ( l ) } j,k,l . Let γ ( ms m , D m , X m , α , P ) denote the moment unconditional conditions. The standard GMMestimator of ( α , P ) is given by ( ˆ α , ˆ P ) = arg min[ 1 M M (cid:88) m =1 γ ( ms m , D m , X m , α , P )] (cid:48) ˆ W [ 1 M M (cid:88) m =1 γ ( ms m , D m , X m , α , P )] , (4.1) The utility parameter β cannot be separated from α kj ( l ) , so I normalize u ≡ for all j . M is the number of markets without the persuader, and ˆ W is any positive semi-definiteweighting matrix. Standard asymptotic normality results on the GMM estimator can beapplied if the moment condition satisfies some regularity conditions. Assumption 4.2.
Suppose the following conditions hold: (i).The true parameter value ( α , P ) lies in the interior of the parameter space; (ii). γ ( ms m , D m , X m , · , · ) is con-tinuously differentiable on the interior of the parameter space for ( ms m , D m , X m ) ; (iii). γ ( ms m , D m , X m , α , P ) has finite second moment; (iv) E [ |∇ ( α , P ) γ ( ms m , D m , X m , α , P ) ] has rank dim (( α , P )) ; (v). There exists a integrable function b such that (cid:12)(cid:12) ∇ ( α , P ) γ ( ms m , D m , X m , α , P ) (cid:12)(cid:12) < b ( ms m , D m , X m ) . Conditions (i), (iii) and (iv) are assumptions on the true value of the parameter of interest ( α , P ) , which are not verifiable without observing the data distribution. Conditions (ii)and (v) are assumptions on the derivatives of the moment conditions. It is difficult to verify(ii) and (v) because δ m ∗ as a function of α and P is defined through the contraction mapping(3.6). General primitive conditions on the rational inattention model to guarantee that δ m ∗ iscontinuously differentiable in α , P are hard to find. However, when there is no demographicheterogeneity, the δ m ∗ in Lemma 3.1 has a closed from solution (3.16). In this case, themoment conditions (3.9) and (3.10) can be rewritten as E (cid:20)(cid:18) log ms mj ms mJ − log P j ( x ( l )) P J ( x ( l )) − α j ( l ) (cid:19) ( X m = x ( l )) (cid:21) = 0 ,E (cid:20) ms mj ms mJ / P j ( x ( l )) P J ( x ( l )) (cid:80) l ∈J P ,kl ( X ) (cid:104) ms mj ms mJ / P j ( x ( l )) P J ( x ( l )) (cid:105) − ( X m = x ( l )) (cid:21) = 0 . If there exists a constant
C > such that P j ( x ( l )) > /C holds for all j, l , then conditions(ii) and (iv) holds. Lemma 4.1.
Suppose assumption 4.2 holds. Denote B = E [ ∇ α , P γ ( ms m , D m , X m , α , P )] .Then √ M [( ˆ α , ˆ P ) − ( α , P ) → d N (0 , Σ) , For example, see Theorem 3.4 of Newey and McFadden (1994). here Σ = ( B (cid:48) W B ) − B (cid:48) W Λ W B (cid:48) ( B (cid:48) W B ) − , and Λ = E [ γ ( ms m , D m , X m , α , P ) γ ( ms m , D m , X m , α , P ) (cid:48) ] . Recall that the moment condition for persuasion strategy in (3.14) is derived for eachidentified value of ( α, β, G ) . Now I give an estimator of the persuasion strategy when theestimated ( ˆ α , ˆ P ) in Lemma 4.1 are directly plugged into (3.14). This is a two-step estimationprocedure and will not be efficient. I will discuss the complexity of the joint estimation ofmoment conditions 3.8-3.10 and (3.14) after the plug-in estimator of the persuasion strategyis introduced.Given the estimated ( ˆ α , ˆ P ) , we can construct a sample of estimated realized utility ˆ v mj,k ( x ( l )) = L (cid:88) l =1 (cid:104) δ j ( ms m , X m , D m , ˆ α , ˆ P ) + ˆ α kj ( X m ) (cid:105) ( X m = x ( l )) (4.2)corresponding to (3.2) and a sample of utility shock ˆ (cid:15) mj = δ j ( ms m , X m , D m , ˆ α , ˆ P ) . (4.3)Fixing the demographic group k and the characteristics x ( l ) , the distribution of ˆ v mj,k ( x ( l )) conditioned on k and x ( l ) is an estimated distribution of realized utility.To form moment condition (3.14), we first need the unconditional choice probability ˜ P ,kj,s ( X m ) in (3.13) for each demographic group k and for each product characteristics. To getan estimator of ˜ P ,kj,s ( X m ) , denoted by ˆ P ,kj,s ( X m ) , we need to solve optimization problem (2.10)with an estimated prior belief. I look at the empirical counterparts of optimization problem(2.10) under persuasion strategy ˜ F ( s ID , (cid:15) ; θ ) conditioned on markets with X m = x ( l ) : (cid:80) M ( x ( l )) m (cid:48) =1 ˜ F k ( s ID = s | ˆ (cid:15) m (cid:48) j ; θ ) max { ˜ P ,kj,s } Jj =1 M ( x ( l )) (cid:88) m =1 log( J (cid:88) j =1 ˜ P ,kj,s ( x ( l )) e ˆ v mj ) × ˜ F k ( s ID = s | ˆ (cid:15) mj ; θ ) s.t. ∀ j : ˜ P ,kj,s ( x ( l )) ≥ , J (cid:88) j =1 ˜ P ,kj,s ( x ( l )) = 1 , (4.4)where M ( x ( l )) is the number of markets such that X m = x ( l ) . I implicitly imposed thatthe marginal distribution of ˜ F k ( (cid:15) mj ) is the empirical distribution of ˆ (cid:15) m , and by Bayes’ rule24 F k ( s ID | ˆ (cid:15) mj ; θ ) (cid:80) Mm (cid:48) =1 ˜ F k ( s ID | ˆ (cid:15) m (cid:48) j ; θ ) is the posterior belief when the DM receive a signal s . Let ˆ P ,kj,s ( x ( l )) bethe solution to (4.4), and denoted the vector ( ˆ P ,kj,s ( x ( l ))) j,k,s,l as ˆ P s .After solving ˆ P ,kj,s ( x ( l )) , we can now write the empirical version of moment condition(3.14). Let N be the number of markets with persuasion. Denote ∀ l = 0 , ...L and ∀ j =1 , ...J − : g l,j,k ( θ, ˜ ms m , D m , X m , ˆ P s ) = [ ˜ ms mj − K (cid:88) d =1 h dj ( θ, ˆ P s , x ( l )) d mk ] d mk ( X m = x ( l )) , (4.5) h kj ( θ, ˆ P s , x ( l )) = (cid:88) s (cid:20) ˆ P ,kj,s ( x ( l ) , θ ) N ( x ( l )) (cid:88) m =1 ˜ F ( s | (cid:15) m ; θ ) N ( x ( l )) (cid:21) (4.6)where ˜ ms m is a vector of share observation in market m , and N ( x ( l )) is the number ofmarkets with persuasion such that X m = x ( l ) . Then we can estimate θ by the usual GMMestimator: ˆ θ = arg min( 1 N N (cid:88) m =1 g m ( θ )) (cid:48) W ( 1 N N (cid:88) m =1 g m ( θ )) , (4.7)where g m ( θ )) is the vector of moment functions ( g l,j ) l,j in (4.5).In what follows, I derive the consistency of ˆ θ when the persuasion strategy has a smoothparametric form ˜ F ( s ID, (cid:15) ; θ ) and the signal s ID is discrete. Assumption 4.3.
The persuasion strategy satisfies that ∃ C > for all s value:1. ˜ F ( s | (cid:15) ; θ ) is differentiable with respect to (cid:15) , and the gradient is uniformly bounded in θ : sup θ ∈ Θ ,s (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ ˜ F ( s | (cid:15) ; θ ) ∂(cid:15) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < C ;
2. The δ ∗ m ( ms m , D m , X m ; α , ( P ,kj ( X m )) j,k ) defined in Lemma 3.1 satisfies | ∂δ ∗ mj ∂κ | < C ∀ κ ∈ { α kj ( l ) , ( P ,kj ( x ( l )) : j, k, l } for all values of ms m , D m , X m .3. The partial derivatives with respect to the elements of θ satisfy sup (cid:15) ,s,i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∂ ˜ F ( s | (cid:15) ; θ ) ∂θ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) < C. Z ( D m ) point identify the parameter θ . The point identification conditions are discussed in section 3. Assumption 4.4.
Let g ( θ ) = (cid:16) g l,j ( θ, ˜ ms m , D m , X m , ˆ P s ) (cid:17) l =1 ,...,Lj =1 ,...,J − , and define L ( θ ) = g ( θ ) (cid:48) W g ( θ ) .The following identification condition hold for all ζ > d ( θ,θ ) >ζ L ( θ ) − L ( θ ) > . Proposition 4.
Under assumptions 4.2 -4.4, and technical assumption C.1, ˆ θ is a consistentestimator of θ . Remark 4.1.
The asymptotic distribution of ˆ θ is not derived in this paper. There are twodifficulties in deriving the asymptotic distribution of θ . The unconditional choice probabilityvector under persuasion ˆ P s is estimated using the sample of markets without persuasion. Thesampling error of ˆ P s comes from two aspects: (i) ˆ P s is estimated from the empirical version(4.4) of the optimization problem (2.10); (ii) the utility shocks in (4.4) are constructed fromthe estimator ˆ α . Another difficulty comes from the fact that P ,sj,k ( x ( l )) can be local to theboundary to the parameter space under the true persuasion strategy, i.e. P ,sj,k ( x ( l )) ≈ √ n forsome ( j, k, l ) . In this case, the sampling distribution of ˆ P ,sj,k ( x ( l )) is hard to derive and theinfluence of the sampling error on ˆ θ is hard to derive. Joint Estimation and Two Step Estimation
In this section, I briefly discuss how to estimate the persuasion strategy parameter θ andpreference parameters ( α , {P j,k ( x ( l )) } , G ) jointly using moment conditions (3.8)-(3.10) and(3.14). The objective function of joint GMM estimation is just the simple stack of γ m in (4.1)and g m in (4.7). For each ( α , {P j,k ( x ( l )) } , θ ) in the parameter space, we need to find the δ ∗ m for each market, and construct the pseudo sample of { (cid:15) m } Mm =1 . Given the pseudo sample of { (cid:15) m } Mm =1 , we then solve the optimization problem (4.4) to get h kj in (4.6). Given (cid:15) m and h kj ,we can evaluate the value of the joint GMM objective function at this ( α , {P j,k ( x ( l )) } , θ ) .The joint GMM estimation procedure introduces two extra computational burden com-pared with the two step estimation procedure. First, the fixed point δ ∗ m needs to calculatedat each ( α , {P j,k ( x ( l )) } j,k,l θ ) parameter evaluation in the joint estimation. In contrast, in26he two-step estimation, we find the fixed point for each ( α , {P j,k ( x ( l )) } j,k,l ) . If the di-mension of θ is large, the extra parameter θ can introduce significant computational bur-den to the joint estimation. Second, the optimization problem (4.4) needs to be solvedat each ( α , {P j,k ( x ( l )) } j,k,l , θ ) in the joint estimation. In contrast, we plug the estimator ( ˆ α , { ˆ P j,k ( x ( l )) } j,k,l ) into (3.14), and the optimization problem (4.4) only needs to be solvedfor each θ . Plugging in the estimator ( ˆ α , { ˆ P j,k ( x ( l )) } j,k,l ) reduces the dimension of theparameter space for the second step GMM estimation.Joint estimation of ( α , {P j,k ( x ( l )) } j,k,l , θ ) also makes the inference of ( α , {P j,k ( x ( l )) } j,k,l ) difficult. The discussion under Proposition 4 reveals the difficulty of deriving the asymptoticdistribution of ˆ θ . The difficulty comes from the unknown limit distribution of ˆ P ,sj,k ( x ( l )) when P ,sj,k ( x ( l )) is local to zero. The same issue will happen to ( ˆ α , { ˆ P j,k ( x ( l )) } j,k,l ) if weestimate all moment conditions jointly. In this section, I apply the rational inattention, discrete choice model, with persuasion tothe effect of Fox News on the 2000 presidential election (DellaVigna and Kaplan, 2007).Fox News started the distribution of its channel in 1996 and its twenty-four-hour cableprogram penetrated about 20% of the towns in the United States by Nov, 2000. Fox Newschannel is perceived to provide political views that are right to the mainstream news channelsuch as ABC and CNN. In the empirical application, I treat the entry of Fox News intothe local cable markets as the presence of the persuader. The DMs’ prior distribution G is understood as the prior belief on the presidential candidates under mainstream newschannels. The goal is to estimate the preference parameters of each demographic groupand the persuasion strategy used by Fox News in these markets. The estimated persuasionstrategy can reveal the degree of bias in Fox News program.27 .1 Data The election outcome data are taken from DellaVigna and Kaplan (2007) and the demo-graphic data are a mixture of the original demographic data in DellaVigna and Kaplan(2007) and the 2000 U.S. census data . Each observation consists of a vector of presidentialelection vote results and a vector of demographic statistics that correspond to a town and anindicator for the presence of Fox News. The presidential election vote result includes the to-tal votes cast, the number of votes for the Democratic Party, and the number of votes for theRepublican Party. The demographic statistics include the number of people that are above18 years old, the gender ratio, the ethnic group decomposition (African American, Hispanic,Asian, etc), and the decomposition by education level. The education level statistics are foreligible voters (18+ years old), but ethnic group statistics incorporate both the adults andchildren.The original demographic data in Vigna and Kaplan is flawed. In about 15% of thetowns, the number of votes cast is more than the number of residents above 18 years old.The issue happens when the town name corresponds to multiple administrative levels. Forexample, there are some names used for two different townships and cities but in differentcounties, their match tends to get wrong. I re-match the voting data with the 2000 U.S.census data to deal with this issue but the problem is not solved completely. There are stillabout 5% of the towns that have the inconsistency of votes and adults. As mentioned inDellaVigna and Kaplan (2007), this may be due to flaws in the process of collecting theelection data.I follow the data selection procedure in DellaVigna and Kaplan (2007) to discard towns:1. without CNN news channel; 2. the number of precincts in 2000 differs from that in 1996by 20%; 3. the total number of votes in 2000 differs from that in 1996 by 100%; 4. withmultiple cabal systems; 5. the number of people with high school and above is more thanthe number of adults; 6. the number of votes is greater than the number of adults.Throughout the application, I assume the choice set includes J = 3 options: { Rep, Dem, Out } . The education level variable in their data set is not correct for some towns. For example, the proportionof residents with no more than high school education and the proportion of residents with more than highschool education sum to greater than 1.
Out option.
First I separate markets into two groups: with persuasion and without persuasion. If FoxNews is available in the town, I assume the town is under the influence of the persuader.This assumes that the presence of Fox News influences the whole town. Since I only useobservation of towns with one unique cable company, if the cable company includes Fox News,everyone in the town should have access to the channel. While some residents may not watchthe channel, the contents of the news program can be spread through workplaces and placesof entertainment. This also assumes that towns without Fox News cannot be influenced bypersuasion. This assumption suits the historical context in 2000 where the fixed broadbandsubscription in the United States accounts for around 2.5% of the population, so streamingof Fox News is not accessible to major voters in the towns without Fox News.The key assumption on market without persuasion is the assumption 3.2. The i.i.d assumption on (cid:15) m assumes that there is no spatial correlation conditioned on the observedcharacteristics in the town. This variation in (cid:15) m may come from the geographic locationdifferences of towns and the composition of industries in towns. For example, a policy ofcleaner fuel may generate different perceptions in the coal mining towns and forest zone.The independence assumption (cid:15) m ⊥ D m assumes the composition of demographics does notinfluence the prior belief.For markets with persuasion, assumption 3.3 requires that Fox News use the same per-suasion strategy for all towns, regardless of the demographic composition. This assumptionis justified because Fox News is a national program, so the perception of persuasion strategyshould be similar for all towns . Last, the assumption that the persuader draw persuasion Note that this is not a restriction on the entry decision. In fact, Fox News can endogenously choose thetown they wanted to provide channels but this is out of the scope of this paper. The model aims to estimatethe persuasion strategy used by Fox News but does not model its utility to justify the persuasion and theentry. As long as the persuasion strategy is the same for all towns, the identification argument goes throughwhether the entry was chosen optimally or exogenous. s ID ∼ i.i.d ˜ F k says that the signals should be independent for all towns. This assump-tion is hard to justify since Fox News is a national program. However, Fox News reportson different aspects of the candidates (e.g. foreign policy, economic policy), and each townmay only focus on one aspect of a candidate, which may result in an i.i.d persuasion signalacross towns. I assume there are no product characteristics across towns. The utility is u mkj = α j,k + (cid:15) mj ,where the parameters α j,k are the mean utility of candidate j that differ across demographicgroup k . The utility for the outside option is normalized to be zero. I partition the decisionmakers in each town based on their education level at the time of the election: { High Schooland Lower, College Partial, College Complete}. The segment of education level can reflectthe differences in income levels and the political spectrum. The estimators and the 95%confidence intervals are reported in table 1.Table 1: Estimated Mean Preference ParametersChoice j α
High School College Partial College CompleteRep -0.1318[-0.1540,-0.1050] 0.1369[0.0816,0.1848] 0.0306[0.0079,0.0538]Dem -0.0859[-0.0983,-0.0707] 0.1260[0.0693,0.1725] 0.0702[0.0529,0.0857]The estimation result shows several interesting observations. First, the group with partialcollege degree has a slightly lower preference for the Democratic Party than the RepublicanParty. The partial college group includes eligible voters who earn degrees from communitycollege or technical colleges. So we see that both highly educated group and the least A finer partition of the demographics is desired, but the U.S. census data do not provide the jointdistribution of education with other demographic characteristics. , but the middle class seems to be indifferentbetween these two parties. Second, the College Partial group has a higher willingness to vote.However, this does not imply the College Partial group vote more to the Democratic Partythan those who complete college education. Table 2 reports the estimated unconditionalchoice probability for each demographic group.Table 2: Unconditional Choice Probability: With and Without Fox NewsHigh School College Partial College CompleteNo Fox With Fox No Fox With Fox No Fox With FoxRep 0.1998 0.1610 0.5082 0.5488 0.3031 0.3415Dem 0.1891 0.2086 0.2925 0.2498 0.3974 0.3634The result in table 2 cannot be generated by a random utility model with Logit shock.By random utility model with Logit shock, we would predict that the College Partialgroup vote more for the Democratic Party than College Complete group would do, because α Dem,College P artial > α
Dem,College Complete . The estimated density of the prior distribution G isgiven in Figure ?? . Note that the confidence interval of α Dem,k does not intersect with α Rep,k for k ∈{ High school, College Complete } (cid:15) , and it means ‘1 is better than 2’ if it comparesthe (cid:15) with (cid:15) . A ‘-’ signal means the contrary. The persuasion strategy of the high school education group is given by
P r ˜ F HS ( S ID = −| (cid:15) ) = if (cid:15) rep < (cid:15) dem θ ( (cid:15) rep − (cid:15) dem ) hs if (cid:15) rep ≥ (cid:15) dem , and the persuasion strategy of the college partial and college complete group is given by P r ˜ F College ( S ID = −| (cid:15) ) = if (cid:15) rep > (cid:15) dem − θ ( (cid:15) rep − (cid:15) dem ) hs if (cid:15) rep ≤ (cid:15) dem . Two-signal persuasion strategy is also justified by Gitmez and Molavi (2018), where the politician intheir model has full control of the news media and voters are heterogeneous in their belief.
32 use the same parametric family for demographic group with education higher than highschool but treat the least educated group separately. This is because table 2 shows that onlythe least educated group has decreased unconditional choice probability for the RepublicanParty and increased unconditional choice probability for the Democratic Party after FoxNews entered into their towns.The ‘-’ signal in the persuasion strategy for the high school group can either mean whenthe Republican party is indeed worse than the Democratic party, or it can mean with a smallprobability that the Republican party is better.The persuasion strategy for the eligible voters with at least a partial college educationhas a better interpretation. The positive signal S ID = + can be read as ‘the Republican isbetter than the Democratic’. A positive signal is always sent when the Republican is indeedbetter, i.e. (cid:15) rep > (cid:15) dem , and a fake positive signal can also be sent when (cid:15) rep < (cid:15) dem , but theprobability decays as the difference becomes larger in absolute value.The estimated persuasion strategy parameters are reported in table 3, and I plot theprobability of the "+" signal for the two persuasion strategies in figure 2. We should notethat the persuasion strategy parameter θ is very close to 1 and the entropy of the marginaldistribution of the signal is close to zero. The close-to-zero entropy indicates that the signalsent by Fox News does not carry much information. However, the relative scale of entropyis still significantly large compared with the utility parameter α jk for all three groups.Table 3: Persuasion StrategyHigh School College Partial and CompleteEstimator ˆ θ Note: The entropy numbers are calculated based on the marginal distribution of signal.
The overall fit of the persuasion model can be seen from the difference between thedata unconditional choice probability and the unconditional choice probability predicted bythe persuasion strategy. Table 4 shows that the model predicts the unconditional choiceprobability quite well except for the high school group’s unconditional choice probability of33igure 2: Estimated Probability of Sending "+" Signal and the histogram of (cid:15) rep − (cid:15) dem choosing the Republican party.Table 4: Unconditional Choice Probability in Towns with Fox News: Model vs DataHigh School College Partial College CompleteModel Data Model Data Model DataRep 0.1853 0.1610 0.5427 0.5488 0.3335 0.3415Dem 0.2090 0.2086 0.2614 0.2498 0.3708 0.3634 Costly information acquisition can lead the decision maker to choose the second-best choicewith some probability. If information is free (i.e. λ = 0 , or decision maker can perfectlyobserve ( (cid:15) rep , (cid:15) dem ) ), the decision maker should be able to choose the one that maximizes34is utility. This is defined as the first-best outcome. Persuasion signal has two influenceson decision makers: persuasion signal provides extra information that reduce the entropyof belief, but it also intentionally leads some decision makers to make wrong decisions. Inthis section, I analyze the welfare by asking what is the percentage of voters that cast votesconsistent with their first best choice before and after Fox News enters into their town.Formally, the first best choice j m,fbk in a town m is defined as j m,fbk = arg max j ∈J α j,k + (cid:15) mj and P kj = j fb ( α + (cid:15) m ) is the proportion of voters that make the correct choice in the rationalinattention model without persuasion in town m , and (cid:80) s ˜ F ( s | (cid:15) )) P kj = j fb ,s ( α + (cid:15) m ) is theproportion of voters that make the correct choice with Fox News Persuasion. Since we havethe estimated prior distribution G ( (cid:15) rep , (cid:15) dem ) , we get the distribution of P kj = j fb ( α + (cid:15) m ) and (cid:80) s ˜ F ( s | (cid:15) )) P kj = j fb ,s ( α + (cid:15) m ) . The estimated distribution (across towns) can be seen in figure 3.The patterns are quite different for the three groups. For voters with high school education,persuasion does not really help them to make better decisions overall. For voters with apartial college education, persuasion generates higher dispersion in the distribution of votersthat vote for their first best choice. It should be noted that even if the persuasion strategy isthe same for voters with a partial and full college education, the persuasion strategy tightensthe distribution of the first best choice for voters who complete a college education.35igure 3: Distribution of percentage of voters that achieve their first best choice In this paper, I study the identification of the rational inattention discrete choice model withBayesian Persuasion. I derive the conditional moment conditions that identify the meanutility of each product and prior distribution. I also show the identification of a parametric36ersuasion strategy when the persuader plays a sequential game with decision makers in themodel. In the empirical application, I studied the effect of Fox News in persuading voters tovote for the Republican Party. I also analyze the welfare change for voters before and afterthe influence of Fox News.For future research, we should derive a method to unify the supply-side model with theidentified persuasion strategy. If the supply side, which is Fox News in the context, is rationalwhen it chooses the persuasion strategy, the optimal strategy should reveal constraints on itsutility parameters. Such parameters are crucial when we conduct a counterfactual analysison the supply side. For instance, in the IO context, the preference for persuasion strategywould allow us to model the non-price competition.
References
Bagwell, Kyle and Garey Ramey (1988), “Advertising and limit pricing.”
The Rand journalof economics , 1, 59–71.Berry, Steven, James Levinsohn, and Ariel Pakes (1995), “Automobile prices in marketequilibrium.”
Econometrica: Journal of the Econometric Society , 841–890.Berry, Steven T (1994), “Estimating discrete-choice models of product differentiation.”
TheRAND Journal of Economics , 242–262.Bloedel, Alexander W and Ilya R Segal (2018), “Persuasion with rational inattention.”
Work-ing Paper .De Oliveira, Henrique, Tommaso Denti, Maximilian Mihm, and Kemal Ozbek (2017), “Ra-tionally inattentive preferences and hidden information costs.”
Theoretical Economics , 12,621–654.DellaVigna, Stefano and Ethan Kaplan (2007), “The fox news effect: Media bias and voting.”
The quarterly journal of economics , 3, 1187–1234.Dorfman, Robert and Peter O Steiner (1954), “Optimal advertising and optimal quality.”
The American Economic Review , 44, 826–836.37itmez, Arda and Pooya Molavi (2018), “Media capture: A bayesian persuasion approach.”
Working Paper .Goeree, Michelle Sovinsky (2008), “Limited information and advertising in the us personalcomputer industry.”
Econometrica , 76, 1017–1074.Guerre, Emmanuel, Isabelle Perrigne, and Quang Vuong (2000), “Optimal nonparametricestimation of first-price auctions.”
Econometrica , 68, 525–574.Jun, Sung Jae and Sokbae Lee (2018), “Identifying the effect of persuasion.” Wo .Kamenica, E. and M. Gentzkow (2016), “A rothschild-stiglitz approach to bayesian persua-sion.” American Economic Review: Papers & Proceedings .Kamenica, Emir and Matthew Gentzkow (2011), “Bayesian persuasion.”
American EconomicReview , 101, 2590–2615.Matejka, Filip and Alisdair McKay (2015), “Rational inattention to discrete choices.”
Amer-ican economic review , 1, 272–298.McFadden, Daniel (1973), “Conditional logit analysis of qualitative choice behavior.”Nelson, Philip (1974), “Advertising as information.”
The journal of political economy , 4,729–754. Mit bibliogr. Angaben.Newey, KW and Daniel McFadden (1994), “Large sample estimation and hypothesis.”
Hand-book of Econometrics, IV, Edited by RF Engle and DL McFadden , 2112–2245.Van der Vaart, Aad W (2000),
Asymptotic statistics , volume 3. Cambridge university press.Wellner, Jon and Aad W. van der Vaart (2013),
Weak convergence and empirical processes:with applications to statistics . Springer Science & Business Media.Xiang, Jia (2020), “Physicians as persuaders: Evidence from hospitals in china.”
WorkingPaper . 38
Appendix 1: Data Compression Interpretation of En-tropy Cost
The entropy of a discrete random variable is closely related to the expected number of binaryquestions needed to be asked to determine the realization. Consider the following example: • X is supported on 4 points: X = ( H, H ) , X = ( H, L ) , X = ( L, H ) and X = ( L, L ) . • The probability of each realization is P = P = 1 / and P = P = 1 / . • Consider two ways of asking questions:1. Q1: The state is: (A) First component is H; (B) the First component is L. Q2:(A) Second component is H; (B) the Second component is L.2. Q1: The state is: (A) Both high; (B) Both low; (C) Neither. Q2: The state is:(A) (H,L); (B) (L,L).Using the first approach, we need to ask two binary questions for sure to pin down therealization. Using the second approach we are expected to ask one 3-adic question for sure,and with / probability we need another binary question. If we consider that a 3-adicquestion is equivalent to log binary questions , the expected number of binary questionwe need to ask is log / − / × log (1 /
3) + 1 / × log (1 / which is the entropynumber. In many examples, the entropy number cannot be coded with an integer numberof binary questions, but nonetheless the entropy number is a good approximation for thecomplexity of the random variable.Now, consider the entropy cost function we defined in (2.5). The entropy H ( G ) is inter-preted as the number of binary questions of the prior distribution. Now, given the signal s the DM acquire from the world, the number of binary questions is reduced to H ( F ( ·| s )) . One way to understand this conversion is the following. Suppose we have N binary questions that cancover all possible states of the world, the cardinality of states of the world is approximately N . In thesame case, suppose we need M N ≈ M , wefind the N = M ∗ log . The more rigorous conversion argument can be established using large scale datacompression theory. See Cover (2006), Chapter 5. s , the expected number of questionsremained is E s [ H ( F ( ·| s ))] . Therefore, the difference of entropy H ( G ) − E s [ H ( F ( ·| s ))] isinterpreted as the expected number of binary questions that is answered by the signal s ,and the unit cost of information λ = 0 is interpreted as the market price for asking a binaryquestion.The interpretation still works when we consider s being discrete but the state v is con-tinuous. Let’s consider an example where X ∼ U [ − , and Y = 1 when X ≥ and Y = 0 when X < . Given X is negative with probability . , Y answers one binary questionwhether X is negative or not. Direct calculation shows that H ( X ) = 1 and H ( X | Y ) = 0 , sothe mutual information I ( X ; Y ) = H ( X ) − H ( X | Y ) = 1 .When the pair ( s , v ) is continuously distributed, the data compression argument needsto be modified slightly. The approach is to take a quantization of the random variable. Thequantization of v is to slice the support of v with cubes of side length ∆ . As the quantizationlength ∆ → , the entropy of the discrete random vector, denoted as V ( δ ) , will converge tothe differential entropy of v in the following sense: H ( V ( δ )) + log ∆ J → H ( G ( v )) as ∆ → where J is the dimension of v . We can perform the same quantization for the signal variable s . When we calculate the entropy difference H ( G ) − E s [ H ( F ( ·| s ))] , which is called themutual information, the effect of quantization will disappear. See Cover and Thomas(2006),Chapter 8 for discussion of quantization. Then we can use the interpretation for the datacompression on the quantized version of ( v , s ) . B Appendix 2: Proofs of Section 3
B.1 The Contraction Mapping Lemma 3.1
Proof.
The proof is a minor adaption of Berry et al. (1995). To show the operator T is acontraction mapping, it suffice to show that the conditions of theorem 1 in BLP holds. Let T j : R J − → R denote the j − th component of the mapping T : R J − → R J − defined in403.6). I use the following notation for the proof: P kj ( δ , X , P ,k , ν k , α ) = P ,kj ( X ) e δ j + u ( X m ,ν k ,α ) (cid:80) l ∈J P ,kl ( X ) e δ l + u ( X m ,ν k ,α ) . (B.1)First note that: ∂T j ∂δ j = 1 − ms ∗ j × (cid:88) k P kj ( δ , X , P ,k , ν k , α ) × (1 − P kj ( δ , X , P ,k , ν k , α )) d k ≥ − ms ∗ j × (cid:88) k P kj ( δ , X , P ,k , ν k , α ) d k ≥ ∂T j ∂δ l = 1 ms ∗ j × (cid:88) k P kj ( δ , X , P ,k , ν k , α ) P kl ( δ l , X , P ,k , ν k , α ) d k ≥ and for any j = 1 , ..., J − : (cid:88) l
The extra condition that the outside option is chosen with positive uncondi-tional choice probability is not required in the proof of Berry et al. (1995), because when theshock is supported on unbounded space, the outside option will always have a positive choiceprobability. The last step is also slightly different from Berry (1994) where the unconditionalchoice probability P ,kj appears in the denominator. B.2 Proof of Proposition 2
Proof.
Since all three moment conditions are conditioned on X m , and by assumption 3.2, theproduct characteristics X m is independent of the random utility shocks (cid:15) m and demographicdistribution vector D m , I prove the proposition conditioned on the value of X m and drop X m moment condition expressions whenever there is no confusion. Constraint on P ,kj For each market m , we observe only the market share vector ms m = ( ms m , ...ms mJ ) (cid:48) and the demographic distribution D m = ( d m , ...d mK ) where d mk is the share of people in demographic group k in market m . Then in market m ,the observation ms m satisfies: ms mj = K (cid:88) k =1 P kj ( (cid:15) m ) d mk ∀ j = 1 , ..., J.
42f we take expectation with respect to the G distribution and the demographic distributionon both sides of the above equation, we have E G [ ms mj − ( P j ( (cid:15) m ) , ... P Kj ( (cid:15) m ))( d m , ...d mK ) (cid:48) | D m ] = 0 By assumption 3.2, ( d m , ...d mK ) ⊥ ( (cid:15) m , X m ) , we have E G [ P kj ( (cid:15) m ) d mk | ( d m , ...d mK )] = d mk E G [ P kj ( (cid:15) m )] = d mk E G [ P ,kj ] . Use the linearity of expectation we can rewrite the above equation as: E [ ms mj − ( P , j , ... P ,Kj )( d m , ...d mK ) (cid:48) | D m ] = 0 . This is the moment condition (3.8).
Independent (cid:15) constraint
Lemma (3.6) establishes δ m as a function of ( α, β, P ,kj ) . So we can write the (cid:15) as thedifference of δ and u . The moment condition (3.9) then comes directly from the assumptionthat (cid:15) m ⊥ D m in assumption (3.2). Optimality constraint
Lastly, I derive the condition that is implied by the fact that P ,kj solves the optimizationproblem (2.10). Since P ,kj uniformly bounded away from zero and one, so the first ordercondition of (2.10) is (cid:90) (cid:15) e δ mj + u ( x mj ,ν k ,α ) (cid:80) Jl =1 P ,kl e δ ml + u ( x ml ,ν k ,α ) dG ( (cid:15) ) = 1 . Note that the optimization (2.10) is a convex optimization so the first order condition issufficient to characterize the solution. So the above first order condition can be transformedinto the condition: E (cid:20) e δ j + u ( x mj ,ν k ,α ) (cid:80) l ∈J P ,kl e δ l + u ( x ml ,ν k ,α ) − (cid:21) = 0 , which is the moment condition (3.10). 43 Proofs of Proposition 4
Some Notations
Fix a θ and a persuasion strategy ˜ F ( s ID , (cid:15) ; θ ) . Recall that I use ˆ P ,kj,s ( x ( l ); θ ) to denotethe estimated unconditional choice probability under persuasion signal s solved from (4.4)and use ˆ P s ( θ ) to denote the vector of all j, k, l, s . I use ˜ P ,kj,s ( x ( l ); θ ) to denote the trueunconditional choice probability under persuasion solved from (2.10), and use ˜ P s ( θ ) to denotethe vector of all j, k, l, s . I use P to denote the true unconditioned choice probabilitieswithout persuasion that corresponds to the moment condition (3.8), and use ˆ P to denoteits estimator. I use G to denote the empirical distribution of ˆ (cid:15) and use G to denote the truedistribution of (cid:15) . I use B r ( · ) to denote a neighborhood of radius r near ( · ). C.1 Some Lemmas
Assumption C.1.
Fixing the index k, l, s , let M ( { P j } Jj =1 , θ ) = (cid:90) (cid:15) J (cid:88) j =1 P j e α kj ( x ( l ))+ (cid:15) j ˜ F k ( s | (cid:15) ; θ ) G ( (cid:15) ) . The following condition hold: ∀ θ ∈ Θ , ∀ κ > , there exists some ζ > such that sup d ( ( P j ) Jj =1 , ( ˜ P ,kj,s ( x ( l ); θ )) Jj =1 ) >κ M ( { ˜ P ,kj,s ( x ( l ); θ ) } Jj =1 , θ ) − M ( { P j } Jj =1 , θ ) > ζ. Lemma C.1.
Fixing the index k, l, s . Let M n ( { P j } Jj =1 , θ ) = 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P j e α kj, ( x ( l ))+ (cid:15) mj ˜ F k ( s | (cid:15) m ; θ ) ( X m = x ( l )) , where M ( x ( l )) = (cid:80) Mm =1 ( X m = x ( l )) . Suppose assumptions in Proposition 4 hold, then inf θ ∈ Θ (cid:20) M n ( { ˆ P ,kj,s ( x ( l ); θ ) } Jj =1 , θ ) − sup ( P j ) Jj =1 ∈ ∆ J M n ( { P j } Jj =1 , θ ) (cid:21) = − o p (1) , where ∆ J − is the J dimensional probability simplex. Remark C.1.
The M n differs from the objective function of (4.4) because the α kj, ( x ( l )) isthe true value of α , while we use ˆ α in (4.4). This Lemma shows that ˆ P s ( θ ) is also the o p (1) -maximizer of M n . roof. Define ˆ M n ( { P j } Jj =1 , θ ) = 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P j e ˆ α kj, ( x ( l ))+ˆ (cid:15) mj ˜ F k ( s | ˆ (cid:15) m ; θ ) ( X m = x ( l )) , which is the objective function in (4.4), and { ˆ P ,kj,s ( x ( l )) } Jj =1 is the maximizer of the aboveobjective function in the simplex ∆ J − . Let { P ∗ j ( θ ) } Jj =1 be the maximizer of M n ( { P j } Jj =1 , θ ) ,then we have ˆ M n ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) ≥ ˆ M n ( { P ∗ j ( θ ) } Jj =1 , θ )= 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) e ˆ α kj, ( x ( l ))+ˆ (cid:15) mj ˜ F k ( s | ˆ (cid:15) m ; θ ) ( X m = x ( l ))= 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (1) ( X m = x ( l )) (C.1)where the function f mj is defined in the following: f mj ( t ) = e α kj, ( x ( l ))+ t (ˆ α kj, ( x ( l )) − α kj, ( x ( l )))+ (cid:15) mj + t (ˆ (cid:15) mj − (cid:15) mj ) ˜ F k ( s | (cid:15) m + t (ˆ (cid:15) m − (cid:15) m ); θ ) . By mean value theorem, we can find a t mj ∈ [0 , such that f mj (1) = f mj (0) + ( f mj ) (cid:48) ( t mj ) . Thederivatives with respect to t is ( f mj ) (cid:48) ( t ) = f mj ( t ) (cid:2) ˆ α kj, ( x ( l )) − α kj, ( x ( l )) + ˆ (cid:15) mj − (cid:15) mj (cid:3) + e α kj, ( x ( l ))+ t (ˆ α kj, ( x ( l )) − α kj, ( x ( l )))+ (cid:15) mj + t (ˆ (cid:15) mj − (cid:15) mj ) J (cid:88) i =1 ∂ ˜ F k ∂(cid:15) i (ˆ (cid:15) mi − (cid:15) mi ) . (C.2)Now I bound the term ˆ (cid:15) mj − (cid:15) mj : | (ˆ (cid:15) mj − (cid:15) mj ) | = (cid:12)(cid:12)(cid:12) [ δ ∗ j ( ms m , D m , X m , ˆ α , ˆ P ) − δ ∗ j ( ms m , D m , X m , α , P ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) (cid:88) j,k ∂δ ∗ mj ∂α kj ( l ) ( ˆ α kj ( l ) − α kj ( l )) + (cid:88) j,k ∂δ ∗ mj ∂ P ,kj ( l ) ( ˆ P ,kj ( l ) − P ,kj ( l ) (cid:12)(cid:12)(cid:12)(cid:12) ≤ J × K × C max j,k (cid:110) max { ˆ α kj ( l ) − α kj ( l ) , ˆ P ,kj ( l ) − P ,kj ( l ) } (cid:111) (C.3)where the inequality holds by Assumption 4.3. Moreover, by Assumption 4.3, ∂ ˜ F k ∂(cid:15) i < C also holds. Now, denote the term max j,k (cid:110) max { ˆ α kj ( l ) − α kj ( l ) , ˆ P ,kj ( l ) − P ,kj ( l ) } (cid:111) by o ∗ α,P ,combining (C.2) and (C.3), we have | ( f mj ) (cid:48) ( t ) | ≤ J KC e α kj, ( x ( l ))+ t (ˆ α kj, ( x ( l )) − α kj, ( x ( l )))+ (cid:15) mj + t (ˆ (cid:15) mj − (cid:15) mj ) × o ∗ α,P . f mj (1) back to (C.1) to get M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (1) ( X m = x ( l ))= 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (0) ( X m = x ( l ))+ 1 M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ )( f mj ) (cid:48) ( t mj ) ( X m = x ( l )) ≥ M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (0) ( X m = x ( l )) − J KC | o ∗ α,P | M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) e α kj, ( x ( l ))+ t (ˆ α kj, ( x ( l )) − α kj, ( x ( l )))+ (cid:15) mj + t (ˆ (cid:15) mj − (cid:15) mj ) ( X m = x ( l )) By Lemma 4.1, | o ∗ α,P | = o p (1) and M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) e α kj, ( x ( l ))+ t (ˆ α kj, ( x ( l )) − α kj, ( x ( l )))+ (cid:15) mj + t (ˆ (cid:15) mj − (cid:15) mj ) ( X m = x ( l )) → p E (cid:34) J (cid:88) j =1 P ∗ j ( θ ) e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) (cid:35) ≤ E (cid:34) J (cid:88) j =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) (cid:35) , where the last inequality holds because P ∗ j ( θ ) ≤ . The observation is that E (cid:34) J (cid:88) j =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) (cid:35) is independent of the parameter θ . M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (1) ( X m = x ( l )) ≥ M ( x ( l )) M ( x ( l )) (cid:88) m =1 J (cid:88) j =1 P ∗ j ( θ ) f mj (0) ( X m = x ( l )) − o p (1)= sup ( P j ) Jj =1 ∈ ∆ J M n ( { P j } Jj =1 , θ ) − o p (1) , where the last equality holds by definition of M n ( { P j } Jj =1 , θ ) and { P ∗ j ( θ ) } Jj =1 is the maximizerof M n ( { P j } Jj =1 , θ ) . In particular, the o p (1) term J KC | o ∗ α,P | E (cid:104)(cid:80) Jj =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) (cid:105) is independent of θ , so the result in the Lemma follows.46 emma C.2. sup θ ∈ Θ , ( P j ) Jj =1 ∈ ∆ J | M n ( { P j } Jj =1 , θ ) − M ( { P j } Jj =1 , θ ) | = o p (1) Proof.
Let (( P j ) Jj =1 , θ ) and (( ¯ P j ) Jj =1 , ¯ θ ) be two values in the set Θ × ∆ J . (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) J (cid:88) j =1 P j e α kj, ( x ( l ))+ (cid:15) mj ˜ F k ( s | (cid:15) m ; θ ) ( X m = x ( l )) − J (cid:88) j =1 ¯ P j e α kj, ( x ( l ))+ (cid:15) mj ˜ F k ( s | (cid:15) m ; ¯ θ ) ( X m = x ( l )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (1) dim ( θ ) (cid:88) i =1 (¯ θ i − θ i ) ∂ ˜ F k ∂θ i + sup j | P j − ¯ P j | J (cid:88) j =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) ≤ (2) C × dim ( θ ) || ( ¯ P j ) Jj =1 , ¯ θ ) − ( P j ) Jj =1 , θ ) || ∞ J (cid:88) j =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) ≤ C × C × dim ( θ ) || ( ¯ P j ) Jj =1 , ¯ θ ) − ( P j ) Jj =1 , θ ) || ∞ J (cid:88) j =1 e α kj, ( x ( l ))+ (cid:15) mj ( X m = x ( l )) where || · || ∞ is the sup norm on a vector, and C is a constant such that || · || ∞ ≤ || · || . Inequality (1) follows by mean value theorem and inequality (2) follows by Assumption4.3. Then by Theorem 2.7.11 in Wellner and van der Vaart (2013), we have the uniformconvergence. Lemma C.3.
If Condition C.1 holds, then sup θ ∈ Θ | ˆ P s ( θ ) − ˜ P s ( θ ) | = o p (1) Proof.
Lemma C.1 and C.2 implies that sup θ ∈ Θ | M n ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) − M ( { ˜ P ,kj,s ( x ( l )) } Jj =1 , θ ) | = o p (1) . So we have sup θ M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) − M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) ≤ sup θ M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) − M n ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) + o p (1) = o p (1) , where the last equality hold by Lemma C.2. By assumption C.1, the event d (cid:16) ( P j ) Jj =1 , ( ˜ P ,kj,s ( x ( l ); θ )) Jj =1 (cid:17) > κ is contained in the event sup θ M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) − M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) > κ , therefore P r (cid:16) d (cid:16) ( P j ) Jj =1 , ( ˜ P ,kj,s ( x ( l ); θ )) Jj =1 (cid:17) > κ (cid:17) < P r (sup θ M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) − M ( { ˆ P ,kj,s ( x ( l )) } Jj =1 , θ ) > κ ) → . The result follows by taking the union over finite index k = 1 , ..., K and l = 1 , ..., L . Such C can always be found because all norms of a finite dimensional vector space are equivalent. emma C.4. Let F k ( s | θ ) ≡ M (cid:80) Mm =1 ˜ F k ( s | ˆ (cid:15) m ; θ ) and let ˜ F k ( s | θ ) ≡ (cid:82) ˜ F k ( s | (cid:15) ; θ ) dG ( (cid:15) ) . Thefollowing hold under assumption 4.3 sup θ ∈ Θ | ˜ F k ( s | θ ) − ˜ F ( s | θ ) | = o p (1) .Proof. We look at the following expansion (cid:12)(cid:12)(cid:12) ˜ F k ( s | θ ) − ˜ F ( s | θ ) (cid:12)(cid:12)(cid:12) = 1 M (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N (cid:88) m =1 ˜ F ( s | ˆ (cid:15) m ; θ ) − ˜ F ( s | (cid:15) m ; θ ) + ˜ F ( s | (cid:15) m ; θ ) − E (cid:15) ( ˜ F ( s | (cid:15) m ; θ )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M M (cid:88) m =1 J (cid:88) j =1 ∂ ˜ F∂(cid:15) j (ˆ (cid:15) mj − (cid:15) mj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M M (cid:88) m =1 [ ˜ F ( s | (cid:15) m ; θ ) − E (cid:15) ( ˜ F ( s | (cid:15) m ; θ ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ CM (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M (cid:88) m =1 J (cid:88) j =1 (ˆ (cid:15) mj − (cid:15) mj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M M (cid:88) m =1 [ ˜ F ( s | (cid:15) m ; θ ) − E (cid:15) ( ˜ F ( s | (cid:15) m ; θ ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , (C.4)where the last inequality holds by assumption 4.3. Now we use the expansion of (cid:15) mj in (C.3)to get (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) CM M (cid:88) m =1 J (cid:88) j =1 (ˆ (cid:15) mj − (cid:15) mj ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ J KC (cid:12)(cid:12)(cid:12)(cid:12) max j,k (cid:110) max { ˆ α kj ( l ) − α kj ( l ) , ˆ P ,kj ( l ) − P ,kj ( l ) } (cid:111)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) Note that F ( s | (cid:15) m ; θ ) is a Donsker class indexed by θ by assumption 4.3, which implies sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) M M (cid:88) m =1 [ ˜ F ( s | (cid:15) m ; θ ) − E (cid:15) ( ˜ F ( s | (cid:15) m ; θ ))] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . Combined the two terms in (C.4) we can get sup θ | ˜ F k ( s | θ ) − ˜ F k ( s | θ )) = o p (1) . Lemma C.5. sup θ ∈ Θ ,j =1 ,...J,k =1 ,...K,l =1 ,...L | ˆ P ,kj,s ( x ( l ); θ )˜ F k ( s | θ ) − ˜ P ,kj,s ( x ( l ); θ ) ˜ F k ( s | θ ) | = o p (1) .Proof. This follows directly from Lemma C.3 and C.4.
Lemma C.6.
Consider g ∗ l,j,k ( θ, ˜ ms m , D m , X m , ˜ P s ) = [ ˜ ms mj − K (cid:88) d =1 h ∗ kj ( θ, ˜ P s , x ( l )) d mk ] d mk ( X m = x ( l )) , (C.5) h ∗ kj ( θ, ˜ P s , x ( l )) = (cid:88) s (cid:20) ˜ P ,kj,s ( x ( l ) , θ ) ˜ F k ( s | θ ) (cid:21) . (C.6) The equation (C.5) and (C.6) differ from (4.5) and (4.6) because (C.5) and (C.6) use thetrue unconditional choice probability instead of the estimator. Define L n ( θ ) = (cid:32) N N (cid:88) m =1 g ∗ ( θ ) (cid:33) (cid:48) W (cid:32) N N (cid:88) m =1 g ∗ ( θ ) (cid:33) , here g ∗ ( θ ) collects g ∗ l,j for all l, j, k indices. Then ˆ θ is an o p (1) minimizer of L n ( θ ) , i.e. L n (ˆ θ ) ≤ min θ L n ( θ ) + o p (1) .Proof. Note that ˆ θ = arg min ˆ L n ( θ ) , where ˆ L n ( θ ) is the objective function of (4.7).I first denote ∆ l,j,k ( θ ) = g ∗ l,j,k ( θ, ˜ ms m , D m , X m , ˜ P s ) − g l,j,k ( θ, ˜ ms m , D m , X m , ˆ P s ) , where g l,j is defined in (4.5). Using the expression of g ∗ l,j,k and g l,j,k , we have sup θ ∈ Θ | ∆ l,j,k ( θ ) | ≤ sup θ ∈ Θ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K (cid:88) d =1 (cid:88) s (cid:16) ˆ P ,kj,s ( x ( l ); θ )˜ F k ( s | θ ) − ˜ P ,kj,s ( x ( l ); θ ) ˜ F k ( s | θ ) (cid:17) ( X m = x ( l )) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = o p (1) . (C.7)The difference L n ( θ ) − ˆ L n ( θ ) = ∆ ( θ ) (cid:48) W ∆ ( θ ) , where ∆ ( θ ) = (∆ l,j ( θ )) l,j . Then by (C.7), sup θ | L n ( θ ) − ˆ L n ( θ ) | ≤ || ∆ ( θ ) || max eig ( W ) = o p (1) .Now I look at L n (ˆ θ ) . Suppose we can find θ ∗ such that L n ( θ ∗ ) ≤ inf θ ∈ Θ L n ( θ ) + o p (1) L n (ˆ θ ) = ˆ L n (ˆ θ ) + L n (ˆ θ ) − ˆ L n (ˆ θ ) ≤ (1) ˆ L n ( θ ∗ ) + L n (ˆ θ ) − ˆ L n (ˆ θ )= L n ( θ ∗ ) + L n (ˆ θ ) − ˆ L n (ˆ θ ) (cid:124) (cid:123)(cid:122) (cid:125) o p (1) − [ L n ( θ ∗ ) − ˆ L n ( θ ∗ ) (cid:124) (cid:123)(cid:122) (cid:125) o p (1) ]= (2) L n ( θ ∗ ) + o p (1) ≤ inf θ ∈ Θ L n ( θ ) + o p (1) where inequality (1) holds by the definition of ˆ θ , and (2) equality holds because we haveshown sup θ | L n ( θ ) − ˆ L n ( θ ) | ≤ || ∆ ( θ ) || max eig ( W ) = o p (1) . Lemma C.7.
Let L ( θ ) = E [ g ∗ ( θ )] (cid:48) W E [ g ∗ ( θ )] . Then sup θ ∈ Θ | L n ( θ ) − L ( θ ) | = o p (1) Proof.
Define the difference ∆ ∗ l,j,k ( θ ) = 1 N N (cid:88) m =1 (cid:0) ˜ ms mj ( X m = x ( l )) d mk − E [ ˜ ms mj ( X m = x ( l )) d mk ] (cid:1) ++ K (cid:88) k (cid:48) =1 (cid:32) N N (cid:88) m =1 d mk (cid:48) ( X m = x ( l )) d mk − E [ d mk (cid:48) ( X m = x ( l )) d mk ] (cid:33) P ,k ,s ( θ ) ˜ F k ( s | θ ) Observe that P ,k ,s ( θ ) ˜ F k ( s | θ ) ∈ [0 , because it is the product of two probability quantities.Moreover, the d mk ∈ [0 , . Therefore, we can bound ∆ ∗ ( θ ) , which is the vector of ∆ ∗ l,j,k forall l, j, k indices by || ∆ ∗ ( θ ) || ≤ J KL (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) m =1 ˜ ms mj − E [ ˜ ms mj ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) + J K L max k,k (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) m =1 d mk d mk (cid:48) − E [ d mk d mk (cid:48) ] (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) . (C.8)49he right hand side of (C.8) does not depend on θ . By apply weak law of large numbers tothe sample means of ˜ ms mj and d mk d mk (cid:48) , we have sup θ ∈ Θ || ∆ ∗ ( θ ) || = o p (1) . Then notice that L ( θ ) − L n ( θ ) = ∆ ∗ ( θ ) (cid:48) W ∆ ∗ ( θ ) , so we have sup θ ∈ Θ | L ( θ ) − L n ( θ ) | ≤ max eig ( W ) || ∆ ∗ ( θ ) || = o p (1) . C.2 Main Proof of Proposition 4
Proof.
The consistency of ˆ θ follows by the identification assumption sup d ( θ,θ ) >ζ L ( θ ) − L ( θ ) > where ˆ θ is an o p (1) minimizer of L n by Lemma C.6. Moreover we have the uniform conver-gence of sup θ ∈ Θ | L n ( θ ) − L ( θ ) | = o p (1) by Lemma C.7. Conditions of Theorem 5.7 in Van derVaart (2000) are satisfied, so ˆ θ → p θ . D Discussion of Computation
The estimators in the main text are constructed in two steps. While the joint estimationof ( α, β, P , θ ) is possible, the computational burden is heavy. Markets with persuasion alsoprovide identification power to the first stage parameter ( α, β, P ) , but this requires me touse contraction mapping each time I search over a higher dimensional parameter space whenincluding θ . Also, the estimation of θ requires solving the empirical optimization problem(4.4) for given first stage parameters. For the two-step estimation, I just plug in the firststage estimator and solve the (4.4) for different values of θ , while for joint estimation theoptimization problem needs to be repeated for each guessed value of ( α, β, P ) .The computational burden also comes from the contraction mapping because I need toiterate over M markets. So here I use the following trick to convert the M contractionmappings to one single contraction mapping. Proposition 5.
Let T m ( δ ) : R d → R d be a contraction mapping for each m = 1 ...M . Then T ≡ ( T , ..., T M ) : R dM → R dM is a contraction mapping acting on ( δ , ..., δ M ) . roof. Let C m < be the contraction constant that | T m ( δ ) − T m ( δ ) | < C m | δ − δ | . Then C ( M ) = max C m < is the contraction constant for T .The computational burden for this combined T can be potentially high because: 1. eventhough iteration on matrix is faster than iteration over M markets, the iteration on matrixis still slow when M is large; 2. the uniform contraction constant C ( M ) can be close toone and number of iteration to achieve certain tolerance level may be large. The followingalgorithm is helpful to reduce the running time: • Set up a tolerance level tol and a threshold integer K thr . Run the iteration on T , andcount the number of δ m such that | T ( δ m ) − δ m | < tol , denote this number as K con . • When K con > K thr , collect the remaining markets index and construct the new con-traction T (cid:48) = { T m } remain . Iterate until convergence. • Multiple threshold to decide remaining markets can be set up to further boost thespeed.This algorithm exploits the fact that the contraction mapping is simply the stacking ofindividuals. The intuition is that if C m ∈ { C small , C large } , and when the number of marketsfalls into C large group is relatively small, the algorithm avoids the slow iteration on C large andalso avoid excessive iteration on markets with C smallsmall