Influence via Ethos: On the Persuasive Power of Reputation in Deliberation Online
IInfluence via Ethos: On the Persuasive Power ofReputation in Deliberation Online
Emaad Manzoor, George H. Chen, Dokyun Lee, Michael D. Smith {emaad, georgechen, dokyun, mds}@cmu.eduCarnegie Mellon University
Abstract
Deliberation among individuals online plays a key role in shaping the opinions that drive votes,purchases, donations and other critical offline behavior. Yet, the determinants of opinion-change viapersuasion in deliberation online remain largely unexplored. Our research examines the persuasivepower of ethos – an individual’s “reputation” – using a 7-year panel of over a million debates froman argumentation platform containing explicit indicators of successful persuasion. We identifythe causal effect of reputation on persuasion by constructing an instrument for reputation froma measure of past debate competition, and by controlling for unstructured argument text usingneural models of language in the double machine-learning framework. We find that an individual’sreputation significantly impacts their persuasion rate above and beyond the validity, strength andpresentation of their arguments. In our setting, we find that having 10 additional reputation pointscauses a 31% increase in the probability of successful persuasion over the platform average. Wealso find that the impact of reputation is moderated by characteristics of the argument content, in amanner consistent with a theoretical model that attributes the persuasive power of reputation toheuristic information-processing under cognitive overload. We discuss managerial implications forplatforms that facilitate deliberative decision-making for public and private organizations online.
Keywords:
Persuasion, reputation systems, double machine-learning, causal inference from text
Preliminary draft, comments are welcome. a r X i v : . [ ec on . E M ] J un Introduction
Deliberation — “an extended conversation among two or more people in order to come to a betterunderstanding of some issue” (Nick Beauchamp, 2020) – forms the grease of societal decision-makingmachinery, lubricating consensus among participants via fair and informed debate. The process ofopinion exchange in deliberation alleviates polarization, minority under-representation and severalother drawbacks of consensus formation arising from non-deliberative processes (such as in majority-voting without discussion) by educating potentially uninformed participants and broadening theirawareness of alternative perspectives (List et al., 2013; Thompson, 2008).An increasing amount of deliberation takes place online (Davies and Gangadharan, 2009), bothon social media and on specialized platforms developed for participatory democracy , knowledgecuration and software planning , among others. While going virtual broadens participation, theincreased visibility of “reputation” indicators online could distort the equitability of the deliberationprocess. For example, (Marlow et al., 2013) find that project managers on Github (used for open-sourcesoftware development by Google, Facebook and Microsoft, among several other technology firms) usevisible reputation indicators when evaluating users’ feature requests and critiquing developers’ codecontributions. At the same time, this creates opportunities for firms to exploit reputation indicatorswhen directly interacting with consumers online to promote sales and mitigate churn (as gaminggiant Electronic Arts does on Reddit, for example). These opportunities also extend to philanthropicorganizations engaged in curbing the spread of misinformation online.Whether reputation indeed has persuasive power in online deliberation is thus an importantconcern, but one that is difficult to quantify due to the difficulty of recognizing opinion-change andpersuasion even on the rare occasions when it does occur. We overcome this challenge by assemblinga dataset of deliberation from the ChangeMyView online argumentation platform, containing over amillion debates spanning 7 years from 2013 to 2019. Strict curation by a team of over 20 moderatorsensures that debates on ChangeMyView are well-informed, balanced and civil, thus satisfying the keytenets of authentic deliberation (Fishkin and Luskin, 2005). The debates in our dataset cover a varietyof topics, from politics and religion to comparisons of products and brands, reflecting the diverseinterests of over 800,000 ChangeMyView users. ChangeMyView users initiate debates by sharingopinions, engage in dyadic deliberation with other users that challenge their opinion, and (uniquely)provide explicit indicators of successful persuasion for each challenger that persuaded them to changetheir opinion. For every user persuaded, challengers earn reputation points that are prominentlydisplayed with their username on the platform. The screenshot in Figure 1 illustrates the deliberationprocess and the nature of reputation indicators on ChangeMyView. For example, see the Stanford Online Deliberation Platform: https://stanforddeliberate.org/ For example, see Wikipedia Talk pages used to discuss Wikipedia edits: https://en.wikipedia.org/wiki/Help:Talk_pages For example, see Github Issues used to plan open-source projects: https://guides.github.com/features/issues/ http://reddit.com/r/changemyview/ oster ReputationChallenger Indicator of successful persuasion Figure 1: A debate.
An opinion shared by poster togtogtog (left), a response by challenger miguelguajiro (top-right) and a reply by togtogtog indicating successful persuasion with the ∆ symbol (bottom-right). Displayedabove miguelguajiro ’s response is their reputation (110 ∆ ), which is the number ∆ s earned previously. We use this dataset to analyze whether an individual’s reputation impacts their persuasiveness indeliberation online, beyond the content of their arguments. Our identification strategy to answer thisquestion draws on four key components, enabled by several unique characteristics of our dataset: I . Within-opinion variation : We exploit the availability of multiple challengers of each opinionto analyze within-opinion variation (via opinion fixed-effects). This controls for unobservedcharacteristics of the opinion (such as the topic) and the poster (such as their agreeability) thatmay introduce biases arising from users endogenously selecting which opinions to challenge. II . Approximating persuasive ability:
We exploit the availability of multiple persuasion attemptsfor each user over time to measure and control for their past (lagged) persuasion rate, as a proxyfor their unobserved persuasive ability (or skill) in each debate.
III . Instrumenting for reputation:
We derive an instrument for the reputation of the challenger ineach debate to address potential confounding due to unobserved challenger characteristics thatvary over time, and are hence not controlled for by their past persuasion rate. IV . Controlling for the response text:
Each challenger’s response text is the primary mediumthrough which their persuasive ability, linguistic fluency and other major determinants ofpersuasion are observed by the poster. By controlling for the response text nonparametrically,we control for and address potential confounding arising from all such determinants.2e instrument for the challenger’s reputation in each debate with their average position inthe sequence of responses to opinions they challenged previously (their mean past position ). Fora given opinion, challengers responding earlier (at lower positions) exhaust the limited space ofgood arguments, making it harder for challengers responding later (at higher positions) to persuadethe poster . Hence, we expect challengers with higher (worse) mean past positions to have lowerreputations in the present, motivating our instrument’s relevance. While users can strategically selectopinions to challenge that have fewer earlier challengers, our instrument remains exogenous aftercontrolling for the user’s present position in each debate. To further alleviate concerns of instrumentvalidity, we derive conservative bounds on our estimates with relaxed instrument validity assumptionsusing the plausibly-exogenous instrumental variable framework (Conley et al., 2012).Text plays a key role in ensuring instrument validity. All confounders of the instrument mustaffect both the instrument and the debate outcome (whether the poster was persuaded). To affect thedebate outcome, such confounders must operate through channels observable by the poster, the mostprominent of which is the text of the challenger’s response. Hence, “controlling for” the challenger’sresponse text blocks the causal pathways between such confounders and the debate outcome, ensuringthat they do not violate instrument validity.To operationalize this intuition in an ideal world, we would manually annotate, measure andcontrol for every possible characteristic of the response text that could affect the debate outcome,which is infeasible at scale. An alternative is to control for a bag-of-words (Harris, 1954) vector ofthe response text, assuming that functions of this vector capture all text characteristics that determinethe debate outcome. However, the high dimensionality of bag-of-words representations introducesstatistical difficulties that prevent consistent estimation and valid inference.Dimensionality-reduction techniques are commonly employed to alleviate these difficulties,whether manually via hand-selected features, or automatically via inverse-regression (Taddy, 2013),topic-modeling (Blei et al., 2003; Roberts, Stewart, and Airoldi, 2016; Roberts, Stewart, and Nielsen,2018) and neural text embeddings (Mikolov et al., 2013). However, these techniques provide noguarantees that the confounders present in the original text are retained in the low-dimensionaltext representation, which raises concerns of omitted variable bias. In addition, there is often littlesubstantive theory to guide the manual feature selection process. Automated dimensionality reductiontechniques, including supervised ones such as feature selection via LASSO (Tibshirani, 1996), couldresult in inconsistent estimates due to model misspecification and invalid confidence intervals due tofeature selection uncertainty that is not accounted for in the inference procedure (Belloni et al., 2014). This resembles the mechanism of the cable news channel position instrument used to quantify the persuasive power ofFox News on voting Republican (Martin and Yurukoglu, 2017). A bag-of-words representation of a document is a high-dimensional vector of the frequencies of all the words it contains.Its dimensionality is the size of the vocabulary of words in the document corpus, which is typically of the order of millions.
3e depart from the focus on dimensionality-reduction and instead incorporate the response text asa control nonparametrically, using recent advances in semiparametric inference with machine learningmodels. Specifically, we estimate “nuisance functions” of the response text via machine learningto predict the debate outcome, challenger reputation and instrument, and partial-out their effectsin the manner of Frisch-Waugh-Lovell (Frisch and Waugh, 1933; Lovell, 1963). This procedure wasintroduced as early as (Robinson, 1988) for parametric nuisance functions and recently extended tononparametric nuisance functions estimated via machine learning (Chernozhukov et al., 2018; Vander Laan and Robins, 2003). The recent extensions show that the partialling-out procedure guarantees √ n -consistent and asymptotically normal estimates, as long as each estimated nuisance functionconverges to the true nuisance function at the rate of n − / or better.In particular, we use a recent econometric extension of the partialling-out procedure called doublemachine-learning (Chernozhukov et al., 2018) to estimate a partially-linear instrumental variablespecification with text as a control. For our nuisance functions, we use neural networks with rectifiedlinear unit (ReLU) activation functions (Nair and Hinton, 2010). These neural networks pass the inputtext through a series of intermediate layers, each of which learns a latent “representation” that capturestextual semantics at different granularities. The networks are trained via backpropogation (Rumelhartet al., 1986) with first-order gradient-based techniques (Kingma and Ba, 2015) to minimize classificationor regressions loss functions. Though recurrent (Hochreiter and Schmidhuber, 1997) and convolutional(Kim, 2014) neural networks are more commonly used for textual prediction tasks, neural networkswith ReLU activation functions come with guaranteed n − / convergence rates (Farrell et al., 2018)that enable consistent estimation and valid inference in the double machine-learning framework. Results.
We find a significant positive effect of reputation on persuasion. Our instrumental variableestimates indicate that having 10 additional units of reputation increases the probability of persuadinga poster by 1.09 percentage points. This corresponds to a 31% increase over the platform averagepersuasion rate of 3.5%. Since each poster successfully persuaded increases a challenger’s reputa-tion, the long-run effect of reputation on persuasion is compounded over time. The effect remainsstatistically significant across a range of specifications, including ones where the instrument exclusionrestriction is relaxed. Our findings counter the prevailing notion on the ChangeMyView platform thatthe persuasive power of reputation can be ignored.The estimated effect of reputation on persuasion is the local average treatment effect (LATE)(Imbens and Angrist, 1994) in the population of compliers , comprised of debates where the challenger’sreputation (the treatment) is affected by their mean past position (the instrument). Such challengersare less persuasive at higher (later, worse) response positions and more persuasive at lower (earlier,better) response positions. Hence, we expect debates in the complier population to involve challengerswith moderate to high persuasive ability, since challengers with low persuasive ability are unlikely tobe any more persuasive at any response position. 4o investigate possible mechanisms for this effect, we test the predictions of a theoretical model ofpersuasion with information-processing shortcuts called reference cues (Bilancini and Boncinelli, 2018).We examine how the proportional effects of a challenger’s reputation and skill vary with characteristicsof the opinion and response content. Using the challenger’s response text length as a proxy for thecognitive complexity of their arguments, we find that the reputation effect share (of the total effectmagnitude of reputation and skill) increases from 82% to 89% from the first to the fourth responselength quantile. This suggests that posters rely more on reputation when the challenger’s argumentsare cognitively complex. This is consistent with the theoretical prediction that individuals will relymore on low-effort heuristic processing (using reputation as a proxy for the quality of the challenger’sresponse) instead of high-effort systematic processing (directly evaluating the challenger’s response)when subject to greater cognitive overload.The theoretical model also predicts that individuals will rely less on low-effort heuristic processingwhen they are more involved in the issue being debated. We test this prediction using the opiniontext length as a proxy for the issue-involvement of the poster and find that the reputation effect sharedecreases from 90% to 83% from the second to the fourth opinion length quantile. This is consistentwith the prediction that more issue-involved posters will rely less on reputation. We find similarpatterns using text complexity measures (such as the Flesch-Kincaid Reading Ease) as proxies forcognitive complexity and issue-involvement, instead of the response and opinion text length. Overall,our findings are consistent with reputation serving as a reference cue and used by posters as aninformation-processing shortcut under cognitive overload.We also examine how the effect of reputation on persuasion is moderated by the total number ofopinion challengers. While we expect that having more challengers will increase the cognitive burdenplaced on the poster (and hence push them to rely more on heuristic information-processing), wefind no evidence that posters rely more on reputation as the number of opinion challengers increases.We do find evidence that challengers with higher reputation have longer conversations with posters,which could be an important mediator of the effect of reputation on persuasion. We also find evidencethat challengers with higher reputation are more likely to attract collaboration from other (non-poster)users, although reputation continues to have a significant (positive) direct effect on persuasion afterexcluding the potential effect of such collaboration.
Contributions and related work.
Our research contributes to the economics of persuasion, thepractitioners of which comprise over a quarter of the United States’ GDP (McCloskey and Klamer,1995), including lawyers, judges, lobbyists, religious workers and salespeople. (Antioch et al., 2013)revises this number to 30 percent, after including marketing, advertising and political campaigningprofessionals. An extensive body of past work on persuasion spans the economics, marketing andpolitical science literature (among others), and is comprised of both theoretical models (Kamenica, 2018;Kamenica and Gentzkow, 2011) and empirical analyses of the efficacy of persuasive communicationvia field or natural experiments (see DellaVigna and Gentzkow, 2010 for a survey).5ur work differs from previous empirical studies on the economics of persuasion in three ways.First, previous work focused on identifying the existence of persuasion by quantifying the causaleffect of persuasive communication on some observable behavior, without the ability to observeindividual-level opinion-change. Content and persuader-based moderators of persuasion were thenanalyzed conditional on non-zero persuasive effects having been identified (Bertrand et al., 2010;Landry et al., 2006). In our work, the explicit indicators of persuasion provided by posters allows us tosidestep the task of identifying persuasion, and directly analyze its determinants. Second, we observeattempts at persuasion made by thousands of unique individuals, in contrast with previous work.This enables a broader investigation of the impact of persuader and content characteristics, whichare predicted to play an important role by belief-based persuasion models (Kamenica and Gentzkow,2011; Mullainathan et al., 2008; Stigler, 1961). Finally, we observe repeated attempts at persuasionmade by each individual that enables approximating and disentangling the impact of their persuasiveability from other factors.More specifically, our work informs persuasive information design (Kamenica, 2018) in interactivesettings by quantifying the impact of extraneous signals that could serve as low-effort information-processing heuristics (Chaiken, 1989; Petty and Cacioppo, 1986; Todorov et al., 2002). Such heuristicsplay an increasingly important role in this era of information overload (Jones et al., 2004), as empha-sized by Cialdini in his seminal book on the principles of influence (Cialdini, 2007):"Finally, each principle is examined as to its ability to produce a distinct kind of automatic,mindless compliance from people, that is, a willingness to say yes without thinking first.The evidence suggests that the ever-accelerating pace and informational crush of modernlife will make this particular form of unthinking compliance more and more prevalent inthe future. It will be increasingly important for the society, therefore, to understand thehow and why of automatic influence."Interactive persuasion channels are common today, with firms adopting online channels suchas live-chat to triangulate consumers’ beliefs and influence them via dialogue. Interactive channelsare often preferred for defensive marketing tasks (Hauser and Shugan, 1983) such as addressingcomplaints and mitigating churn. Some firms invest in interaction further and embed themselves asbonafide members of influential enthusiast-run online forums . Marketing communication designedto persuade in such channels closely resembles the dyadic deliberation we examine in our work.Our work is also related to research on the impact of certification and reputation systems (Dranoveand Jin, 2010) in markets for labor (Kokkodis and Ipeirotis, 2016; Moreno and Terwiesch, 2014),knowledge (Dev et al., 2019), and other goods and services (Hui et al., 2016; Lu and Rui, 2018; Tadelis,2016). Consumers studied in this line of research engage in costly information-processing to evaluate online (Quattrociocchiet al., 2016). By quantifying how an individual’s reliance on heuristic and systematic information-processing varies with the cognitive complexity of the persuasive message content, our findings couldinform online campaigns that involve persuasive information design aimed at reducing polarizationby affecting opinion-change.Finally, our work contributes an application to the nascent study of causal inference from text, andmore broadly to the literature on text as data (Gentzkow et al., 2019; Netzer et al., 2019; Toubia et al.,2019). Our setting involves text as a control (see Keith et al., 2020 for a recent survey of work in thissetting). Previous approaches to accommodate text as a control (though with treatments assumedto be exogenous) include (Sridhar and Getoor, 2019) which controls for topics in the text, (Roberts,Stewart, and Nielsen, 2018) which assumes a structural topic model (Roberts, Stewart, and Airoldi,2016) of text and controls for its sufficient reduction (Taddy, 2013), and (C. Shi et al., 2019; Veitch et al.,2019) which incorporate neural language models of text in the targeted learning inference framework(Van der Laan and Rose, 2011).Our work also links the social science literature on persuasion with the computational naturallanguage processing literature on argument-mining (Lippi and Torroni, 2016), where online argumen-tation platforms have been extensively studied (Atkinson et al., 2019; Jo et al., 2018; Luu et al., 2019;Srinivasan et al., 2019; Tan et al., 2016). Outline.
We begin in Section 2 by introducing background, formalizing our conceptual frameworkand motivating our hypotheses. We then describe our dataset in Section 3 and detail our empiricalstrategy in Section 4, including a description of our estimation procedure and evidence supporting thevalidity of our instrument. We discuss our results in Section 5 and interpret them through the lens of atheoretical model of persuasion. We conclude by summarizing our findings, discussing managerialimplications for platforms facilitating online deliberation for public and private organizations, andnoting the limitations of our research in Section 6. Background and Conceptual Framework
The ChangeMyView online argumentation platform was created in January, 2013 to foster good-faithdiscussions on polarizing issues and has received praise for helping combat the proliferation of echochambers online . In this section, we formalize the process of deliberation on ChangeMyView anddescribe important platform features to motivate our empirical analyses in Section 4. Opinion posters, opinion challengers and debates.
Our unit of analysis is a debate . Each debate isassociated with an opinion shared by an opinion poster , which is titled with the poster’s primaryclaim and contains at least 500 characters of supporting arguments. A response to the opinion bya challenger initiates a debate between the poster and challenger. Other users can (but rarely) jointhe ongoing discussion between a poster and a challenger with their own comments; we term suchdebates multi-party . Debates must follow several rules (detailed in Appendix A) enforced by over20 moderators. Notable rules are: (i) the poster must personally hold a non-neutral opinion, (ii) theposter must engage with all challengers for at least 3 hours after sharing their opinion, and (iii) achallenger’s response must counter at least one claim made by the poster. Responses to an opinion areordered chronologically and popularity votes on responses are hidden for the first 24 hours after anopinion is shared. These rules mitigate popularity biases, irrelevant digressions and hostility.
Opinion selection by users.
The titles of posted opinions and the identities of the posters who sharedthem are displayed in a paginated list on the platform’s homepage, ordered by a combination ofrecency and popularity votes . A tab on the homepage also allows users to order opinions by recencyonly. Clicking on an opinion title opens a new page displaying the opinion text and any ongoing orconcluded debates between the poster and other challengers. Users could select opinions to challengebased on various factors such as the opinion text, their own topical preferences, the poster’s identity,and the number and status of the debates between the poster and other challengers. The ∆ -system. In mid-February, 2013, ChangeMyView introduced a reputation system called the ∆ -system to incentivize challenging opinions on the platform. At any point in a debate, the poster mayreply to the challenger indicating that their opinion has changed using the ∆ symbol or equivalentalternatives. We term debates where the poster awarded a ∆ to the challenger as successful andopinions that led to at least one successful debate as conceded . Due to the platform rules requiringactive engagement, 98% of the ∆ s from the poster in our dataset were awarded within 24 hours ofthe opinion being posted, with over 50% being awarded within just 90 minutes. This short delayreduces concerns of opinion-change occurring due channels external to the debate. Each awarded ∆ grants the challenger a reputation point. Other non-poster users can (but rarely) also award ∆ s toany challenger and contribute to their reputation. The total reputation points earned previously, ifnon-zero, are displayed next to the challenger’s username with all of their responses on the platform. “Civil discourse exists in this small corner of the internet” — The Atlantic. December 30, 2018. Specifically, in decreasing order of the score: sign ( upvotes − downvotes ) log | upvotes − downvotes | + post-datetime / . he poster’s decision. Consider an opinion p that is challenged by user u . The poster observes u ’susername, reputation r pu and the text of their immediate response to the opinion. Based on thisinformation, the poster may initiate a discussion with the challenger, elicit additional responses (whichwe do not model) and eventually award a ∆ if persuaded to change their opinion. We model theposter p ’s decision to award a ∆ to challenger u as a function of an opinion-specific threshold τ p andthe perceived quality ˜ q pu of u ’s response: Y pu = I [˜ q pu > τ p ] I [ x ] = 1 if x is true , I [ x ] = 0 otherwise ˜ q pu = α r r pu + α q q pu α r + α q = 1 (1)Here, Y pu = 1 is the observed debate outcome if the poster awarded a ∆ to u and Y pu = 0 otherwise.The unobserved threshold τ p encodes opinion-specific characteristics such as the opinion topic andthe poster’s openness to persuasion. Based on (Bilancini and Boncinelli, 2018; Dewatripont andTirole, 2005), we model the perceived quality ˜ q pu as a weighted linear combination of the challenger’sreputation r pu and the “true” response quality q pu , which the poster can determine by evaluating thechallenger’s response at some cognitive cost. Posters choose α r and α q endogenously based on thiscognitive cost and their reliance on heuristic and systematic information-processing (Chaiken, 1989;Petty and Cacioppo, 1986). If α r > , reputation in this model serves as a reference cue : a proxy for thetrue response quality that can be processed with lesser effort than evaluating q pu directly. “True” response quality. We model the true response quality q pu as a function of the user’s “skill” s pu at the time they challenged opinion p and their position t pu in the sequence of challengers ofopinion p . t pu captures the overall impact of previous challengers’ responses. For example, challengersresponding earlier could exhaust the limited space of good arguments, making it harder for laterchallengers to respond with arguments of similar quality. We formalize this as follows: t pu = (cid:88) u (cid:48) I [ u (cid:48) challenged opinion p before u ] q pu = γ s s pu + γ t t pu (2)We approximate u ’s skill by the Laplace-smoothed (Manning et al., 2008) fraction of posters persuadedbefore opinion p , where p is chronologically-ordered and S p (cid:48) u = 1 if u challenged opinion p (cid:48) : s pu = (cid:80) p (cid:48)
. H2. The relative persuasive power of reputation, α r α r + α q , increases as the cognitivecost of processing the challenger’s response increases.H3. The relative persuasive power of reputation, α r α r + α q , decreases as the involve-ment of the poster in the debated issue increases. Confirming (H1) indicates that reputation has persuasive power, and confirming (H2) and (H3) lendssupport to the mechanism proposed by the model of (Bilancini and Boncinelli, 2018).
We collect all the discussions on the ChangeMyView platform between January, 2013 and October,2019 using a combination of the official Reddit API and the third-party PushShift API (Baumgartneret al., 2020), in full compliance with their terms of service. We exclude submissions to ChangeMyViewthat are not opinions using the fact that opinion titles are required to be prefixed with “CMV:”. Theexcluded submissions encompass discussions about the platform, announcements of platform changesand celebrations of milestones. We also exclude the opinions and responses posted to ChangeMyViewbefore the reputation system became fully functional on March 1, 2013. Figure 2:
Reputation and skill
We extract indicators of successful persuasion from the debate textusing the same extraction rules employed by ChangeMyView toprogrammatically parse ∆ s and other alternative symbols . We usethe extracted indicators to label debate success, to reconstruct eachchallenger’s reputation and to measure each challenger’s skill in eachdebate. Figure 2 shows the empirical variation in skill with reputationin our dataset, with each point indicating the reputation and skillfor each challenger measured in a single debate, colored based onthe number of debates they participated in previously. At values ofskill outside the low and high extremes, there is a wide variationin the reputation ( r = 0 . , p < . ). This variation is essential todisentangle the effects of reputation and skill on persuasion. Code obtained from: https://github.com/alexames/DeltaBot s µ (based on equation 3) is likely to attenuate ourestimates due to measurement error. Hence, we exclude all 118,277 such debates from our dataset .Our final dataset contains 91,730 opinions (23.5% of them conceded) shared by 60,573 unique posters,which led to 1,026,201 debates (3.5% of them successful) with 143,891 unique challengers. Table 1reports descriptive statistics of our dataset, and Figure 3 reports user-level distributions of participationand debate success. Table 2 summarizes the notation that will use in all subsequent sections. Mean Standard Deviation Median
Statistics of challengers in each debate
Reputation r pu s pu (%) 3.0 3.7 1.6Position t pu Z pu (cid:80) p (cid:48)
Number of opinions 91,730Opinions conceded 21,576Opinions leading to more than 1 debate 84,998 (number of clusters with opinion fixed-effects)
Number of debates 1,026,201Successful debates 36,187Multi-party debates 348,041Number of debates per opinion 11.2 12.7 9Successful debates per opinion 0.4 0.9 0Number of unique posters 60,573Opinions per poster 1.5 2.4 1Number of unique challengers 143,891Challengers with more than 1 debate 64,871 (number of clusters with user fixed-effects)
Number of debates per challenger 7.1 58.5 1Successful debates per challenger 0.3 3.2 0
Table 1: Descriptive Statistics.
Debates from March 1, 2013 to October 10, 2019.
Number of debates N u m b e r o f u s e r s Number of successful debates N u m b e r o f u s e r s Figure 3: Debate participation and success.
Distribution of total and successful debates per user. For completeness, we also report our main results including debates with deleted challengers in Appendix C. ymbol p Chronological opinion index u Chronological response index pu Tuple representing a debate: the u th response to opinion pτ p Opinion fixed-effect; captures unobserved opinion characteristics ρ u Challenger fixed-effect; captures unobserved challenger characteristics r pu Reputation of the challenger in debate pu ; sum of the past ∆ s earned s pu Skill of the challenger in debate pu ; smoothed lagged persuasion rate t pu Position of the challenger in debate pum pu Calendar month-year fixed-effect for debate puX pu Vector representation of the text of the challenger’s immediate response in debate puY pu Binary outcome of debate pu ; Y pu = 1 for successful debates, Y pu = 0 otherwise S pu Binary opinion selection indicator; S pu = 1 if u challenged opinion p , S pu = 0 otherwise Z pu Instrument (mean past position) for the challenger’s reputation in debate pu Table 2: Summary of notation.
List of recurring symbols introduced in Sections 2 and 4.
Equations (1) and (2) motivate baseline specifications that relate the observed debate outcome Y pu tothe challenger’s reputation r pu , skill s pu and position t pu as follows (constants omitted for brevity): Y ∗ pu = τ p + β r pu + β s pu + β t pu + (cid:15) pu E [ (cid:15) pu | τ p , r pu , s pu , t pu ] = 0 Y pu = I [ Y ∗ pu > Here, τ p is an opinion fixed-effect and (cid:15) pu is an error term with zero conditional mean. Since thefixed-effects are at the opinion level and skill (a function of lagged dependent variables) is at the userlevel, these are not dynamic panel specifications, and are hence unaffected by Nickell bias (Nickell,1981). Including the opinion fixed-effects excludes 6,732 debates from the sample, which were the onlyresponses to their respective opinions. If distributional assumptions (such as Gumbel or Gaussian)on (cid:15) pu hold and there are no unobserved confounders, the estimate of β quantifies the change in theprobability of persuading the poster of opinion p upon increasing the challenger’s reputation by oneunit, with all else equal. In Section 5, we report estimates from logistic and linear probability models.While the assumption of no unobserved confounding is restrictive (and relaxed in Section 4.2),the baseline specifications address two important sources of confounding. First, controlling for thechallenger’s skill controls for all challenger characteristics that affect persuasion (such as their rhetoricalability and linguistic fluency) and that do not vary with their tenure on ChangeMyView. To see whysuch characteristics confound the effect of reputation on persuasion, note that a user’s reputationlargely depends on the number of posters persuaded previously: r pu ≈ (cid:80) p (cid:48)
However, skill does not capture challenger characteristics that vary with their tenure on Change-MyView. By assuming the absence of such characteristics, the baseline specifications implicitly assumethat users do not learn to be more persuasive with experience on the platform. We provide empiricalevidence to support this assumption by estimating the following linear probability model: Y pu = ρ u + m pu + θ (cid:88) p (cid:48)
Dependent Variable: I [ u challenges > 1 future opinion ] Reputation r pu (10 units) . ( . ) ∗∗∗ User fixed-effects ( ρ u ) (cid:51) Month-year fixed-effects ( m pu ) (cid:51) No. of debates , R . Note: Standard errors displayed in parentheses. ∗∗∗ p < . ∗∗ p < . ∗ p < . Table 4: Reputation/opinion selection correlation.
The shaded nodes r pu , Y pu and S pu correspond to the reputation, debate outcome and opinionselection indicator respectively. The unshaded node U p is any unobserved opinion characteristic thatcould directly affect both opinion selection and debate success, such as the opinion topic.In the causal graph in Figure 4, S pu is a collider . A collider is any node C that is a common outcomein causal substructures of the form X → C ← Y . Conditioning on C opens a causal pathway between X and Y that would otherwise be blocked. If reputation is correlated with opinion selection (depictedby the undirected edge r pu ↔ S pu ), conditioning on the collider S pu (which our specifications doimplicitly) opens the confounding causal pathway U p → S pu ↔ r pu . This confounds the effect ofreputation on the debate outcome, since U p now affects both Y pu and r pu (via S pu ).We test for correlation between r pu and S pu by estimating the following linear probability model ofa user challenging more than one opinion after opinion p : I [ u challenges > 1 future opinion ] = ρ u + m pu + θ r pu + (cid:15) pu where ρ u is a user fixed-effect, m pu is a calendar month-year fixed-effect and (cid:15) pu is a Gaussian errorterm. The estimate of θ in Table 4 suggests a significant positive correlation between r pu and S pu .This correlation may arise either because users that were successful in the past (and hence have higherreputation) are more likely to challenge opinions in the future, or because more active users are likelyto have higher reputation (a mechanical relationship). Fortunately, the opinion fixed-effect τ p controlsfor all opinion characteristics, including the unobserved U p , thus addressing potential confounding.In summary, our baseline specifications address potential confounding due to (i) time-invariantchallenger characteristics that affect persuasion, and (ii) users endogenously selecting which opinionsto challenge. In the next section, we introduce specifications that instrument for the challenger’sreputation in each debate. The instrumental variable specifications inherit the robustness of thebaseline specifications to confounding from time-invariant challenger characteristics and endogenousopinion selection, while further addressing potential confounding due to time-varying challengercharacteristics that affect debate success. 14 .2 Instrumental Variable Specifications Our instrumental variable specifications address confounding due to unobserved user characteristicsthat affect persuasion and vary with their experience on the platform. Estimates from this specificationquantify the local average treatment effect (LATE) of reputation on debate success if instrumentrelevance, exogeneity, exclusion and monotonicity hold (Imbens and Angrist, 1994). In this section,we derive our instrument and provide empirical evidence to support its validity.Our instrument is motivated by the fact that a user’s reputation largely depends on the number ofposters persuaded previously, since other users who are not the poster rarely award ∆ s: r pu ≈ (cid:88) p (cid:48)
Mean past position as an instrument for reputation.
An immediate concern is users selecting opinions to challenge based on their anticipated position inthe sequence of challengers, since users can observe the number of ongoing and concluded debateswith the poster before deciding to challenge an opinion. We characterize this scenario using thecausal graph in Figure 5, which extends the causal graph in Figure 4 with shaded nodes Z pu (for theinstrument) and t pu (for the challenger’s present position). t pu affects the debate outcome Y pu , based onequation (2). Recall from Section 4.1 that our specifications implicitly condition on the collider S pu = 1 .If the instrument is correlated with opinion selection (depicted by the undirected edge Z pu ↔ S pu ) andusers select opinions to challenge based on their anticipated position (depicted by the edge t pu → S pu ),conditioning on S pu will open the confounding causal pathway t pu → S pu ↔ Z pu . Hence, it is essentialto control for the challenger’s present position t pu , which could otherwise confound the instrument.The causal graph in Figure 5 reveals a second source of instrument confounding that has receivedrecent attention (Hughes et al., 2019; Swanson, 2019). If the instrument is correlated with opinionselection (depicted by the undirected edge Z pu ↔ S pu ) and some unobserved opinion characteristic U p (such as the opinion topic) affects both opinion selection and debate success, conditioning on S pu opens the confounding causal pathway U p → S pu ↔ Z pu that violates instrument exogeneity.We can test for correlation between the instrument and opinion selection by estimating the followinglinear probability model of a user challenging more than one opinion after opinion p , where ρ u is auser fixed-effect, m pu is a calendar month-year fixed-effect and (cid:15) pu is a Gaussian error term: I [ u challenges > 1 future opinion ] = ρ u + m pu + θ Z pu + θ r pu + (cid:15) pu The estimates of θ in Table 6 suggest a small but significant negative correlation between Z pu and S pu ,justifying our concerns of endogenous opinion selection violating instrument exogeneity. Fortunately(as discussed Section 4.1), the opinion fixed-effect τ p controls for all opinion characteristics, includingunobserved U p . This alleviates concerns of instrument exogeneity being violated due to endogenousopinion selection. 16 pu r pu U p S pu t pu Z pu Figure 5: Instrument confounding via opinion selection.
Dependent Variable: I [ u challenges > 1 future opinion ] Mean past position Z pu − . ( . ) ∗∗∗ Reputation r pu (10 units) . ( . ) ∗∗∗ User fixed-effects ( ρ u ) (cid:51) Month-year fixed-effects ( m pu ) (cid:51) No. of debates , R . Note: Standard errors displayed in parentheses. ∗∗∗ p < . ∗∗ p < . ∗ p < . Table 6: Instrument/opinion selection correlation.
Another plausible concern is of the instrument affecting the debate outcome via channels that donot include the user’s reputation, which violates the instrument exclusion restriction. For example, ifusers learn to be more persuasive from the earlier challengers of an opinion, a user with a high meanpast position could be more persuasive in the present than one with a low mean past position. r pu Z pu Y pu V abcd X pu We address this concern in two ways. First, note that any usercharacteristic correlated with successful persuasion is likely to affect thedebate outcome through the text of their responses. Hence, controllingfor the response text will block direct channels of influence betweenthe instrument and the debate outcome. This is formalized by thecausal graph on the right. Here, the reputation r pu , debate outcome Y pu , response text X pu and instrument Z pu are observed. V contains allunobserved confounders of the instrument or reputation (or both) thataffect the outcome through the text X pu . If we decompose the text intoconceptual components a , b , c and d , it is sufficient to control for a toblock the Z pu ↔ V → aaa → Y pu causal pathway.We operationalize this idea by estimating the following partially-linear instrumental variable specifica-tion with endogenous r pu , as formulated by (Chernozhukov et al., 2018): Y pu = β r pu + β s pu + β t pu + g ( τ p , X pu ) + (cid:15) pu E [ (cid:15) pu | Z pu , τ p , s pu , t pu , X pu ] = 0 Z pu = α s pu + α t pu + h ( τ p , X pu ) + (cid:15) (cid:48) pu E [ (cid:15) (cid:48) pu | τ p , s pu , t pu , X pu ] = 0 In this specification, the high-dimensional covariates τ p (the opinion fixed-effects) and X pu (a vectorrepresentation of u ’s response text) have been moved into the arguments of the “nuisance functions” g ( · ) and h ( · ) . As earlier, r pu is u ’s reputation, s pu is u ’s skill, t pu is u ’s position and Z pu (the instrument)is the mean past position of u before opinion p . (cid:15) pu and (cid:15) (cid:48) pu are error terms with zero conditional mean. β is the parameter of interest, quantifying the causal effect of reputation on persuasion.17o distributional assumptions are placed on (cid:15) pu and (cid:15) (cid:48) pu , and hence, this specification does notassume any functional form (in contrast with logit, probit and linear probability models). g ( · ) and h ( · ) can be flexible nonparametric functions. We discuss estimation and inference in Section 4.3.Second, we use the “plausibly exogenous” instrumental variable framework (Conley et al., 2012)to relax the instrument exclusion restriction and include Z pu directly in the debate outcome model : Y pu = τ p + β r pu + β s pu + β t pu + γZ pu + (cid:15) pu E [ (cid:15) pu | Z pu , τ p , s pu , t pu ] = 0 Z pu = τ p + α s pu + α t pu + (cid:15) (cid:48) pu E [ (cid:15) (cid:48) pu | τ p , s pu , t pu ] = 0 where γ (cid:54) = 0 encodes by how much the exclusion restriction is violated. For a fixed γ and conditionallyexogenous instrument, the effect of reputation on debate success can be quantified via two-stageleast-squares estimation of the following regression, using Z pu as an instrument for r pu : ( Y pu − γZ pu ) = τ p + β r pu + β s pu + β t pu + (cid:15) pu If users indeed learn to be more persuasive from earlier challengers, we would expect γ > . Wereport estimates of β from the specification above for a range of γ values in Section 5.While instrument relevance, exogeneity and exclusion are sufficient to guarantee identificationof the effect of reputation on debate success, we also require instrument monotonicity to interpretour estimate as a local average treatment effect (LATE) (Imbens and Angrist, 1994). Instrumentmonotonicity will be violated if there exists a subpopulation of debates where increasing the meanpast position of the challenger would increase their present reputation, and decreasing their mean pastposition would decrease their present reputation (members of this subpopulation are called defiers ).Such challengers are more likely to persuade a poster when they respond later. The large and precisely-estimated negative within-user correlation between the number of earlier challengers and debatesuccess ( ˆ θ = − . ± . ) in Table 3 suggests that the existence of such challengers is unlikely.The LATE is the effect of reputation on debate success for compliers , comprised of debates wherethe challenger’s reputation is indeed affected by their mean past position. The challengers in thesedebate subpopulations are more persuasive at earlier (lower) positions, and less persuasive at later(higher) positions. Hence, we expect that the compliers exclude debates with challengers havinglow persuasive ability, who are unlikely to be more or less persuasive in any position. We alsoexpect challengers with moderate to high persuasive ability to benefit more from an increase in theirreputation than challengers with low persuasive ability, since a high reputation is unlikely to substitutefor low persuasive ability. Hence, we expect the LATE to be larger than the average treatment effect ofreputation on debate success. (Conley et al., 2012) proposes four inference strategies that incorporate plausibly exogenous instruments. The inferencestrategy we use relies on the fewest assumptions and provides the most conservative estimates of β . .3 Estimation and Inference Our baseline linear probability model and linear instrumental variable specifications can be estimatedusing ordinary least-squares and two-stage least squares respectively, adapted to accommodate high-dimensional fixed-effects (Correia, 2016). In this section, we describe how the double machine-learningframework (Chernozhukov et al., 2018) can be used to consistently estimate the effects of reputation,skill and position in the partially-linear instrumental variable specification.Double machine-learning extends the partialling-out procedure of Frisch-Waugh-Lovell (Frischand Waugh, 1933; Lovell, 1963) to use flexible nonparametric functions estimated via machine learning.We first describe the basic setup assuming reputation is conditionally exogenous (given the responsetext), ignoring the opinion fixed-effects, and ignoring the effects of skill and position. Consider thefollowing partially-linear probability model: Y pu = βr pu + g ( X pu ) + (cid:15) pu E [ (cid:15) pu | r pu , X pu ] = 0 r pu = h ( X pu ) + (cid:15) (cid:48) pu E [ (cid:15) (cid:48) pu | X pu ] = 0 where Y pu is the debate outcome, r pu is the challenger’s reputation, X pu is their response text and (cid:15) pu and (cid:15) (cid:48) pu are Gaussian error terms with zero conditional mean. g ( · ) and h ( · ) are unknown nonparametricfunctions. We are interested in consistently estimating and performing valid inference on β .If g ( · ) and h ( · ) were fixed and known, consistent estimation is possible by solving for β in anempirical version of the following moment condition (equivalent to ordinary least-squares estimation): E [( Y pu − g ( X pu ) − βr pu ) r pu ] = 0 However, g ( · ) is unknown and needs to be jointly estimated with β . A solution is to first estimate g ( · ) on a separate subsample S (cid:48) of the data, and then estimate β by solving an empirical version ofthe moment condition above on the remaining subsample S . This procedure, called sample-splitting ,eliminates the “overfitting-bias” introduced in the process of estimating g ( · ) .If g ( · ) is estimated via machine learning, the procedure above results in inconsistent estimates ˆ β .(Chernozhukov et al., 2018) decomposes the scaled bias of ˆ β into the following two terms: √ n ( ˆ β − β ) = ( 1 n (cid:88) ( p,u ) ∈S r pu ) − √ n (cid:88) ( p,u ) ∈S r pu (cid:15) pu (cid:124) (cid:123)(cid:122) (cid:125) Term a + ( 1 n (cid:88) ( p,u ) ∈S r pu ) − √ n (cid:88) ( p,u ) ∈S r pu ( g ( X pu ) − ˆ g ( X pu )) (cid:124) (cid:123)(cid:122) (cid:125) Term b Term a converges at a n − / rate to a zero-mean Gaussian. However, by virtue of g ( · ) being estimatedvia machine learning, term b will typically converge to zero at a rate slower than n − / due to the slowconvergence of the estimation error g ( X pu ) − ˆ g ( X pu ) . This is called the “regularization bias” of ˆ g ( · ) .19ouble machine-learning eliminates regularization bias via a procedure called orthogonalization. β is estimated by solving an empirical version of the following “Neyman-orthogonal” moment condition: E [(( Y pu − E [ Y pu | X pu ]) − β ( r pu − E [ r pu | X pu ]))( r pu − E [ r pu | X pu ])] = 0 The empirical version of this moment condition can be solved via a procedure similar to the residual-on-residuals regression of (Robinson, 1988). The procedure is as follows (where S and S (cid:48) are disjointsubsamples of the data, and m ( · ) and l ( · ) are nonparametric functions):1. Estimate the conditional expectation function l ( X pu ) = E [ Y pu | X pu ] on S (cid:48) to get ˆ l ( · ) .2. Estimate the conditional expectation function m ( X pu ) = E [ r pu | X pu ] on S (cid:48) to get ˆ m ( · ) .3. Estimate the outcome residual ˜ Y pu = Y pu − ˆ l ( X pu ) on S .4. Estimate the treatment residual ˜ r pu = r pu − ˆ m ( X pu ) on S .5. Regress ˜ Y pu on ˜ r pu to obtain ˆ β .Note that we no longer need to estimate g ( · ) , and instead need to estimate the conditional expectations l ( · ) and m ( · ) that can be arbitrary nonparametric functions of X pu (such as neural networks). Thisprocedure can be extended to include skill and position as controls by estimating additional conditionalexpectation functions to predict the challenger’s skill and position from their response text on S (cid:48) ,estimating the residuals ˜ s pu and ˜ t pu on S , and then regressing ˜ Y pu on ˜ r pu , ˜ s pu and ˜ t pu .The resulting estimate ˆ β is √ n -consistent and asymptotically normal. (Chernozhukov et al., 2018)shows that term b of the scaled bias of ˆ β is now given by following expression: ( 1 n (cid:88) ( p,u ) ∈S V pu ) − √ n (cid:88) ( p,u ) ∈S ( m ( X pu ) − ˆ m ( X pu )) (cid:124) (cid:123)(cid:122) (cid:125) ˆ m ( · ) estimation error ( l ( X pu ) − ˆ l ( X pu )) (cid:124) (cid:123)(cid:122) (cid:125) ˆ l ( · ) estimation error This contains the product of nuisance function estimation errors. Hence, orthogonalization enables √ n -consistent estimation of ˆ β as long as the product of the convergence rates of ˆ m ( · ) and ˆ l ( · ) is n − / .This is more viable than requiring each nuisance function to converge at a n − / rate.If r pu is endogenous and Z pu is a valid instrument for r pu , (Chernozhukov et al., 2018) proposesthe following Neyman-orthogonal moment condition to estimate β in a partially-linear instrumentalvariable specification: E [(( Y pu − E [ Y pu | X pu ]) − β ( r pu − E [ r pu | X pu ]))( Z pu − E [ Z pu | X pu ])] = 0 (6)By a similar bias derivation, the estimated ˆ β is shown to be √ n -consistent and asymptotically normal,as long as the instrument is valid and the product of the nuisance function convergence rates is n − / .20e now detail our overall estimation procedure for the partially-linear instrumental variablespecification. We include the opinion fixed-effect τ p , skill s pu and position t pu as controls. S and S (cid:48) aredisjoint subsamples of the data, and m r ( · ) , m s ( · ) , m t ( · ) , m p ( · ) , l ( · ) and q ( · ) are nonparametric functionsthat we detail in the next subsection. The procedure is as follows:1. Estimate the following conditional expectation functions on sample S (cid:48) :i. l ( X pu , τ p ) = E [ Y pu | X pu , τ p ] to get ˆ l ( · ) .ii. q ( X pu , τ p ) = E [ Z pu | X pu , τ p ] to get ˆ q ( · ) . iii. m r ( X pu , τ p ) = E [ r pu | X pu , τ p ] to get ˆ m r ( · ) .iv. m s ( X pu , τ p ) = E [ s pu | X pu , τ p ] to get ˆ m s ( · ) .v. m t ( X pu , τ p ) = E [ t pu | X pu , τ p ] to get ˆ m t ( · ) .2. Estimate the following residuals on sample S :i. ˜ Y pu = Y pu − ˆ l ( X pu , τ p ) .ii. ˜ Z pu = Z pu − ˆ q ( X pu , τ p ) . iii. ˜ r pu = r pu − ˆ m r ( X pu , τ p ) .iv. ˜ s pu = s pu − ˆ m s ( X pu , τ p ) .v. ˜ t pu = t pu − ˆ m t ( X pu , τ p ) .3. Run a two-stage least-squares regression of ˜ Y pu on ˜ r pu , ˜ s pu , ˜ t pu using ˜ Z pu as an instrument for ˜ r pu to obtain the estimated local average treatment effects of reputation, skill and position ondebate success.We partition the debates for opinions with more than one response (mirroring the data used inthe specifications with opinion fixed-effects) uniformly at random into an estimation subsample S (cid:48) containing 10% of the debates (101,946 debates) and an inference subsample S containing 90% of thedebates (917,523 debates), ensuring that every opinion is represented in both S and S (cid:48) . In the nextsection, we describe how we use neural networks with rectified linear unit (ReLU) activation functionsfor the nonparametric functions m r ( · ) , m s ( · ) , m t ( · ) , m p ( · ) , l ( · ) and q ( · ) , which have been shown toconverge at n − / rates (Farrell et al., 2018) that enables consistent estimation and valid inference. A fully-connected neural network with h hidden layers is parameterized by matrices WWW , . . . , WWW h +1 and activation functions (called activations) a , . . . , a h +1 . The hidden layer sizes s , . . . , s h are architec-tural hyperparameters that determine the sizes of the matrices WWW , . . . , WWW h +1 as follows, where D and O are the dimensionalities of the neural network input and output, respectively:Size of WWW = D × s Size of
WWW i = s i − × s i for i = 2 , . . . , h Size of
WWW h +1 = s h × O X pu , τ p ] ∈ ℝ D R ∈ ℝ s W ∈ ℝ s ×1 a ( ⋅ ) ̂ r pu ∈ ℤ + ̂ Y pu ∈ {0,1}̂ s pu ∈ [0,100]̂ t pu ∈ ℝ Input Output Layer Predicted Output W ∈ ℝ D × s a ( ⋅ ) Hidden Layer ̂ Z pu ∈ ℝ + Figure 6: A neural network with one hidden layer ( h = 1 ). The neural network transforms the D -dimensionalinput, a concatenation of the response text vector X pu and the fixed-effects indicator vector for τ p , into a1-dimensional output. WWW and WWW are parameters to be estimated. a ( · ) and a ( · ) are activation functions. Each layer i multiplies the intermediate vector R i − produced by the previous layer with WWW i , andapplies the activation function a i ( · ) to produce R i = a i ( R i − WWW i ) . Figure 6 illustrates a neural networkwith one hidden layer ( h = 1 ), input dimensionality D and output dimensionality O = 1 . The neuralnetwork transforms the input, a concatenation of the response text vector X pu and the fixed-effectsindicator vector for τ p , into the 1-dimensional predicted output a ( a ([ X pu , τ p ] × WWW ) × WWW ) .We estimate five neural networks with rectified linear unit (ReLU) activations to predict (i) debatesuccess Y pu ∈ { , } , (ii) reputation r pu ∈ Z + , (iii) skill s pu ∈ [0 , (as a percentage), (iv) position t pu ∈ R (standardized to have zero-mean and unit-variance) and (v) the instrument Z pu ∈ R + fromthe response text X pu and opinion fixed-effects τ p . Though recurrent (Hochreiter and Schmidhuber,1997) and convolutional (Kim, 2014) neural networks are more popular for textual prediction tasks,ReLU neural networks have guaranteed n − / convergence rates (Farrell et al., 2018) that we requirefor consistent estimation and valid inference. Hence, we set each of the hidden layer activations a ( · ) , . . . , a h ( · ) to the rectifier function a i ( x ) = max (0 , x ) . Since the output of each neural network isone-dimensional, we set the size of the output layer matrix WWW h +1 to s h × . Output layer activations and loss functions.
For the debate success prediction network with thebinary target Y pu ∈ { , } , we set the output layer activation to the logistic sigmoid function: a h +1 ( x ) =(1 + e − x ) − ∈ [0 , . For the skill prediction network with the bounded target s pu ∈ [0 , , we set theoutput layer activation to the scaled logistic sigmoid function: a h +1 ( x ) = (1 + e − x ) − × ∈ [0 , .For the reputation and instrument prediction networks with nonnegative targets r pu ∈ Z + and Z pu ∈ R + , we set the output layer activation to the rectifier function: a h +1 ( x ) = max (0 , x ) . For theposition prediction network with unbounded target t pu ∈ R , we set the output layer activation tothe identity function: a h +1 ( x ) = x . We estimate the parameters WWW , . . . , WWW h +1 for the debate successprediction network by minimizing the binary cross-entropy loss Y pu log ( ˆ Y pu ) + (1 − Y pu ) log (1 − ˆ Y pu ) (where ˆ Y pu is the predicted output), and for the other networks by minimizing the mean squared error.22 eural network input. We follow recommendations from the text-as-data literature (Gentzkow et al.,2019) and construct a term-frequency inverse-document-frequency (TF-IDF) matrix from the text of thechallengers’ responses. We preprocess the text to remove links, numbers, pronouns, punctuation andtext formatting symbols, and replace each word with its lower-cased stem (for example, “economically”and “economics” will be replaced by the stem “economic”). We exclude very rare words (present inless than 0.1% of the responses) and very frequent words (present in more than 99.9% of the responses),since these words will contribute negligibly towards more accurate predictions. The final vocabularycontains 4,926 distinct words. Each row of the TF-IDF matrix corresponds to a vector X pu ∈ R .We also construct an indicator vector τ p ∈ { , } (since there are 84,998 unique opinion clusters),where only the p th element of τ p is set to 1 and the rest are set to zero. The concatenation of these twovectors, [ X pu , τ p ] ∈ R , is passed as input to the neural networks. Optimization.
We train each network via backpropagation (Rumelhart et al., 1986) with the Adamgradient-based optimization algorithm (Kingma and Ba, 2015) iterating over mini-batches of thetraining data. We begin the optimization process by initializing the parameters using the
Kaiminguniform initialization scheme (He et al., 2015), which has been shown to perform well both empiricallyand theoretically (Hanin and Rolnick, 2018). We perform batch-normalization (Ioffe and Szegedy,2015) on each layer’s output after applying the activation function to prevent internal covariateshift and accelerate convergence. To prevent overfitting to the training data, we apply weight-decay(a form of L -norm penalization) (Krogh and Hertz, 1992) to all the parameters, along with early-stopping (halting the training process once the out-of-sample predictive power starts decreasing withtraining iterations). We do not employ dropout regularization (Srivastava et al., 2014), since it reducesout-of-sample predictive power when combined with batch-normalization (Li et al., 2019). Architectural and optimization hyperparameters.
The number of hidden layers h , hidden layer sizes s , . . . , s h , weight-decay penalty, optimization learning rate and mini-batch size are architecturaland optimization hyperparameters that need to be tuned empirically. Hence, we further partitionthe debates in the estimation subsample S (cid:48) uniformly at random into a training subsample S (cid:48) train containing 75% of the debates (76,459 debates) and a validation subsample S (cid:48) val containing 25% of thedebates (25,487 debates). During the hyperparameter tuning process, we train the neural network on S (cid:48) train and evaluate its loss at each training iteration on both S (cid:48) train and S (cid:48) val .We fix the size of the hidden layers s , . . . , s h to the dimensionality of the response text vector X pu (=4,926) and tune the number of hidden layers for each neural network. Deep, fixed-width ReLUnetworks of this type have been shown to generalize well both empirically and theoretically (Hanin,2019; Safran and Shamir, 2017). For each neural network, we evaluate the training loss (for at most5,000 mini-batch iterations with early-stopping) with an increasing number of hidden layers, untilthe training loss no longer improves. Each neural network with the number of hidden layers thusfound has enough representational capacity to capture patterns in the training data, but is likely tohave overfit the training data and suffer from poor out-of-sample predictive power.23 umber of Activation FunctionsPrediction target Hidden layers Hidden Layer Output Layer Loss FunctionDebate success Y pu ∈ { , } r pu ∈ Z + s pu ∈ [0 , (percentage) 3 ReLU Sigmoid Mean squared errorPosition t pu ∈ R (standardized) 3 ReLU Identity Mean squared errorInstrument Z pu ∈ R + Table 7: Architectural hyperparameters.
The input layer matrix
WWW of each neural network has size 89,924 × X pu (the vocabulary size). Each of the h hidden layermatrices WWW , . . . WWW h has size 4,926 × WWW h +1 has size 4,926 × Y pu ∈ { , } r pu ∈ Z + s pu ∈ [0 , (percentage) 0.0001 50,000 10 3.672 3.764 3.707Position t pu ∈ R (standardized) 0.0001 50,000 10 0.658 0.789 0.796Instrument Z pu ∈ R + Table 8: Optimization hyperparameters.
The subsample losses on S (cid:48) train , S (cid:48) val and S are reported after trainingeach neural network with the selected hyperparameters for at most 5,000 mini-batch iterations (with early-stopping) on S (cid:48) train . The binary cross-entropy subsample loss is reported for the network predicting Y pu and the root mean squared prediction error is reported for the other networks. Hence, after having selected the number of hidden layers for each neural network via the aforemen-tioned procedure, we evaluate the validation loss of each neural network (for at most 5,000 mini-batchiterations with early-stopping) with an increasingly large weight-decay penalty (in the logarithmically-spaced range . , . , . , . . . ), until the validation loss no longer improves. The final neuralnetwork thus found will have sufficient representational capacity and be sufficiently regularized togeneralize well out-of-sample. During the process of tuning the number of hidden layers and theweight-decay penalty, we also empirically evaluate and select the values of the learning rate andmini-batch size that deliver the minimum validation loss with fast and stable convergence.Table 7 summarizes the selected architectural hyperparameters. Table 8 summarizes the selectedoptimization hyperparameters and the losses on each data subsample, which reflect the extent towhich each target is correlated with potential confounders present in response text. After fixing theselected hyperparameters, we re-estimate the neural networks with on the full estimation subsample S (cid:48) , estimate the prediction residuals on the inference sample S and run a two-stage least-squaresregression with these residuals, as described in the double machine-learning procedure in Section 4.3.24 Results
Dependent Variable: Debate Success Y pu (1) (2) (3) (4) (5)Reputation r pu (10 units) . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . ) ( . )Skill s pu (percentage) . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . ) ( . )Position t pu (std. deviations) − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ ( . ) ( . ) ( . ) ( . ) ( . )Response text ( X pu ) (cid:55) (cid:55) (cid:55) (cid:55) (cid:51) Instrument ( Z pu ) (cid:55) (cid:55) (cid:55) (cid:51) (cid:51) Opinion fixed-effects ( τ p ) (cid:55) (cid:55) (cid:51) (cid:51) (cid:51) No. of debates , ,
201 1 , ,
201 1 , ,
469 1 , ,
469 1 , , R .
051 0 .
012 0 . — —Note: Standard errors (clustered by opinion) displayed in parentheses. ∗∗∗ p < . ∗∗ p < . ∗ p < . Table 9: Main results.
Estimated effects of reputation, skill and position on debate success with a logit modelwithout opinion fixed-effects (1), linear probability models without (2) and with (3) opinion fixed-effects, linearinstrumental variable (4) and partially-linear instrumental variable (5) specifications. Position is standardized tohave zero-mean and unit-variance. Average marginal effects and pseudo- R are reported for the logit model.The instrument Z pu is the mean past position of user u before they challenged opinion p . Table 9 reports the estimated (marginal) effects of reputation, skill and position on debate success.We exclude opinion fixed-effects from the logit model to prevent dropping debates for opinions wherenone of the challengers persuaded the poster, which is required to estimate conditional logit models(Chamberlain, 1980). For all specifications, we find that the effects are precisely estimated, statisticallysignificant and have the expected signs. The estimates from the baseline specifications in columns(1) — (3) indicate that a challenger having 10 additional units of reputation is 0.06 — 0.14 percentagepoints more likely to persuade a poster on average, than another challenger of the same opinion withall else equal. This corresponds to a 1.7 — 4.0% increase over the platform average debate success rateof 3.5%. Keeping in mind the difficulty of persuasion (the average persuasion rate of a user in ourdataset is 0.6%, and the median persuasion rate of a user is 0.0%), this increase is significant.Three additional observations are worth noting. First, we expect the estimated effect of skill to beattenuated due to measurement error in all specifications. Second, comparing columns (2) and (3), wefind that including the opinion-fixed effects increases the estimated effect of reputation. This suggeststhat endogenous opinion selection (discussed in Section 4.1) biases the estimated effect of reputationdownwards. Second, the estimated effect of reputation is an average across all challengers, includingthose with low (and high) persuasive ability who are unlikely to benefit from additional reputation.We expect the impact of reputation to be higher for challengers with moderate persuasive ability.25he estimated effects from the baseline specifications in columns (1) — (3) of Table 9 may beconfounded by unobserved time-varying challenger characteristics (discussed in Section 4.2). Hence,we use the challenger’s mean past position as an instrument for their reputation, and report theestimated local average treatment effects (LATEs) of reputation, skill and position on debate successwith a linear instrumental variable specification in column (4). The estimates indicate that a challengerhaving 10 additional units of reputation is 1.09 percentage points more likely on average to persuadea poster, than another challenger of the same opinion with all else equal. This corresponds to a 31%increase over the platform average debate success rate of 3.5%. Compared to the linear probabilitymodel estimates, the estimated LATE of reputation is 7.8 times larger. We attribute this increase tothe compliers (debate subpopulations where the challenger’s reputation is indeed affected by theinstrument) having challengers with moderate to high persuasive ability, who benefit more from anincrease in their reputation than those with low persuasive ability.The instrument exclusion restriction will be violated if the mean response position of a challengerin the past has a direct effect on their probability of success in the present debate. This is possible if,for example, users become more persuasive by learning from the earlier responders to the opinionsthey challenged previously (as discussed in Section 4.2). However, the persuasive ability acquired inthis manner is likely to affect debate success through the text of the challenger’s responses. Hence,we alleviate concerns of instrument exclusion being violated by controlling for the response text ina partially-linear instrumental variable specification, and report the estimated LATEs in column (5)of Table 9. The estimated LATEs from this specification are less precise and differ slightly from theestimates in column (4), but are statistically indistinguishable at the 5% level (using a two-tailed Z-test).This suggests that the instrument exclusion restriction is not violated by factors present in the text.We supplement these results by reporting LATE estimates using the plausibly-exogenous method-ology (Conley et al., 2012), which assumes a known direct effect γ of the instrument on debate success,for a range of γ in Table 10. If users become more persuasive by learning from earlier challengers,we expect γ > . Table 10 indicates that the estimated reputation LATE after correcting for exclusionrestriction violations of this type is larger , in which case our linear instrumental variable specification(without this correction) under-estimates the true LATE. For γ = − . , the estimated LATE of rep-utation after correcting for the exclusion restriction violation becomes insignificant. This is expected,since correcting for γ = − . completely eliminates the “reduced form” effect of the instrumenton debate success (reported in Appendix D). Setting the “reduced form” effect of the instrument onthe outcome to zero as such necessarily renders the estimated treatment effect insignificant (Angristand Krueger, 2001). For γ < − . , the estimated LATE of reputation after correcting for exclusionrestriction violations is negative, which is intuitively implausible.In summary, our main results confirm hypothesis H1 (Section 2): that reputation has persuasivepower on the ChangeMyView platform. In the rest of this section, we investigate potential mechanismsthat could explain the persuasive power of reputation.26 ependent Variable: Debate Success Y pu Exclusion violation ( γ ) − . − . − . . . . Reputation r pu (10 units) − . ∗∗∗ − . . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . ) ( . ) ( . )Skill s pu (percentage) . ∗∗∗ . ∗∗∗ . ∗∗∗ − . − . ∗∗∗ − . ∗∗∗ ( . ) ( . ) ( . ) ( . ) ( . ) ( . )Position t pu (std. deviations) − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ ( . ) ( . ) ( . ) ( . ) ( . ) ( . )Response text ( X pu ) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) (cid:55) Instrument ( Z pu ) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) Opinion fixed-effects ( τ p ) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) No. of debates , ,
469 1 , ,
469 1 , ,
469 1 , ,
469 1 , ,
469 1 , , Note: Standard errors (clustered by opinion) displayed in parentheses. ∗∗∗ p < . ∗∗ p < . ∗ p < . Table 10: Results with a plausibly-exogenous instrument.
Estimated effects of reputation, skill and positionwith a linear instrumental variable specification and plausibly-exogenous instrument. All specifications includeopinion fixed-effects. Position is standardized to have zero-mean and unit-variance. The instrument Z pu is themean past position of user u before they challenged opinion p . Motivated by a theoretical model of persuasion with reference cues (Bilancini and Boncinelli, 2018),we investigate whether reputation serves as a reference cue (an information-processing shortcut) forthe challenger’s response quality by examining patterns in the heterogeneity of the effects of reputationand skill with the content of the debate. These patterns reflect how the poster’s usage of heuristicand systematic information-processing (Chaiken, 1989) varies with content characteristics that affectinformation-processing effort. The model predicts that individuals rely more on low-effort heuristicprocessing than on high-effort systematic processing when subject to greater cognitive overload. Inour setting, this translates to posters relying more on a challenger’s reputation than on their skill whenthe challenger’s response is more cognitively complex (hypothesis H2 in Section 2). The model alsopredicts that individuals rely less on heuristic processing than on systematic processing when they aremore involved in the issue being debated (hypothesis H3 in Section 2).We test both predictions by examining how the relative estimated effects of reputation and skillon debate success vary with the cognitive complexity of the challenger’s response and with theissue-involvement of the poster. Using the length (in characters) of the challenger’s response text andof the poster’s opinion text as proxies for cognitive complexity and issue-involvement, respectively,Table 11 reports LATE estimates of the effects of reputation and skill interacted with the responseand opinion length (binned into quantiles). Note that the specifications in Table 11 do not includeopinion fixed-effects, which would absorb all variation in the opinion length. They include calendarmonth-year fixed-effects, since unobserved temporal variation is no longer accounted for without theopinion fixed-effects. 27he estimated LATEs in columns (1) and (2) of Table 11 quantify how the effects of reputation andskill (separately) vary with the response length, which serves as a proxy for the cognitive complexityof the challenger’s response. The effects of both reputation and skill increase with the response length.This is expected since, with longer responses, the content explains more of the debate outcome thanother factors. However, the effect of reputation increases more than that of skill from the first to thefourth response length quantile: by +0.0133 for reputation compared to +0.0030 for skill. Measuringthe share of the effect magnitude of reputation (of the sum of effect magnitudes of reputation andskill) at each response length quantile reveals that the reputation effect share increases from 82% to89%. This is consistent with the poster’s increased reliance on reputation as a heuristic shortcut anddecreased reliance on systematic evaluation of argument quality. This pattern supports hypothesis H2.The estimates in columns (3) and (4) quantify how the effects of reputation and skill vary withthe opinion length, where we assume that longer opinions are correlated with the poster being moreinvolved in the issue being debated. The effect of reputation remains largely the same after the secondopinion length quantile, while that of of skill increases significantly in each subsequent opinion lengthquantile. The share of the effect magnitude of reputation (of the sum of effect magnitudes of reputationand skill) decreases from 90% to 83% from the second to the fourth opinion length quantile. Thisis consistent with the decreased reliance of the poster on reputation as a heuristic shortcut and theincreased reliance on systematic evaluation of argument quality as the issue-involvement of the posterincreases. This pattern supports hypothesis H3.Similar trends are observed if reading complexity measures (such as the Flesch-Kincaid ReadingEase) are used to proxy for the cognitive complexity of the challenger’s response and the issue-involvement of the poster. The negative estimated effects of skill for opinions and responses in the firstlength quantile could be attributed to higher skilled users preferring to write longer responses and tochallenge longer opinions. These preferences are a form of endogenous selection on the interactionterms, and will bias our estimates downwards. Since we only examine trends in the estimated effects,and do not interpret their absolute values, these biases are not a major concern.We now examine if reputation serves as a way for cognitively-overloaded posters to select chal-lengers to engage with. Table 12 reports the LATE estimates of reputation interacted with the totalnumber of challengers of the opinion, binned into quantiles. The specification in Table 12 does notinclude opinion fixed-effects, since they absorb all variation in the number of challengers, but includescalendar month-year fixed-effects and the response and opinion length as controls. We expect that,under the larger cognitive burden of having to respond to more challengers, posters would rely onreputation as a filtering heuristic. However, the estimates in Table 12 indicate a decrease in the effectof reputation as the number of opinion challengers increases, which likely reflects the preference ofreputed users for challenging opinions with fewer existing challengers. Hence, we find no support forthe hypothesis that posters use reputation to select challengers to engage with.28 ependent Variable: Debate Success Y pu Moderator M pu Response Text Length Opinion Text Length(1) (2) (3) (4)Reputation r pu (10 units) . ∗∗∗ . ∗∗∗ ( . ) ( . ) × I [ M pu ∈ st quantile ] 0 . ∗∗∗ − . ( . ) ( . ) × I [ M pu ∈ nd quantile ] 0 . ∗∗∗ . ∗∗∗ ( . ) ( . ) × I [ M pu ∈ rd quantile ] 0 . ∗∗∗ . ∗∗∗ ( . ) ( . ) × I [ M pu ∈ th quantile ] 0 . ∗∗∗ . ∗∗∗ ( . ) ( . )Skill s pu (percentage) . ∗∗∗ . ∗∗∗ ( . ) ( . ) × I [ M pu ∈ st quantile ] − . ∗∗∗ − . ∗∗∗ ( . ) ( . ) × I [ M pu ∈ nd quantile ] − . . ∗∗∗ ( . ) ( . ) × I [ M pu ∈ rd quantile ] 0 . . ∗∗∗ ( . ) ( . ) × I [ M pu ∈ th quantile ] 0 . ∗∗∗ . ∗∗∗ ( . ) ( . )Position t pu (std. deviations) − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ ( . ) ( . ) ( . ) ( . )Response Length (characters) I [ ∈ st quantile ] I [ ∈ nd quantile ] 0 . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . ) I [ ∈ rd quantile ] 0 . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . ) I [ ∈ th quantile ] 0 . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . )Opinion Length (characters) I [ ∈ st quantile ] I [ ∈ nd quantile ] 0 . ∗∗∗ . ∗∗∗ . ∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . ) I [ ∈ rd quantile ] 0 . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . ) I [ ∈ th quantile ] 0 . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . )Response text ( X pu ) (cid:55) (cid:55) (cid:55) (cid:55) Instrument ( Z pu ) (cid:51) (cid:51) (cid:51) (cid:51) Opinion fixed-effects ( τ p ) (cid:55) (cid:55) (cid:55) (cid:55) Month-year fixed-effects (cid:51) (cid:51) (cid:51) (cid:51)
No. of debates , ,
201 1 , ,
201 1 , ,
201 1 , , Note: Standard errors (clustered by opinion) displayed in parentheses. ∗∗∗ p < . ∗∗ p < . ∗ p < . Table 11: Heterogeneity with response and opinion length.
Variation in the estimated LATEs of reputation,skill and position on debate success with response and opinion length (binned into quantiles).ependent Variable: Debate Success Y pu Reputation r pu (10 units) × I [ Number of opinion challengers ∈ st quantile ] 0 . ∗∗∗ ( . ) × I [ Number of opinion challengers ∈ nd quantile ] 0 . ∗∗∗ ( . ) × I [ Number of opinion challengers ∈ rd quantile ] 0 . ∗∗∗ ( . ) × I [ Number of opinion challengers ∈ th quantile ] 0 . ∗∗∗ ( . )Skill s pu (percentage) . ∗∗∗ ( . )Position t pu (std. deviations) − . ∗∗∗ ( . )Number of opinion-challengers I [ ∈ st quantile ] I [ ∈ nd quantile ] − . ∗∗∗ ( . ) I [ ∈ rd quantile ] − . ∗∗∗ ( . ) I [ ∈ th quantile ] − . ∗∗∗ ( . )Response Length (characters) I [ ∈ st quantile ] I [ ∈ nd quantile ] 0 . ∗∗∗ ( . ) I [ ∈ rd quantile ] 0 . ∗∗∗ ( . ) I [ ∈ th quantile ] 0 . ∗∗∗ ( . )Opinion Length (characters) I [ ∈ st quantile ] I [ ∈ nd quantile ] 0 . ∗∗∗ ( . ) I [ ∈ rd quantile ] 0 . ∗∗∗ ( . ) I [ ∈ th quantile ] 0 . ∗∗∗ ( . )Response text ( X pu ) (cid:55) Instrument ( Z pu ) (cid:51) Opinion fixed-effects ( τ p ) (cid:55) Month-year fixed-effects (cid:51)
No. of debates , , Note: Standard errors (clustered by opinion) displayed in parentheses. ∗∗∗ p < . ∗∗ p < . ∗ p < . Table 12: Heterogeneity with the number of opinion challengers.
Variation in the estimated LATEs of reputa-tion, skill and position with the number of opinion challengers (binned into quantiles).ependent Variable: Conversation Tree Length(1) (2) (3)Reputation r pu (10 units) . ∗∗∗ . ∗∗∗ . ∗∗ ( . ) ( . ) ( . )Skill s pu (percentage) − . ∗∗∗ − . ∗∗∗ − . ( . ) ( . ) ( . )Position t pu (std. deviations) − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ ( . ) ( . ) ( . )Conditional on debate success Y pu Unsuccessful debates ( Y pu = 0 ) (cid:55) (cid:51) (cid:55) Successful debates ( Y pu = 1 ) (cid:55) (cid:55) (cid:51) Response text ( X pu ) (cid:55) (cid:55) (cid:55) Instrument ( Z pu ) (cid:51) (cid:51) (cid:51) Opinion fixed-effects ( τ p ) (cid:51) (cid:51) (cid:51) No. of debates , ,
469 982 ,
867 23 , Note: Standard errors (clustered by opinion) displayed in parentheses. ∗∗∗ p < . ∗∗ p < . ∗ p < . Table 13: Conversation tree length as the outcome.
Estimated effects of reputation, skill and position on theconversation tree length with the linear instrumental variable specification. All specifications include opinionfixed-effects. Position is standardized to have zero-mean and unit-variance. The instrument Z pu is the meanpast position of user u before they challenged opinion p . We also examine if having higher reputation leads to longer conversations with the challenger,which could mediate the effect of reputation on debate success. We estimate the effect of reputationon the conversation tree length : the maximum number of turns of dialogue in the conversation treeinitiated by the challenger’s response. It equals 1 if no one responds to the challenger, and is greaterthan 1 otherwise. It is positively (but weakly) correlated with debate success ( r = 0 . , p < . ).Table 13 reports the estimated LATEs of reputation, skill and position on the conversation treelength. The estimates in column (1) indicate that having 10 additional units of reputation leads toconversations trees that are 0.85 turns longer on average,with all else equal. Since the poster mustuse one turn of dialogue when awarding a ∆ to the challenger, this estimate may simply reflect thedirect effect of reputation on persuading the poster. Hence, columns (2) and (3) report the estimatedLATEs separately for unsuccessful and successful debates, and find that having 10 additional units ofreputation leads to conversations trees that are 0.85 turns longer on average even for unsuccessfuldebates. Hence, it is plausible that reputation affects debate success via longer conversations.The estimated LATEs also indicate that an additional percentage point of skill leads to conversationtrees that are up to 0.11 turns shorter on average for both unsuccessful and successful debates. Thissuggests that higher skilled challengers are either quicker to abandon futile conversations, or are ableto persuade the poster in fewer turns of dialogue.31 ependent Variable I [ Debate pu is multi-party ] Debate Success Y pu (1) (2)Reputation r pu (10 units) . ∗∗∗ . ∗∗∗ ( . ) ( . )Skill s pu (percentage) − . ∗∗∗ . ∗∗∗ ( . ) ( . )Position t pu (std. deviations) − . ∗∗∗ − . ∗∗∗ ( . ) ( . )Only single-party debates (cid:55) (cid:51) Response text ( X pu ) (cid:55) (cid:55) Instrument ( Z pu ) (cid:51) (cid:51) Opinion fixed-effects ( τ p ) (cid:51) (cid:51) No. of debates , ,
468 667 , Note: Standard errors (clustered by opinion) displayed in parentheses. ∗∗∗ p < . ∗∗ p < . ∗ p < . Table 14: Single-party vs. multi-party debates.
Estimated effects of reputation, skill and position on whethera debate is multi-party in column (1), and on debate success for single-party debates only in column (2),with a linear instrumental variable specification. All specifications include opinion fixed-effects. Position isstandardized to have zero-mean and unit-variance. The instrument Z pu is the mean past position of user u before they challenged opinion p . Finally, we examine the effect of reputation on attracting collaboration from other users. Recallfrom Section 3 that other (non-poster) users may join an ongoing debate between the challenger andposter; we term such debates multi-party . 6.1% of multi-party debates are successful, as compared to2.4% of non-multi-party debates, indicating a positive association between debate success and whetherthe debate is multi-party. If this association is causal, part of the effect of reputation on debate successmay be due to attracting collaboration from other users. The estimated LATEs in column (1) of Table14 indicate that having 10 additional units of reputation increases the probability of the debate beingmulti-party by 11%. The mechanism via which higher reputation challengers attract other users tojoin their debates may be complex. For example, higher reputation challengers may engage in longerconversations that make it easier for other users to join.To exclude the effect of reputation that is potentially due to attracting collaboration from otherusers, we report the estimated LATEs of reputation, skill and position on debate success for single-party debates only in column (2) of Table 14. The estimates share the same sign as the overall LATEsreported in Table 9, with smaller magnitudes. This could be due to single-party debates occurring onless popular topics, which are more difficult to persuade in, and hence less likely to attract users whomay collaborate with the challengers in ongoing debates. Nevertheless, reputation continues to have asignificant positive effect on debate success after excluding the potential effect of collaboration fromother non-poster users. 32
Conclusion
Using a 7-year panel of over a million debates from a large online argumentation platform containingexplicit indicators of successful persuasion, we quantify the persuasive power of ethos — one of thethree modes of persuasion described in Aristotle’s
Rhetoric — in deliberation online. Specifically, weidentify the causal effect of the reputation of an opinion challenger on persuading a poster on theChangeMyView platform, using the mean position of the challenger in past debates as an instrumentfor their present reputation. We address endogenous opinion selection using opinion fixed-effects,and ensure instrument validity by controlling for the text of the challenger’s response using neuralmodels of text in a partially-linear instrumental variable specification.We find a statistically significant positive effect of reputation on debate success, with 10 additionalunits of reputation increasing the probability of persuading the poster by 31% over the mean platformpersuasion rate of 3.5%. The relative effect of the challenger’s reputation with respect to their skillincreases with the cognitive complexity of their response (proxied for by the length of their responsetext), and decreases with the issue-involvement of the poster (proxied for by the length of the opiniontext), confirming the predictions of a theoretical model of persuasion where reputation serves as areference cue (Bilancini and Boncinelli, 2018). We find no evidence that posters use reputation asa way to select which challengers to engage with, but we do find evidence that reputation induceslonger conversations with the challenger. While we find evidence that reputation attracts collaborationwith the challenger from other users (which may in turn affect persuasion), the estimated effect ofreputation continues to be positive and statistically significant after excluding debates where suchcollaboration occurred. These findings suggest that reputation serves as a proxy for argument qualityand validity, and is used by cognitively-overloaded posters as a low-effort information-processingheuristic to evaluate a challenger’s arguments.Our findings have implications for a variety of platforms that facilitate deliberative decision-making online, including the Stanford Online Deliberation Platform used by government organiza-tions to elicit public opinion, and Github used by technology firms engaged in remote collaborativesoftware development (Marlow et al., 2013). Specifically, our results suggest that if the participantsin deliberation sessions can observe characteristics of other participants, such as past organizationalcontributions or awards, such characteristics may serve as reference-cues that endow “reputed” par-ticipants with additional persuasive power. As a consequence, the degree of consensus within eachdeliberation session could increase, while distorting the average consensus opinions over a sequenceof deliberation sessions towards those held by reputed participants. Such an outcome is undesirablein practice, and violates one of the key tenets of authentic deliberation (Fishkin and Luskin, 2005). Ingeneral, our findings may be of interest to online platforms that employ reputation systems and areconcerned about their unintended effects (Chen, 2018; Z. Shi et al., 2020). https://stanforddeliberate.org/ eferences Angrist, Joshua and Alan B Krueger (2001). “Instrumental variables and the search for identification:From supply and demand to natural experiments”. In:
Journal of Economic perspectives
Economic Round-up .Atkinson, David, Kumar Bhargav Srinivasan, and Chenhao Tan (2019). “What Gets Echoed? Under-standing the “Pointers” in Explanations of Persuasive Arguments”. In:
EMNLP-IJCNLP .Baumgartner, Jason, Savvas Zannettou, Brian Keegan, Megan Squire, and Jeremy Blackburn (2020).“The PushShift Reddit Dataset”. In: arXiv preprint arXiv:2001.08435 .Beauchamp, Nick (2020). “Modeling and Measuring Deliberation Online”. In:
The Oxford Handbook ofNetworked Communication .Belloni, Alexandre, Victor Chernozhukov, and Christian Hansen (2014). “High-dimensional methodsand inference on structural and treatment effects”. In:
Journal of Economic Perspectives .Bertrand, Marianne, Dean Karlan, Sendhil Mullainathan, Eldar Shafir, and Jonathan Zinman (2010).“What’s advertising content worth? Evidence from a consumer credit marketing field experiment”.In:
The Quarterly Journal of Economics .Bilancini, Ennio and Leonardo Boncinelli (2018). “Rational attitude change by reference cues wheninformation elaboration requires effort”. In:
Journal of Economic Psychology .Blei, David M, Andrew Y Ng, and Michael I Jordan (2003). “Latent dirichlet allocation”. In:
Journal ofmachine Learning research .Chaiken, Shelly (1989). “Heuristic and systematic information processing within and beyond thepersuasion context”. In:
Unintended thought .Chamberlain, Gary (1980). “Analysis of Covariance with Qualitative Data”. In:
The Review of EconomicStudies .Chen, Yiwei (2018). “User-Generated Physician Ratings: Evidence from Yelp”. In:Chernozhukov, Victor, Denis Chetverikov, Mert Demirer, Esther Duflo, Christian Hansen, WhitneyNewey, and James Robins (2018). “Double/debiased machine learning for treatment and structuralparameters”. In:
The Econometrics Journal .Cialdini, Robert B (2007).
Influence: The psychology of persuasion . Collins New York.Conley, Timothy G, Christian B Hansen, and Peter E Rossi (2012). “Plausibly exogenous”. In:
Review ofEconomics and Statistics .Correia, Sergio (2016).
Linear Models with High-Dimensional Fixed Effects: An Efficient and FeasibleEstimator . Tech. rep. Working Paper.Davies, Todd and Seeta Gangadharan (2009).
Online Deliberation: Design, Research, and Practice . Centerfor the Study of Language and Information. 35ellaVigna, Stefano and Matthew Gentzkow (2010). “Persuasion: Empirical Evidence”. In:
AnnualReviews of Economics .Dev, Himel, Karrie Karahalios, and Hari Sundaram (2019). “Quantifying Voter Biases in OnlinePlatforms: An Instrumental Variable Approach”. In:
Proceedings of the ACM on Human-ComputerInteraction
CSCW.Dewatripont, Mathias and Jean Tirole (2005). “Modes of communication”. In:
Journal of PoliticalEconomy .Dranove, David and Ginger Zhe Jin (2010). “Quality disclosure and certification: Theory and practice”.In:
Journal of Economic Literature .Farrell, Max H, Tengyuan Liang, and Sanjog Misra (2018). “Deep Neural Networks for Estimation andInference”. In: arXiv preprint arXiv:1809.09953 .Fishkin, James S and Robert C Luskin (2005). “Experimenting with a democratic ideal: Deliberativepolling and public opinion”. In:
Acta politica .Frisch, Ragnar and Frederick V Waugh (1933). “Partial time regressions as compared with individualtrends”. In:
Econometrica .Gentzkow, Matthew, Bryan Kelly, and Matt Taddy (2019). “Text as data”. In:
Journal of EconomicLiterature .Hanin, Boris (2019). “Universal function approximation by deep neural nets with bounded width andrelu activations”. In:
Mathematics
Advances in Neural Information Processing Systems , pp. 571–581.Harris, Zellig S (1954). “Distributional structure”. In:
Word .Hauser, John R and Steven M Shugan (1983). “Defensive marketing strategies”. In:
Marketing Science .He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun (2015). “Delving deep into rectifiers: Surpass-ing human-level performance on imagenet classification”. In:
Proceedings of the IEEE internationalconference on computer vision , pp. 1026–1034.Heckman, James (1979). “Sample Selection Bias as a Specification Error.” In:
Econometrica .Hernán, Miguel A, Sonia Hernández-Díaz, and James M Robins (2004). “A structural approach toselection bias”. In:
Epidemiology .Hochreiter, Sepp and Jürgen Schmidhuber (1997). “Long short-term memory”. In:
Neural computation .Hughes, Rachael A, Neil M Davies, George Davey Smith, and Kate Tilling (2019). “Selection biaswhen estimating average treatment effects using one-sample instrumental variable analysis”. In:
Epidemiology .Hui, Xiang, Maryam Saeedi, Zeqian Shen, and Neel Sundaresan (2016). “Reputation and regulations:evidence from ebay”. In:
Management Science .Imbens, Guido and Joshua Angrist (1994). “Identification and estimation of local average treatmenteffects”. In:
Econometrica .Ioffe, Sergey and Christian Szegedy (2015). “Batch Normalization: Accelerating Deep Network Trainingby Reducing Internal Covariate Shift”. In:
International Conference on Machine Learning , pp. 448–456.36o, Yohan, Shivani Poddar, Byungsoo Jeon, Qinlan Shen, Carolyn Rose, and Graham Neubig (2018).“Attentive Interaction Model: Modeling Changes in View in Argumentation”. In:
Proceedings of the2018 Conference of the North American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers) , pp. 103–116.Jones, Quentin, Gilad Ravid, and Sheizaf Rafaeli (2004). “Information overload and the message dy-namics of online interaction spaces: A theoretical model and empirical exploration”. In:
Informationsystems research .Kamenica, Emir (2018). “Bayesian persuasion and information design”. In:
Annual Reviews of Economics .Kamenica, Emir and Matthew Gentzkow (2011). “Bayesian persuasion”. In:
American Economic Review .Keith, Katherine A., David Jensen, and Brendan O’Connor (2020).
Text and Causal Inference: A Review ofUsing Text to Remove Confounding from Causal Estimates . arXiv: 2005.00649 [cs.CL] .Kim, Yoon (2014). “Convolutional Neural Networks for Sentence Classification”. In:
EMNLP .Kingma, Diederik P and Jimmy Ba (2015). “Adam: A Method for Stochastic Optimization”. In:
ICLR .Kokkodis, Marios and Panagiotis G Ipeirotis (2016). “Reputation transferability in online labor mar-kets”. In:
Management Science .Krogh, Anders and John A Hertz (1992). “A simple weight decay can improve generalization”. In:
Advances in neural information processing systems , pp. 950–957.Landry, Craig E, Andreas Lange, John A List, Michael K Price, and Nicholas G Rupp (2006). “Towardan understanding of the economics of charity: Evidence from a field experiment”. In:
The QuarterlyJournal of economics .Li, Xiang, Shuo Chen, Xiaolin Hu, and Jian Yang (2019). “Understanding the disharmony betweendropout and batch normalization by variance shift”. In:
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pp. 2682–2690.Lippi, Marco and Paolo Torroni (2016). “Argumentation mining: State of the art and emerging trends”.In:
ACM Transactions on Internet Technology (TOIT) .List, Christian, Robert C Luskin, James S Fishkin, and Iain McLean (2013). “Deliberation, single-peakedness, and the possibility of meaningful democracy: evidence from deliberative polls”. In:
The Journal of Politics .Lovell, Michael C (1963). “Seasonal adjustment of economic time series and multiple regressionanalysis”. In:
Journal of the American Statistical Association .Lu, Susan F and Huaxia Rui (2018). “Can we trust online physician ratings? Evidence from cardiacsurgeons in Florida”. In:
Management Science .Luu, Kelvin, Chenhao Tan, and Noah A Smith (2019). “Measuring Online Debaters’ Persuasive Skillfrom Text over Time”. In:
Transactions of the Association for Computational Linguistics .Ma, Liye, Baohong Sun, and Sunder Kekre (2015). “The Squeaky Wheel Gets the Grease—An empiricalanalysis of customer voice and firm intervention on Twitter”. In:
Marketing Science .Manning, Christopher D, Prabhakar Raghavan, and Hinrich Schütze (2008).
Introduction to informationretrieval . Cambridge university press. 37arlow, Jennifer, Laura Dabbish, and Jim Herbsleb (2013). “Impression formation in online peerproduction: activity traces and personal profiles in github”. In:
CSCW .Martin, Gregory J and Ali Yurukoglu (2017). “Bias in cable news: Persuasion and polarization”. In:
American Economic Review .McCloskey, Donald and Arjo Klamer (1995). “One quarter of GDP is persuasion”. In:
The AmericanEconomic Review .Mikolov, Tomas, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean (2013). “Distributed rep-resentations of words and phrases and their compositionality”. In:
Advances in neural informationprocessing systems .Moreno, Antonio and Christian Terwiesch (2014). “Doing business with strangers: Reputation inonline service marketplaces”. In:
Information Systems Research .Mullainathan, Sendhil, Joshua Schwartzstein, and Andrei Shleifer (2008). “Coarse thinking and per-suasion”. In:
The Quarterly Journal of Economics .Nair, Vinod and Geoffrey E Hinton (2010). “Rectified linear units improve restricted boltzmannmachines”. In:
ICML .Netzer, Oded, Alain Lemaire, and Michal Herzenstein (2019). “When words sweat: Identifying signalsfor loan default in the text of loan applications”. In:
Journal of Marketing Research .Nickell, Stephen (1981). “Biases in dynamic models with fixed effects”. In:
Econometrica .Pearl, Judea (2009).
Causality . Cambridge university press.Petty, Richard E and John T Cacioppo (1986). “The elaboration likelihood model of persuasion”. In:
Communication and persuasion .Quattrociocchi, Walter, Antonio Scala, and Cass R Sunstein (2016). “Echo chambers on Facebook”. In:
Available at SSRN 2795110 .Roberts, Margaret E, Brandon M Stewart, and Edoardo M Airoldi (2016). “A model of text for experi-mentation in the social sciences”. In:
Journal of the American Statistical Association .Roberts, Margaret E, Brandon M Stewart, and Richard A Nielsen (2018). “Adjusting for confoundingwith text matching”. In:
American Journal of Political Science .Robinson, Peter M (1988). “Root-N-consistent semiparametric regression”. In:
Econometrica .Rumelhart, David E, Geoffrey E Hinton, and Ronald J Williams (1986). “Learning representations byback-propagating errors”. In:
Nature .Safran, Itay and Ohad Shamir (2017). “Depth-width tradeoffs in approximating natural functions withneural networks”. In:
Proceedings of the 34th International Conference on Machine Learning-Volume 70 .JMLR. org, pp. 2979–2987.Shi, Claudia, David Blei, and Victor Veitch (2019). “Adapting neural networks for the estimation oftreatment effects”. In:
Advances in Neural Information Processing Systems .Shi, Zijun, Kannan Srinivasan, and Kaifu Zhang (2020). “Design of Platform Reputation Systems:Optimal Information Disclosure”. In: Available at SSRN: https://ssrn.com/abstract=3557086.Shugars, Sarah and Nicholas Beauchamp (2019). “Why Keep Arguing? Predicting Engagement inPolitical Conversations Online”. In:
SAGE Open .38ridhar, Dhanya and Lise Getoor (2019). “Estimating causal effects of tone in online debates”. In:
Proceedings of the 28th International Joint Conference on Artificial Intelligence .Srinivasan, Kumar Bhargav, Cristian Danescu-Niculescu-Mizil, Lillian Lee, and Chenhao Tan (2019).“Content removal as a moderation strategy: Compliance and other outcomes in the ChangeMyViewcommunity”. In:
Proceedings of the ACM on Human-Computer Interaction
CSCW.Srivastava, Nitish, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov (2014).“Dropout: A Simple Way to Prevent Neural Networks from Overfitting”. In:
Journal of MachineLearning Research
15, pp. 1929–1958.Stigler, George J (1961). “The economics of information”. In:
Journal of political economy .Stock, James H and Motohiro Yogo (2005). “Testing for weak instruments in Linear Iv regression”. In:
Andrews DWK Identification and Inference for Econometric Models .Swanson, Sonja A (2019). “A practical guide to selection bias in instrumental variable analyses”. In:
Epidemiology .Taddy, Matt (2013). “Multinomial inverse regression for text analysis”. In:
Journal of the AmericanStatistical Association .Tadelis, Steven (2016). “Reputation and feedback systems in online platform markets”. In:
AnnualReview of Economics .Tan, Chenhao, Vlad Niculae, Cristian Danescu-Niculescu-Mizil, and Lillian Lee (2016). “Winningarguments: Interaction dynamics and persuasion strategies in good-faith online discussions”. In:
Proceedings of the 25th international conference on world wide web .Thompson, Dennis F (2008). “Deliberative democratic theory and empirical political science”. In:
Annual Reviews in Political Science .Tibshirani, Robert (1996). “Regression shrinkage and selection via the lasso”. In:
Journal of the RoyalStatistical Society: Series B (Methodological)
The persuasion handbook: Developments in theory andpractice .Toubia, Olivier, Garud Iyengar, Renée Bunnell, and Alain Lemaire (2019). “Extracting features ofentertainment products: A guided latent dirichlet allocation approach informed by the psychologyof media consumption”. In:
Journal of Marketing Research .Van der Laan, Mark J and James M Robins (2003).
Unified methods for censored longitudinal data andcausality . Springer Science & Business Media.Van der Laan, Mark J and Sherri Rose (2011).
Targeted learning: causal inference for observational andexperimental data .Veitch, Victor, Dhanya Sridhar, and David M Blei (2019). “Using text embeddings for causal inference”.In: arXiv preprint arXiv:1905.12741 . 39 ppendix A: Platform Rules
Rules for shared opinions: • Rule A:
Explain the reasoning behind your view (500+ characters required). • Rule B:
You must personally hold the view and demonstrate that you are open to it changing. • Rule C:
Submission titles must adequately sum up your view. • Rule D:
Posts cannot express a neutral stance, suggest harm against a specific person, or beself-promotional. • Rule E:
Only post if you are willing to have a conversation with those who reply to you, and areavailable to do so within 3 hours after posting.An opinion may also be removed if it violates any of the following:1. It was posted by a brand new account on a highly controversial topic.2. The user already has an active opinion from the last 24 hours. This is to encourage posters tostay engaged with their posts and continue discussion.3. It’s identical in principle to another post made within 24 hours before it.4. Anything that is clearly spam or posted by a bot/novelty account.Opinions will never be removed based on topic or content, so long as they follow the rules above.
Rules for responses to opinions: • Rule 1:
Direct responses to a submission must challenge or question at least one aspect of thesubmitted view. • Rule 2:
Don’t be rude or hostile to other users. • Rule 3:
Refrain from accusing the poster or anyone else of being unwilling to change their view. • Rule 4:
Award a delta when acknowledging a change in your view, and not for any other reason. • Rule 5:
Responses must contribute meaningfully to the conversation.A response may also be removed if it violates any of the following:1. It is a deliberate attempt to disrupt discussion.2. Anything that is clearly spam or posted by a bot/novelty account.40 ppendix B: Correlation Table
All correlations are significant at p < . . (1) (2) (3) (4) (5)(1) Reputation 1.00 — — — —(2) Skill 0.29 1.00 — — —(3) Position -0.11 -0.14 1.00 — —(4) Mean past position -0.11 -0.14 0.22 1.00 —(5) Number of past debates 0.88 0.16 -0.12 -0.10 1.00 Appendix C: Results Including Debates by Deleted Challengers
Dependent Variable: Debate Success Y pu (1) (2) (3) (4)Reputation r pu (10 units) . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . )Skill s pu (percentage) . ∗∗∗ . ∗∗∗ . ∗∗∗ . ∗∗∗ ( . ) ( . ) ( . ) ( . )Position t pu (std. deviations) − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ − . ∗∗∗ ( . ) ( . ) ( . ) ( . )Response text ( X pu ) (cid:55) (cid:55) (cid:55) (cid:55) Instrument ( Z pu ) (cid:55) (cid:55) (cid:55) (cid:51) Opinion fixed-effects ( τ p ) (cid:55) (cid:55) (cid:51) (cid:51) No. of debates , ,
478 1 , ,
478 1 , ,
968 1 , , R .
054 0 .
013 0 . —Note: Standard errors (clustered by opinion) displayed in parentheses. ∗∗∗ p < . ∗∗ p < . ∗ p < . Table 15: Main results.
Estimated effects of reputation, skill and position on debate success with a logit modelwithout opinion fixed-effects (1), linear probability models without (2) and with (3) opinion fixed-effects, linearinstrumental variable (4) specifications. Position is standardized to have zero-mean and unit-variance. Averagemarginal effects and pseudo- R are reported for the logit model. The instrument Z pu is the mean past positionof user u before they challenged opinion p . ppendix D: Instrumental Variable Specification Reduced-Form Dependent Variable: Reputation Y pu Mean past position Z pu − . ( . ) ∗∗∗ Skill s pu (percentage) . ( . ) ∗∗∗ Position t pu (std. deviations) − . ( . ) ∗∗∗ Opinion fixed-effects ( τ p ) (cid:51) No. of debates , , R . Note: Standard errors displayed in parentheses. ∗∗∗ p < . ∗∗ p < . ∗ p < . Table 16: Reduced-form estimates.