[PDF] Adjusting for Confounders with Text: Challenges and an Empirical Evaluation Framework for Causal Inference

Abstract

Leveraging text, such as social media posts, for causal inferences requires the use of NLP models to 'learn' and adjust for confounders, which could otherwise impart bias. However, evaluating such models is challenging, as ground truth is almost never available. We demonstrate the need for empirical evaluation frameworks for causal inference in natural language by showing that existing, commonly used models regularly disagree with one another on real world tasks. We contribute the first such framework, generalizing several challenges across these real world tasks. Using this framework, we evaluate a large set of commonly used causal inference models based on propensity scores and identify their strengths and weaknesses to inform future improvements. We make all tasks, data, and models public to inform applications and encourage additional research.

Full PDF

AAdjusting for Confounders with Text: Challenges and an EmpiricalEvaluation Framework for Causal Inference

Galen Weld *= , Peter West *= , Maria Glenski † , David Arbour ‡ , Ryan A. Rossi ‡ and Tim Althoff ** Paul G. Allen School of Computer Science and Engineering † Paciﬁc Northwest National Laboratory ‡ Adobe Research = Contributed equally

Abstract

Leveraging text, such as social media posts,for causal inferences requires the use of NLPmodels to ‘learn’ and adjust for confounders,which could otherwise impart bias. How-ever, evaluating such models is challenging,as ground truth is almost never available. Wedemonstrate the need for empirical evaluationframeworks for causal inference in natural lan-guage by showing that existing, commonlyused models regularly disagree with one an-other on real world tasks. We contributethe ﬁrst such framework, generalizing severalchallenges across these real world tasks. Us-ing this framework, we evaluate a large setof commonly used causal inference modelsbased on propensity scores and identify theirstrengths and weaknesses to inform future im-provements. We make all tasks, data, and mod-els public to inform applications and encour-age additional research.

A frequent goal for computational social sciencepractitioners is to understand the casual effect ofintervening on a treatment of interest. Researchersoften operationalize this by estimating the averagetreatment effect ( AT E ) of a speciﬁc treatment vari-able ( e.g. therapy) on a speciﬁc outcome ( e.g. sui-cide) (Pearl, 1995, 2009; Rosenbaum, 2010; Keithet al., 2020). A major challenge is adjusting for confounders ( e.g. comments mentioning depres-sion) that affect both the treatment and outcome(depression affects both an individual’s propensityto receive therapy and their risk of suicide) (Keithet al., 2020). Without adjusting for depression as aconfounder, we might look at suicide rates amongtherapy patients and those not receiving therapy,and wrongly conclude that therapy causes suicide.The gold standard for avoiding confounders is toassign treatment via a randomized controlled trial (RCT). Unfortunately, in many domains, assigning Figure 1: Causal graph representing the the context ofour evaluation framework. All edges have known prob-abilities. While our framework naturally generalizes tomore complex scenarios, we chose binary treatmentsand outcomes, and a binary latent confounder, as evenin this simple scenario, current models struggle ( § treatments in this manner is not feasible ( e.g. due toethical or practical concerns). Instead, researchersconduct observational studies (Rosenbaum, 2010),using alternate methods to adjust for confounders.Text ( e.g. users’ social media histories) canbe used to adjust for confounding by training amodel to recognize confounders (or proxies forconfounders) in the text, so that similar treated anduntreated observations can be compared. However,a recent review (Keith et al., 2020) ﬁnds that evalu-ating the performance of such models is “a difﬁcultand open research question” as true AT E s are al-most never known, and so, unlike in other NLPtasks, we cannot know the correct answer. In thiswork, we ﬁnd that this challenge is ampliﬁed, asmodels disagree with one another on real worldtasks ( §

3) – how do we know which is correct?As ground truth is almost never available, theonly practical method to evaluate causal inferencemodels is with semi-synthetic data, where synthetictreatments and outcomes are assigned to real ob-servations, as in Fig. 1 (Dorie et al., 2019; Jensen,2019; Gentzel et al., 2019). While widely-usedsemi-synthetic benchmarks have produced positiveresults in the medical domain (Dorie et al., 2019), With the extremely rare exception of constructed obser-vational studies, conducted with a parallel RCT. a r X i v : . [ c s . C L ] S e p o such benchmark exists for causal inference mod-els using text (Gentzel et al., 2019; Dorie et al.,2019; Jensen, 2019; Keith et al., 2020).In this work, we contribute the ﬁrst evaluationframework for causal inference with text, consist-ing of ﬁve tasks inspired by challenges from a widerange of studies (Keith et al., 2020) ( § §

5) from real Redditusers’ proﬁles, perturbed with synthetic posts tocreate increasing levels of difﬁculty ( § § § § For NLP researchers:

We make our tasks, dataand models publicly available to encourage the de-velopment of stronger models for causal inferencewith text and identify areas for improvement ( § For CSS practitioners:

We identify strengths andweaknesses of commonly used models, identifyingthose best suited for speciﬁc applications, and makethese publicly available ( § Causal Inference Primer.

We formalize causalinference using notation from Pearl (1995). Givena series of n observations (in our context, a so-cial media user), each observation is a tuple O i =( Y i , T i , X i ) , where Y i is the outcome ( e.g. did user i develop a suicidal ideation?), T i is the treatment( e.g. did user i receive therapy?), and X i is thevector of observed covariates ( e.g. user i ’s textualsocial media history).The Fundamental Problem of Causal Inference is that each user is either treated or untreated, andso we can never observe both outcomes. Thus, wecannot compute the AT E = n (cid:80) ni =1 Y i [ T i = 1] − Y i [ T i = 0] directly, and must estimate it by ﬁnd-ing comparable treated and untreated observations.To do so, it is common practice to use a model toestimate the propensity score , ˆ p ( X i ) ≈ p ( T i = Dataset Website | X i ) , for each observation i . As treatments aretypically known, propensity score models are effec-tively supervised classiﬁers, predicting T i , given X i . Matching, stratifying, or weighting using thesepropensity scores will produce an unbiased AT E estimate ( § common support assumption ) (Rosenbaum,2010; Hill and Su, 2013). In practice, verifyingthese assumptions is difﬁcult, hence the need forempirical evaluation. Causal Inference and NLP.

Until recently, therehas been little interaction between causal infer-ence researchers and the NLP research commu-nity (Keith et al., 2020). There are many ways toconsider text in a causal context, such as text asa mediator (Veitch et al., 2019), text as treatment(Wood-Doughty et al., 2018; Egami et al., 2018;Fong and Grimmer, 2016; Tan et al., 2014), text asoutcome (Egami et al., 2018; Zhang et al., 2018),and causal discovery from text (Mani and Cooper,2000; Mirza and Tonelli, 2016). However, we nar-row our focus to text as a confounder , as in Keithet al. (2020). This is an important area of researchbecause the challenge of adjusting for confound-ing underlies most causal contexts, such as text astreatment or outcome (Keith et al., 2020). Effectiveadjusting for confounding with text enables causalinference in any situation where observations canbe represented with text – e.g. social media, newsarticles, and dialogue.

Adjusting for Confounding with Text.

A recentACL review (Keith et al., 2020, Table 1) sum-marizes common practices across a diverse rangeof studies. Models and text representations usedin these applications do not yet leverage recentbreakthroughs in NLP, and generally fall into threegroups: those using uni- and bi-gram representa-tions (De Choudhury et al., 2016; Johansson et al.,2016; Olteanu et al., 2017), those using LDA ortopic modeling (Falavarjani et al., 2017; Robertset al., 2020; Sridhar et al., 2018), and those usingneural word embeddings such as GLoVe (Phamand Shen, 2017) and BERT (Veitch et al., 2019).Three classes of estimators are commonly used tocompute the

AT E from text data: inverse proba-bility of treatment weighting (IPTW), propensityscore stratiﬁcation , and matching , either usingpropensity scores or some other distance metric. igure 2: Treatment accuracy and

AT E for both real world experiments, with bootstrapped 95% conﬁdenceintervals. Note that for the Gender Experiment, the models with the highest accuracy have the lowest

AT E . In this work, we evaluate at least one variant ofevery commonly used model ( § § any AT E estimation method, includingthose computed using non-propensity score-basedmatching, such as TIRM (Roberts et al., 2020) andexact matches (Mozer et al., 2020).

Evaluation of Causal Inference.

In rare special-ized cases, researchers can use the unbiased out-comes of a parallel RCT to evaluate those of an ob-servational study, as in Eckles and Bakshy (2017).This practice is known as a constructed observa-tional study, and, while useful, is only possiblewhere parallel RCTs can be conducted. Out-side these limited cases, proposed models are typi-cally evaluated on synthetic data generated by theirauthors. These synthetic datasets often favor theproposed model, and do not reﬂect the challengesfaced by real applications (Keith et al., 2020).Outside of the text domain, widely used evalu-ation datasets have been successful, most notablythe 2016 Atlantic Causal Inference Competition(Dorie et al., 2019), and a strong case has beenmade for the empirical evaluation of causal infer-ence models (Gentzel et al., 2019; Jensen, 2019).In the text domain, matching approaches have beenevaluated empirically (Mozer et al., 2020), but thisapproach evaluates only the quality of matches , notthe causal effect estimates. In contrast, our workapplies to all estimators, not just matching, andevaluates the entire causal inference pipeline.

Recent causal inference papers (Veitch et al., 2019;Roberts et al., 2020; De Choudhury et al., 2016;Chandrasekharan et al., 2017; Bhattacharya andMehrotra, 2016) have used social media histories toadjust for confounding. Each of these papers usesa different model: BERT in Veitch et al. (2019), topic modeling in Roberts et al. (2020), and logisticregression in De Choudhury et al. (2016). For allof these studies, ground truth causal effects areunavailable, and so we cannot tell if the chosenmodel was correct. However, we can compute theirprediction accuracy on propensity scores, and seeif their

AT E estimates agree—if they don’t, thenat most one disagreeing model can be correct.

Methods.

We conducted two experiments usingreal world data from Reddit, inspired by these re-cent papers. In the

Moderation Experiment , wetest if having a post removed by a moderator im-pacts the amount a user later posts to the samecommunity again. In the

Gender Experiment , weuse data from Veitch et al. (2019) to study the im-pact of the author’s gender on the score of theirposts. For details on data collection, see § A. Results.

Comparing the performance of nine dif-ferent models (Fig. 2), we ﬁnd that all models havesimilar treatment accuracy in the Moderation Ex-periment. However, the models using 1,2-gramfeatures perform better in the Gender Experimentthan the LDA and SHERBERT models. Most im-portantly, we see that all models have mediocretreatment accuracy (Fig. 2a,c) and the models withthe highest treatment accuracy produce the lowest

AT E estimates (Fig. 2b,d), which in many casesdisagree entirely with estimates from other models.

Implications.

This should come as a great con-cern to the research community. We do not knowwhich model may be correct, and we do not knowwhether there may be a more accurate model thatwould even further decrease the estimated treat-ment effect. We derive theoretical bounds andcompute them, ﬁnding that in 99+% of cases, thesebounds are looser than those computed empiricallyusing our framework ( § C), making them less use-ful for model selection. This concern motivatesour research questions ( §

1) and underlines the im-portance and urgency of empirical evaluation forcausal inference in natural language. Next, we de-cribe key challenges in adjusting for confoundingwith text and present a principled evaluation frame-work that highlights these challenges and generatesactionable insights for future research.

Using the common setting of real social media his-tories (De Choudhury et al., 2016; Olteanu et al.,2017; Veitch et al., 2019; Choudhury and Kiciman,2017; Falavarjani et al., 2017; Kiciman et al., 2018;Saha et al., 2019; Roberts et al., 2020), we identifyﬁve challenges consistently present when represent-ing natural language for causal inference:1.

Linguistic Complexity:

Different expres-sions can be indicative of important under-lying commonalities and signals. Someonewho struggles with mental health might write“I feel depressed” or “I am isolated from mypeers,” which have distinct meanings but bothmay be indicative of depression.

Can modelsrecognize that both are relevant? Signal Intensity:

Some users only have a fewposts that contain a speciﬁc signal (such aspoor mental health) whereas others may havemany posts with this signal. Signals are espe-cially weak when posts containing the signalconstitute only a small fraction of a user’sposts.

Can models detect weak signals? Strength of Selection Effect:

Many studieshave few comparable treated and untreatedusers ( §

2) (Li et al., 2018; Crump et al., 2009).

Can models adjust for strong selection effects? Sample Size:

Observational studies oftenface data collection limitations. Can mod-els perform well with limited data samples? Placebo Test:

Oftentimes, no causal effectis present between a given treatment and anoutcome.

Do models falsely predict causalitywhen none is present?

While natural language is far more complex thanany ﬁnite set of challenges can capture, the ﬁvewe have chosen to highlight are challenges thatregularly need to be addressed in causal inferencetasks that use natural language (Keith et al., 2020).They also cover three key concepts of model per-formance: generalizability ( linguistic complexity ),sensitivity ( signal intensity , strength of selection In Keith et al. (2020, Table 1), 8/12 studies had fewerthan 5,000 observations, and 4/12 had fewer than 1,000. effect ), and usability ( sample size , placebo test )that are critical for comprehensive evaluation. Toproduce our evaluation framework, we derive aconcrete task from each challenge. We generate ﬁve tasks, each with discrete levels ofdifﬁculty, and corresponding semi-synthetic taskdatasets based on real social media histories. With-out the semi-synthetic component, it would not bepossible to empirically evaluate a model, as wewould not know the true

AT E . By basing our userhistories on real data, we are able to include muchof the realism of unstructured text found ‘in thewild.’ This semi-synthetic approach to evaluationpreserves the best of both worlds: the empiricismof synthetic data with the realism of natural data(Jensen, 2019; Gentzel et al., 2019; Jensen, 2019).

The method for generating a semi-synthetic datasetcan be arbitrarily complex, however, for simplicityand clarity, we generate our datasets according toa simpliﬁed model of the universe; where all con-founding is present in the text, and where there areonly two types of people, class 1 and class2 (Fig. 1). In the context of mental health, forexample, these two classes could simply be peo-ple who struggle with depression ( class 1 ), andthose who don’t ( class 2 ). If models struggleon even this simple two-class universe, as we ﬁnd,then it is highly unlikely they will perform better inthe more complex real world. In this universe, theuser’s (latent) class determines the probability oftreatment and outcome conditioned on treatment.Dependent on class, but independent of treatmentand outcome is the user’s comment history, whichcontains both synthetic and real posts that are inputto the model to produce propensity scores.We produce each dataset using a generative pro-cess ( § B). For each task, we start with the samecollection of real world user histories from publicReddit proﬁles. We randomly assign (with .5/.5probability) each user to class 1 or class 2 .Into each proﬁle, we insert synthetic posts usinga function f n for class n speciﬁc to each task,described in § § B). These outcomes andtreatments could represent anything of interest, andthey need not be binary.o estimate the

AT E , there must be overlapbetween the treated and untreated groups ( § class 1 treated andall users in class 2 untreated; instead, we as-sign treatment with a biased coin-ﬂip: Treated with P = . and untreated with P = . for class 1 ,and the opposite for class 2 . These true propen-sities are not more extreme than those commonlyaccepted in practice (Crump et al., 2009; Lee et al.,2011; Yang and Ding, 2018).Once a treatment has been assigned accordingto the class’ probabilities, a positive outcome isassigned with probability . (treated) and . (un-treated) for class 1 , and . for both treatmentsfor class 2 . These probabilities are the ‘default’and are used in all Tasks except Tasks 5.2.3 and5.2.5, where we vary them to explore those speciﬁ-cally. The objective for models ( § We use Reddit user histories as the real world com-ponent of our semi-synthetic datasets. Reddit wasselected as our natural data source due to its use inDe Choudhury et al. (2016), its public nature, andits widespread use in the research community.We downloaded all Reddit comments for the2014 and 2015 calendar years from the Pushshiftarchives (Baumgartner et al., 2020) and groupedcomments by user. After ﬁltering out users withfewer than 10 comments, we randomly sampled8,000 users and truncated users’ histories to a max-imum length of 60 posts for computational prac-ticality. These users were randomly partitionedinto three sets: a 3,200 user training set, an 800user validation set, and a 4,000 user test set used tocompute Treatment Accuracy and

AT E

Bias.

When generating semi-synthetic tasks, we insertthree types of synthetic posts ( § D), representativeof major life events that could impact mental health,into real users’ histories: • Sickness Posts describe being ill ( e.g. ‘Thedoctor told me I have AIDS’). We vary boththe illness, as well as way the it is expressed. • Social Isolation Posts indicate a sense of iso-lation or exclusion. (‘I feel so alone, my lastfriend said they needed to stop seeing me.’) The resulting set of users had a mean of 41 posts/user,mean of 37.37 tokens/post, and a mean of 1523.28 tokens/user. • Death Posts describe the death of companion( e.g. ‘I just found out my Mom died’). Wevary the phrasing as well as the companion.A complete list of all posts of each type is in § D. We consider ﬁve tasks focused around the commonchallenges for text-based causal inference methodspreviously highlighted in § This task tests a model’s resilience to the linguis-tic complexity of text inputs, i.e. the ability torecognize synonyms and the shared importance ofdissimilar phrases. We increase the difﬁculty infour steps by increasing the diversity of syntheticsentences inserted into user histories assigned to class 1 ( i.e. the linguistic complexity of thedataset): f initially appends the same SicknessPost to the end of each class 1 user’s history;At the second level of difﬁculty, f selects a Sick-ness Post uniformly at random; At the third level, f selects either a Sickness or Social Isolation Post;and at the fourth level, f selects a Sickness, SocialIsolation, or Death Post. For each level of difﬁ-culty, f is the identity function, i.e. user historiesassigned to class 2 are unchanged. This task tests a model’s ability to distinguish be-tween the number of similar posts in a history.There are two levels of difﬁculty. At the easierlevel, f appends 10 randomly sampled (with re-placement) Sickness Posts, while f is the identityfunction. At the harder level, f appends only threeSickness Posts, while f appends one. In this and the following tasks, we do not vary f or f . For Strength of Selection Effect , we makecausal inference more challenging by increasingthe strength of the selection effect, decreasing theoverlap between treated and untreated users ( § class 1 to the treated group and class 2 to the control group. For the strongerselection effect (harder), we increase this split for class 1 to .95/.05. For both the weak and strongselection effects, we use f to append a single ran-dom Sickness Post and f as the identity function.Outcome probabilities, conditioned on treatment,are unchanged from § .2.4 Sample Size In this task, we test how the models’ performancedrops off as the amount of available training datais reduced. As before, we use f to append asingle random Sickness Post and f as the identityfunction. For the easiest case, we train on all 3,200users’ histories in the training set. We then createsmaller training sets by randomly sampling subsetswith 1,600 and 800 users. The ﬁnal task assesses a model’s tendency to pre-dict a treatment effect when none is present. To doso, we must have asymmetric treatment probabil-ities between class 1 and class 2 . Withoutthis asymmetry, the unadjusted estimate would beequal to the true

AT E of zero. We use the sameasymmetric class 1 treatment split as in § P ( Y = 1 | T = 0 , class=1 ) = . , P ( Y = 1 | T = 1 , class=2 ) = . , and the op-posite for Y = 0 . This gives a treatment effectof +.9 to class 1 and a treatment effect of -.9to class 2 , making the true AT E for the entiretask equal 0. As in previous tasks, f appendsone random Sickness Post and f is the identityfunction. We evaluate commonly used text representations,propensity score models, and

AT E estimators ( § The

Oracle uses the true propensity scores, whichare known in our semi-synthetic evaluation frame-work ( § Unadjusted Estimator , whichuses the naive method of not adjusting for selectioneffects, producing an estimated treatment effect of ¯ Y T =1 − ¯ Y T =0 , and as such is a lower-bound formodels that attempt to correct for selection effects.We train a Simple Neural Net (with one fullyconnected hidden layer) in four variants with dif-ferent text representations: 1-grams with a binaryencoding, 1,2-grams with a binary encoding, 1,2-grams with counts, and Latent Dirichlet Allocation(LDA) features (Blei et al., 2003) based on 1,2-grams, counted. We also train

Logistic Regression models on the same four text representations. In Keith et al. (2020, Table 1), 8/12 studies had fewerthan 5,000 observations, and 4/12 had fewer than 1,000.

Finally, we propose and evaluate a novel cauSalHiERarchical variant of BERT, which we call

SHERBERT . SHERBERT expands upon CausalBERT proposed by Veitch et al. (2019), which istoo computationally intensive to scale to user histo-ries containing more than 250 tokens, let alone onesorders of magnitude longer, such as in our tasks. InSHERBERT, we use one pretrained BERT modelper post to produce a post-embedding (AppendixFig. 5), followed by two hierarchical attention lay-ers to produce a single embedding for the entirehistory, with a ﬁnal linear layer to estimate thepropensity score. This architecture is similar toHIBERT (Zhang et al., 2019), but is faster to trainon long textual histories, as SHERBERT ﬁxes thepretrained BERT components.

We consider three commonly used

AT E estima-tors – IPTW, stratiﬁcation, and matching. All threeestimators use propensity scores ( §

2) but differ inhow they weight or group relevant samples.

Inverse Propensity of Treatment Weighting estimates the

AT E by weighting each user by theirrelevance to selection effects: (cid:91)

AT E

IPTW = n (cid:88) i =1 (2 ∗ T i − ∗ Y i ˆ p T i ( X i ) ∗ (cid:20)(cid:80) nj =1 1ˆ p Tj ( X j ) (cid:21) where T i , Y i , and X i are treatment, outcome, andfeatures for sample i , and ˆ p T ( X ) is the estimatedpropensity for treatment T on features X . Use ofthe Hajek estimator (1970) adjustment improvesstability compared to simple inverse propensity. Stratiﬁcation divides users into strata based ontheir propensity score, and the

AT E for each isaveraged: (cid:91)

AT E strat = n (cid:80) k n k ∗ (cid:91) AT E k where n is the total number of users, n k is the num-ber of users in the k-th stratum, and (cid:91) AT E k is theunadjusted ATE within the k-th stratum. We reportresults on 10 strata divided evenly by percentile,but results are qualitatively similar for other k . Matching can be considered as a special caseof stratiﬁcation, where each strata contains onlyone treated user. As matching produces extremelysimilar results to stratiﬁcation, we include detailsof our approach and plots of the results in § F.1.

Our semi-synthetic tasks are generated such thatwe know the true

AT E and thus can compute the

Bias of (cid:91)

ATE . A bias of zero is optimal, indicating onstant +sick +isolation +death

Linguistic Complexity T r e a t m e n t A cc u r a c y Logistic Regression (1-grams)Simple NN (1-grams) Logistic Regression (1,2-grams)Simple NN (1,2-grams) Logistic Regression (1,2-grams, counted)Simple NN (1,2-grams, counted) Logistic Regression (LDA features)Simple NN (LDA features) SHERBERTOracle Propensity TheoreticalOptimumUnadjustedEstimatorconstant +sick +isolation +death

Linguistic Complexity B i a s o f A T E ( I P T W ) constant +sick +isolation +death Linguistic Complexity B i a s o f A T E ( S t r a t i f i e d ) large small Signal Intensity T r e a t m e n t A cc u r a c y large small Signal Intensity B i a s o f A T E ( I P T W ) large small Signal Intensity B i a s o f A T E ( S t r a t i f i e d ) Strength of Selection Effect T r e a t m e n t A cc u r a c y Strength of Selection Effect B i a s o f A T E ( I P T W ) Strength of Selection Effect B i a s o f A T E ( S t r a t i f i e d ) Sample Size T r e a t m e n t A cc u r a c y Sample Size B i a s o f A T E ( I P T W ) Sample Size B i a s o f A T E ( S t r a t i f i e d ) Placebo Test T r e a t m e n t A cc u r a c y Placebo Test B i a s o f A T E ( I P T W ) Placebo Test B i a s o f A T E ( S t r a t i f i e d ) a b cd e fg h ij k l m n o Figure 3: Results for tasks, with bootstrapped 95% conﬁdence intervals, perturbed along the x-axis for readability.Within each plot, difﬁculty increases from left to right. SHERBERT generally does well, especially on

Strength ofSelection Effect and

Absence of Non-Zero Treatment Effect , but struggles on

Signal Intensity . correct estimated AT E . The greater the bias, pos-itive or negative, the worse the model performance.This is the primary metric we use in evaluation, andwe compute it for both (cid:91)

AT E strat and (cid:91)

AT E

IPTW .We also consider

Treatment Accuracy , the accu-racy of the model’s predictions of binary treatmentassignment. While higher accuracy is often better,high accuracy does not guarantee low bias. Weinclude additional metrics (Spearman Correlationof Estimated Propensity Scores and Mean SquaredError of IPTW for each task) in § F.2.

Transformers better model relevant linguisticvariation.

Many trends in the results manifest inthe Linguistic Complexity task ( § Transformer models struggle with countingand ordering.

The Signal Intensity task ( § High accuracy often reﬂects strong selection ef-fects, not low

ATE bias.

In the Strength of Selec-tion Effect task ( § easier to distinguish betweenthe two groups. We see corresponding increases inTreatment Accuracy (Fig. 3g), however, bias wors-ens (Fig. 3h,i). In context of observational studies,models with high treatment accuracy should beused with extreme caution — high accuracy likelyreﬂects that the common support assumption is vio-lated, preventing causal inference. This highlightsthe importance of empirical evaluation of the com-plete causal inference pipeline. Transformer models fail with limited data.

TheSample Size task ( § Models predict causality when none is present.

Alarmingly, in the Placebo Test ( § AT E = 0 ) in their 95%conﬁdence intervals (Fig. 3n,o), including high ac-curacy models using bigram features (Fig. 3m).This result is of greatest concern, as eight out ofnine methods falsely claim a non-zero effect.

Models have greater impact than estimators.

Each estimator evaluated produced overall similarresults (Fig. 6), with the quality of the propensityscores being far more impactful. However, IPTWis more sensitive to extreme propensity scores(Fig. 3h). See § F.2 for more details.

Causal inferences are difﬁcult to evaluate in theabsence of ground truth causal effects – a limitationof virtually all real world observational studies.Despite this absence, we can compare differentmodels’ estimates and demonstrate that differentmodels regularly disagree with one another.Empirical evaluation requires knowledge of thetrue treatment effects. Our proposed evaluationframework is reﬂective of ﬁve key challenges forcausal inference in natural language.We evaluate every commonly used propensityscore model to produce key insights:For

NLP Researchers , we ﬁnd that continued de-velopment of transformer-based models offers apromising path towards rectifying deﬁciencies ofexisting models. Models are needed that can ef-fectively represent the order of text, variability inexpression, and the counts of key tokens. Given thelimited availability of training data in many causalinference applications, more research is needed inadapting pretrained transformers to small data set-tings (Gururangan et al., 2020). We hope our publicframework will provide a principled method forevaluating future NLP models for causal inference.For CSS Practitioners , we ﬁnd that transformer-based models such as SHERBERT, which we makepublicly available, perform the best in all casesexcept those with very limited data. Models withhigh accuracy should be applied with great care, asthis is likely indicative of a strong and unadjustableselection effect. Many models failed our placebotest by making false causal discoveries, a majorproblem (Aarts et al., 2015; Freedman et al., 2015). Dataset Website eferences

Alexander Aarts, Joanna Anderson, Christopher Ander-son, Peter Attridge, Angela Attwood, Jordan Axt,Molly Babel, tpn Bahnk, Erica Baranski, MichaelBarnett-Cowan, Elizabeth Bartmess, Jennifer Beer,Raoul Bell, Heather Bentley, Leah Beyan, GraceBinion, Denny Borsboom, Annick Bosch, FrankBosco, and Mike Penuliar. 2015. Estimating thereproducibility of psychological science.

Science ,349.Alberto Abadie and Guido W. Imbens. 2016. Matchingon the estimated propensity score.

Econometrica ,84(2):781–807.David Arbour and Drew Dimmery. 2019. Permutationweighting.Jason Baumgartner, Savvas Zannettou, Brian Keegan,Megan Squire, and Jeremy Blackburn. 2020. Thepushshift reddit dataset.P. Bhattacharya and R. Mehrotra. 2016. The infor-mation network: Exploiting causal dependencies inonline information seeking. In

Proceedings of the2016 ACM on Conference on Human Information In-teraction and Retrieval , CHIIR 16.D.M. Blei, A.Y. Ng, and M. I. Jordan. 2003. Latentdirichlet allocation.

J. Mach. Learn. Res. , 3:993–1022.Glenn W. Brier. 1950. Veriﬁcation of forecasts ex-pressed in terms of probability.

Monthly WeatherReview , 78(1):1–3.Eshwar Chandrasekharan, Umashanthi Pavalanathan,Anirudh Srinivasan, Adam Glynn, Jacob Eisen-stein, and Eric Gilbert. 2017. You cant stay here:The efﬁcacy of reddits 2015 ban examined throughhate speech.

Proc. ACM Hum.-Comput. Interact. ,1(CSCW).Eshwar Chandrasekharan, Mattia Samory, ShagunJhaver, Hunter Charvat, Amy Bruckman, CliffLampe, Jacob Eisenstein, and Eric Gilbert. 2018.The internets hidden rules: An empirical studyof reddit norm violations at micro, meso, andmacro scales.

Proc. ACM Hum.-Comput. Interact. ,2(CSCW).Munmun De Choudhury and Emre Kiciman. 2017.The language of social support in social media andits effect on suicidal ideation risk.

Proceedings ofthe ... International AAAI Conference on Weblogsand Social Media. International AAAI Conferenceon Weblogs and Social Media , 2017:32–41.Richard K. Crump, V. Joseph Hotz, Guido W. Imbens,and Oscar A. Mitnik. 2009. Dealing with limitedoverlap in estimation of average treatment effects.

Biometrika , 96(1):187–199. Munmun De Choudhury, Emre Kiciman, Mark Dredze,Glen Coppersmith, and Mrinal Kumar. 2016. Dis-covering shifts to suicidal ideation from mentalhealth content in social media. In

Proceedings of the2016 CHI Conference on Human Factors in Comput-ing Systems , CHI 16.Jacob Devlin, Ming-Wei Chang, Kenton Lee, andKristina Toutanova. 2019. BERT: Pre-training ofdeep bidirectional transformers for language under-standing. In

NAACL .Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott,and Dan Cervone. 2019. Automated versus do-it-yourself methods for causal inference: Lessonslearned from a data analysis competition.

Statist.Sci. , 34(1):43–68.Dean Eckles and Eytan Bakshy. 2017. Bias and high-dimensional adjustment in observational studies ofpeer effects.Naoki Egami, Christian J. Fong, Justin Grimmer, Mar-garet E. Roberts, and Brandon M. Stewart. 2018.How to make causal inferences using texts.Seyed Amin Mirlohi Falavarjani, Hawre Hosseini,Zeinab Noorian, and Ebrahim Bagheri. 2017. Es-timating the effect of exercising on users online be-havior. In

AAAI 2017 .Christian Fong and Justin Grimmer. 2016. Discoveryof treatments from text corpora. In

Proceedingsof the 54th Annual Meeting of the Association forComputational Linguistics (Volume 1: Long Papers) ,pages 1600–1609, Berlin, Germany. Association forComputational Linguistics.Leonard P. Freedman, Iain M. Cockburn, and Timo-thy S. Simcoe. 2015. The economics of reproducibil-ity in preclinical research.

PLOS Biology , 13(6):1–9.Amanda Gentzel, Dan Garant, and David Jensen. 2019.The case for evaluating causal models using inter-ventional measures and empirical data. In H. Wal-lach, H. Larochelle, A. Beygelzimer, F. d‘Alch´e Buc,E. Fox, and R. Garnett, editors,

Advances in NeuralInformation Processing Systems 32 , pages 11722–11732. Curran Associates, Inc.Suchin Gururangan, Ana Marasovi, SwabhaSwayamdipta, Kyle Lo, Iz Beltagy, Doug Downey,and Noah A. Smith. 2020. Don’t stop pretraining:Adapt language models to domains and tasks.J. H´ajek. 1970. A characterization of limiting dis-tributions of regular estimates.

Zeitschrift f¨urWahrscheinlichkeitstheorie und Verwandte Gebiete .Jennifer Hill and Yu-Sung Su. 2013. Assessing lack ofcommon support in causal inference using bayesiannonparametrics: Implications for evaluating the ef-fect of breastfeeding on childrens cognitive out-comes.

Ann. Appl. Stat. , 7(3):1386–1420.avid Jensen. 2019. Comment: Strengthening empir-ical evaluation of causal inference methods.

Statist.Sci. , 34(1):77–81.Fredrik Johansson, Uri Shalit, and David Sontag. 2016.Learning representations for counterfactual infer-ence. In

ICML .Katherine A. Keith, David Jensen, and BrendanO’Connor. 2020. Text and causal inference: Areview of using text to remove confounding fromcausal estimates. In

Proceedings of the 58th An-nual Meeting of the Association for ComputationalLinguistics , Seattle, WA. Association for Computa-tional Linguistics.Emre Kiciman, Scott Counts, and Melissa Gasser. 2018.Using longitudinal social media analysis to under-stand the effects of early college alcohol use. In

ICWSM .Diederik P. Kingma and Jimmy Ba. 2014. Adam: Amethod for stochastic optimization.Brian K. Lee, Justin Lessler, and Elizabeth A. Stu-art. 2011. Weight trimming and propensity scoreweighting.

PLOS ONE , 6(3):1–6.Fan Li, Laine E Thomas, and Fan Li. 2018. Ad-dressing Extreme Propensity Scores via the Over-lap Weights.

American Journal of Epidemiology ,188(1):250–257.Subramani Mani and Gregory F. Cooper. 2000. Causaldiscovery from medical textual data.

Proceedings.AMIA Symposium , pages 542–6.Paramita Mirza and Sara Tonelli. 2016. CATENA:CAusal and TEmporal relation extraction from NAt-ural language texts. In

Proceedings of COLING2016, the 26th International Conference on Compu-tational Linguistics: Technical Papers , pages 64–75,Osaka, Japan. The COLING 2016 Organizing Com-mittee.Reagan Mozer, Luke Miratrix, Aaron Russell Kaufman,and L. Jason Anastasopoulos. 2020. Matching withtext data: An experimental evaluation of methods formatching documents and of measuring match qual-ity.

Political Analysis , page 124.Alexandra Olteanu, Onur Varol, and Emre Kiciman.2017. Distilling the outcomes of personal experi-ences: A propensity-scored analysis of social me-dia. In

Proceedings of the 2017 ACM Conferenceon Computer Supported Cooperative Work and So-cial Computing , CSCW 17.Judea Pearl. 1995. Causal diagrams for empirical re-search.

Biometrika , 82(4).Judea Pearl. 2009.

Causality . Cambridge UniversityPress. Thai T. Pham and Yuanyuan Shen. 2017. A deep causalinference approach to measuring the effects of form-ing group loans in online non-proﬁt microﬁnanceplatform.Margaret E Roberts, Brandon M Stewart, andRichard A Nielsen. 2020. Adjusting for confound-ing with text matching.Paul R. Rosenbaum. 2010.

Design of ObservationalStudies . Springer.Koustuv Saha, Benjamin Sugar, John B Torous,Bruno D. Abrahao, Emre Kiciman, and Munmun DeChoudhury. 2019. A social media study on the ef-fects of psychiatric medication use.

Proceedings ofthe ... International AAAI Conference on Weblogsand Social Media. International AAAI Conferenceon Weblogs and Social Media , 13:440–451.Dhanya Sridhar, Aaron Springer, Victoria Hollis, SteveWhittaker, and Lise Getoor. 2018. Estimating causaleffects of exercise from mood logging data.Chenhao Tan, Lillian Lee, and Bo Pang. 2014. The ef-fect of wording on message propagation: Topic- andauthor-controlled natural experiments on twitter. In

Proceedings of the 52nd Annual Meeting of the As-sociation for Computational Linguistics (Volume 1:Long Papers) , pages 175–185, Baltimore, Maryland.Association for Computational Linguistics.Victor Veitch, Dhanya Sridhar, and David M. Blei.2019. Using text embeddings for causal inference.

CoRR , abs/1905.12741.Yongji Wang, Hongwei Cai, Chanjuan Li, ZhiweiJiang, Ling Wang, Jiugang Song, and Jielai Xia.2013. Optimal caliper width for propensity scorematching of three treatment groups: A monte carlostudy.

PLOS ONE , 8(12):1–7.Zach Wood-Doughty, Ilya Shpitser, and Mark Dredze.2018. Challenges of using text classiﬁers for causalinference. In

Proceedings of the 2018 Conferenceon Empirical Methods in Natural Language Process-ing , pages 4586–4598, Brussels, Belgium. Associa-tion for Computational Linguistics.S Yang and P Ding. 2018. Asymptotic inference ofcausal effects with observational studies trimmedby the estimated propensity scores.

Biometrika ,105(2):487–493.Justine Zhang, Jonathan Chang, Cristian Danescu-Niculescu-Mizil, Lucas Dixon, Yiqing Hua, DarioTaraborelli, and Nithum Thain. 2018. Conversationsgone awry: Detecting early signs of conversationalfailure. In

Proceedings of the 56th Annual Meet-ing of the Association for Computational Linguistics(Volume 1: Long Papers) , pages 1350–1361, Mel-bourne, Australia. Association for ComputationalLinguistics.ingxing Zhang, Furu Wei, and Ming Zhou. 2019. HI-BERT: Document level pre-training of hierarchicalbidirectional transformers for document summariza-tion. In

ACL . Moderation and Gender Experiments – Data Collection Details

A.1 Moderation Experiment

In the Moderation Experiment, we test if having a post removed by a moderator impacts the amount auser later posts to the same community. For this experiment, we use 13,786 public Reddit histories (all ofwhich contain more than 500 tokens) from users in /r/science from 2015-2017 who had a not hada post removed prior to 2018. Our treated users are those who have had a post removed in 2018. Outuntreated users are those who have not had a post removed in 2018 (nor before). The outcome of interestis the number of posts they made in 2019.To determine which users have had posts removed, we utilize the Pushshift Reddit API (Baumgartneret al., 2020). The data acessible via this API, in combination with publicly available Pushshift dumparchives, allow us to compare two snapshots of each Reddit post: one snapshot made within a few secondsof posting, and one made approximately 2 months later. By comparing these two versions, we can tell a)which user made the post, and b) if it was removed. This approach is similar to that of Chandrasekharanet al. (2018).This experiment mimics the setup in De Choudhury et al. (2016), where each user is represented bytheir entire Reddit comment history within speciﬁc subreddits. While (De Choudhury et al., 2016) hasbeen inﬂuential in our work, their dataset is not public, and publicly available comparable data containsonly a relatively small set of Reddit users, leading to underpowered experiments with large, uninformativeconﬁdence intervals that fail to reproduce the ﬁndings in the original paper.

A.2 Gender Experiment

In the Gender Experiment, we use the dataset made public by Veitch et al. (2019), which consists of singleposts from three subreddits: /r/okcupid , /r/childfree , and /r/keto . Each post is annotatedwith the gender (male or female) of the poster, which is considered the treatment. The outcome is thescore of the post (number of ‘upvotes’ minus number of ‘downvotes’). Model of Conditional Probabilities used for Assignment of Treatment and Outcome

Hidden From Model

Figure 4: The latent class is used to assign treatments and outcomes to users, and to modify their histories ( § Below are the ‘default’ outcome probabilities used in the synthetic data generation process, conditionedon user class: P ( Y = 1 | T = 0 , class = 1) = . , P ( Y = 0 | T = 0 , class = 1) = . P ( Y = 1 | T = 1 , class = 1) = . , P ( Y = 0 | T = 1 , class = 1) = . P ( Y = 1 | T = 0 , class = 2) = . , P ( Y = 0 | T = 0 , class = 2) = . P ( Y = 1 | T = 1 , class = 2) = . , P ( Y = 0 | T = 1 , class = 2) = . These probabilities are used unless otherwise indicated in § Theoretical Bounds

We leverage recent results of Arbour and Dimmery (2019) to bound the expected bias of the ATE, ¯ Y [ T = 1] − ¯ Y [ T = 0] , by considering the weighted risk of the propensity score: (cid:12)(cid:12)(cid:12) E (cid:104) ˆ Y ( T ) (cid:105) − E (cid:104) Y ( T ) (cid:105)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12) E (cid:104) Yp ( T | X ) S (ˆ p ( T | X ) ,p ( T | X ))ˆ p ( T | X ) (cid:105)(cid:12)(cid:12)(cid:12) where ˆ p and p are the estimated and true propensity score, and S is the Brier score (1950). Conceptually,this bound suggests that the bias grows as a function of the Brier score between estimated and truepropensity score (numerator), and the inverse of the squared estimate of the propensity score, signiﬁcantlypenalizing very small scores. Findings.

We compute these bounds using the estimated propensity score and ﬁnd that they are largelyuninformative in practice. In 250/252 cases, the empirical conﬁdence interval (Fig. 6) provides a tighterbound than the theoretical bound, and in 230/252 cases the Unadjusted Estimator ( § Details of Derivation.

The central challenge is estimating the error of the counterfactual quantities, Y (1) , and Y (0) . Recall that in the case of weighting estimators, when the true propensity score ( p ( · ) ) isavailable, these are estimated as E [ y ( T )] = E (cid:104) Yp ( T ) (cid:105) , where y is the observed outcome. For the problemaddressed in this paper, the propensity must be estimated. Estimating the error for each potential outcomeunder an estimated propensity score results in a bias of (cid:12)(cid:12)(cid:12) E (cid:104) ˆ Y ( T ) (cid:105) − E (cid:104) Y ( T ) (cid:105)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) E (cid:104) Yp ( T | X ) (cid:105) − E (cid:104) Y ˆ p ( T | X ) (cid:105)(cid:12)(cid:12)(cid:12) following Proposition 1 of Arbour and Dimmery (2019).More concretely, an empirical upper bound can be obtained for Equation 1 given a lower bound onthe true propensity score. Speciﬁcally, replacing the p with the lower bound and using the weightedcross-validated Brier score will provide a conservative bound on the bias of the counterfactual. This boundcan be tightened with further assumptions, for example by assuming instance level bounds on p instead ofa global bound. Balancing weights may also be used to estimate the bias directly using only empiricalquantities (Arbour and Dimmery, 2019).Note that due to the evaluation framework in this paper, the true propensity score p is known, andtherefore we do not need to apply loose bounds. (cid:12)(cid:12)(cid:12) E (cid:104) ˆ Y ( T ) (cid:105) − E (cid:104) Y ( T ) (cid:105)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) yp ( T ) − yp ( T ) + (ˆ p ( T ) − p ( T )) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) yp ( T ) − yp ( T ) + (ˆ p ( T ) − p ( T )) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) y (1 + p (ˆ p ( T ) − p ( T ))) p ( T )(1 + p (ˆ p ( T ) − p ( T ))) − yp ( T ) + (ˆ p ( T ) − p ( T )) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) E (cid:34) y + yp (ˆ p ( T ) − p ( T )) p ( T ) + (ˆ p ( T ) − p ( T )) − yp ( T ) + (ˆ p ( T ) − p ( T )) (cid:35)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) yp (ˆ p ( T ) − p ( T ))ˆ p ( T ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) yp (ˆ p ( T ) − p ( T )) ˆ p ( T ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) ≤ (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) yp ( T ) S (ˆ p ( T ) , p ( T ))ˆ p ( T ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) (1)fter obtaining the bounds on the individual counterfactual quantities, the corresponding lower andupper bias bounds on the average treatment effect can be constructed by considering ˆ Y (0) + (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) yp ( T = 0 | X ) S (ˆ p (0 | X ) , p (0 | X ))ˆ p ( T = 0 | X ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) (2) ˆ Y (1) − (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) yp ( T = 1 | X ) S (ˆ p ( T = 1 | X ) , p ( T = 1 | X ))ˆ p ( T = 1 | X ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) (3)and ˆ Y (0) − (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) yp ( T = 0 | X ) S (ˆ p ( T = 0 | X ) , p (0))ˆ p ( T = 0 | X ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) (4) ˆ Y (1) + (cid:12)(cid:12)(cid:12)(cid:12) E (cid:20) yp ( T = 1 | X ) S (ˆ p ( T = 1 | X ) , p ( T = 1 | X ))ˆ p ( T = 1 | X ) (cid:21)(cid:12)(cid:12)(cid:12)(cid:12) (5)respectively. Templates for Synthetic Posts

As described in § sickness , social isolation , and death posts are used to generate ourevaluation tasks. These synthetic posts are selected and inserted into social media histories of real worldusers by randomly sampling a template and word pair, or, in the case of Social Isolation Posts, by randomlysampling a complete post. D.1 Sickness Posts

Sickness Posts are created by randomly sampling a Sickness Word and inserting it into a randomlysampled Sickness Template.Sickness Templates are sampled from: { The doctor told me I have x , I was at the hospital earlier and I have x . , I got diagnosed with x last week. , Have anyone here dealt with x ? I just got diagnosed. , How should I handle a x diagnosis? , How do I tell my parents I have x ? } Sickness Words are sampled from { cancer , leukemia , HIV , AIDS , Diabetes , lung cancer , stomach cancer , skin cancer , parkinson‘s } D.2 Social Isolation Posts

Social Isolation Posts are randomly sampled from the following set of complete synthetic posts: { My friends stopped talking to me. , My wife just left me. , My parents kicked me out of the house today. , I feel so alone, my last friend said they needed to stop seeing me. , My partner decided that we shouldn‘t talk anymore last night. , My folks just cut me off, they won‘t talk to me anymore. , I just got a message from my brother that said he can‘t talk to me anymore. He was mylast contact in my family. , My last friend at work quit, now there‘s no one I talk to regularly. , I tried calling my Mom but she didn’t pick up the phone. I think my parents may be donewith me. , I got home today and my partner was packing up to leave. Our apartment feels so emptynow. } D.3 Death Posts

Death Posts are created by randomly sampling a Death Word and inserting it into a Death Template.Death Templates are sampled from: { My x just died , I just found out my x died , My x died last weekend , What do you do when your x dies? This happened to me. , Has anyone else had a x die recently? , I lost my x yesterday. , My x passed away recently. , I am in shock. My x is gone. } Death Words are sampled from { Mom , Mother , Mama , Father , Dad , Papa , Brother , Wife , girlfriend , partner , spouse , husband , son , daughter , best friend } Model Implementation, Tuning, and Parameters

E.1 SHERBERT Architecture

Fig. 5 depicts the architecture of SHERBERT as part of the broader ATE estimation pipeline.

ATE EstimationPropensity ScoringRepresentation LearningInput Textpost post post k post … BERTLinear + ActivationLinear + ActivationLinear + ActivationpIPTW Stratification Matching ~ User AttentionPost AttentionTreatment EffectPropensityUser VectorsPost VectorsWord VectorsTokens

Figure 5: The complete ATE estimation pipeline, with tokens input at the bottom, and an estimate propensity atthe top. ATE estimates are computed with IPTW, Stratiﬁcation, and Matching ( § e.g. , Bag-of-n-grams with LogisticRegression). Our work attempts to expand the success of large pretrained transformers to long history length using ahierarchical attention, which is a problem also explored by the HIBERT model in Zhang et al. (2019).Essentially, SHERBERT differs from HIBERT in that SHERBERT trains a light-weight hierarchicalattention on top of the pretrained BERT model (Devlin et al., 2019) whereas HIBERT is trained fromscratch. This results in a relatively simple training procedure for SHERBERT, and lighter limitationson history length, both at the local (50 words for HIBERT v. 512 wordpiece tokens for SHERBERT)and global (30 sentences for HIBERT v. 60 for SHERBERT) scales. This reﬂects differing tradeoffs;where HIBERT has a more sophisticated attention mechanism for combining local and global information,SHERBERT sacriﬁces some complexity for fast and simple training and longer text histories.

E.2 Practicality of Models

SHERBERT trades-off practicality for performance in comparison to simpler models. For instance, inmost experiments we found SHERBERT takes 10 - 12 hours to train, sometimes requiring multiple startsto converge to a reasonable model. In contrast, training all other models collectively requires less than1 hour. Further, the performance of SHERBERT sharply suffered as the number of users was reduced(Fig. 3j). While effectively training SHERBERT on 1 GPU (Tesla V100) in under 24 hours is quitepractical compared to contemporary text pretraining regimes (Devlin et al., 2019), these issues should beconsidered when deciding on a causal text model.

E.3 Hyperparameters

A complete description of parameters and hyperparameters is included in the code repository at DatasetWebsite. Basic details are included here.n producing n-gram features, a count threshold of 10 is used to ﬁlter out low frequency words, andword tokenization is done using the NLTK word tokenizer. In producing LDA features, we use the ScikitLearn implementation, with 20 topics. To produce BERT word embedding features, we use the uncasedmodel of the ‘base’ size.All models use the Adam optimizer (Kingma and Ba, 2014), with various learning rates decidedempirically depending on model and task to maximize treatment accuracy on the validation set.For the simple neural network model, we use a hidden size of 10. For SHERBERT, we use hidden sizesof 1000 and dot-product attention.

Additional Estimators and Metrics

In order to further detail our ﬁndings ( § § § F.1 Matching Estimator

Matching can be considered as a special case of stratiﬁcation, where each strata contains only one treateduser ( § with replacement, as inAbadie and Imbens (2016, pg. 784): (cid:91) AT E match = 1 n n (cid:88) i =1 (2 T i −

1) ( Y i − Y j ) where j is the matched observation, i.e. j = min j ∈{ ...n } | ˆ p ( X i ) − ˆ p ( X i ) | where T i (cid:54) = T j .A recent evaluation of matching techniques for text found no signiﬁcant difference in match qualitybetween matches produced with and without replacement (Mozer et al., 2020). We use a caliper value of . × the standard deviation of propensity scores in the population, as was found to perform the best byWang et al. (2013) and recommended by Rosenbaum (2010, pg. 251).For each of the ﬁve tasks, the matching estimator produces results extremely similar to those of thestratiﬁed estimator (Fig. 6). F.2 Mean Squared Error of IPTW and Spearman Correlation

In addition to Treatment Accuracy and Bias ( § Mean Squared Error of IPTW shows the absolute error in the calibration of a models’ causal weights:

M SE

IP T W = n (cid:88) i =1  (cid:20)(cid:80) nj =1 1ˆ p Tj ( X j ) (cid:21) − ˆ p T i ( X i ) − (cid:20)(cid:80) nj =1 1 p Tj ( X j ) (cid:21) − p T i ( X i )  Notation is the same as in § p as true propensity, which is known in our semi-synthetic tasks. The MSE is fairly correlated with the Treatment Accuracy, with MSE increasing asaccuracy decreases as the tasks become more difﬁcult. This is especially evident in Fig. F.2b,k. Spearman Correlation instead shows the relative calibration of a models’ propensity scores. Propen-sity scores may have poor absolutely calibration, but still have meaningful relative ordering, in whichcase the Spearman Rank Correlation is close to its maximum value of 1. The Spearman Correlationcoefﬁcient is simply the Pearson correlation coefﬁcient between the rank variables for the estimated andactual propensity scores. We ﬁnd that Spearman Correlation is also quite correlated with the TreatmentAccuracy (Fig. 7c,i) onstant +sick +isolation +death

Linguistic Complexity B i a s o f A T E ( I P T W ) Logistic Regression (1-grams)Simple NN (1-grams) Logistic Regression (1,2-grams)Simple NN (1,2-grams) Logistic Regression (1,2-grams, counted)Simple NN (1,2-grams, counted) Logistic Regression (LDA features)Simple NN (LDA features) SHERBERTOracle Propensity TheoreticalOptimumUnadjustedEstimatorconstant +sick +isolation +death

Linguistic Complexity B i a s o f A T E ( S t r a t i f i e d ) constant +sick +isolation +death Linguistic Complexity B i a s o f A T E ( M a t c h i n g ) large small Signal Intensity B i a s o f A T E ( I P T W ) large small Signal Intensity B i a s o f A T E ( S t r a t i f i e d ) large small Signal Intensity B i a s o f A T E ( M a t c h i n g ) Strength of Selection Effect B i a s o f A T E ( I P T W ) Strength of Selection Effect B i a s o f A T E ( S t r a t i f i e d ) Strength of Selection Effect B i a s o f A T E ( M a t c h i n g ) Sample Size B i a s o f A T E ( I P T W ) Sample Size B i a s o f A T E ( S t r a t i f i e d ) Sample Size B i a s o f A T E ( M a t c h i n g ) Placebo Test B i a s o f A T E ( I P T W ) Placebo Test B i a s o f A T E ( S t r a t i f i e d ) Placebo Test B i a s o f A T E ( M a t c h i n g ) a b cd e fg h ij k l m n o Figure 6: Comparison of bias computed using IPTW, Stratiﬁcation, and Propensity Score Matching, for each task.Note that matching produces extremely similar results to stratiﬁcation. onstant +sick +isolation +death

Linguistic Complexity M S E o f I P T W w e i g h t s ( * e - ) constant +sick +isolation +death Linguistic Complexity Sp e a r m a n c o rr e l a t i o n large small Signal Intensity T r e a t m e n t A cc u r a c y large small Signal Intensity M S E o f I P T W w e i g h t s ( * e - ) large small Signal Intensity Sp e a r m a n c o rr e l a t i o n Strength of Selection Effect T r e a t m e n t A cc u r a c y Strength of Selection Effect M S E o f I P T W w e i g h t s ( * e - ) Strength of Selection Effect Sp e a r m a n c o rr e l a t i o n Sample Size T r e a t m e n t A cc u r a c y Sample Size M S E o f I P T W w e i g h t s ( * e - ) Sample Size Sp e a r m a n c o rr e l a t i o n Placebo Test T r e a t m e n t A cc u r a c y Placebo Test M S E o f I P T W w e i g h t s ( * e - ) Placebo Test Sp e a r m a n c o rr e l a t i o n a b cd e fg h ij k lm n oa b cd e fg h ij k lm n o