[PDF] Generating Synthetic Text Data to Evaluate Causal Inference Methods

Abstract

Drawing causal conclusions from observational data requires making assumptions about the true data-generating process. Causal inference research typically considers low-dimensional data, such as categorical or numerical fields in structured medical records. High-dimensional and unstructured data such as natural language complicates the evaluation of causal inference methods; such evaluations rely on synthetic datasets with known causal effects. Models for natural language generation have been widely studied and perform well empirically. However, existing methods not immediately applicable to producing synthetic datasets for causal evaluations, as they do not allow for quantifying a causal effect on the text itself. In this work, we develop a framework for adapting existing generation models to produce synthetic text datasets with known causal effects. We use this framework to perform an empirical comparison of four recently-proposed methods for estimating causal effects from text data. We release our code and synthetic datasets.

Full PDF

GG ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

Generating Synthetic Text Data to Evaluate Causal InferenceMethods

Zach Wood-Doughty

ZACH @ CS . JHU . EDU

Johns Hopkins UniversityBaltimore, MD 21218, USA

Ilya Shpitser

ILYAS @ CS . JHU . EDU

Johns Hopkins UniversityBaltimore, MD 21218, USA

Mark Dredze

MDREDZE @ CS . JHU . EDU

Johns Hopkins UniversityBaltimore, MD 21218, USA

Abstract

Drawing causal conclusions from observational data requires making assumptions aboutthe true data-generating process. Causal inference research typically considers low-dimensionaldata, such as categorical or numerical ﬁelds in structured medical records. High-dimensionaland unstructured data such as natural language complicates the evaluation of causalinference methods; such evaluations rely on synthetic datasets with known causal effects.Models for natural language generation have been widely studied and perform well empirically.However, existing methods not immediately applicable to producing synthetic datasetsfor causal evaluations, as they do not allow for quantifying a causal effect on the textitself. In this work, we develop a framework for adapting existing generation modelsto produce synthetic text datasets with known causal effects. We use this framework toperform an empirical comparison of four recently-proposed methods for estimating causaleffects from text data. We release our code and synthetic datasets.

1. Introduction

Causal understanding is necessary for reasoning about hypothetical interventions (Pearland Mackenzie, 2018). As machine learning (ML) methods demonstrate predictive successin complex domains, there is considerable interest in relying on ML to make real-worlddecisions. However, predictive models cannot be relied upon for decision-making withoutconsidering how confounding or selection biases may affect the models’ predictions (Charet al., 2018; Chen and Asch, 2017; Liu et al., 2019; Subbaswamy et al., 2019). Real-worldinterventions require causal reasoning, but causal reasoning requires evaluations that gobeyond traditional ML metrics such as test set accuracy. In particular, causal methods relyon untestable assumptions about the data-generating process (DGP) that produced thedata. Violations of the methods’ assumptions may lead to biased predictions or estimates.Researchers need complete knowledge of a DGP to test the assumptions of a causalmethod, but such knowledge is often impossible for real-world datasets. Thus whilesynthetic data has its limitations (Jensen et al., 2019; Gentzel et al., 2019), it plays a crucial https://github.com/zachwooddoughty/causal_text_dgps a r X i v : . [ c s . C L ] F e b OOD -D OUGHTY , S

HPITSER AND D REDZE role in understanding how a causal method performs when its assumptions are met orviolated. Recently, causal inference evaluations have tested proposed methods againstheld-out synthetic DGPs (Hahn et al., 2019; Dorie et al., 2019; Shimoni et al., 2018). Thesesynthetic datasets are designed to test different empirical properties of the methods, suchas the coverage of conﬁdence intervals or the ﬁnite-sample behavior variance of an estimator.Synthetic datasets are rarely used for predictive tasks when empirical data is widelyavailable. The enormous quantities of text and image data have been curated to producewidely-used datasets for ML and natural language processing (NLP) research (Deng et al.,2009; Brown et al., 2020). However, synthetic datasets have been used in predictive tasks toexplore how models handle edge cases or low-resource settings (Elman, 1990; Patki et al.,2016; Khayrallah and Koehn, 2018; Wang and Eisner, 2018; Kim and O’Neill-Brown, 2019;Winata et al., 2019). This is especially true in domains where data is not as widely available,such as clinical settings (Boag et al., 2016; Belinkov and Bisk, 2018; Melamud and Shivade,2019).Causal methods have only recently been applied to natural language datasets. Keithet al. (2020) provides a comprehensive overview of recent work, focusing speciﬁcally oncases where text data can be used to adjust for (otherwise unobserved) confounding. Textdata provides a particularly difﬁcult domain for evaluating causal methods because itrequires modeling causal relationships between structured variables and text: “what causedthe author to write the text this way?” While there is plentiful text data for trainingpredictive models, we cannot directly measure the underlying processes that humans useto produce or adapt their language in complex domains. Synthetic DGPs need to balance‘realism and control’ (Wendling et al., 2018): the goal of producing realistic text dataagainst the competing goal of completely specifying the causal effects that produce thetext. Past methods evaluated on synthetic data have only satisﬁed one such goal, either byproducing particularly unrealistic text with known effects (Yao et al., 2019; Wood-Doughtyet al., 2018; Johansson et al., 2016) or using real-world text without a fully-speciﬁed DGP (Veitchet al., 2020; Mozer et al., 2018; Weld et al., 2020).We introduce a synthetic framework for evaluating causal methods that incorporatetext data, exploring desiderata of synthetic text DGPs and tradeoffs between competinggoals. We introduce two nontrivial synthetic DGPs, one which samples a bag-of-wordsfrom an Latent Dirichlet Allocation (LDA) topic model, and another which samples fullsentences from GPT-2 (Blei et al., 2003; Radford et al., 2019). These two underlying generativemodels allow us to test how causal methods perform when their assumptions are violated(e.g. whether word order matters) (Wallach, 2006). We use our framework to compare fourcausal methods that rely on text, addressing a known gap in empirical evaluation of suchmethods (Keith et al., 2020). We explore how existing methods’ empirical performancedepends on their assumptions and show that when the causal estimator depends on a textclassiﬁer model, better classiﬁcation accuracy of that classiﬁer does not necessarily implybetter causal estimates. We release our code and synthetic datasets to facilitate furtherdevelopment and evaluation of causal methods for language data. ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

AC UY T

Figure 1: The causal DAG we consider. A is our treatment, Y is our outcome, C and U are confounders, and T is the raw text which is inﬂuenced by U . The counterfactual p ( Y ( a )) cannot be non-parametrically identiﬁed from p ( C , A , Y ) alone due to unobservedconfounding from U . Methods may make parametric assumptions on the relationshipbetween T and U in order to estimate the causal effect, or assume knowledge of p ( U | T ) .We parameterize p ( T | U ) with text generation models in § 4. We discuss the limitations ofthis DAG model and extensions to other models in § 9.1.

2. Clinical Notes: A Motivating Example

We begin by motivating causal inference for text data through an example. Free text notesin medical records contain information about patients’ histories, possible diagnoses, orpatient-doctor relationships (Rajkomar et al., 2018; McVeigh et al., 2016). Importantly, suchinformation often does not appear anywhere else in a patient’s medical record, and thusis inaccessible to retrospective causal analyses that do not use the free text data (Wu et al.,2013; Rosenbloom et al., 2011; Zheng et al., 2011).In this domain, assumptions about the DGP correspond to assumptions about howclinical notes are written. Unless we have the requisite domain expertise to precisely modelthe style, vocabulary, and semantics in the true DGP, we must be particularly conservativeabout the assumptions we make. Synthetic DGPs allow us to test how a method performswhen its assumptions are violated, which is essential to understanding whether to trusta real-world application. While empirical success on synthetic data does not guaranteesimilar performance on real data, any proposed method to draw causal inferences frommedical notes should ﬁrst be validated on synthetic datasets that can capture at leastsome of the complexity of human language. The goal of this work is the developmentof synthetic DGPs for language data which make it possible to evaluate causal inferencemethods.

3. Overview of Causal Assumptions

While randomized control trials are the gold standard for determining causal effects, theyare often unethical, impossible, or prohibitively expensive. Causal methods use non-randomized,observational data and assumptions about the DGP to draw conclusions about hypotheticalinterventions. The ability to make causal conclusions from observational data is transformative,but comes at a cost. The methods require assumptions about the underlying DGP, andviolation of these assumptions can invalidate the model’s conclusions. These assumptionsare often represented by a directed acyclic graph (DAG; Pearl, 2009) like Figure 1.Imagine we want to study whether maternal vitamin D deﬁciency is a risk factor for thepregnancy complication preeclampsia (Bodnar et al., 2014; Silva et al., 2008). In Figure 1, OOD -D OUGHTY , S

HPITSER AND D REDZE the treatment A is a binary measure of vitamin D deﬁciency and the outcome Y is theonset of preeclampsia. C and U , age above 35 years and socioeconomic status (SES),are confounders that inﬂuence both A and Y . Suppose SES is not directly recorded instructured (i.e. tabular) records, but can be inferred from physician’s text notes about thepatient. While for simplicity we will assume A , C , U , and Y are binary variables, we let T denote the raw text of the clinical notes. The edge from U to T assumes that the clinician’snote-taking is inﬂuenced by the underlying U value; the lack of edges between { A , C , Y } and T reﬂects a simplifying assumption. The relationship between U and T is complexand essential to the methods we will consider.In this setting, the target of interest is the average treatment effect; how much morelikely, on average, would patients suffer preeclampsia if they were to have a vitamin Ddeﬁciency. We write this as E [ Y ( )] − E [ Y ( )] where Y ( ) is a counterfactual randomvariable representing “preeclampsia status if a patient, possibly contrary to fact, had avitamin D deﬁciency.” This counterfactual variable’s distribution can be identiﬁed as: p ( Y ( a )) = ∑ C , U p ( Y | A = a , C , U ) p ( C , U ) (1)All confounders (common causes) must be included in Eq. (1) to draw valid causal inferences(Pearl, 2009). If we have no information on U and only observe p ( C , A , Y ) = ∑ U p ( Y , A , C , U ) ,it is generally impossible to write p ( Y ( a )) as a function of the observed data (Pearl, 2009).In this case, we say p ( Y ( a )) is not identiﬁed ; it is impossible to derive a consistent estimatorfor the causal effect. In real-world applications, an estimator for an unidentiﬁed effect mayreturn arbitrarily bad estimates. For a known DAG model, we can use the ID algorithm todetermine whether a causal effect is identiﬁed given which variables are observed (Shpitserand Pearl, 2006).In Figure 1, we need nontrivial assumptions to identify p ( Y ( a )) from p ( Y , A , C , T ) .The joint p ( U , T ) determines whether identiﬁcation is possible. If U ⊥ T ( T provides noinformation on U ), the causal effect is not identiﬁed and no method will succeed; if T isan exact copy of U , then it should be trivial to recover the causal effect by replacing U with T in Eq. (1). When T is not an exact copy of U , we may be able to treat it as a noisy,high-dimensional proxy for the unobserved confounder U . Depending on the empiricalrelationship between the text and the structured variables, methods that observe T insteadof U may be biased.For real-world data, we cannot validate assumptions about the DGP. Therefore, whileapplying a causal method to the data will produce conclusions given our assumptions, itcannot validate the efﬁcacy of the method itself. This is the role of the synthetic DGP; wecan compare the method’s assumptions to a known ground truth to explore how causalmethods succeed or fail as the relationship between text and structured data varies.

4. Causal Effects in Text Generation

Recent work in natural language generation has introduced language models with enormousempirical gains in perplexity and according to human judgments (Radford et al., 2019;Hashimoto et al., 2019; Brown et al., 2020). Language models generate text sequencestoken by token, where token i is sampled conditional on the previous i − ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS τ = τ = τ = δ = . δ = . δ = . Figure 2: Causal effect strengths and Trivial text generation. Blue and red bars correspondto U = U = τ increases, the ranked preferences between U = U = δ increases, the distribution is puts more weight on the rankedpreferences. The x-axis indexes the 16 words in the vocabulary, with each bar indicatingthe probability that a word shows up at least once in a 16 word sequence. When τ = δ = V order matches the x-axis order. As τ increases, the ˜ V order diverges. As δ increases,both distributions become more concentrated on higher-ranked words.ﬁrst token is often sampled conditional on some initial context. These existing methods,however, do not produce datasets with known causal effects on text itself; we must ﬁrstproduce a formal deﬁnition for the causal effect of a structured variable on the text generationprocess. In our clinical example, such an effect represents how a doctor’s notes would havechanged had a patient, counterfactually , been of high SES. By controlling the effect of U on T in Figure 1, we can evaluate how causal methods perform when their assumptions aremet or violated.We want our marginal p ( T ) to conform to a language model that generates text accordingto a learned distribution, but want to parameterize p ( T | U ) such that we can force thegeneration to smoothly diverge from its learned distribution to depend on U . We want acausal effect of U on T to make some words or topics more likely and others less so. That is,texts generated when U = U =

0. We will introduce τ and δ as hyperparameters that control our causal effects.Intuitively, τ controls rankings over the vocabulary; the larger τ is, the more the rankedpreference for U = U =

1. We can conceptualize δ as controllinghow much the model indulges its preference; the larger δ is, the more likely p ( T | U = u ) OOD -D OUGHTY , S

HPITSER AND D REDZE τ word δ word Table 1: p ( U | T ) classiﬁer accuracy for the nine examples of our trivial DGP. When both τ and δ are small, accuracy is near random chance. If one of the two parameters is large,accuracy improves; if both are large, accuracy nears 100%samples according to these ranked preferences rather than from the pre-trained languagemodel distribution.To formalize this, let V = { x , . . . , x N } be a vocabulary of N words. The learnedlanguage model provides an initial distribution p ( V ) and uses it to generate the sequencesthat comprise p ( T ) . Let ˜ V be an ordering over V For a binary U , we choose two orderings,˜ V u = and ˜ V u = . Our τ parameter controls the correlation between those two orderings.When τ =

0, the orderings are the same; when τ =

1, they are exact reversals of eachother. For a given τ , we sample these orderings such that their Kendall Tau correlation isapproximately 1 − τ .For a given ˜ V and our choice of δ , we will construct a new distribution over thevocabulary. Deﬁne f ˜ V ( x i ) as a mapping from a vocabulary item x i to the position of thatitem in the ordering. If x is the ﬁrst item in the ˜ V ordering, then f ˜ V ( x ) =

1. Now deﬁnea ‘modiﬁed Zipﬁan distribution’ as p ( x i ) ∝ f ˜ V ( x i ) − δ / ( − δ ) , When δ =

0, this is simply auniform distribution over the vocabulary; when δ =

1, it is a point mass on the ﬁrst itemin its preference. Now, given our language model’s learned p ( V ) , we construct a new distribution: p (cid:48) ( x i ; ˜ V , δ ) ∝ p ( x i ) ⊗ f ˜ V ( x i ) − ( δ /1 + δ ) (2)where ⊗ indicates element-wise multiplication. The distribution p (cid:48) represents an averagebetween the initial p ( V ) and the modiﬁed Zipﬁan deﬁned by ˜ V and δ . We deﬁne a function h which takes in an initial text generation distribution p ( V ) , and values for τ and δ andreturns new distributions p (cid:48) u following Eq. (2). We write this as: h : ( p ( V ) , τ , δ ) → { p (cid:48) ( V ; ˜ V , δ ) , p (cid:48) ( V ; ˜ V , δ ) } (3)Both τ and δ live in the [

0, 1 ] domain. We can conceptualize τ as controlling the‘preference’ over words in the vocabulary and δ as controlling the ‘strength’ of that preference.If either hyperparameter is 0, the structured variable U has no effect on the text generation.If τ = V = ˜ V ; while δ will change the word probabilities, it will change themequally for either value of U . Similarly, if δ is 0, then no matter how different ˜ V is from ˜ V , h ( p ( V ) , τ , 0 ) ignores those preferences and returns the language model’s learned p ( x i ) .

2. For any value of δ , we will normalize p to be a distribution with probabilities in [ e − , 1 − e − ] . ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS δ word τ word δ topic τ topic Table 2: p ( U | T ) classiﬁcation accuracy for LDA text. Increasing τ and δ values lead toincreased classiﬁcation accuracy, with exceptions when δ increases but τ decreases. Ifeither the topic or word effects are particularly large, classiﬁcation accuracy exceeds 90%;when both are large, it quickly approaches 100%.Figure 2 shows how δ and τ control a trivial text generation model. We sample ninedatasets of 10k sequences of 16 tokens. Our initial p ( V ) distribution is simply uniformover the vocabulary of 16 tokens. Each cell in the ﬁgure shows how the Trivial p ( T | U ) distributions change as we vary δ and τ . When δ is large but τ is small, some words aremuch more likely than others, but the two distributions only differ on a single word. When τ is large but δ is small, the distributions differ by a small amount on many words.If we want to explore how causal methods perform in Figure 1, we can control the p ( T | U ) distribution with δ and τ . As we turn to more complicated p ( V ) distributions, wewant a better way to interpret the text generated with a given choice of these hyperparameters.Our approach differs from past (semi-)synthetic text datasets for causal evaluation.Wood-Doughty et al. (2018) sampled synthetic ‘texts’ in a bag-of-words manner similarto our Trivial distribution above, except without the ability to control the strength of the p ( T | U ) relationship. Veitch et al. (2020) used real text from Reddit or academic papers andsampled synthetic outcomes conditional on metadata related to each text, but without theability to measure or specify the causal relationship between the text and its metadata.Weld et al. (2020) generate semi-synthetic data by inserting template-based posts intothe actual post history of a social media user. These synthetic interventions are discrete,however; there is no way to specify a real-valued causal effect and manipulate it arbitrarily.The ﬂexibility of our approach allows us to explore how methods perform as we vary thecausal effect on the text. δ , τ The δ and τ hyperparameters completely control the effect of the structured variables onthe text, but are not particularly interpretable. How do we know if particular δ or τ valuesare realistic? What values best mimic a real clinical notes DGP? OOD -D OUGHTY , S

HPITSER AND D REDZE δ word τ word δ topic τ topic Table 3: p ( U | T ) classiﬁcation accuracy for GPT-2 text. Accuracy is much lower than ontrivial or LDA data. Increasing τ and δ values generally leads to increased classiﬁcationaccuracy, but this is not monotonic. When increasing ( δ word , τ word ) from (0.5, 0.15) to(0.7, 0.15) and reducing ( δ template , τ template ) from (0.9, 0.15) to (0.7, 0.15) we see a notabledecrease in accuracy even when ( δ template , τ template ) returns to (0.9, 0.15). This is becauseGPT-2 word and template effects can conﬂict; because the language model tries to maintaingrammatical structure, certain templates make it unlikely to sample certain words.Rather than adapt our hyperparameters to a speciﬁc natural language domain, we willuse text classiﬁcation accuracy as a lens that can be equally applied to both synthetic andreal-world text. Given a synthetic dataset, we will train a classiﬁer with T as the featuresand U as the labels. Considering the accuracy of such a classiﬁer will let us comparea synthetic dataset to a real dataset; past work has extensively considered the task ofclassifying clinical concepts from unstructured text (Liu et al., 2018; Meystre et al., 2008;Afzal et al., 2018; Savova et al., 2010). A synthetic dataset in which a text classiﬁer achieves99% accuracy is unrealistic, implying δ and τ are too large. Similarly, if δ and τ are toosmall, a p ( U | T ) classiﬁer will be no better than chance.Table 1 shows binary classiﬁcation accuracy of a simple bag-of-words model trainedon the datasets from Figure 2. We use a train/dev/test split of 8k/1k/1k sequences forthis and all subsequent text classiﬁcation experiments. Accuracy improves above randomchance as either δ or τ increase, and quickly maxes out when both are large. Classiﬁcationaccuracy on this task provides a useful way to abstract away the underlying DGP as weintroduce more complicated synthetic datasets. For a slightly more complicated synthetic DGP, we consider Latent Dirichlet Analysis(LDA), one of the most widely-used models of text (Blei et al., 2003). It provides a generativemodel of text that clusters the distribution over the vocabulary into a distribution overtopics. While the LDA model ignores word order, so each sampled word drawn from thetrained model is independent. This results in generated texts that have no grammaticalstructure. We train an LDA model on a set of 250,000 documents which was released aspart of the training data for GPT-2 (Radford et al., 2019).To deﬁne a p ( T | U ) distribution that uses LDA, we will deﬁne causal effects for boththe words and the topics. Let V word be the word vocabulary and V topic be the set oflearned topics. Then p LDA ( V topic ) is LDA’s learned baseline distribution over the topics,and p LDA ( V word | t ∈ V topic ) is the learned distribution over words for topic t . ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS δ w The child was known for . . .0.0 his role in the very real Peter Pan ﬁlm that skyrocketed0.1 his role in the ﬂamboyant sleuth Jackie Turner’s hit0.15 his German business, and her books were sold in Bavaria0.25 her ability to play, run and shoot gags involving giant0.4 her ability to see. She began training one more spring0.45 her ability to see in one eye; her ability conquer magic0.5 her ability to disown her magic ability and her identify0.6 her ability one ability her magic ability her magic

Figure 3: DistilGPT-2 generation when we ﬁx the random seed, template, and ˜ V word butvary δ word . We construct ˜ V word so the most-preferred words are her , magic , and ability . Themodel switches from his to her pronouns as δ increases. As δ further increases, sentenceﬂuency decreases.We introduce causal effects with h from (3). To sample a word from our modiﬁed LDAmodel when U = u , we ﬁrst sample a topic t from h ( p LDA ( V topic ) , τ topic , δ topic ) . Then,instead of sampling from the original LDA distribution, p LDA ( V word | t ) , we sample from h ( p LDA ( V word | t ) , τ word , δ word ) .How do these τ and δ hyperparameters control the generated text? Table 2 shows textclassiﬁcation results. We see that in general, larger τ and δ lead to higher accuracy, yetthere are exceptions. Within a given row or column, when δ increases but τ decreases,we see a brief drop in accuracy. We can conceptualize this with the plots in Figure 2; as δ increases the effect of U on T grows and the word distribution changes from its learneddistribution, but as τ decreases it decreases the difference between the U = U = τ word against δ word and hold topic effects constant,we would see that accuracy monotonically increases as either word effect hyperparameterincreases. One of the primary drawbacks of LDA is that it only models topic, and has no sense ofword order or syntax. Therefore, we consider a more complex DGP by extending oursynthetic data framework to more complicated neural models that are widely used for textgeneration.GPT-2 is a large neural language model that has improved the state-of-the-art on severalbenchmark evaluations (Radford et al., 2019). It uses 1.5-billion parameters to encodea context sentence into an internal representation and then uses that representation topredict a distribution over the next word in the sentence. Once a word has been sampledfrom that distribution, it is fed back into the model as additional context, and the samplingprocess continues. Word-order is thus intrinsic to the sentences generated by GPT-2. Tosave computation time, we use a smaller 82M parameter DistilGPT-2 model (Sanh et al.,2019). We discuss extensions to more recent neural language models in § 9.2.While the model can take as input an arbitrary context sentence or phrase, we followSheng et al. (2019) and use a set of simple templates to seed the generation of the GPT-2model. The templates are a combination of a subject (e.g. ‘the person’) and the beginning OOD -D OUGHTY , S

HPITSER AND D REDZE of a verb phrase (e.g. ‘was known for’). Our V template has 60 templates. We treat GPT-2 asa black-box which inputs a distribution over these 60 templates and outputs a distributionover the words in the vocabulary. As with our LDA model, we will introduce causal effectswhich inﬂuence these inputs and outputs, but otherwise leave the model untouched.We start with an initial uniform distribution over the 60 templates. From an initiallyuniform p GPT-2 ( V template ) , we sample a template t from h ( p GPT-2 ( V template ) , τ template , δ template ) .Then, we feed that template into the GPT-2 model as context, and it produces a distributionover words: p GPT-2 ( V word | t ) . We then sample the ﬁrst word from h ( p GPT-2 ( V word | t ) , τ word , δ word ) . We then feed that sampled word, w , back into the GPT-2 model andsample the next word, conditioning on both the template and the ﬁrst sampled word, from h ( p GPT-2 ( V word | w , t ) , τ word , δ word ) .Table 3 shows how text classiﬁcation accuracy changes as our τ and δ parameterschange. As in Table 2, larger τ and δ values lead to better classiﬁcation accuracy, withsome exceptions. Every p ( U | T ) accuracy drop on LDA data in Table 2 co-occurred with adrop in a τ or δ effect. With GPT-2, we see one case where causal effects strictly increasebut text classiﬁcation accuracy decreases. When τ word = δ template = δ word increases from 0.5 to 0.7 and τ template increases from 0.05 to 0.15, text classiﬁcation accuracydrops from 85% to 78%. A likely explanation for this is that the GPT-2 templates do notaffect individual word probabilities, but provide context that affects the entire sequence.The template fragment ‘worked as a’ likely increases occupation-related words, wherethe fragment ‘was known for’ may not. These non-monotonic effects may complicate theability of our simple bag-of-words model to differentiate the two distributions.We also see that while the formal deﬁnitions of τ and δ are the same between LDA andGPT-2, the values must be much larger for the classiﬁer to reach 90% test set accuracy. Thisreﬂects the mismatch between the bag-of-words assumption of our text classiﬁer and themore complex text sequences of GPT-2.As GPT-2 produces more ﬂuent text than LDA, we can also visualize the effect of δ word by slightly varying its value while repeatedly sampling from the model. Figure 3 showshow the generation changes when we ﬁx the template and GPT-2’s random seeds, andincrease δ word for a given ˜ V word preference.

5. Causal Methods with Text

We have introduced a framework for producing datasets where we can provide ﬁne-grainedcontrol over how structured variables inﬂuence the text. We can use this framework toevaluate existing methods for estimating causal effects with text data. We will ﬁrst providean overview of four such approaches, and then use our framework to conduct a range ofsimulation studies that explore how well these methods perform as we vary the p ( T | U ) relationship.Each method relies on sample-splitting for robust inference (Chernozhukov et al., 2016;Anderson and Magruder, 2017). In particular, we will split dataset in half, use one split totrain and validate a simple bag-of-words logistic regression model, and then use the othersplit to estimate our causal effect. Then we will ﬂip the splits to get a second effect estimateon the ﬁrst split, and then report the average of the two. As we only use simple models forthese evaluations, we leave full implementation and training details to our released code. ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

Matching is a popular causal method (Stuart, 2010), which has been recently applied totext datasets (Roberts et al., 2018; Mozer et al., 2018; Yao et al., 2019; Wang and Culotta,2019). Matching adjusts for confounding by estimating the causal effect among patientswho are similar, where similarity can be deﬁned by confounders or by their propensityto have received the treatment. We consider two types of text matching: propensity scorematching and representation matching.If U were observed in Figure 1, valid propensity score matching would proceed bylearning a model for p ( A | C , U ) and matching patients based on the estimated propensity.With U unobserved, we will instead match on a propensity score modeled as p ( A | C , T ) .This method will be biased in general because matching requires the true propensity score.However, if there exists a function that maps our estimated p ( A | C , T ) to the true propensity p ( A | C , U ) , this approach can be unbiased. To implement this method, we model thepropensity p ( A | C , T ) with a bag-of-words classiﬁer. We then match on the estimatedpropensity using full matching as implemented in the R package optmatch , following Mozeret al. (2018).Representation matching attempts to adjust for confounding by matching patients ontheir covariates ( C , U ) and then taking p ( Y | A ) within each matched group as an unbiasedestimate of p ( Y ( a )) . As U is unobserved, we can instead match on both C and a learnedrepresentation of T . The intuition is that if two patients have similar T representations,they are likely to have the same value of U . However, this method will be biased in generalif two values U can produce the same T representation. For our experiments, we use anLDA topic model representation of T and perform full matching using cosine similarity,following (Mozer et al., 2018). Rather than matching on the propensity score, we can directly use it in an inverse propensityweighting (IPW) model (Rosenbaum and Rubin, 1983). This approach reweighs the observeddata by the inverse of the true propensity model; if the true propensity p ( A | C , U ) is used,this is a consistent estimator for Eq. (1). When we replace p ( A | C , U ) with p ( A | C , T ) ,our estimates are no longer guaranteed to converge to the ground truth. Instead, wemust assume that if the effect of U on T is strong, then the learned propensity score willsufﬁce to reweigh the examples. This approach is similar to the bag-of-words method usedby Veitch et al. (2020). Initial experiments, we found that more powerful neural modelsperformed poorly on our datasets of only 10k examples. This method follows other workin controlling for high-dimensional confounders (Hill et al., 2011; McCaffrey et al., 2004;Low et al., 2016).Our implementation again models p ( A | C , T ) as a bag-of-words classiﬁer. We truncatepropensity weights and report the mean of 100 bootstrap estimates (Lee et al., 2011). Our fourth causal method assumes access to a text classiﬁer model p ( U | T ) that can impute U ∗ , a noisy proxy for the true U . The method uses the classiﬁer and an estimate of the error OOD -D OUGHTY , S

HPITSER AND D REDZE

Representation Propensity IPW Measurement τ word δ word Table 4: Causal estimation error for the four estimation methods on our trivial DGP. Allmethods approach zero error as δ and τ values increase.rate of the classiﬁer to correct for the bias induced by the imperfect classiﬁcations (Pearl,2010). Importantly, this approach requires more information than text matching or IPW,as we must have access to either a pre-trained classiﬁer with known error rate or enoughlabeled data p ( U , T ) to train a classiﬁer. In many cases, such labeled data may be difﬁcultor impossible to collect. We train a logistic regression classiﬁer for p ( U | T ) , using half thetraining split to train the classiﬁer, and the other half to estimate the classiﬁer’s error rates.Our implementation uses code released by Wood-Doughty et al. (2018).

6. Evaluating Causal Methods with Text

Our framework for producing synthetic text datasets and discussed four past methods thathave been proposed for estimating causal effects from text datasets. We will now applyeach of these four methods – text propensity score matching (Prop), text representationmatching (Rep.), IPW, and measurement error (ME) – to the synthetic datasets we haveintroduced. Our released code reproduces these experiments.

In §4, we introduced hyperparameters that control the causal effect of a structured variableon a text generation model. To build our datasets, we ﬁrst deﬁne p ( Y , A , C , U ) and thendeﬁne the text distribution p ( T | U ) . We limit ourselves to the DAG in Figure 1 and onlyconsider binary structured variables.We choose the parameters of p ( Y , A , C , U ) randomly, subject to three constraints. First,we ensure that the true distribution-level causal effect (1) is equal to 0.1; given C and U ,the treatment increases the likelihood of the outcome by 0.1. Second, we ensure that ourdataset exhibits Simpson’s paradox: if we estimate (1) without conditioning on U , thecausal effect should appear to be − U and T will fail to estimate the causal effect. Finally, we ensure that p ( U = ) = U maximally uninformative.These constraints allow for consistency across experimental evaluations; each structureddistribution should be comparable. ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

Representation Propensity δ word τ word δ topic τ topic δ word τ word δ topic τ topic IPW Measurement δ word τ word δ topic τ topic δ word τ word δ topic τ topic Table 5: Estimation error for each causal method on LDA synthetic data, averaged overthe combination of four structured distributions and four text distributions for each cell.All methods reduce estimation error as the δ and τ effects increase in strength, but onlymeasurement error achieves near-zero error for any effect strength. Because we have a complex method for producing our text distribution p ( T | U ) and weenforce non-trivial constraints on p ( Y , A , C , U ) , we carefully seed the random numbergeneration required to produce these synthetic distributions. In particular, our samplingof text distributions and structured distributions are orthogonal. We consider four separatestructured distributions that meet our above constraints, which we reuse in our evaluationsacross all three text distribution settings: the trivial 16-word vocabulary, the LDA model,and the GPT-2 model.All results in Tables 4, 5, and 6 show the absolute-value divergence of the methods’estimates from an oracle with access to the full structured distribution p ( Y , A , C , U ) . Thecausal estimate errors for a given ( τ , δ ) pair are averaged over the 16 synthetic distributionsthat combine our four structured distributions and four text distributions. Table 4 shows how the four causal methods perform on the trivial 16-word vocabularydataset we introduced in § 4. We see that when the p ( T | U ) relationship is very weak( δ w = τ w = ) , all four methods perform about as poorly as they would if they hadignored the text entirely. As the p ( T | U ) relationship becomes stronger, all four methods OOD -D OUGHTY , S

HPITSER AND D REDZE δ word τ word δ template τ template Table 6: Causal estimation error for each method on the GPT-2 synthetic data. Themeasurement error method estimates approach zero only for the largest values of δ and τ .Neither Propensity nor IPW correct more than half the confounding of a naive estimator,and Representation barely reduces the confounding bias at all.improve. The text matching and measurement error methods are able to perfectly estimatethe true causal effect when the effect of U on T becomes overwhelmingly strong. The IPWmethod does worse, but does correct for the U confounding as the p ( T | U ) relationshipstrengthens. It is not surprising that the measurement error approach works here, as Table1 and Figure 2 showed us that p ( U | T ) classiﬁcation can achieve perfect accuracy on thistrivial dataset. The success of the text matching approach highlights that even though p ( A | C , T ) is not the true propensity score, the relationship between U and T is strongenough to allow for the method to correct for the confounding. Table 5 shows how the four causal methods perform on synthetic datasets using the LDAtext generation we introduce in § 4.2. These results are less encouraging. Our text generatedfrom LDA is word-order independent, so simple bag-of-words models p ( A | C , T ) shouldbe powerful enough to capture the text’s complexity. Even so, the matching methodsstruggle to correct for U ’s confounding, though they slightly improve as τ and δ increase.Compared to the trivial setting, in LDA there is much less direct relationship between U and the sampled text. Thus Representation matching is more likely to match two textswith different U values, and in Propensity the estimated p ( A | C , T ) diverges from the truepropensity. That Propensity outperforms Representation when it did not for Trivial textsuggests that the propensity matching may be more effective as it in a single dimension (Robertset al., 2018). The IPW method, on the other hand, does extremely poorly when the effectsof U on T are small. Because a na¨ıve estimator that ignores the text can achieve a causalerror of 0.20, the IPW estimator actually worsens the confounding bias. The measurementerror approach is effective when τ and δ are large enough. Table 5 shows how the four causal methods perform on synthetic datasets using the GPT-2text generation we introduce in § 4.3. Here we see that neither the matching nor IPW ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

DGP Prop IPW M.E.Trivial − − − − − − Table 7: Pearson correlation between absolute causal estimation error and the test accuracyof the text classiﬁer that the estimation method relies on. On the Trivial text data, allmethods have a negative correlation: increased test accuracy implies lower estimationerror. As the text DGP increases in complexity to LDA and GPT-2, this correlationdwindles and then reverses for the Propensity and IPW methods, but remains stable forthe measurement method.methods ever noticeably improve. The measurement error method is still effective, butonly when the effect of U on T is strongest.While GPT-2 clearly does not produce language at the complexity of real-world datasets,we can better understand the assumptions made by these causal models by exploringhow they perform as the underlying text generation become more complex. On this data,simple bag-of-words models we consider are not ﬂexible enough to fully capture the complexityof the text. Even though Table 3 shows us that a bag-of-words classiﬁer can effectively learnthis more complicated p ( U | T ) when the word and template effects are large enough, the p ( A | C , T ) model learned for the IPW and matching methods does not capture informationon the true propensity. The measurement error method and its p ( U | T ) classiﬁer canprovide unbiased estimates, but only when δ and τ effects are strongest.

7. Text Classiﬁcation Accuracy and Estimation Error

Our propensity score matching, IPW, and measurement error methods all rely in part upona text classiﬁer to estimate the causal effect. However, better performance (as measuredby classiﬁcation accuracy) of this classiﬁer does not necessarily translate into lower causalestimation error. For both propensity score matching and IPW, the text classiﬁer models p ( A | C , T ) . For the measurement error estimator, the text classiﬁer models p ( U | T ) . Forthe binary A and U we consider, we can easily characterize these models in terms of theirclassiﬁcation accuracy. The density plots in Figure 4 shows the relationship between textclassiﬁer accuracy and the causal estimation error.Across all three DGPs, we see that when the p ( U | T ) classiﬁer has accuracy greater than80%, our estimate of the causal effect is within 0.05 of the truth. If we could achieve 100%classiﬁer accuracy for the measurement method, it would imply that we had access to thetrue p ( A , Y , C , U ) , and can trivially estimate the causal effect.However, for propensity and IPW methods, better classiﬁcation accuracy does notimply lower estimation error. In fact, better classiﬁcation accuracy of p ( A | C , T ) is orthogonalto our goals of low causal estimation error. Instead, we need p ( A | C , T ) to converge to thetrue p ( A | C , U ) , which is untestable without observing U . OOD -D OUGHTY , S

HPITSER AND D REDZE

Figure 4: Joint and marginal density plots of text classiﬁer accuracy and mean absolutecausal estimation error for each DGP and each estimation method that relies on a textclassiﬁer. Each dot represents one experiment. Figure 5 shows a zoomed-out plot forLDA+IPW; all other plots contain all data. Colors indicate the four structured variablerandom seeds used to create the true data-generating distributions. For the IPW and Propmethods, the visible clusters show that the relationship between classiﬁer accuracy andcausal error is highly dependent on the random seed for structured variables. Thus, for areal-world analysis with an unknown DGP, better classiﬁer accuracy does not imply lowercausal error. For the ME method, classiﬁer accuracy and causal error are not clustered bythe underlying DGP. ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

Figure 5: Zoomed-out version of Figure 4 for for IPW estimator on LDA data. For onerandom seed for structured variables (the blue cluster), causal error is quite large.Table 7 shows that as we increase the complexity of our DGP from the Trivial text toLDA and then to GPT-2, we can also empirically see that the correlation between classiﬁeraccuracy and estimation error degrades for the Propensity and IPW methods. For thePropensity and IPW methods on GPT-2 data, classiﬁer accuracy is positively correlatedwith estimation error, suggesting that the p ( A | C , T ) classiﬁer has overﬁt and divergedfrom the true p ( A | C , U ) propensity.

8. Availability and Use of Labeled U Data

Our empirical results have demonstrated that the measurement error estimator performsthe best on our synthetic datasets. However, this method relies upon access to labeled p ( U | T ) data. This ﬁnding raises two questions: how much labeled data does the measurementerror method require, and could other methods perform as well or better if given access tosuch labeled p ( U , T ) data?We run additional experiments where we limit the amount of labeled data that ourestimator has access to. Of a dataset of 10,000 total examples, we use n of them to trainand validate a classiﬁer p ( U | T ) and use (

10, 000 − n ) to compute our estimate of thecausal effect. Our previous experiments have considered n =

5, 000; in Table 8 we plotestimation error as we vary n from 50 to 5, 000. For the DGPs with the strongest causaleffects, the mean absolute error remains small even as we substantially reduce the numberof examples. Estimation error on DGPs with weaker causal effects are more sensitive tothe number of examples.We then compare these evaluations on limited labeled data against a baseline thatassumes access to an equal amount of data on the full p ( U , C , A , Y ) distribution. Supposewe can pay clinicians to annotate n patient records for the unobserved confounder U ; OOD -D OUGHTY , S

HPITSER AND D REDZE

Labeled p ( U , T ) examples δ w τ w δ t τ t

50 100 200 300 400 500 1000 1500 2000 2500 50000.0 0.00 0.7 0.45 0.19 0.16 0.14 0.11 0.11 0.10 0.08 0.08 0.07 0.11 0.100.2 0.03 0.7 0.15 0.19 0.18 0.16 0.17 0.16 0.16 0.16 0.14 0.10 0.10 0.100.2 0.03 0.7 0.45 0.18 0.17 0.13 0.09 0.11 0.10 0.10 0.08 0.07 0.06 0.040.2 0.15 0.7 0.15 0.19 0.17 0.13 0.15 0.13 0.09 0.06 0.04 0.03 0.03 0.030.5 0.05 0.5 0.05 0.17 0.16 0.13 0.11 0.12 0.09 0.04 0.03 0.04 0.03 0.030.5 0.05 0.7 0.05 0.18 0.16 0.14 0.13 0.13 0.12 0.06 0.03 0.04 0.04 0.040.5 0.05 0.7 0.45 0.18 0.14 0.12 0.10 0.09 0.10 0.06 0.03 0.03 0.03 0.030.5 0.15 0.7 0.05 0.10 0.05 0.06 0.04 0.04 0.02 0.02 0.01 0.01 0.01 0.010.5 0.15 0.9 0.15 0.11 0.07 0.06 0.05 0.04 0.03 0.03 0.02 0.02 0.02 0.020.7 0.15 0.7 0.15 0.10 0.08 0.05 0.04 0.04 0.03 0.02 0.02 0.02 0.02 0.010.7 0.15 0.9 0.15 0.10 0.07 0.04 0.04 0.04 0.03 0.02 0.02 0.02 0.02 0.01 p ( U , C , A , Y ) Baseline 0.28 0.14 0.05 0.03 0.04 0.03 0.03 0.02 0.01 0.01 0.01

Table 8: Measurement error method’s mean absolute estimation error on GPT-2 data aswe vary the amount of labeled data used. Train and validation data is split evenly; wetrain the p ( U | T ) classiﬁer with half and estimate its error rate on the other half. The lastcolumn is equivalent to the last row of Table 6. The p ( U , C , A , Y ) baseline ignores the textand simply computes the causal effect using Equation 1.should we use those examples to use the measurement error method, or should we justdirectly compute the causal effect using Equation (1), ignoring the text entirely? The p ( U , C , A , Y ) baseline in Table 8 suggests that as soon as we have at least 200 examples,this baseline is as good on average as the measurement error method, even on DGPs withthe strongest U → T causal effects. Figure 6 shows in more detail this baseline comparedagainst two of the DGPs in Table 8. In particular, this ﬁgure shows the 95% conﬁdenceinterval for the three methods. For the DGP with large causal effects, the measurementerror method is quite comparable to the baseline as n ≥ U → T causal effects,the measurement error method is strictly worse than the baseline.The measurement error method is the only approach that achieves success on ourGPT-2 DGPs, but requires access to p ( U | T ) labels. If this method can be matched by abaseline that ignores the text entirely, it may seem that incorporating NLP methods intocausal inference is not worth the effort. But our results are not entirely pessimistic andthe ﬂaws they do reveal point to many opportunities for future work. The p ( U , C , A , Y ) baseline importantly requires access to the full joint, whereas the measurement error methodonly requires data on the p ( U | T ) conditional. This has many practical implications. Forexample, if researchers at a hospital cannot collect U annotations for their data due topatient privacy restrictions, they still may be able to apply a p ( U | T ) classiﬁer to that data.Thus if we can leverage existing anonymized clinical datasets as the p ( U , T ) data, we canproduce analysis that would otherwise be impossible. ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

100 500 4000 − . − . − . − . . . p ( U, T ) examples C a u s a l e s t i m a t ee rr o r p ( U, C, A, Y ) Baseline δ w = 0 . , τ w = 0 . , δ t = 0 . , τ t = 0 . δ w = 0 . , τ w = 0 . , δ t = 0 . , τ t = 0 . Figure 6: A closer look at three rows from Table 8. Solid line plots mean (notmean absolute) causal error; shaded regions show 95% conﬁdence interval from 100bootstrap samples. Measurement error results are averaged over four structured variablesdistributions and four text distributions. The baseline ignores the text and is averagedover four structured distributions. Even for text data with the strongest causal effects weconsider, the measurement error approach is not noticeably better than the p ( U , C , A , Y ) baseline once we have at least 200 labeled examples.There are also many opportunities to develop new approaches that outperform thefour methods we evaluated. We should expect that some access to labeled data shouldmake it possible to learn a propensity score or text representation that provides for lowerestimation error when primarily using data without labeled U . An unsupervised textrepresentation such as LDA could be augmented with labeled p ( U , T ) so that learnedtopics are more discriminative of the underlying U (Blei and McAuliffe, 2007). Similarly,if we were given access to some labeled p ( U , T , C ) data, we could train a propensity scoremodel such that predicted propensities must be roughly equal for examples with the same U . We can also explore approaches that combine these four methods to produce new multiply-robust methods. Many causal estimators use multiple models and are provablyunbiased if at least one or more of those models are correctly-speciﬁed (Bang and Robins,2005; Vansteelandt et al., 2008). Can we develop a new matching method that are unbiasedif either the propensity model or representation model are unbiased? Can we effectivelycombine all four methods we considered into a single multiply-robust estimator?

9. Limitations and Extensions

Our evaluation framework and experimental results provide new insights into how existingestimators perform on synthetic text datasets. In generating our synthetic datasets andevaluating these methods, we have made simplifying assumptions. Many of these assumptions OOD -D OUGHTY , S

HPITSER AND D REDZE

AC UY T (a) A DAG in which all structuredvariables inﬂuence text generation.

AC UT Y (b) A DAG in which the text acts aseither a treatment or outcome.

Figure 7: Causal DAG models to which our evaluation framework could be extended.may limit the efﬁcacy of our work to certain applications, yet most such assumptions canbe relaxed by extending our work.

We only sample datasets from synthetic DGPs corresponding to the DAG model in Figure 1.There are of course inﬁnitely many DAG models that could be considered, but we pointout a few important generalizations that would complicate our methods for sampling dataor evaluating methods.Figure 7a extends Figure 1 by adding causal effects from all structured variables tothe text data. Such a DAG complicates our approach for sampling text from a languagemodel conditional on the structured variables. In § 4 we parameterized p ( T | U ) with ourtwo types of hyperparameters: δ and τ . Figure 7a requires sampling from p ( T | U , C , A , Y ) ,which may require a different hyperparameter formulation. Our implementation assumes U is binary, the immediate extension to a continuous-valued U simply requires replacingthe two orderings ( ˜ V u = and ˜ V u = ) with a continuous function of U that outputs an ordering˜ V u . If T is sampled conditional on multiple structured variables, then we need a functionthat maps from those variables to an ordering over the vocabulary. In such a setting, weneed one or more τ hyperparameters that control how sensitive this function is to changesin one or more structured variables.The DAG in Figure 7a also changes the assumptions for the causal methods we consider.The Propensity and Representation methods, like any matching estimator, requires matchingonly on pre-treatment covariates; variables that are non-descendants of the treatment A .Matching on post-treatment variables can introduce signiﬁcant bias (Rosenbaum, 1984;Stuart, 2010). If the text data is inﬂuenced by both U and A , it cannot be easily used formatching. Similarly, for the IPW model (or an outcome model), if the text is a collider(descendant of both A and Y ), conditioning on it may introduce bias (Greenland, 2003).Within the context of the measurement error estimator, Figure 7a violates our previousassumption of non-differential measurement error (Carroll et al., 2006; Wood-Doughtyet al., 2018). Thus, rather than estimating two (assuming U is binary) marginal error rates p ( U ∗ = | U = ) and p ( U ∗ = | U = ) , we must estimate several conditional error ratesof the form p ( U ∗ = u (cid:48) | U = u , A = a , C = c , Y = y ) . Estimating such error rates requiresdata on the full joint p ( U , C , A , Y , T ) which, as discussed in § 8, reduces the efﬁcacy ofthese methods compared to simpler approaches that ignore the text data entirely. ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

In the DAG in Figure 7b, the text T can be seen as a treatment or an outcome; p ( T ( a )) isthe counterfactual distribution over T if we intervene on A , and p ( Y ( t )) is the counterfactualdistribution over Y if we intervene on T . Because our framework currently does notsupport sampling structured variables conditional on the text, we cannot sample from p ( Y | T , U , C ) . The causal estimators we consider do not make the necessary assumptionsto estimate the high-dimensional effects of A on T or of T on Y (Nabi et al., 2017; Egamiet al., 2018). Recent years have seen an explosion in both the frequency and size of neural languagemodels (Bender et al., 2021). While the only such model we have considered is a compressedversion of GPT-2 (Sanh et al., 2019; Radford et al., 2019), our framework for adding causaleffects can be easily extended to new language models such as GPT-3 or Switch-C (Brownet al., 2020; Fedus et al., 2021). All our approach assumes is that the model takes as inputan initial context and then, for each word, outputs a distribution over the vocabulary. Ourcausal effects simply adjust the distribution over context inputs and the distribution overthe word logits.Other work on language modeling has focused on controllable text generation which canproduce sentences that follow a speciﬁed style (Xu et al., 2020; Keskar et al., 2019; Kedzieand McKeown, 2020). For example, the approach from Dathathri et al. (2019) speciﬁestopic (e.g. politics) and a sentiment (e.g. negative) which guides the text generation. Suchan approach could help generate synthetic datasets which are more domain-speciﬁc (see§ 9.4). In any future work analyzing synthetic text generated from large-scale languagemodels, researchers should be careful to examine how such models learn and reproducesocietal biases encoded in the training data (Sheng et al., 2019; Bender et al., 2021).

We have mentioned in § 8 that future work should consider multiply-robust estimatorswith better asymptotic properties. Our evaluations could also be extended by implementingmore ﬂexible (e.g. neural) nuisance models that capture relationship between the structuredvariables and the text. Veitch et al. (2020) proposed causal methods that leverage existingtext embeddings which have been widely successful in many predictive tasks. Such neuralmodels may require new assumptions – such as with respect to smoothness (Farrell et al.,2021) – but have demonstrated empirical performance greatly surpassing that of the bag-of-wordslogistic regression models we have considered (Rajpurkar et al., 2016). Such neural modelsoften require large datasets for training or pre-training, and in our initial experiments, suchmodels did not outperform logistic regression on our small datasets. Future work couldcombine pre-training on large datasets (Lee et al., 2020) with ﬁne-tuning on our smalldatasets (Jin et al., 2019). We could also compare against stronger baselines that ignorethe text but leverage all available data, such as the estimator of Yang and Ding (2020)which combines both a small dataset that includes the unobserved confounder and a largedataset that does not. Such an estimator should outperform the p ( U , C , A , Y ) baseline weconsidered in § 8 by leveraging the additional data that does not contain U . OOD -D OUGHTY , S

HPITSER AND D REDZE

Our synthetic DGPs enable new evaluations for causal methods for text, but synthetic datain general is not without its inherent limitations. One barrier that prevents generalizabilityof results on synthetic data to real-world data is that often synthetic DGPs are explicitlydesigned to demonstrate the utility of a proposed method, and thus other assumptionsthat could expose the method’s ﬂaws may be ignored by the creator (Gentzel et al., 2019).While our framework addresses some of these concerns by making it easy to randomizethe DGP parameterization and enabling extensions to new language models, there is morethat can be done. Gentzel et al. (2019) suggests semi-synthetic datasets that, for example,use p ( U , C ) data from a real-world study and then sample p ( A , Y | U , C ) synthetically sothe causal effects are known (Dorie et al., 2019; Shimoni et al., 2018). While our frameworkcould adopt this approach and use empirical p ( U , C ) data, if we use empirical text datawe lose any knowledge of the causal relationships between text and structured variables.Within the synthetic framework we have proposed, there are many ways to make oursynthetic DGPs more realistic for applications to speciﬁc domain areas. We have used EHRdata and clinical notes as a motivating example throughout, but our DGPs are unrelatedto such applications. Suppose we have an EHR dataset with physiological measurementsand clinical notes. If we want to conduct a retrospective causal analysis using text, wemight ﬁrst develop a synthetic DGP that tries to approximate the empirical dataset (Nealet al., 2020). To adapt the synthetic DGPs from this work to this application, we mightconsider using a language model ﬁne-tuned on clinical notes (Lee et al., 2020) or adapted tothe complex vocabulary and style of the domain (Ruch et al., 2003; Melamud and Shivade,2019; Boag et al., 2016; Choi et al., 2017). If our clinical data has a structured variable U thatwe believe inﬂuences the text T , we might incorporate controllable generation techniquesto parameterize p ( T | U ) more realistically, for example by choosing a vocabulary preference˜ V u that reﬂect which words are more commonly used when describing patients with differentvalues of U . Such adaptations could make inferences drawn from synthetic data morerobust or make evaluations more interpretable to domain experts.

10. Conclusions

Our experiments demonstrate the importance of accurate assumptions in a causal analysis.All four causal methods can control for unobserved confounding in a trivial text generationsetting, but as our generative p ( T | U ) increases in complexity, the implicit assumptions ofthe matching and IPW methods render them biased. Although the matching and IPWmethods use the same p ( A | C , T ) propensity score model, the matching approaches workare superior in the trivial and LDA settings. Even though the trained models are identical,the underlying assumptions are different. Because it requires additional data, the measurementerror approach is able to make fewer assumptions, remaining effective as long as its p ( U | T ) classiﬁer is accurate. These results do not imply that text matching and IPW methods cannot control for unobserved confounding, but rather that we should be cautious andclear about what assumptions we make about our models and the underlying DGP. Evaluatingon synthetic data can help clarify these assumptions.As NLP research furthers the state-of-the-art in predictive modeling, such tools offerthe potential to inﬂuence human decision-making and guide our understanding of the ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS world. Such models rely on assumptions that may be irrelevant for a supervised learningbenchmark and yet essential to any real-world application. Explicitly adopting a causalinference perspective on natural language datasets can help enable inferences that arerobust to confounding or other biases. We hope our evaluation framework and releasedcode will support further research in these directions. OOD -D OUGHTY , S

HPITSER AND D REDZE

References

Naveed Afzal, Vishnu Priya Mallipeddi, Sunghwan Sohn, Hongfang Liu, RajeevChaudhry, Christopher G Scott, Iftikhar J Kullo, and Adelaide M Arruda-Olson.Natural language processing of clinical notes for identiﬁcation of critical limb ischemia.

International journal of medical informatics , 111:83–89, 2018.Michael L Anderson and Jeremy Magruder. Split-sample strategies for avoiding falsediscoveries. Technical report, National Bureau of Economic Research, 2017.Heejung Bang and James M Robins. Doubly robust estimation in missing data and causalinference models.

Biometrics , 61(4):962–973, 2005.Yonatan Belinkov and Yonatan Bisk. Synthetic and natural noise both break neuralmachine translation. In

International Conference on Learning Representations , 2018.Emily M. Bender, Timnit Gebru, Angelina McMillan-Major, and Shmargaret Shmitchell.On the dangers of stochastic parrots: Can language models be too big? In

Proceedings ofFAccT , 2021.David M Blei and Jon D McAuliffe. Supervised topic models. In

Proceedings of the 20thInternational Conference on Neural Information Processing Systems , pages 121–128, 2007.David M Blei, Andrew Y Ng, and Michael I Jordan. Latent dirichlet allocation.

Journal ofmachine Learning research , 3(Jan):993–1022, 2003.Willie Boag, Tristan Naumann, and Peter Szolovits. Towards the creation of a large corpusof synthetically-identiﬁed clinical notes. In

Machine Learning for Health Workshop atNeurIPS , 2016.Lisa M Bodnar, Hyagriv N Simhan, Janet M Catov, James M Roberts, Robert W Platt, Jill CDiesel, and Mark A Klebanoff. Maternal vitamin d status and the risk of mild and severepreeclampsia.

Epidemiology (Cambridge, Mass.) , 25(2):207, 2014.Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, PrafullaDhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, SandhiniAgarwal, Gretchen Herbert, Ariel and-Voss Krueger, Tom Henighan, Rewon Child,Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen,Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner,Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language modelsare few-shot learners. In

Advances in Neural Information Processing Systems , 2020.Raymond J Carroll, David Ruppert, Leonard A Stefanski, and Ciprian M Crainiceanu.

Measurement error in nonlinear models: a modern perspective . CRC press, 2006.Danton S Char, Nigam H Shah, and David Magnus. Implementing machine learning inhealth care – addressing ethical challenges.

The New England journal of medicine , 378(11):981, 2018. ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

Jonathan H Chen and Steven M Asch. Machine learning and prediction in medicine –beyond the peak of inﬂated expectations.

The New England journal of medicine , 376(26):2507, 2017.Victor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Duﬂo, Christian Hansen,and Whitney K Newey. Double machine learning for treatment and causal parameters.Technical report, cemmap working paper, 2016.Edward Choi, Siddharth Biswal, Bradley Malin, Jon Duke, Walter F Stewart, and JimengSun. Generating multi-label discrete patient records using generative adversarialnetworks. In

Machine learning for healthcare conference , pages 286–305. PMLR, 2017.Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino,Jason Yosinski, and Rosanne Liu. Plug and play language models: A simple approachto controlled text generation. In

International Conference on Learning Representations , 2019.Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: Alarge-scale hierarchical image database. In , pages 248–255. Ieee, 2009.Vincent Dorie, Jennifer Hill, Uri Shalit, Marc Scott, Dan Cervone, et al. Automatedversus do-it-yourself methods for causal inference: Lessons learned from a data analysiscompetition.

Statistical Science , 34(1):43–68, 2019.Naoki Egami, Christian J Fong, Justin Grimmer, Margaret E Roberts, and Brandon MStewart. How to make causal inferences using texts. arXiv preprint arXiv:1802.02163 ,2018.Jeffrey L Elman. Finding structure in time.

Cognitive science , 14(2):179–211, 1990.Max H Farrell, Tengyuan Liang, and Sanjog Misra. Deep neural networks for estimationand inference.

Econometrica , 89(1):181–213, 2021.William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillionparameter models with simple and efﬁcient sparsity. arXiv preprint arXiv:2101.03961 ,2021.Amanda Gentzel, Dan Garant, and David Jensen. The case for evaluating causal modelsusing interventional measures and empirical data. In

Advances in Neural InformationProcessing Systems , pages 11722–11732, 2019.Sander Greenland. Quantifying biases in causal models: classical confounding vscollider-stratiﬁcation bias.

Epidemiology , pages 300–306, 2003.P Richard Hahn, Vincent Dorie, and Jared S Murray. Atlantic causal inference conference(acic) data analysis challenge 2017. arXiv preprint arXiv:1905.09515 , 2019.Tatsunori Hashimoto, Hugh Zhang, and Percy Liang. Unifying human and statisticalevaluation for natural language generation. In

Proceedings of the 2019 Conference of theNorth American Chapter of the Association for Computational Linguistics: Human LanguageTechnologies, Volume 1 (Long and Short Papers) , pages 1689–1701, 2019. OOD -D OUGHTY , S

HPITSER AND D REDZE

Jennifer Hill, Christopher Weiss, and Fuhua Zhai. Challenges with propensity scorestrategies in a high-dimensional setting and a potential alternative.

MultivariateBehavioral Research , 46(3):477–513, 2011.David Jensen et al. Comment: Strengthening empirical evaluation of causal inferencemethods.

Statistical Science , 34(1):77–81, 2019.Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. Pubmedqa:A dataset for biomedical research question answering. In

International conference on machine learning , pages 3020–3029,2016.Chris Kedzie and Kathleen McKeown. Controllable meaning representation to textgeneration: Linearization and data augmentation strategies. In

Proceedings of the2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages5160–5185, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.419.Katherine Keith, David Jensen, and Brendan O’Connor. Text and causal inference: Areview of using text to remove confounding from causal estimates. In

Proceedingsof the 58th Annual Meeting of the Association for Computational Linguistics , pages5332–5344, Online, July 2020. Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.474. URL .Nitish Shirish Keskar, Bryan McCann, Lav R Varshney, Caiming Xiong, and RichardSocher. Ctrl: A conditional transformer language model for controllable generation. arXiv preprint arXiv:1909.05858 , 2019.Huda Khayrallah and Philipp Koehn. On the impact of various types of noise on neuralmachine translation. In

Proceedings of the 2nd Workshop on Neural Machine Translation andGeneration , pages 74–83, 2018.Jungi Kim and Patricia O’Neill-Brown. Improving American Sign Language recognitionwith synthetic data. In

Proceedings of Machine Translation Summit XVII Volume 1: ResearchTrack , pages 151–161, Dublin, Ireland, August 2019. European Association for MachineTranslation. URL .Brian K Lee, Justin Lessler, and Elizabeth A Stuart. Weight trimming and propensity scoreweighting.

PloS one , 6(3), 2011.Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So,and Jaewoo Kang. Biobert: a pre-trained biomedical language representation model forbiomedical text mining.

Bioinformatics , 36(4):1234–1240, 2020. ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

Jingshu Liu, Zachariah Zhang, and Narges Razavian. Deep ehr: Chronic disease predictionusing medical notes. In

Machine Learning for Healthcare Conference , pages 440–464, 2018.Vincent X Liu, David W Bates, Jenna Wiens, and Nigam H Shah. The number needed tobeneﬁt: estimating the value of predictive analytics in healthcare.

Journal of the AmericanMedical Informatics Association , 26(12):1655–1659, 2019.Yen Sia Low, Blanca Gallego, and Nigam Haresh Shah. Comparing high-dimensionalconfounder control methods for rapid cohort studies from electronic health records.

Journal of comparative effectiveness research , 5(2):179–192, 2016.Daniel F McCaffrey, Greg Ridgeway, and Andrew R Morral. Propensity scoreestimation with boosted regression for evaluating causal effects in observational studies.

Psychological methods , 9(4):403, 2004.Katharine H McVeigh, Remle Newton-Dame, Pui Ying Chan, Lorna E Thorpe, LaurenSchreibstein, Kathleen S Tatem, Claudia Chernov, Elizabeth Lurie-Moroni, and Sharon EPerlman. Can electronic health records be used for population health surveillance?validating population health metrics against established survey data. eGEMs , 4(1), 2016.Oren Melamud and Chaitanya Shivade. Towards automatic generation of shareablesynthetic clinical notes using neural language models. In

Proceedings of the 2nd ClinicalNatural Language Processing Workshop , pages 35–45, 2019.St´ephane M Meystre, Guergana K Savova, Karin C Kipper-Schuler, and John F Hurdle.Extracting information from textual documents in the electronic health record: a reviewof recent research.

Yearbook of medical informatics , 17(01):128–144, 2008.Reagan Mozer, Luke Miratrix, Aaron Russell Kaufman, and L Jason Anastasopoulos.Matching with text data: An experimental evaluation of methods for matchingdocuments and of measuring match quality. arXiv preprint arXiv:1801.00644 , 2018.Razieh Nabi, Todd McNutt, and Ilya Shpitser. Semiparametric causal sufﬁcient dimensionreduction of high dimensional treatments. arXiv preprint arXiv:1710.06727 , 2017.Brady Neal, Chin-Wei Huang, and Sunand Raghupathi. Realcause: Realistic causalinference benchmarking. arXiv preprint arXiv:2011.15007 , 2020.Neha Patki, Roy Wedge, and Kalyan Veeramachaneni. The synthetic data vault. In , pages399–410. IEEE, 2016.Judea Pearl.

Causality . Cambridge University Press, 2009.Judea Pearl. On measurement bias in causal inference. In

Proceedings of the Twenty-SixthConference on Uncertainty in Artiﬁcial Intelligence , pages 425–432, 2010.Judea Pearl and Dana Mackenzie.

The book of why: the new science of cause and effect . BasicBooks, 2018. OOD -D OUGHTY , S

HPITSER AND D REDZE

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.Language models are unsupervised multitask learners.

OpenAI Blog , 1(8):9, 2019.Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt,Peter J Liu, Xiaobing Liu, Jake Marcus, Mimi Sun, et al. Scalable and accurate deeplearning with electronic health records.

NPJ Digital Medicine , 1(1):18, 2018.Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. Squad: 100,000+questions for machine comprehension of text. In

Proceedings of the 2016 Conference onEmpirical Methods in Natural Language Processing , pages 2383–2392, 2016.Margaret E Roberts, Brandon M Stewart, and Richard A Nielsen. Adjusting forconfounding with text matching.

American Journal of Political Science , 2018.Paul R Rosenbaum. The consequences of adjustment for a concomitant variable that hasbeen affected by the treatment.

Journal of the Royal Statistical Society: Series A (General) ,147(5):656–666, 1984.Paul R Rosenbaum and Donald B Rubin. The central role of the propensity score inobservational studies for causal effects.

Biometrika , 70(1):41–55, 1983.S Trent Rosenbloom, Joshua C Denny, Hua Xu, Nancy Lorenzi, William W Stead, andKevin B Johnson. Data from clinical notes: a perspective on the tension betweenstructure and ﬂexible documentation.

Journal of the American Medical InformaticsAssociation , 18(2):181–186, 2011.Patrick Ruch, Robert Baud, and Antoine Geissb ¨uhler. Using lexical disambiguation andnamed-entity recognition to improve spelling correction in the electronic patient record.

Artiﬁcial intelligence in medicine , 29(1-2):169–184, 2003.Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilledversion of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108v4 ,2019.Guergana K Savova, James J Masanz, Philip V Ogren, Jiaping Zheng, Sunghwan Sohn,Karin C Kipper-Schuler, and Christopher G Chute. Mayo clinical Text Analysisand Knowledge Extraction System (cTAKES): architecture, component evaluation andapplications.

JAMIA , 17(5):507–513, 09 2010. ISSN 1067-5027. doi: 10.1136/jamia.2009.001560. URL https://doi.org/10.1136/jamia.2009.001560 .Emily Sheng, Kai-Wei Chang, Prem Natarajan, and Nanyun Peng. The woman worked asa babysitter: On biases in language generation. In

Proceedings of the 2019 Conference onEmpirical Methods in Natural Language Processing and the 9th International Joint Conferenceon Natural Language Processing (EMNLP-IJCNLP) , pages 3398–3403, 2019.Yishai Shimoni, Chen Yanover, Ehud Karavani, and Yaara Goldschmnidt. Benchmarkingframework for performance-evaluation of causal inference analysis. arXiv preprintarXiv:1802.05046 , 2018. ENERATING S YNTHETIC T EXT D ATA TO E VALUATE C AUSAL I NFERENCE M ETHODS

Ilya Shpitser and Judea Pearl. Identiﬁcation of joint interventional distributions inrecursive semi-markovian causal models. In , pages 1219–1226, 2006.Lindsay M Silva, Marianne Coolman, Eric AP Steegers, Vincent WV Jaddoe, Henriette AMoll, Albert Hofman, Johan P Mackenbach, and Hein Raat. Low socioeconomic statusis a risk factor for preeclampsia: the generation r study.

Journal of hypertension , 26(6):1200–1208, 2008.Elizabeth A Stuart. Matching methods for causal inference: A review and a look forward.

Statistical science: a review journal of the Institute of Mathematical Statistics , 25(1):1, 2010.Adarsh Subbaswamy, Peter Schulam, and Suchi Saria. Preventing failures due to datasetshift: Learning predictive models that transport. In

The 22nd International Conference onArtiﬁcial Intelligence and Statistics , pages 3118–3127, 2019.Stijn Vansteelandt, Tyler J VanderWeele, Eric J Tchetgen, and James M Robins. Multiplyrobust inference for statistical interactions.

Journal of the American Statistical Association ,103(484):1693–1704, 2008.Victor Veitch, Dhanya Sridhar, and David Blei. Adapting text embeddings for causalinference. In

Conference on Uncertainty in Artiﬁcial Intelligence , pages 919–928. PMLR,2020.Hanna M Wallach. Topic modeling: beyond bag-of-words. In

ICML , pages 977–984, 2006.Dingquan Wang and Jason Eisner. Synthetic data made to order: The case ofparsing. In

Proceedings of the 2018 Conference on Empirical Methods in Natural LanguageProcessing , pages 1325–1337, Brussels, Belgium, October-November 2018. Association forComputational Linguistics. doi: 10.18653/v1/D18-1163. URL .Zhao Wang and Aron Culotta. When do words matter? understanding the impact oflexical choice on audience perception using individual treatment effect estimation. In

Proceedings of the AAAI Conference on Artiﬁcial Intelligence , volume 33, pages 7233–7240,2019.Galen Weld, Peter West, Maria Glenski, David Arbour, Ryan Rossi, and Tim Althoff.Adjusting for confounders with text: Challenges and an empirical evaluation frameworkfor causal inference. arXiv preprint arXiv:2009.09961 , 2020.T Wendling, K Jung, A Callahan, A Schuler, NH Shah, and B Gallego. Comparing methodsfor estimation of heterogeneous treatment effects using observational data from healthcare databases.

Statistics in medicine , 37(23):3309–3324, 2018.Genta Indra Winata, Andrea Madotto, Chien-Sheng Wu, and Pascale Fung. Code-switchedlanguage models using neural based synthetic data from parallel sentences. In

Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL) , OOD -D OUGHTY , S

HPITSER AND D REDZE pages 271–280, Hong Kong, China, November 2019. Association for ComputationalLinguistics. doi: 10.18653/v1/K19-1026. URL .Zach Wood-Doughty, Ilya Shpitser, and Mark Dredze. Challenges of using text classiﬁersfor causal inference. In

Proceedings of the 2018 Conference on Empirical Methods inNatural Language Processing , pages 4586–4598, Brussels, Belgium, 2018. Association forComputational Linguistics. doi: 10.18653/v1/D18-1488.Chia-Yi Wu, Chin-Kuo Chang, Debbie Robson, Richard Jackson, Shaw-Ji Chen, Richard DHayes, and Robert Stewart. Evaluation of smoking status identiﬁcation using electronichealth records and open-text information in a large mental health case register.

PloS one ,8(9), 2013.Peng Xu, Mostofa Patwary, Mohammad Shoeybi, Raul Puri, Pascale Fung, AnimashreeAnandkumar, and Bryan Catanzaro. Controllable story generation with externalknowledge using large-scale language models. In

Proceedings of the 2020 Conference onEmpirical Methods in Natural Language Processing (EMNLP) , pages 2831–2845, 2020.Shu Yang and Peng Ding. Combining multiple observational data sources to estimatecausal effects.

Journal of the American Statistical Association , 115(531):1540–1554, 2020.Liuyi Yao, Sheng Li, Yaliang Li, Hongfei Xue, Jing Gao, and Aidong Zhang. On theestimation of treatment effect with text covariates. In

Proceedings of the 28th InternationalJoint Conference on Artiﬁcial Intelligence , pages 4106–4113. AAAI Press, 2019.Kai Zheng, David A Hanauer, Rema Padman, Michael P Johnson, Anwar A Hussain, WenYe, Xiaomu Zhou, and Herbert S Diamond. Handling anticipated exceptions in clinicalcare: investigating clinician use of ‘exit strategies’ in an electronic health records system.

Journal of the American Medical Informatics Association , 18(6):883–889, 2011., 18(6):883–889, 2011.