aa r X i v : . [ s t a t . M E ] O c t Comment: Reflections on the Deconfounder
Alexander D’AmourGoogle Research, Cambridge, MA, [email protected]
I would like to congratulate the authors on their illuminating article, and thank the editors for the opportunityto discuss the paper. The deconfounder method that this article presents is appealing: a number of importantscientific investigations and high-stakes decisions fit into its template. Indeed, as the authors note, instancesof the deconfounder have already been deployed without explicit causal language in a number of appliedsettings. By bringing to light the implicit causal argument that underlies this approach, the authors havesparked an important conversation with potentially far-reaching consequences. It is thus important tocarefully outline when we expect the deconfounder method to succeed in characterizing causal relationshipsand when we expect it to fail.I have personally been in conversation with the authors over the past two years about this work, and thisdiscussion has yielded some interesting insights, some of which have been published (D’Amour, 2019), andsome of which now appear in the current version of the article and in follow-up work (Wang and Blei,2019). The aim of this note is to draw out some conclusions from this conversation about the role thatthe deconfounder can play in practical causal inference. In particular, I will make three points here. First,in my role as the critic in this conversation, I will summarize some arguments about the lack of causalidentification in the bulk of settings where the “informal” message of the paper suggests that deconfoundercould be used. This is a point that is discussed at length in D’Amour (2019), which motivated the resultsconcerning causal identification in Theorems 6–8. Second, I will argue that adding parametric assumptionsto the working model in order to obtain identification of causal parameters (a strategy followed in Theorem6 and in the experimental examples) is a risky strategy, and should only be done when extremely strongprior information is available. Finally, I will consider the implications of the nonparametric identificationresults provided for a narrow, but non-trivial, set of causal estimands in Theorems 7 and 8. I will highlightthat these results may be even more interesting from the perspective of detecting causal identification from1bserved data, under relatively weak assumptions about confounders.Throughout this note, I will draw connections to sensitivity analysis methods that probe the implicationsof unobserved confounding. This is a natural lens through which to study the deconfounder because manysensitivity analysis methods posit a similar latent variable model to the one that the deconfounder deploysas a working model (see, e.g., Rosenbaum and Rubin, 1983). Well-designed sensitivity analyses can revealhow specific assumptions restrict the range of causal conclusions that are compatible with the observed data,and are thus useful for understanding what is lost when assumptions like “no unobserved confounders” arerelaxed to “no unobserved single-cause confounders.” Thus, I believe, as the authors suggest, that sensitivityanalysis should be a core part of any workflow that deploys the deconfounder, and discuss at various placeshow sensitivity analysis could be used effectively in this setting.
Preliminaries
Following the paper, I will denote causes as A := ( A (1) , . . . , A ( m ) ) taking specific values a = ( a (1) , . . . , a ( m ) ), potential outcomes as Y ( a ), and latent confounders as Z . To avoid measure-theoreticconsiderations when writing conditioning statements, I will consider the treatments A ( k ) to be discrete. Iwill write observed outcomes as Y obs , where, under the stable unit treatment value assumption (SUTVA), Y obs = Y ( A ). Finally, I will denote by Z any latent confounders.Throughout, I will consider models of the joint distribution P ( A, Y obs , Z ), which I will refer to as latentvariable models. I will assume that unconfoundedness is satisfied conditional on Z : Y ( a ) ⊥⊥ A | Z Z -a.e. , ∀ a. Thus, if the latent variable model is fully specified, the potential outcome distributions P ( Y ( a )) are alsospecified by the following adjustment formula, which “adjusts” for the confounder ZP ( Y ( a )) = E [ P ( Y obs | Z, A = a )] ∀ a. (1)I will refer to the integrand in (1) P ( Y obs | Z, A = a ) as the outcome model. If the confounder Z is observed,and the overlap condition is satisfied, then P ( Y ( a )) is identified from observed data. The question at handis whether P ( Y ( a )) can be identified when Z is unobserved.2 Fundamental Limitations of the Deconfounder Approach
I will begin by summarizing the argument in D’Amour (2019) critiquing the “informal” message about thedeconfounder approach (stated most explicitly in the informal statement of Theorem 6 and Section 3.4).Specifically, this message asserts that, under the “no unobserved single-cause confounders” assumption,any well-fitting latent variable model P ( Y obs , A, Z ) will yield the correct potential outcome distribution in P ( Y ( a )) via the adjustment formula (1). This informal story is motivated by strong intuition. Lemmas 1–3establish that multi-cause confounding leaves an observable “imprint” of dependence between the causes A .Thus, it seems natural that we might be able to gain some information, and even adjust for, an unobservedmulti-cause confounder Z by modeling the dependence between the causes A .Unfortunately, this intuition can only be carried so far: while a factor model for the causes A can recoverinformation about multi-cause confounders from observed data, the potential outcome distributions P ( Y ( a ))are not non-parametrically identified, except in cases where all confounding is observed. Thus, without addi-tional unverifiable assumptions, no method can recover the distributions P ( Y ( a )) when there is unobservedconfounding. In this section, I briefly demonstrate why this is the case. For a more in-depth argument aboutlack of identification in this setting with concrete examples, see D’Amour (2019).As I show formally below, the key difficulty is that the causes A cannot be used simultaneously as measure-ments of the unobserved confounder Z , and as treatments whose effects are being estimated. If the event A = a provides only a noisy measurement of Z , there is ambiguity in how the outcome model P ( Y obs | Z, A = a ) should align the variability in the residual distributions P ( Y obs | A = a ) and P ( Z | A = a );there are many specifications of the residual dependence between Y obs and Z that are compatible with theobserved data. This is a classic problem that arises when confounders are measured with error (see, e.g.Ogburn and Vanderweele, 2012). On the other hand, if the event A = a provides a perfect measurement of Z , such that there is some function ˆ z ( A ) such that ˆ z ( a ) = Z , then the overlap condition fails. In this case, P ( Y obs | Z, A = a ) is only identified when Z = ˆ z ( a ) because the event Z = ˆ z ( a ) has zero probability in theobserved data.Let us now make this argument formal. To do this, we will account for how the two deconfounder assumptionsof (a) good model fit, and (b) “no unobserved single-cause confounders” constrain the factor model and itsimplications about the potential outcomes P ( Y ( a )). This accounting is convenient if we rewrite the jointdistribution using copula densities c ( V, W ) = P ( V,W ) P ( V ) P ( W ) , which characterize the dependence between random3ariables independently of their marginal distributions. P ( Y obs , A, Z ) = P ( A, Y obs ) | {z } Observed · P ( Z ) c ( Z, A ) | {z } Factor Model · c ( Y obs , Z | A ) | {z } Outcome Copula . (2)Each factor in this composition corresponds to a different assumption. The requirement for good modelfit constrains only the first term, which specifies the distribution of observable quantities, while the “nounobserved single-cause confounders” assumption constrains the second term by constraining the causes tobe conditionally independent given Z (Lemma 2). This leaves the outcome-confounder copula density c ( Y obs , Z | A ) = P ( Y obs ,Z | A ) P ( Y | A ) P ( Z | A ) unconstrained. This copula specifies the residual dependence between Y obs and Z after conditioning on the causes A , and plays a key role in specifying the outcome model P ( Y obs | A, Z ).To complete the argument, note that the potential outcome distributions P ( Y ( a )) implied by the latentvariable model are sensitive to the specification of this copula. Specifically, the estimand in (1) can bewritten as P ( Y ( a )) = Z Z P ( Y obs | A = a ) c ( Y obs , Z | A = a ) dP ( Z ) . Plugging in different specifications of the copula here yields different conclusions about P ( Y ( a )). Whenever P ( Y ( a )) = P ( Y | A = a ), there are multiple specifications of the copula that yield different conclusions aboutthe potential outcomes. Thus, P ( Y ( a )) is not identified unless there is no confounding and P ( Y ( a )) = P ( Y | A = a ).We can now revisit the tension between the roles of causes A as measurements of Z , and as treatments.In cases where Z can only be inferred inexactly (i.e., P ( Z | A = a ) is non-degenerate), the marginals P ( Y obs | A = a ) and P ( Z | A = a ) put some constraints on the outcome model P ( Y obs | Z, A = a ), but theambiguity in the copula implies that this model is not identified for any value of Z . In cases where Z canbe reconstructed deterministically from the causes by some function ˆ z ( a ), (i.e., P ( Z | A = a ) is degenerate),the outcome model P ( Y obs | Z, A = a ) is identified when Z = ˆ z ( a ), but the copula is undefined whenever Z = ˆ z ( a ) because this event has zero probability.The upshot of this argument is that neither the deconfounder nor any other estimation method can adjustfor unobserved confounding when estimating P ( Y ( a )) under the “no unobserved single-cause confounders” The “no unobserved single-cause confounders” assumption does not uniquely identify the factor model by itself. Somestructure also needs to be put on the latent variable, and even then, the factor model may not be identified. See D’Amour(2019) for an example where the factor model P ( A, Z ) is itself not identified. To see this, note that the independence copula c ( Y obs , Z | A = a ) = 1 implies that P ( Y ( a )) = P ( Y | A = a ). Thus, because P ( Y ( a )) = P ( Y | A = a ), this copula and the true copula yield different conclusions about P ( Y ( a )). Z from the causes A . Although the single-cause confounding assumption does put some non-trivial structure on the latent variable model, it is not enough for causal estimation.This lack of identification leaves practitioners looking to apply the deconfounder with two options: eithermake additional assumptions about the latent variable model P ( Y obs , A, Z ) so that P ( Y ( a )) is identified, orseek out causal comparisons where all of the confounding is effectively observed. In the Theory section ofthe paper, the authors consider both of these paths. I will discuss each of these options in turn. I now turn to the subject of parametric identification of causal parameters, and offer some cautions aboutemploying this strategy. Parametric identification is a natural strategy to employ when the causal parametersof interest are not non-parametrically identified. One obtains parametric identification by adding parametricassumptions to the working model that constrain the implied potential outcome distributions P ( Y ( a )) tobe unique. The authors employ this parametric identification strategy in the experimental demonstrationsof the deconfounder, as well as the formal result in Theorem 6. In Theorem 6, the copula c ( Y obs , Z | A ) isrestricted by assuming that there is no interaction between the causes A and the latent variable Z in theoutcome model (i.e., that they combine linearly), and assuming that the confounder is piecewise constantin A . In the paper’s experiments, the authors assume a parametric factor model (e.g., a quadratic factormodel for the genome-wide association study simulation), and a true linear outcome model. In the casesof Theorem 6 and the GWAS simulation study, the authors prove that these parametric assumptions aresufficient for identification.Parametric identification can be a risky strategy to employ in practice. Specifically, the fact that theparametric assumptions are necessary to identify causal parameters implies that some aspects of theseassumptions are not testable in the observed data. The decomposition in (2) makes this clear: given that theobserved data are insufficient to identify the causal parameters, the parametric assumptions must restrictsome of the unidentified portions of the latent variable model. Thus, to have confidence in this approach,one needs to have confidence in the parametric model used to identify causal effects as a true model ofthe world , not merely as an acceptable description of the observed data. This is because the identifyingparametric assumptions specify not only a descriptive model of the observed data, but also a structuralmodel for unobserved counterfactual outcomes. Relying on parametric identification may be feasible in caseswhere one has strong prior knowledge—e.g., about the quantity represented by the unmeasured confounder,5r the specific distributions of measurement errors—but such knowledge is often unavailable.In addition, uncertainty estimates that are based directly on the parametric specification, e.g., Bayesiancredible sets, do not capture the full extent of uncertainty about causal effects according to the data.Specifically, these uncertainty estimates only quantify uncertainty within the specified model, and do notinclude the fundamental uncertainty associated with the lack of non-parametric identification of the potentialoutcome distributions P ( Y ( a )). As a result, unless the prior information used to specify the parametricassumptions is very strong, these uncertainty estimates will understate the degree of uncertainty about acausal parameter estimate. This is a standard critique of parametric uncertainty quantification, but carriesextra weight in the context where conclusions depend on untestable aspects of the parametric model. Forexample, for the parametrically identified latent variable model in the GWAS example, as the sample sizegrows, the posterior for the causal parameter will concentrate around a single value, even though there existsa range of outcome models that correspond to different copulas c ( Y obs , Z | A = a ) that are equivalentlycompatible with the observed data, but would concentrate on different causal parameters. In fact, even small,seemingly benign parametric choices can mask alternative causal explanations. Lessons from latent variablemodels in the missing data and causal inference literatures can be instructive here. For example, analysesof the widely-used Heckman selection model (Heckman, 1979) have noted that the tail thickness of priorson latent variables can induce starkly different conclusions that are hidden by using the Gaussian default(Little and Rubin, 2015; Ding, 2014). See also discussions in Robins et al. (2000) and Linero and Daniels(2017) for other examples.Here, sensitivity analysis can be a useful tool to account for the fundamental uncertainty due to non-identification of the causal estimand. When performed with parametric models, sensitivity analyses perturbthe parametric assumptions made with the estimating model in order to understand what other causalconclusions could be obtained under different parametric specifications. Performing sensitivity analyses ondeconfounder estimates is straightforward: a number of sensitivity analysis approaches employ a workingmodel with the same latent variable structure (e.g., Rosenbaum and Rubin, 1983; Imbens, 2003; Dorie et al.,2016; Cinelli and Hazlett, 2018). However, sensitivity analyses can also fall victim to spurious parametricidentification if the perturbations are not appropriately parameterized (Gustafson et al., 2018). To avoidthis issue, it can be useful to employ sensitivity analysis strategies that cleanly separate the portions of themodel that are identified by the observed data from those that are identified by parametric assumptions(Franks et al., 2019; Robins et al., 2000; Linero and Daniels, 2017). In the context of the deconfounder, thedecomposition in (2) is a promising place to start, and is the subject of current work.6 Toward a More Selective Deconfounder Workflow
A more cautious alternative to pursuing parametric identification is to seek out causal questions that havedefinitive answers under the “no unobserved single-cause confounders” assumption. The authors take thispath in Theorems 7 and 8, in a setting where the latent confounder Z can be deterministically reconstructedas a function of the causes ˆ z ( A ). Here, however, the factor model seems less interesting as a tool forcalculating causal effects, and more interesting as a tool for establishing empirically when no unobservedconfounding is present. In my opinion, this seems to be a more interesting thread to follow.To review, in Theorem 7 the authors consider partitioning the causes into a set of focal causes A k whoseeffects will be estimated, and a set of auxiliary causes A k +1: m that will serve as measurements of the latentconfounder. The theorem then states that if the latent confounder Z can be written as a function of theauxiliary causes Z = ˆ z ( A k +1: m ) alone, then the distributions of potential outcomes defined with respect tothe subset of focal causes P ( Y ( a k )) are identifiable subject to an overlap condition. Meanwhile, Theorem8 states that certain counterfactual potential outcome distributions of the form P ( Y ( a ) | A = a ′ ) areidentifiable as long as the causes a and a ′ map to the same value of the latent confounder, i.e., ˆ z ( a ) =ˆ z ( a ′ ).In these results, the authors focus on the role of the factor model in the identification of causal estimandsunder the “no unobserved single-cause confounders” assumption. However, the factor model is not essentialfor this point. Note that Theorems 7 and 8 both imply that the causal parameters can be identified interms of the causes A alone, because it is assumed that the confounder Z can be written as a function of A .Written with slightly more generality, the identification result in Theorem 7 implies: P ( Y ( a k )) = E [ P ( Y obs | A k = a k , A k +1: m )] , (3)while the identification result in Theorem 8 implies: P ( Y ( a ′ ) | A = a ) = P ( Y obs | A = a ′ ) ∀ ( a, a ′ ) s.t. ˆ z ( a ) = ˆ z ( a ′ ) . (4)To me, the more interesting point is that the factor model can be used in some cases to determine em-pirically whether some of the assumptions of the theorems are met. For example, the setting of Theorem7 can be framed as a problem where the unobserved confounder Z is measured with proxies A k +1: m . Itis well-understood that in the limit where Z is perfectly recovered by the proxies, the potential outcome This is not how the theorem is stated, but this function restriction is implied by the subsequent overlap condition. Y obs A ( m ) · · · A (1) X Figure 1: DAG assumed in Proposition 1, representing the relationship between causes A , latent confounder Z , covariates X , and observed outcome Y obs .distribution P ( Y ( a K )) is identified (Ogburn and Vanderweele, 2012); however, in single-cause problems,one cannot determine whether this condition has been met. Similarly, Theorem 8 can be framed as a settingwhere one is imputing a set of counterfactual outcomes within a subpopulation where there is no confound-ing because, within this subpopulation, the confounder is fixed. Here, too, in single-cause problems, onecannot definitively identify such subpopulations from observed data. Interestingly, the theory of multi-causeconfounding presented in the paper suggests that these assumptions can be empirically validated undersome restrictions on the causal DAG relating A to Y obs and the “no unobserved single-cause confounders”assumption. For example, this theory supports the following proposition. Proposition 1.
Suppose there are no single-cause confounders, and the structural relationships betweencauses A , latent confounder Z , and observed outcomes Y obs can be represented in the DAG in Figure 1.Suppose that in addition to causes A , we also have auxiliary covariates X , which are conditionally independentof the causes A conditional on the multi-cause confounder Z . Then for any function ˆ z ( A, X ) such that thecauses A are mutually independent conditional on ˆ z ( A, X ) , the conditional independence A ⊥⊥ Y ( a ) | ˆ z ( A, X ) also holds for each a . Theorems 7 and 8 can be written as consequences of this proposition. This proposition is potentially usefulbecause it shows that absence of certain confounding structures has observable implications. This insight isclosely related to the literature on negative controls (see, e.g., Lipsitch et al., 2010).This result suggests that one can use a similar workflow to the deconfounder to determine, at least inprinciple, whether identification statements like (3) or (4) are valid in a given setting. Specifically, one canobtain a function ˆ z ( A, X ) (perhaps by fitting a factor model), then test whether the causes A appear to bemutually independent conditional on ˆ z ( A, X ). If one is satisfied that this is true, (3) or (4) can be applied.Importantly, this procedure is truly agnostic to the parametric specification of the model used to obtainˆ z ( A, X ): all of the conditions are only functions of observables.While the workflow in this procedure is similar to the deconfounder, it has a different use case. Insteadof enabling causal inference in a wide range of cases, this procedure would be used to determine whether8ne can proceed with unconfounded inference at all, and can potentially give “no” as an answer. Still, thissort of procedure can prove useful in complex data contexts, where it can be valuable to surface causalquestions that can be adequately answered with the available data. In a specific example of this approach,Sharma et al. (2018) propose a similar testing procedure to uncover unconfounded comparisons, and use itto evaluate the causal effect of a recommender system on purchasing rates for certain products.In outlining this procedure, I have belabored the point that it is a workflow “in principle” because it couldprove tricky to implement. The observable implication that needs to be tested is a complex conditionalindependence statement, and these are notoriously difficult to test in practice (Shah and Peters, 2018). Inparticular, one would receive the “green light” to estimate a causal parameter by failing to reject the null ofconditional independence, which can only be reliably depended upon if the test has acceptably high power,but designing such tests is difficult, and in some settings, impossible.Here, it can again be helpful to turn back to sensitivity analysis. Instead of attempting to rule out allpossible forms of dependence between the causes A conditional on ˆ z ( A, X ), a sensitivity analysis approachcould explore a number of candidate models for the residual dependence between the causes A and relatethese models to the confounding induced by the unobserved confounder Z . For example, one could examinethe range of causal effects that would be compatible with the assumption that, conditional on ˆ z ( A, X ), thethe causes A are no more predictive of a potential outcome Y ( a ) than any leave-one-out set of the causes A − k is able to predict a held-out cause A ( k ) . This sort of calibration argument is common in more standardsensitivity analyses (Imbens, 2003; Dorie et al., 2016; Franks et al., 2019; Cinelli and Hazlett, 2018). Incases where dependence between the causes can be ruled out conclusively, this approach would yield asensitivity region that collapses to a point; however, in the more likely case where many dependences cannotbe ruled out, this approach would represent this uncertainty with a wider sensitivity region. It should benoted that constructing a plausible sensitivity analysis of this type would require deep domain knowledgeto justify the analogy between different dependences between variables. Negative control methods andrelated identification strategies Lipsitch et al. (2010) and Miao et al. (2018) could be framed as particularlysuccessful executions of this type of argument. In writing this paper, the authors have drawn attention to a problem that is simultaneously scientificallyimportant, methodologically interesting, and conceptually subtle. Although I have taken on the role ofcritic in our conversations, I believe their contribution here is important. I remain skeptical about the9econfounder as a method for causal point estimation, but believe that the authors’ characterization ofmulti-cause confounding could yield fruitful developments in sensitivity analysis, and in potentially obtainingidentification results in more complex settings. This work has certainly inspired me to pay more attentionto this problem, and to consider how new methods and tools can be developed to help practitioners drawprincipled causal conclusions in this setting.
References
Carlos Cinelli and Chad Hazlett. Making sense of sensitivity: Extending omitted variable bias. Technicalreport, Working Paper, 2018.Peng Ding. Bayesian robust inference of sample selection using selection-t models.
Journal of MultivariateAnalysis , 124:451–464, 2014.Vincent Dorie, Masataka Harada, Nicole Bohme Carnegie, and Jennifer Hill. A flexible, interpretable frame-work for assessing sensitivity to unmeasured confounding.
Statistics in medicine , 35(20):3453–3470, 2016.Alexander D’Amour. On multi-cause causal inference with unobserved confounding: Counterexamples, im-possibility, and alternatives. In
The 22nd International Conference on Artificial Intelligence and Statistics ,pages 3478–3486, 2019.Alex Franks, Alex D’Amour, and Avi Feller. Flexible sensitivity analysis for observational studies withoutobservable implications.
Journal of the American Statistical Association , (just-accepted):1–38, 2019.Paul Gustafson, Lawrence C McCandless, et al. When is a sensitivity parameter exactly that?
StatisticalScience , 33(1):86–95, 2018.James J Heckman. Sample selection bias as a specification error.
Econometrica , 47(1):153–161, 1979.Guildo W Imbens. Sensitivity to exogeneity assumptions in program evaluation.
American Economic Review ,93(2):126–132, 2003.Antonio R Linero and Michael J Daniels. Bayesian approaches for missing not at random outcome data:The role of identifying restrictions. 2017.Marc Lipsitch, Eric Tchetgen Tchetgen, and Ted Cohen. Negative controls: a tool for detecting confoundingand bias in observational studies.
Epidemiology (Cambridge, Mass.) , 21(3):383, 2010.Roderick JA Little and Donald B Rubin.
Statistical analysis with missing data . John Wiley & Sons, 2015.Wang Miao, Zhi Geng, and Eric J Tchetgen Tchetgen. Identifying causal effects with proxy variables of anunmeasured confounder.
Biometrika , 105(4):987–993, 2018.Elizabeth L Ogburn and Tyler J Vanderweele. Bias attenuation results for nondifferentially mismeasuredordinal and coarsened confounders.
Biometrika , 100(1):241–248, 2012.James M Robins, Andrea Rotnitzky, and Daniel O Scharfstein. Sensitivity analysis for selection bias andunmeasured confounding in missing data and causal inference models. In
Statistical models in epidemiology,the environment, and clinical trials , pages 1–94. Springer, 2000.Paul R Rosenbaum and Donald B Rubin. Assessing sensitivity to an unobserved binary covariate in an ob-servational study with binary outcome.
Journal of the Royal Statistical Society: Series B (Methodological) ,45(2):212–218, 1983.Rajen D Shah and Jonas Peters. The hardness of conditional independence testing and the generalisedcovariance measure. arXiv preprint arXiv:1804.07203 , 2018.10mit Sharma, Jake M Hofman, Duncan J Watts, et al. Split-door criterion: Identification of causal effectsthrough auxiliary outcomes.
The Annals of Applied Statistics , 12(4):2699–2733, 2018.Yixin Wang and David M Blei. Multiple causes: A causal graphical view. arXiv preprint arXiv:1905.12793arXiv preprint arXiv:1905.12793