[PDF] Computational Causal Inference

Abstract

We introduce computational causal inference as an interdisciplinary field across causal inference, algorithms design and numerical computing. The field aims to develop software specializing in causal inference that can analyze massive datasets with a variety of causal effects, in a performant, general, and robust way. The focus on software improves research agility, and enables causal inference to be easily integrated into large engineering systems. In particular, we use computational causal inference to deepen the relationship between causal inference, online experimentation, and algorithmic decision making. This paper describes the new field, the demand, opportunities for scalability, open challenges, and begins the discussion for how the community can unite to solve challenges for scaling causal inference and decision making.

Full PDF

CComputational Causal Inference

Jeﬀrey C. WongComputational Causal Inference, Netﬂix

Abstract

We introduce computational causal inference as an interdisciplinary ﬁeld across causal inference, algorithmsdesign and numerical computing. The ﬁeld aims to develop software specializing in causal inference that cananalyze massive datasets with a variety of causal eﬀects, in a performant, general, and robust way. The focuson software improves research agility, and enables causal inference to be easily integrated into large engineeringsystems. In particular, we use computational causal inference to deepen the relationship between causal inference,online experimentation, and algorithmic decision making.This paper describes the new ﬁeld, the demand, opportunities for scalability, open challenges, and begins thediscussion for how the community can unite to solve challenges for scaling causal inference and decision making.

Causal inference and machine learning have a symbiotic relationship that is growing deeper. Companies are usingmachine learning to improve content recommendations, sales, business operations, and to personalize user experiences.These companies will test new algorithms online in order to determine whether the algorithms cause a positive eﬀectfor the company. In this capacity, causal inference for online experiments serves as an honest and independentevaluator for an algorithm. However, recent interdisciplinary work in the combination of machine learning andcausal inference has shown a much deeper synergy between the two ﬁelds. Predictions from machine learned modelshave been debiased by utilizing inverse propensity weights (Dudik, Langford, and Li 2011), which are frequentlyfound in studies of causal eﬀects. At the same time, causal inference methods have beneﬁted from methods formodeling high dimensional relationships in order to determine heterogeneity in treatment eﬀects, such as in Wagerand Athey 2018. Frameworks such as Pearl’s do-calculus (Pearl 2012) have also created clear programmatic structurefor answering causal eﬀects queries when relationships in data can be modeled as a graph.The intersection of machine learning and causal inference is particularly strong in the symbiotic relationshipbetween algorithmic policy making and experimentation platforms. In policy making, we are presented with adecision to make; we wish to construct an algorithm that receives features as input and outputs an action to take.The action can be personalized, for example in contextual bandits (Li, Chu, Langford, and Schapire 2010), andshould be the optimal action that maximizes a reward function. These algorithms are tested in an online experimentthat reports the causal eﬀect on a key performance indicator (KPI) due to the new algorithm. Experimentationand policy making are highly aligned when the reward function for the policy algorithm and the KPI used for anonline experiment are the same, thus the policy algorithm must determine the action with the largest causal eﬀecton the KPI. Recent research in contextual bandits (Dimakopoulou et al. 2017) shows new policy algorithms thatincorporate methods from the causal inference literature in order to reduce bias in the reward estimator and increaseits robustness. In Siddiqi 2019, LinkedIn, Netﬂix, Facebook, and Dropbox described engineering systems that cansupport algorithmic policy making, how they are utilized for their products, and their deepening relationship toattribution, a well known causal inference problem in many industries. Similarly, ﬁelds like artiﬁcial intelligence arealso dependent on a rich analysis of causal eﬀects.It is common to ﬁnd mature software and engineering systems for an experimentation platform (Fabijan et al.2017, Kohavi et al. 2013, Deng, Lu, and Litz 2017, Diamantopoulos et al. 2019). However, it is much less common toﬁnd mature software for statistics and causal inference that integrates into such systems. One of the main challengesis the lack of software dedicated to estimating causal eﬀects that scales well. For example, policy algorithms trainmodels over high dimensional feature sets, then use them to evaluate several actions and counterfactuals for diﬀerentcombinations of features, a task that demands a computationally eﬃcient engine. The lack of performant softwarefor such a daunting task creates engineering risk, as well as slow and challenging iteration cycles.In order to achieve broad adoption of causal inference methods in ﬁelds such as experimentation platforms andpolicy making, the methods need to be general, performant, and robust in software that is easy to use. This requires1 a r X i v : . [ s t a t . C O ] J u l n interdisciplinary ﬁeld across causal inference, algorithm design, and numerical computing, which we introduce as computational causal inference (CompCI). The ﬁeld focuses on software that scales causal inference methods sothat they are practical to use broadly in research and in engineering systems. We aim to improve research agility, aswell as software that report treatment eﬀects for online experiments and produce policies that maximize the causaleﬀect on a KPI. By adopting strategies for improving computational performance, causal models can be trained andevaluated eﬃciently, frequently 30 times faster than oﬀ-the-shelf strategies. In addition, there are strategies thatcan make a single machine scale well, greatly reducing the overhead and maintenance burden in large engineeringsystems. The combined simplicity and performance reduce friction to apply causal inference in policy making.This paper discusses the exact algorithmic and human need for scalable causal eﬀects, speciﬁcally in performant,general, and robust estimation. Our introduction of computational causal inference calls for community involvementto solve open challenges in software and methods research for causal inference. We review the state of causal eﬀectssoftware in the industry, with exact references to code. To lead an open discussion, we propose a general frameworkto structure a software library around causal eﬀects, and speciﬁc ways to optimize ﬁtting models and estimatingdistributions of treatment eﬀects. Early development at Netﬂix shows promising results. Finally, we share the majorchallenges we hope the broader community will unite on and solve in the ﬁeld of computational causal inference. There are two signiﬁcant classes of engineering systems that motivate the need for performant computation for causaleﬀects: experimentation platforms, and algorithmic policy making engines.

First, an experimentation platform (XP) that reports on multiple online and controlled experiments needs to be ableto estimate causal eﬀects at scale. For each experiment, an XP models a variety of causal eﬀects, for example thecommon average treatment eﬀect, conditional average treatment eﬀects, and time dynamic treatment eﬀects, seen inFigure 1. These eﬀects help the business understand its user base, diﬀerent segments in the user base, and how theychange over time. The volume of the data demands large amounts of memory, and the estimation of the treatmenteﬀects on such volume of data can be overwhelming.Figure 1: Types of treatment eﬀects. y Treatment A v e r age T r ea t m en t E ff e c t Average Treatment Effect (a) Average treatment eﬀect. y − .

02 0 .

00 0 . JIHGFEDCBA

Conditional Average Treatment Effect S eg m en t Heterogeneous Effects acrossSegments (b) Conditional average treatment eﬀectsreport the average eﬀect per segment.

I JE F G HA B C D0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2

Time T i m e D y na m i c T r ea t m en t E ff e c t Heterogeneous Effects acrossTime and Segments (c) Time dynamic treatment eﬀects re-port the average eﬀect per segment perunit of time.

Ordinary Least Squares (OLS) is usually the foundation for measuring average treatment eﬀects, and extendselegantly into conditional average eﬀects and time dynamic eﬀects (Wong, R. Lewis, and Wardrop 2019). The ﬁrststep is to ﬁt OLS, and the second is to query it for counterfactual predictions across all potential outcomes. Thecomputational complexity for ﬁtting OLS is usually O ( np ), where n is the number of observations, and p is thenumber of features. In practice, an XP can encounter scenarios where n is on the order of 10 . The OLS modelrequires interaction terms in order to estimate conditional average treatment eﬀects, making p large. To measuretreatment eﬀects over time, we must observe users for multiple time periods, dramatically increasing the numberof observations in the analysis; an analysis of causal eﬀects that took n = 10 observations can easily become an2nalysis that takes n = 3 · or n = 9 · observations. After ﬁtting such a large OLS model with covariates, X ∈ R n × p , and dependent variable, y , we must evaluate the model for diﬀerent potential outcomes. Suppose thereare K potential treatments in the set: A = { A , A , . . . A K } , with A being the control experience. The conditionalaverage treatment eﬀects are the conditional diﬀerences E [ y | A i , X ] − E [ y | A , X ] ∀ i ∈ { . . . K } . For each conditional diﬀerence, the expectation scores the counterfactual feature matrix of size n × p , where thetreatment variable is set to A i . Generating these counterfactual feature matrices and predictions is again a memoryand computationally heavy operation. An XP repeats this lengthy exercise for multiple dependent variables and formultiple experiments, culminating in large amounts of computation. Second, policy algorithms support engineering systems through automated decision making by recommending actionsthat cause a system to incrementally reach a better state. They have similar computational complexity to that of anXP. Large applications in product recommendations and artwork have been discussed in Siddiqi 2019, Krishnan 2016and Nelson 2016. The setup for policy algorithms begins with n users, and for each user we must decide an actionamong K actions in A = { A , A , . . . , A K } . Each user has features, x , and each action generates a reward, R ( A, x )with respect to a KPI. A deterministic policy, π ( x ), is a function that receives x and returns an action that is believedto generate the optimal reward. Given the current policy, π , we want to know whether there are alternate policiesthat achieve a larger reward than R ( π , x ), that is we seek to optimize π ∗ ( x ) = arg max π ( x ) R ( π ( x ) , x ) − R ( π , x ).This formulation can be thought of as a treatment eﬀect, with π being the control policy and π ( x ) the treatmentpolicy.There are many other ways causal eﬀects problems overlap with policy algorithms, such as:1. Identifying the best action that improves over π requires measuring treatment eﬀects.2. Personalized policies seek heterogeneity in treatment eﬀects. Constant treatment eﬀects yield policies that arealso constant.3. The eﬀect of an action can be a function of time, and can be a function of the actions previously taken. Thisis similar to analyzing the causal eﬀects of digital ads, which can vary in time and can have compounding ordiminishing eﬀects, for example in R. A. Lewis and Wong 2018.4. A policy algorithm may suggest to take an action, π ∗ ( x ). However, the action that is executed may be a diﬀerentaction, or no action at all. This is a case of noncompliance with the treatment, a common phenomenon inmany other ﬁelds.5. A policy algorithm usually assumes that all actions in A can be used with all n users. However, some usersmay not be qualiﬁed for certain actions. Furthermore, the list of actions they are qualiﬁed for may change overtime. This is similar to modeling causal eﬀects for an at risk group.To estimate personalized policies using causal eﬀects, we ﬁrst ﬁt a causal model. Then, we query it across allpossible actions to predict individual treatment eﬀects conditioning on a user’s features. One way to build a policyfrom the individual treatment eﬀects is to ﬁnd the actions that yield the largest treatment eﬀect. The analysis ofcausal eﬀects is similar to that in an XP, with the exception that an XP analyzes causal eﬀects retrospectively, andpolicy algorithms predict causal eﬀects prospectively. Policy algorithms naturally inherit all of the computationalcomplexity that XPs have, and frequently have greater computational complexity than an XP.The evaluation of a policy is diﬀerent than the evaluation of a single treatment in an XP, and introduces greatercomputational complexity for policy engines. Policy algorithms assign variable treatments to diﬀerent users, whereasan XP reports eﬀects conditioning on a ﬁxed treatment. In order to test if the personalized policy, π ∗ , is better than π , we can test the hypothesis H : (cid:88) i R ( π ∗ ( x i ) , x i ) − R ( π ( x i ) , x i ) = 0 against H A : (cid:88) i R ( π ∗ ( x i ) , x i ) − R ( π ( x i ) , x i ) > . x . The hypothesis test keeps the reward evaluation general, even in the casewhen data is autocorrelated. In this case, estimating the distribution of the sum of the treatment eﬀects can benumerically challenging, for example in the case of clustered covariances for OLS (Newey and West 1986). In themore challenging case when a closed form solution for the distribution of the treatment eﬀect does not exist, we maybe able to defer to the bootstrap (Efron and R. J. Tibshirani 1994). However, the bootstrap requires ﬁtting themodel multiple times, further highlighting the causal inference and numerical challenges in policy engines.In summary, experimentation platforms use rich causal eﬀects to inform business strategy. These eﬀects arenumerically challenging to estimate. Policy engines are similar to XPs, utilizing causal eﬀects for algorithmic decisionmaking, but have even greater numerical challenges. Computational causal inference will provide a focus on scalablealgorithms for causal inference, improving engineering systems for both ﬁelds. Computational causal inference’s focus on numerical computing delivers agility that enables people to innovate,be productive, and quickly recover from mistakes. This unique focus in CompCI simultaneously serves both theengineering needs in the industry as well as the needs to iterate quickly in research.New industry engineering systems need to be assessed for risk and for their ability to scale before they are deployed.Sustaining an engineering system is a long term commitment, so the system needs to be capable of continuouslyscaling with the industry for years. Massive computations, such as the ones we outlined for XPs and policy engines,are risky, and it is unclear whether simply acquiring more hardware can solve scaling challenges. The risk causesteams to debate the gains and costs of integrating causal inference into their systems. Often, there are three majorcosts to a team:1. Instability or latency in the product for the end consumers.2. The risk that scaling becomes too expensive in terms of money, hardware, and/or time, and will require asigniﬁcant redesign in the future. The redesign may include the redesign of other engineering services in theecosystem.3. The associated loss in team productivity due to managing such large computations.Alternatively, teams may create more scalable, but nuanced, approximations that deviate from rigorous mathematicalframeworks in causal inference. Such deviations can create challenges in the future, where it becomes hard to extend,and hard to absorb new innovations from the causal inference ﬁeld. CompCI preempts the scalability challenges byoptimizing the numerical algorithms for causal eﬀects, reducing the risk in developing and maintaining engineeringsystems that are based on causal inference. Secondly, CompCI’s approach allows failed tasks to be restarted andrerun quickly, improving developer operations and maintainability of the system.Fast computing helps researchers become productive and innovative. First, fast or interactive computing max-imizes cognitive ﬂow (Gray and Rumpe 2017). Scalable computation that runs on a single machine removes thecognitive overhead of managing distributed environments. Attention becomes concentrated on a single machine thatreturns results in seconds, facilitating a two way communication between the researcher and the data. This helps theresearcher transform thought into code, and results into new thought, ultimately improving productivity. Second,fast computing empowers researchers to innovate by making failure less costly. Innovations are always preceded by along series of failures, which can bear high mental and emotional costs. The challenges to success can be exacerbatedwhen iterations take hours instead of seconds. Fast software not only saves time for the researcher, it makes it easierto recover from mistakes, lowering psychological barriers to innovate and amplifying the researcher’s determination.To support the possibility of such an experience, we outline computational strategies in section 7 that greatly improvethe performance of causal inference software.Finally, CompCI provides a path for researchers and industry practitioners to use the same software, which iseasy to iterate on and can run large problems on a single machine. This is a powerful productivity advantage forXPs and algorithmic policy making, which are receiving signiﬁcant development from both academic and industrycommunities.

Computational causal inference is an interdisciplinary ﬁeld with a broad audience. Its impact beneﬁts engineersin industry who are developing large scale systems, as well as methods researchers. The relationship between the4wo communities becomes stronger by consolidating software methods, where CompCI software can be deployed indistributed environments, and on a single machine. Furthermore, the community is better able to leverage causalinference across wider applications.There are several unsolved challenges in CompCI, and we believe a community of domain experts is requiredto solve them. As an interdisciplinary ﬁeld, we invite causal inference scientists, experimenters, statisticians, policymakers, algorithms designers, and software engineers to help advance the methods, application, scalability, anddeployment of causal eﬀects. As examples, researchers with experience in causal identiﬁcation with complex data,for example in econometrics, experimenters with experience analyzing imperfect experiments, for example in clinicaltrials with noncompliers and deﬁers, and engineers in numerical computing can have a tremendous impact on CompCIsoftware. We discuss examples of major challenges in a later section.The remainder of this paper describes ways the community can unite to strengthen the CompCI ﬁeld. We takethe initiative to begin the discussion on major topics, with the aim to evolve a solution together with other expertsin the community.

Many resources have been devoted to improving the performance of machine learned models, and integrating machinelearned models with other engineering systems seamlessly. For instance, Xgboost has been designed from its inceptionto be a highly performant tree boosting algorithm (T. Chen and Guestrin 2016), and contributors have added methodsto apply GPU acceleration (Mitchell and Frank 2017). Nvidia has invested multiple iterations of cuDNN (Chetluret al. 2014), a library of deep learning primitives. TensorFlow (Abadi et al. 2016) has also received much attentionon computing (Mo et al. 2017, Sergeev and Del Balso 2018, Jia et al. 2018). However, fewer resources have beendevoted to computational performance and engineering for causal eﬀects.Spark, Python, and R are large contributors to programmatic access to industry data science. Spark is a dis-tributed computation engine that scales well for data engineering and parts of data science, though there is notmuch development for causal inference. Python and R have rich ecosystems for modeling causal eﬀects, for examplegeneralized linear models, linear mixed eﬀects models, instrumental variables, matching, propensity scores, doublyrobust estimators and regression discontinuity. The R library, grf (Wager and Athey 2018), represents cutting edgeresearch that estimates heterogeneous treatment eﬀects using random forests.At the time of writing, each of these models face scaling challenges. However, combining best practices fromnumerical computing can greatly enhance scalability. For many problems, the improved performance aﬀords theluxury of developing in single machine computing environments, and makes research and development more agile.Below, we highlight the state of common software implementations for causal eﬀects models.

1. Ordinary Least Squares (OLS) assumes a model in the form y = β + Aβ + Xβ + ε where ε ∼ N (0 , Σ) and Σ isblock diagonal. It estimates the parameters using the normal equations, ﬁtting ˆ β = ( M T M ) − M T y where M = (cid:2) A X (cid:3) ∈ R n × p + K is the model matrix concatenating a ones vector for the intercept with the treatmentvariables, A , and covariates, X . The covariances of the parameters are cov ( ˆ β ) = ( M T M ) − M T Σ M ( M T M ) − .2. To ﬁt the model, both StatsModels and R will convert a dataframe to a model matrix, M , then run a matrixdecomposition on M (Seabold and Perktold 2019a and R-Core 2019). For example, using the SVD or QRdecompositions.3. StatsModels ﬁts this model using numpy arrays, which are used to represent a dense matrix. R uses numericvectors. Both data structures are dense, so neither language is optimized for storage when the feature matrixhas many zeroes, nor are they optimized for sparse linear algebra.4. We can compute the diﬀerence in counterfactuals, E [ y | A = A i , X ] − E [ y | A = A , X ], by naively constructingtwo counterfactual matrices, M ( A = A i ) and M ( A = A ). In these two matrices we set the treatment variableto A i and A respectively. Then the treatment eﬀect is the diﬀerence in the predicted counterfactual outcome, M ( A = A i ) ˆ β − M ( A = A ) ˆ β . If there are K actions to evaluate, there would be K counterfactual matrices togenerate, all of which would be dense matrices and would suﬀer from ineﬃcient storage and linear algebra.With these constraints it is diﬃcult to use linear models to iterate and ﬁnd treatment eﬀect heterogeneity. Ona problem with n = 10 observations, p = 200 features, a dense linear algebra solver spends 30 minutes to compute1000 CATEs. 5 .2 2 Stage Least Squares in StatsModels and R’s AER Two stage least squares (Angrist and Imbens 1995) is a model that can estimate a local average treatment eﬀectwhen the treatment variable is endogeneous. The model estimates the system y = β + Aβ + Xβ + ε, (1) A = γ + Zγ + Xγ + ν. (2)This is estimated by ﬁrst ﬁtting an OLS model to the ﬁrst stage: A = γ + Zγ + Xγ + ν . The ﬁtted values,ˆ A = ˆ γ + Z ˆ γ + X ˆ γ , are used to estimate the second stage model: y = β + ˆ Aβ + Xβ + ε . This commonimplementation is also not scalable for sparse data. When A and Z are sparse, the ﬁrst stage can be solved eﬃcientlywith sparse linear algebra, but the implementations in StatsModels and AER (Seabold and Perktold 2019b andKleiber and Zeileis 2008) use dense algebra. Furthermore, the design of the algorithm relies on materializing ˆ A inmemory, which is dense because ˆ γ is dense. By ﬁtting two naive OLS models this implementation forces the secondstage to be ﬁt using dense methods, even when A and Z are originally sparse. Generalized Random Forests (grf) (Athey, J. Tibshirani, Wager, et al. 2019) is a rich model based on randomforests that can estimate heterogeneity in treatment eﬀects. The software has a highly optimized C++ core and wasdesigned with scalability in mind. However, heterogeneous treatment eﬀect estimation is inherently complex, makingit diﬃcult for large problems. In practice, evaluating causal eﬀects with K treatments and m KPIs requires K · m calls to grf (Zhou et al. 2020). A single call to grf using a problem with n = 2 · , p = 10 and num.trees = 4000takes 2 hours to ﬁt the ensemble with 32 active cores. In order to create leverage, computational causal inference needs to generalize a software framework for computingdiﬀerent causal eﬀects from diﬀerent models, then needs to optimize that framework.We begin our discussion by using the potential outcomes framework to measure treatment eﬀects. First, weassume a model for the relationship between a KPI, y , a treatment variable, A , and other exogeneous features, X .For simplicity, let A be a binary indicator variable where A = 1 represents a user receiving the treatment experience,and A = 0 the control experience. We can estimate the diﬀerence between providing the treatment experience to theuser, and not providing the treatment experience, by taking the diﬀerence in conditional expectations E [ y | A = 1 , X ] − E [ y | A = 0 , X ] . If a model, y = f ( A, X ), has already been assumed apriori, then this deﬁnition for the conditional treatment eﬀectis simple.However, experimenting across many models is diﬃcult. There are many models that can estimate E [ y | A, X ].Each of these has diﬀerent requirements for the input data, each has diﬀerent options, each has a diﬀerent estimationstrategy, and each has a diﬀerent integration strategy into engineering systems. CompCI needs to generalize asoftware framework beyond the potential outcomes framework.Design patterns in machine learning frameworks are leading examples for how software can democratize andprovide programmatic access across many models. First, these frameworks have several built-in models, but alsoexpose an API to deﬁne an arbitrary model as the minimization of a cost function. The frameworks then applygeneric numerical optimization functions to these arbitrary models. The TensorFlow tutorial page (Tensorﬂow 2020)shows simple and composable code that allows the user to specify the form of a model, then train it without needingto derive an estimation strategy. model = keras . Sequential ([keras . layers . Flatten ( input_shape =(28 , 28) ),keras . layers . Dense (128 , activation =" relu "),keras . layers . Dense (10)])model . compile (optimizer =" adam ",loss =tf. keras . losses . SparseCategoricalCrossentropy ( from_logits = True ), etrics =[" accuracy "])model . fit ( train_images , train_labels , epochs =10) The framework then provides a single entrypoint to make predictions using an arbitrary model. probability_model = tf. keras . Sequential ([model ,tf. keras . layers . Softmax ()])predictions = probability_model . predict ( test_images )

This simplicity makes integration with other engineering systems seamless; deploying a change to the form of themodel automatically changes the estimation strategy and predictions on new features, and is a single point of changefor a developer.CompCI seeks similar software that can generalize the software framing of causal inference problems, create astructured framework for computing causal eﬀects, and make software deployment simple. Frameworks like Tensor-Flow already simplify the process of training an arbitrary model. After a model is trained, the conditional treatmenteﬀect is the diﬀerence in conditional expectations comparing the treatment experience to the control experience.However, causal inference problems have two additional layers of complexity that demand a powerful software frame-work.First, there are conditions on when the conditional diﬀerence can be safely interpreted as a causal eﬀect. Saythe model is parameterized by θ f so that y = f ( A, X ; θ f ). If the conditional diﬀerence is written as a function g ( A, X ; θ g ), with θ g ⊆ θ f , it is a causal eﬀect if θ g is identiﬁed. In parametric models, a parameter θ ∗ is identiﬁed(Koopmans 1949) if it has an estimator ˆ θ ∗ that is consistent for θ ∗ , and the convergence of ˆ θ ∗ to θ ∗ does not depend onother parameters. Identiﬁcation is a property that varies by model, making it challenging for a framework to detectwhether an arbitrary model has parameters that can be identiﬁed. In most cases it requires declaring assumptionsabout the data generating process, which should be made understandable to the framework in order to provide safeestimation of treatment eﬀects. After determining a collection of models that have causal identiﬁcation, the secondsoftware challenge is to estimate the distribution of the treatment eﬀect from an arbitrary model. A possible solutionto this is to estimate the distribution through the bootstrap (Efron and R. J. Tibshirani 1994). Together, arbitrarymodel training, safe identiﬁcation of causal eﬀects, the bootstrap, and the potential outcomes framework can createa general framework for computing treatment eﬀects that can be leveraged in an experimentation platform and analgorithmic policy making engine. In addition to providing a general framework for measuring causal eﬀects, CompCI software must be scalable andperformant. Without this second quality, the software will still be diﬃcult to integrate into engineering systems.The previous State of the Industry section shows common demands in computing:1. Optimize for sparse data.2. Eﬃciently build, and predict with, counterfactual feature matrices.3. Vectorize or parallelize for multiple KPIs and multiple treatments.4. Estimate the distribution of the treatment eﬀect eﬃciently.Below we provide an overview of strategies that can address these themes; when combined these strategies yieldsigniﬁcant performance improvements to causal inference software. Greater details on each strategy can be found inour related CompCI papers.

No matter the form of the causal model, a large volume of observations is a signiﬁcant contributor to computationalcomplexity. For a certain class of models, it is possible to reduce the volume of data while still preserving estimators foraverage eﬀects, conditional average eﬀects, and time dynamic eﬀects. Since these eﬀects are averages of counterfactualpredictions, it is possible to aggregate data at the beginning of the modeling process and still return estimates of thetreatment eﬀects. 7n the simplest example, the two-sample t-test can be recast to the simple least squares model y = α + Aβ + ε .This models y ’s conditional means per unique condition: for the treatment group, A = 1, and the control group, A = 0. The β coeﬃcient represents the average diﬀerence between the groups. The model can be estimated usingsize n arrays, or its data can be aggregated to just the mean and variances of the treatment and control group.This simple example leads to an elegant generalization for data compression using conditional suﬃcient statistics.A larger linear model that conditions on more features, X , can operate on matrices with n rows, or its data canbe aggregated to just the mean and variance per unique condition. A modiﬁcation to the standard OLS algorithmallows linear models to be estimated on aggregated data without loss. Strategies for compressing for average eﬀects,conditional average eﬀects, and time dynamic eﬀects are discussed in a related CompCI paper (Wong, Forsell, andR. Lewis 2020). There are three opportunities for sparse data optimization in the modeling stack. First, the creation of a featurematrix using data can be optimized for sparse features. Second, model training can be optimized for sparse algebra.Third, estimating the treatment eﬀects can also be optimized for sparse algebra.Data is frequently retrieved from a data warehouse in the form of a dataframe, which is then encoded as afeature matrix. A signiﬁcant optimization is to convert data from warehouses directly into feature matrices, withoutconstructing dataframes. For example, parquet ﬁles are serialized with the arrow format (Arrow 2020) for columnarstorage, and are aligned with the common columnar storage formats for matrices. Software that constructs a featurematrix from a parquet ﬁle, eliminating overhead from dataframes, would have great impact.Feature matrices frequently contain many zeroes, especially when A or X contain categorical variables that areone-hot encoded. The software library creating the feature matrix, M , must be optimized so that the feature matrixis sparse, which will decrease storage costs and will improve subsequent modeling steps. This is done in SparkML(Spark 2019), and R’s Matrix library (Bates and Maechler 2017). After constructing M , estimating the modelwith M should optimize any linear algebra operations as sparse linear algebra. For example, high performancecomputing for linear models can be achieved using sparse feature matrices and sparse cholesky decompositions inEigen (Guennebaud, Jacob, et al. 2010).Finally, the estimation of treatment eﬀects through the diﬀerence in conditional expectations should also optimizefor the creation of sparse feature matrices and sparse linear algebra. Because the diﬀerence in conditional expectationsholds all X variables constant, and only vary the A variable, the conditional diﬀerence can be represented as operationson a sparse matrix. Estimating conditional average treatment eﬀects requires constructing counterfactual feature matrices to simulatethe diﬀerence between treatment and control experiences. Even with sparse data optimizations this can be a largecomputational task. In a related CompCI paper, Wong, R. Lewis, and Wardrop 2019, shows it is possible to leveragestructure in linear models to estimate conditional average eﬀects across multiple treatments, without allocatingmemory for large counterfactual matrices, by reusing the model matrix used for training.

In both applications for an experimentation platform and an algorithmic policy making engine, there are multipletreatments to consider. The experimenter wants to analyze the eﬀect of each treatment and identify the best one.There may also be multiple KPIs that are used to determine the best treatment. The evaluation of many diﬀerenttreatment eﬀects for many outcomes can usually be done in a vectorized way, where computational overhead isminimized and iteration over KPIs and causal eﬀects have minimal incremental computation. For example, OLSestimates a set of parameters that analyze multiple treatments simultaneously by one-hot encoding the treatmentvariable and using heteroskedasticity-consistent covariances (Eicker 1967; Huber 1967; White et al. 1980). The normalequations for OLS can be extended to analyze multiple KPIs simultaneously with minimal incremental computationby computing ˆ β = ( M T M ) − M T Y , where Y is a matrix of KPIs that share a common ( M T M ) − . We can estimate the sampling distribution of treatment eﬀects generically using the bootstrap. To do this at scale,we may implement the bag of little bootstraps (Kleiner et al. 2014), an eﬃcient way to compute the bootstrap by8ividing the data into multiple small partitions, then bootstrapping within each partition. This method can be run inparallel and is scalable. Furthermore, it is generic and can be applied to models without knowing speciﬁc propertiesapriori.By integrating into a general framework for measuring treatment eﬀects, the bag of little bootstraps becomesan engineering abstraction that allows developers to focus on causal identiﬁcation and parameter estimation, with-out having to write specialized functions for estimating the distribution of treatment eﬀects. It is a fundamentalcomponent to create a simple and uniﬁed framework.

In addition to the above strategies, CompCI should leverage conventional wisdom from high performance numericalcomputing.

Memory allocations and deallocations can consume a signiﬁcant amount of time in numerical computing. For example,one software implementation for computing the variance on a vector can use the sum of the vector, and the sum of itssquared elements. Allocating a vector to represent its squared elements would be ineﬃcient because the vector willbe reduced through the sum function. Numerical algorithms should design for the end target in mind, and minimizememory allocations along the way.Conditional average treatment eﬀects can be thought of as the average treatment eﬀect among a subset of users.This can be computed by taking the subset of the feature matrix, computing counterfactual predictions, then takingthe diﬀerence. To minimize memory allocations, the subset of the feature matrix should not create a deep copy ofthe data, it should be a view of the original data. Among linear models, another implementation is to never subsetthe feature matrix directly, instead multiply it by a vector of ones and zeros to select the observations that belongto a particular subset.

Computations can be optimized by making use of cache memory. One way to do this is to load data so that it isaccessed sequentially, improving spatial locality. For example, when computing the treatment eﬀect for a speciﬁcsubpopulation where X = x ∗ , spatial locality can be improved by loading data that is sorted apriori so that userswith X = x ∗ are in consecutive memory blocks. Netﬂix’s experimentation platform has been investing heavily in CompCI software. Estimating treatment eﬀects onlarge data with their software is approximately 30 times more performant than oﬀ-the-shelf libraries. Estimating 1000conditional average eﬀects with n = 5 · and p ≈

200 returns in 10 seconds on a single machine. In an extremelylarge problem with n = 5 · and p ≈ As an emerging ﬁeld, computational causal inference has a plethora of challenges, including generalizability ofcausal inference, numerical computation, statistics and probability theory, software design, and applications of causalinference. We believe experts from diﬀerent domains are needed to solve remaining problems in CompCI. Below isa list of signiﬁcant challenges in this new ﬁeld that we wish to open to the community.1. Structuring software to enable detection of causal identiﬁcation can create safe, scalable, and programmaticaccess to causal eﬀects. It is a challenge because the framework must know some properties of the modelsimplemented. Previous literature exists for identiﬁcation in causal graphs where the data generating processcan be represented as a directed and acyclic graph (DAG) (Pearl 1995 and Tian and Pearl 2002). In sucha graph, data are represented as nodes, and causal eﬀects are represented as edges. The backdoor criterion9llows an experimenter to query a DAG and know if the causal eﬀect on y due to X is identiﬁed. This canshift the responsibility of causal identiﬁcation from the model to the data, making it easy for a frameworkfor causal eﬀects to adopt. However, it is unclear how identiﬁcation in a DAG extends to causal graphs whenrelationships can be cyclic, which is a common phenomenon in machine learned systems that have immediatefeedback loops.2. One of the ways machine learning has been democratized is the fact that most cost functions for modelsare diﬀerentiable, and can be optimized through stochastic gradient descent (SGD) (Bottou 2010). Thisleads to minimal friction for developers that want to create, or iterate on, a model: given only a functionalform and a cost function, a model can be estimated and predictions can be generated. Similarly, we couldstructure CompCI software around the class of models with causal identiﬁcation that are also diﬀerentiable,then train them generically using SGD. However, SGD requires hyperparameter tuning, and it is unclear howthe risk of poor convergence aﬀect bias in the causal eﬀects estimators. In particular, we do not have a robustunderstanding of the convergence speciﬁcally on the causal parameters of interest.3. Tracing causal eﬀects for n users over T time periods requires a dataset to grow to n · T observations. Whencomputing the distribution of the treatment eﬀect, a model must also acknowledge the autocorrelation inthe data. This becomes both statistically and computationally challenging. A solution to this will preventoverconﬁdence in the estimates, and is also relevant for oﬄine policy evaluation, since most methods make asimplifying assumption that the data is not autocorrelated.4. In algorithmic policy making, policy makers may hypothesize that additional treatments have compoundingor diminishing returns. While the ﬁrst treatment can be randomized, experiencing a second treatment is notnecessarily random; for example, there can be a selection bias in who returns for a second round of treatment.Developing software that can determine identiﬁcation for marginal treatment eﬀects is hard, but will create safersoftware, and can be used in common situations where a user returns for multiple treatments. Instrumentalvariable methods can sometimes be used to estimate marginal eﬀects when the ﬁrst round of treatment israndomized but subsequent rounds are not, however these methods are diﬃcult to scale. A solution for this isa signiﬁcant contribution to attribution analysis.5. The set of treatments that a user can experience can change over time and with context. When learningfrom historic data, we must distinguish between a user not experiencing treatment A , from treatment A notbeing available at that time. If we have the ability to randomize the data, we must also understand how theprobability of experiencing treatment A varied with context, availability, and time. Software that understandsconditional randomization, and that the conditions change in time, are extremely important for policy enginesthat evolve with a business over time.

10 Conclusion

Computational causal inference, CompCI, is an interdisciplinary ﬁeld across causal inference, algorithms design, andnumerical computing. CompCI aims to unite the community of engineers, scientists, experimenters, statisticiansand many others to develop mature software for causal inference. The ﬁeld addresses engineering needs and humanneeds for scalability, and directly beneﬁts the deepening relationship between experimentation and personalizationin products and algorithms. For example, experimentation platforms and policy engines both use causal inferenceto drive innovation, automation, and personalization in a company.We proposed the design for CompCI software follow design principles from machine learning and the potentialoutcomes framework, which would separate the framing and identiﬁcation of causal eﬀects from its estimationstrategy. By doing so we can develop a general and public framework for a wide variety of causal eﬀects, thenoptimize its internal numerical engine privately. This software design would allow users to estimate causal eﬀectsby simply specifying the form of a model, and assumptions of the data that can be used to determine if eﬀects areidentiﬁed. Furthermore, we shared details of numerical computing that are crucial for improving the performance ofcausal inference algorithms. Mature CompCI software will minimize the amount of friction in researching, developingand applying causal eﬀects to large datasets.We have started the discussion and made a large investment in increasing the performance of causal inferencealgorithms. Other ﬁelds in physical sciences, social sciences, experimentation, statistics and engineering have localexpertise that can be contributed to this interdisciplinary ﬁeld. As an emerging ﬁeld, we left several open challengesas a call for others to contribute and help develop this community.10 eferences [A+19] Susan Athey, Julie Tibshirani, Stefan Wager, et al. “Generalized random forests”. In:

The Annals ofStatistics { USENIX } Sym-posium on Operating Systems Design and Implementation ( { OSDI } . 2016, pp. 265–283.[AI95] Joshua D Angrist and Guido W Imbens. “Two-stage least squares estimation of average causal eﬀectsin models with variable treatment intensity”. In: Journal of the American statistical Association

Apache Arrow . https://arrow.apache.org/. 2020.[BM17] Douglas Bates and Martin Maechler.

Matrix: Sparse and Dense Matrix Classes and Methods . R packageversion 1.2-12. 2017. url : https://CRAN.R-project.org/package=Matrix .[Bot10] L´eon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of COMP-STAT’2010 . Springer, 2010, pp. 177–186.[CG16] Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting system”. In:

Proceedings of the22nd acm sigkdd international conference on knowledge discovery and data mining . ACM. 2016, pp. 785–794.[Che+14] Sharan Chetlur et al. “cudnn: Eﬃcient primitives for deep learning”. In: arXiv preprint arXiv:1410.0759 (2014).[Dia+19] Nikos Diamantopoulos et al. “Engineering for a Science-Centric Experimentation Platform”. In: arXivpreprint arXiv:1910.03878 (2019).[Dim+17] Maria Dimakopoulou et al. “Estimation considerations in contextual bandits”. In: arXiv preprint arXiv:1711.07077 (2017).[DLL11] Miroslav Dudik, John Langford, and Lihong Li. “Doubly robust policy evaluation and learning”. In: arXiv preprint arXiv:1103.4601 (2011).[DLL17] Alex Deng, Jiannan Lu, and Jonthan Litz. “Trustworthy analysis of online A/B tests: Pitfalls, challengesand solutions”. In:

Proceedings of the Tenth ACM International Conference on Web Search and DataMining . 2017, pp. 641–649.[Eic67] Friedhelm Eicker. “Limit theorems for regressions with unequal and dependent errors”. In:

Proceedings ofthe Fifth Berkeley Symposium on Mathematical Statistics and Probability . University of California Press,1967, pp. 59–82.[ET94] Bradley Efron and Robert J Tibshirani.

An introduction to the bootstrap . CRC press, 1994.[Fab+17] Aleksander Fabijan et al. “The beneﬁts of controlled experimentation at scale”. In: . IEEE. 2017, pp. 18–26.[G+10] Ga¨el Guennebaud, Benoˆıt Jacob, et al.

Eigen v3 . http://eigen.tuxfamily.org. 2010.[GR17] Jeﬀ Gray and Bernhard Rumpe.

The importance of ﬂow in software development . 2017.[Hub67] Peter J. Huber. “The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions”. In:

Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability . University ofCalifornia Press, 1967, pp. 221–234.[Jia+18] Chengfan Jia et al. “Improving the performance of distributed tensorﬂow with RDMA”. In:

InternationalJournal of Parallel Programming

Journal of the Royal Statistical Society:Series B (Statistical Methodology)

Proceedings of the 19th ACMSIGKDD international conference on Knowledge discovery and data mining . 2013, pp. 1168–1176.[Koo49] Tjalling C Koopmans. “Identiﬁcation problems in economic model construction”. In:

Econometrica, Jour-nal of the Econometric Society (1949), pp. 125–144.[Kri16] Gopal Krishnan.

Selecting the best artwork for videos through A/B testing . 2016. url : https://medium.com / netflix - techblog / selecting - the - best - artwork - for - videos - through - a - b - testing -f6155c4595f6 (visited on 08/05/2019). 11KZ08] Christian Kleiber and Achim Zeileis. Applied Econometrics with R . ISBN 978-0-387-77316-2. New York:Springer-Verlag, 2008. url : https://CRAN.R-project.org/package=AER .[Li+10] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. “A contextual-bandit approach to person-alized news article recommendation”. In: Proceedings of the 19th international conference on World wideweb . 2010, pp. 661–670.[Li+11] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. “Unbiased oﬄine evaluation of contextual-bandit-based news article recommendation algorithms”. In:

Proceedings of the fourth ACM internationalconference on Web search and data mining . ACM. 2011, pp. 297–306.[LW18] Randall A Lewis and Jeﬀrey Wong. “Incrementality bidding & attribution”. In:

Available at SSRN3129350 (2018).[MF17] Rory Mitchell and Eibe Frank. “Accelerating the XGBoost algorithm using GPU computing”. In:

PeerJComputer Science . IEEE. 2017, pp. 240–242.[Nel16] Nick Nelson.

The Power Of A Picture . 2016. url : https://media.netflix.com/en/company-blog/the-power-of-a-picture (visited on 08/05/2019).[NW86] Whitney K Newey and Kenneth D West. A simple, positive semi-deﬁnite, heteroskedasticity and auto-correlation consistent covariance matrix . 1986.[Pea12] Judea Pearl. “The do-calculus revisited”. In: arXiv preprint arXiv:1210.4852 (2012).[Pea95] Judea Pearl. “Causal diagrams for empirical research”. In:

Biometrika

Source code for lm in R . 2019. url : https://github.com/wch/r-source/blob/master/src/library/stats/R/lm.R .[SD18] Alexander Sergeev and Mike Del Balso. “Horovod: fast and easy distributed deep learning in TensorFlow”.In: arXiv preprint arXiv:1802.05799 (2018).[Sid19] Faisal Siddiqi. Infra for Contextual Bandits and Reinforcement Learning . 2019. url : https://netflixtechblog.com / ml - platform - meetup - infra - for - contextual - bandits - and - reinforcement - learning -4a90305948ef (visited on 06/06/2020).[SP19a] Skipper Seabold and Josef Perktold. Source code for statsmodels.regression.linear model.OLS . 2019. url : https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py .[SP19b] Skipper Seabold and Josef Perktold. Source code for statsmodels.sandbox.regression.gmm . 2019. url : .[Spa19] Apache Spark. ML Features . 2019. url : https://spark.apache.org/docs/latest/ml-features.html .[Ten20] Tensorﬂow. Basic classiﬁcation: Classify images of clothing . Accessed: 2020-06-20. 2020. url : (visited on 06/20/2020).[TP02] Jin Tian and Judea Pearl. “A general identiﬁcation condition for causal eﬀects”. In: Aaai/iaai . 2002,pp. 567–573.[WA18] Stefan Wager and Susan Athey. “Estimation and inference of heterogeneous treatment eﬀects usingrandom forests”. In:

Journal of the American Statistical Association econometrica arXiv preprint arXiv:1910.01305 (2019).[Zho+20] Zhengyuan Zhou et al.