CComputational Causal Inference
Jeffrey C. WongComputational Causal Inference, Netflix
Abstract
We introduce computational causal inference as an interdisciplinary field across causal inference, algorithmsdesign and numerical computing. The field aims to develop software specializing in causal inference that cananalyze massive datasets with a variety of causal effects, in a performant, general, and robust way. The focuson software improves research agility, and enables causal inference to be easily integrated into large engineeringsystems. In particular, we use computational causal inference to deepen the relationship between causal inference,online experimentation, and algorithmic decision making.This paper describes the new field, the demand, opportunities for scalability, open challenges, and begins thediscussion for how the community can unite to solve challenges for scaling causal inference and decision making.
Causal inference and machine learning have a symbiotic relationship that is growing deeper. Companies are usingmachine learning to improve content recommendations, sales, business operations, and to personalize user experiences.These companies will test new algorithms online in order to determine whether the algorithms cause a positive effectfor the company. In this capacity, causal inference for online experiments serves as an honest and independentevaluator for an algorithm. However, recent interdisciplinary work in the combination of machine learning andcausal inference has shown a much deeper synergy between the two fields. Predictions from machine learned modelshave been debiased by utilizing inverse propensity weights (Dudik, Langford, and Li 2011), which are frequentlyfound in studies of causal effects. At the same time, causal inference methods have benefited from methods formodeling high dimensional relationships in order to determine heterogeneity in treatment effects, such as in Wagerand Athey 2018. Frameworks such as Pearl’s do-calculus (Pearl 2012) have also created clear programmatic structurefor answering causal effects queries when relationships in data can be modeled as a graph.The intersection of machine learning and causal inference is particularly strong in the symbiotic relationshipbetween algorithmic policy making and experimentation platforms. In policy making, we are presented with adecision to make; we wish to construct an algorithm that receives features as input and outputs an action to take.The action can be personalized, for example in contextual bandits (Li, Chu, Langford, and Schapire 2010), andshould be the optimal action that maximizes a reward function. These algorithms are tested in an online experimentthat reports the causal effect on a key performance indicator (KPI) due to the new algorithm. Experimentationand policy making are highly aligned when the reward function for the policy algorithm and the KPI used for anonline experiment are the same, thus the policy algorithm must determine the action with the largest causal effecton the KPI. Recent research in contextual bandits (Dimakopoulou et al. 2017) shows new policy algorithms thatincorporate methods from the causal inference literature in order to reduce bias in the reward estimator and increaseits robustness. In Siddiqi 2019, LinkedIn, Netflix, Facebook, and Dropbox described engineering systems that cansupport algorithmic policy making, how they are utilized for their products, and their deepening relationship toattribution, a well known causal inference problem in many industries. Similarly, fields like artificial intelligence arealso dependent on a rich analysis of causal effects.It is common to find mature software and engineering systems for an experimentation platform (Fabijan et al.2017, Kohavi et al. 2013, Deng, Lu, and Litz 2017, Diamantopoulos et al. 2019). However, it is much less common tofind mature software for statistics and causal inference that integrates into such systems. One of the main challengesis the lack of software dedicated to estimating causal effects that scales well. For example, policy algorithms trainmodels over high dimensional feature sets, then use them to evaluate several actions and counterfactuals for differentcombinations of features, a task that demands a computationally efficient engine. The lack of performant softwarefor such a daunting task creates engineering risk, as well as slow and challenging iteration cycles.In order to achieve broad adoption of causal inference methods in fields such as experimentation platforms andpolicy making, the methods need to be general, performant, and robust in software that is easy to use. This requires1 a r X i v : . [ s t a t . C O ] J u l n interdisciplinary field across causal inference, algorithm design, and numerical computing, which we introduce as computational causal inference (CompCI). The field focuses on software that scales causal inference methods sothat they are practical to use broadly in research and in engineering systems. We aim to improve research agility, aswell as software that report treatment effects for online experiments and produce policies that maximize the causaleffect on a KPI. By adopting strategies for improving computational performance, causal models can be trained andevaluated efficiently, frequently 30 times faster than off-the-shelf strategies. In addition, there are strategies thatcan make a single machine scale well, greatly reducing the overhead and maintenance burden in large engineeringsystems. The combined simplicity and performance reduce friction to apply causal inference in policy making.This paper discusses the exact algorithmic and human need for scalable causal effects, specifically in performant,general, and robust estimation. Our introduction of computational causal inference calls for community involvementto solve open challenges in software and methods research for causal inference. We review the state of causal effectssoftware in the industry, with exact references to code. To lead an open discussion, we propose a general frameworkto structure a software library around causal effects, and specific ways to optimize fitting models and estimatingdistributions of treatment effects. Early development at Netflix shows promising results. Finally, we share the majorchallenges we hope the broader community will unite on and solve in the field of computational causal inference. There are two significant classes of engineering systems that motivate the need for performant computation for causaleffects: experimentation platforms, and algorithmic policy making engines.
First, an experimentation platform (XP) that reports on multiple online and controlled experiments needs to be ableto estimate causal effects at scale. For each experiment, an XP models a variety of causal effects, for example thecommon average treatment effect, conditional average treatment effects, and time dynamic treatment effects, seen inFigure 1. These effects help the business understand its user base, different segments in the user base, and how theychange over time. The volume of the data demands large amounts of memory, and the estimation of the treatmenteffects on such volume of data can be overwhelming.Figure 1: Types of treatment effects. y Treatment A v e r age T r ea t m en t E ff e c t Average Treatment Effect (a) Average treatment effect. y − .
02 0 .
00 0 . JIHGFEDCBA
Conditional Average Treatment Effect S eg m en t Heterogeneous Effects acrossSegments (b) Conditional average treatment effectsreport the average effect per segment.
I JE F G HA B C D0 10 20 30 0 10 20 30 0 10 20 30 0 10 20 30−0.2−0.10.00.10.2−0.2−0.10.00.10.2−0.2−0.10.00.10.2
Time T i m e D y na m i c T r ea t m en t E ff e c t Heterogeneous Effects acrossTime and Segments (c) Time dynamic treatment effects re-port the average effect per segment perunit of time.
Ordinary Least Squares (OLS) is usually the foundation for measuring average treatment effects, and extendselegantly into conditional average effects and time dynamic effects (Wong, R. Lewis, and Wardrop 2019). The firststep is to fit OLS, and the second is to query it for counterfactual predictions across all potential outcomes. Thecomputational complexity for fitting OLS is usually O ( np ), where n is the number of observations, and p is thenumber of features. In practice, an XP can encounter scenarios where n is on the order of 10 . The OLS modelrequires interaction terms in order to estimate conditional average treatment effects, making p large. To measuretreatment effects over time, we must observe users for multiple time periods, dramatically increasing the numberof observations in the analysis; an analysis of causal effects that took n = 10 observations can easily become an2nalysis that takes n = 3 · or n = 9 · observations. After fitting such a large OLS model with covariates, X ∈ R n × p , and dependent variable, y , we must evaluate the model for different potential outcomes. Suppose thereare K potential treatments in the set: A = { A , A , . . . A K } , with A being the control experience. The conditionalaverage treatment effects are the conditional differences E [ y | A i , X ] − E [ y | A , X ] ∀ i ∈ { . . . K } . For each conditional difference, the expectation scores the counterfactual feature matrix of size n × p , where thetreatment variable is set to A i . Generating these counterfactual feature matrices and predictions is again a memoryand computationally heavy operation. An XP repeats this lengthy exercise for multiple dependent variables and formultiple experiments, culminating in large amounts of computation. Second, policy algorithms support engineering systems through automated decision making by recommending actionsthat cause a system to incrementally reach a better state. They have similar computational complexity to that of anXP. Large applications in product recommendations and artwork have been discussed in Siddiqi 2019, Krishnan 2016and Nelson 2016. The setup for policy algorithms begins with n users, and for each user we must decide an actionamong K actions in A = { A , A , . . . , A K } . Each user has features, x , and each action generates a reward, R ( A, x )with respect to a KPI. A deterministic policy, π ( x ), is a function that receives x and returns an action that is believedto generate the optimal reward. Given the current policy, π , we want to know whether there are alternate policiesthat achieve a larger reward than R ( π , x ), that is we seek to optimize π ∗ ( x ) = arg max π ( x ) R ( π ( x ) , x ) − R ( π , x ).This formulation can be thought of as a treatment effect, with π being the control policy and π ( x ) the treatmentpolicy.There are many other ways causal effects problems overlap with policy algorithms, such as:1. Identifying the best action that improves over π requires measuring treatment effects.2. Personalized policies seek heterogeneity in treatment effects. Constant treatment effects yield policies that arealso constant.3. The effect of an action can be a function of time, and can be a function of the actions previously taken. Thisis similar to analyzing the causal effects of digital ads, which can vary in time and can have compounding ordiminishing effects, for example in R. A. Lewis and Wong 2018.4. A policy algorithm may suggest to take an action, π ∗ ( x ). However, the action that is executed may be a differentaction, or no action at all. This is a case of noncompliance with the treatment, a common phenomenon inmany other fields.5. A policy algorithm usually assumes that all actions in A can be used with all n users. However, some usersmay not be qualified for certain actions. Furthermore, the list of actions they are qualified for may change overtime. This is similar to modeling causal effects for an at risk group.To estimate personalized policies using causal effects, we first fit a causal model. Then, we query it across allpossible actions to predict individual treatment effects conditioning on a user’s features. One way to build a policyfrom the individual treatment effects is to find the actions that yield the largest treatment effect. The analysis ofcausal effects is similar to that in an XP, with the exception that an XP analyzes causal effects retrospectively, andpolicy algorithms predict causal effects prospectively. Policy algorithms naturally inherit all of the computationalcomplexity that XPs have, and frequently have greater computational complexity than an XP.The evaluation of a policy is different than the evaluation of a single treatment in an XP, and introduces greatercomputational complexity for policy engines. Policy algorithms assign variable treatments to different users, whereasan XP reports effects conditioning on a fixed treatment. In order to test if the personalized policy, π ∗ , is better than π , we can test the hypothesis H : (cid:88) i R ( π ∗ ( x i ) , x i ) − R ( π ( x i ) , x i ) = 0 against H A : (cid:88) i R ( π ∗ ( x i ) , x i ) − R ( π ( x i ) , x i ) > . x . The hypothesis test keeps the reward evaluation general, even in the casewhen data is autocorrelated. In this case, estimating the distribution of the sum of the treatment effects can benumerically challenging, for example in the case of clustered covariances for OLS (Newey and West 1986). In themore challenging case when a closed form solution for the distribution of the treatment effect does not exist, we maybe able to defer to the bootstrap (Efron and R. J. Tibshirani 1994). However, the bootstrap requires fitting themodel multiple times, further highlighting the causal inference and numerical challenges in policy engines.In summary, experimentation platforms use rich causal effects to inform business strategy. These effects arenumerically challenging to estimate. Policy engines are similar to XPs, utilizing causal effects for algorithmic decisionmaking, but have even greater numerical challenges. Computational causal inference will provide a focus on scalablealgorithms for causal inference, improving engineering systems for both fields. Computational causal inference’s focus on numerical computing delivers agility that enables people to innovate,be productive, and quickly recover from mistakes. This unique focus in CompCI simultaneously serves both theengineering needs in the industry as well as the needs to iterate quickly in research.New industry engineering systems need to be assessed for risk and for their ability to scale before they are deployed.Sustaining an engineering system is a long term commitment, so the system needs to be capable of continuouslyscaling with the industry for years. Massive computations, such as the ones we outlined for XPs and policy engines,are risky, and it is unclear whether simply acquiring more hardware can solve scaling challenges. The risk causesteams to debate the gains and costs of integrating causal inference into their systems. Often, there are three majorcosts to a team:1. Instability or latency in the product for the end consumers.2. The risk that scaling becomes too expensive in terms of money, hardware, and/or time, and will require asignificant redesign in the future. The redesign may include the redesign of other engineering services in theecosystem.3. The associated loss in team productivity due to managing such large computations.Alternatively, teams may create more scalable, but nuanced, approximations that deviate from rigorous mathematicalframeworks in causal inference. Such deviations can create challenges in the future, where it becomes hard to extend,and hard to absorb new innovations from the causal inference field. CompCI preempts the scalability challenges byoptimizing the numerical algorithms for causal effects, reducing the risk in developing and maintaining engineeringsystems that are based on causal inference. Secondly, CompCI’s approach allows failed tasks to be restarted andrerun quickly, improving developer operations and maintainability of the system.Fast computing helps researchers become productive and innovative. First, fast or interactive computing max-imizes cognitive flow (Gray and Rumpe 2017). Scalable computation that runs on a single machine removes thecognitive overhead of managing distributed environments. Attention becomes concentrated on a single machine thatreturns results in seconds, facilitating a two way communication between the researcher and the data. This helps theresearcher transform thought into code, and results into new thought, ultimately improving productivity. Second,fast computing empowers researchers to innovate by making failure less costly. Innovations are always preceded by along series of failures, which can bear high mental and emotional costs. The challenges to success can be exacerbatedwhen iterations take hours instead of seconds. Fast software not only saves time for the researcher, it makes it easierto recover from mistakes, lowering psychological barriers to innovate and amplifying the researcher’s determination.To support the possibility of such an experience, we outline computational strategies in section 7 that greatly improvethe performance of causal inference software.Finally, CompCI provides a path for researchers and industry practitioners to use the same software, which iseasy to iterate on and can run large problems on a single machine. This is a powerful productivity advantage forXPs and algorithmic policy making, which are receiving significant development from both academic and industrycommunities.
Computational causal inference is an interdisciplinary field with a broad audience. Its impact benefits engineersin industry who are developing large scale systems, as well as methods researchers. The relationship between the4wo communities becomes stronger by consolidating software methods, where CompCI software can be deployed indistributed environments, and on a single machine. Furthermore, the community is better able to leverage causalinference across wider applications.There are several unsolved challenges in CompCI, and we believe a community of domain experts is requiredto solve them. As an interdisciplinary field, we invite causal inference scientists, experimenters, statisticians, policymakers, algorithms designers, and software engineers to help advance the methods, application, scalability, anddeployment of causal effects. As examples, researchers with experience in causal identification with complex data,for example in econometrics, experimenters with experience analyzing imperfect experiments, for example in clinicaltrials with noncompliers and defiers, and engineers in numerical computing can have a tremendous impact on CompCIsoftware. We discuss examples of major challenges in a later section.The remainder of this paper describes ways the community can unite to strengthen the CompCI field. We takethe initiative to begin the discussion on major topics, with the aim to evolve a solution together with other expertsin the community.
Many resources have been devoted to improving the performance of machine learned models, and integrating machinelearned models with other engineering systems seamlessly. For instance, Xgboost has been designed from its inceptionto be a highly performant tree boosting algorithm (T. Chen and Guestrin 2016), and contributors have added methodsto apply GPU acceleration (Mitchell and Frank 2017). Nvidia has invested multiple iterations of cuDNN (Chetluret al. 2014), a library of deep learning primitives. TensorFlow (Abadi et al. 2016) has also received much attentionon computing (Mo et al. 2017, Sergeev and Del Balso 2018, Jia et al. 2018). However, fewer resources have beendevoted to computational performance and engineering for causal effects.Spark, Python, and R are large contributors to programmatic access to industry data science. Spark is a dis-tributed computation engine that scales well for data engineering and parts of data science, though there is notmuch development for causal inference. Python and R have rich ecosystems for modeling causal effects, for examplegeneralized linear models, linear mixed effects models, instrumental variables, matching, propensity scores, doublyrobust estimators and regression discontinuity. The R library, grf (Wager and Athey 2018), represents cutting edgeresearch that estimates heterogeneous treatment effects using random forests.At the time of writing, each of these models face scaling challenges. However, combining best practices fromnumerical computing can greatly enhance scalability. For many problems, the improved performance affords theluxury of developing in single machine computing environments, and makes research and development more agile.Below, we highlight the state of common software implementations for causal effects models.
1. Ordinary Least Squares (OLS) assumes a model in the form y = β + Aβ + Xβ + ε where ε ∼ N (0 , Σ) and Σ isblock diagonal. It estimates the parameters using the normal equations, fitting ˆ β = ( M T M ) − M T y where M = (cid:2) A X (cid:3) ∈ R n × p + K is the model matrix concatenating a ones vector for the intercept with the treatmentvariables, A , and covariates, X . The covariances of the parameters are cov ( ˆ β ) = ( M T M ) − M T Σ M ( M T M ) − .2. To fit the model, both StatsModels and R will convert a dataframe to a model matrix, M , then run a matrixdecomposition on M (Seabold and Perktold 2019a and R-Core 2019). For example, using the SVD or QRdecompositions.3. StatsModels fits this model using numpy arrays, which are used to represent a dense matrix. R uses numericvectors. Both data structures are dense, so neither language is optimized for storage when the feature matrixhas many zeroes, nor are they optimized for sparse linear algebra.4. We can compute the difference in counterfactuals, E [ y | A = A i , X ] − E [ y | A = A , X ], by naively constructingtwo counterfactual matrices, M ( A = A i ) and M ( A = A ). In these two matrices we set the treatment variableto A i and A respectively. Then the treatment effect is the difference in the predicted counterfactual outcome, M ( A = A i ) ˆ β − M ( A = A ) ˆ β . If there are K actions to evaluate, there would be K counterfactual matrices togenerate, all of which would be dense matrices and would suffer from inefficient storage and linear algebra.With these constraints it is difficult to use linear models to iterate and find treatment effect heterogeneity. Ona problem with n = 10 observations, p = 200 features, a dense linear algebra solver spends 30 minutes to compute1000 CATEs. 5 .2 2 Stage Least Squares in StatsModels and R’s AER Two stage least squares (Angrist and Imbens 1995) is a model that can estimate a local average treatment effectwhen the treatment variable is endogeneous. The model estimates the system y = β + Aβ + Xβ + ε, (1) A = γ + Zγ + Xγ + ν. (2)This is estimated by first fitting an OLS model to the first stage: A = γ + Zγ + Xγ + ν . The fitted values,ˆ A = ˆ γ + Z ˆ γ + X ˆ γ , are used to estimate the second stage model: y = β + ˆ Aβ + Xβ + ε . This commonimplementation is also not scalable for sparse data. When A and Z are sparse, the first stage can be solved efficientlywith sparse linear algebra, but the implementations in StatsModels and AER (Seabold and Perktold 2019b andKleiber and Zeileis 2008) use dense algebra. Furthermore, the design of the algorithm relies on materializing ˆ A inmemory, which is dense because ˆ γ is dense. By fitting two naive OLS models this implementation forces the secondstage to be fit using dense methods, even when A and Z are originally sparse. Generalized Random Forests (grf) (Athey, J. Tibshirani, Wager, et al. 2019) is a rich model based on randomforests that can estimate heterogeneity in treatment effects. The software has a highly optimized C++ core and wasdesigned with scalability in mind. However, heterogeneous treatment effect estimation is inherently complex, makingit difficult for large problems. In practice, evaluating causal effects with K treatments and m KPIs requires K · m calls to grf (Zhou et al. 2020). A single call to grf using a problem with n = 2 · , p = 10 and num.trees = 4000takes 2 hours to fit the ensemble with 32 active cores. In order to create leverage, computational causal inference needs to generalize a software framework for computingdifferent causal effects from different models, then needs to optimize that framework.We begin our discussion by using the potential outcomes framework to measure treatment effects. First, weassume a model for the relationship between a KPI, y , a treatment variable, A , and other exogeneous features, X .For simplicity, let A be a binary indicator variable where A = 1 represents a user receiving the treatment experience,and A = 0 the control experience. We can estimate the difference between providing the treatment experience to theuser, and not providing the treatment experience, by taking the difference in conditional expectations E [ y | A = 1 , X ] − E [ y | A = 0 , X ] . If a model, y = f ( A, X ), has already been assumed apriori, then this definition for the conditional treatment effectis simple.However, experimenting across many models is difficult. There are many models that can estimate E [ y | A, X ].Each of these has different requirements for the input data, each has different options, each has a different estimationstrategy, and each has a different integration strategy into engineering systems. CompCI needs to generalize asoftware framework beyond the potential outcomes framework.Design patterns in machine learning frameworks are leading examples for how software can democratize andprovide programmatic access across many models. First, these frameworks have several built-in models, but alsoexpose an API to define an arbitrary model as the minimization of a cost function. The frameworks then applygeneric numerical optimization functions to these arbitrary models. The TensorFlow tutorial page (Tensorflow 2020)shows simple and composable code that allows the user to specify the form of a model, then train it without needingto derive an estimation strategy. model = keras . Sequential ([keras . layers . Flatten ( input_shape =(28 , 28) ),keras . layers . Dense (128 , activation =" relu "),keras . layers . Dense (10)])model . compile (optimizer =" adam ",loss =tf. keras . losses . SparseCategoricalCrossentropy ( from_logits = True ), etrics =[" accuracy "])model . fit ( train_images , train_labels , epochs =10) The framework then provides a single entrypoint to make predictions using an arbitrary model. probability_model = tf. keras . Sequential ([model ,tf. keras . layers . Softmax ()])predictions = probability_model . predict ( test_images )
This simplicity makes integration with other engineering systems seamless; deploying a change to the form of themodel automatically changes the estimation strategy and predictions on new features, and is a single point of changefor a developer.CompCI seeks similar software that can generalize the software framing of causal inference problems, create astructured framework for computing causal effects, and make software deployment simple. Frameworks like Tensor-Flow already simplify the process of training an arbitrary model. After a model is trained, the conditional treatmenteffect is the difference in conditional expectations comparing the treatment experience to the control experience.However, causal inference problems have two additional layers of complexity that demand a powerful software frame-work.First, there are conditions on when the conditional difference can be safely interpreted as a causal effect. Saythe model is parameterized by θ f so that y = f ( A, X ; θ f ). If the conditional difference is written as a function g ( A, X ; θ g ), with θ g ⊆ θ f , it is a causal effect if θ g is identified. In parametric models, a parameter θ ∗ is identified(Koopmans 1949) if it has an estimator ˆ θ ∗ that is consistent for θ ∗ , and the convergence of ˆ θ ∗ to θ ∗ does not depend onother parameters. Identification is a property that varies by model, making it challenging for a framework to detectwhether an arbitrary model has parameters that can be identified. In most cases it requires declaring assumptionsabout the data generating process, which should be made understandable to the framework in order to provide safeestimation of treatment effects. After determining a collection of models that have causal identification, the secondsoftware challenge is to estimate the distribution of the treatment effect from an arbitrary model. A possible solutionto this is to estimate the distribution through the bootstrap (Efron and R. J. Tibshirani 1994). Together, arbitrarymodel training, safe identification of causal effects, the bootstrap, and the potential outcomes framework can createa general framework for computing treatment effects that can be leveraged in an experimentation platform and analgorithmic policy making engine. In addition to providing a general framework for measuring causal effects, CompCI software must be scalable andperformant. Without this second quality, the software will still be difficult to integrate into engineering systems.The previous State of the Industry section shows common demands in computing:1. Optimize for sparse data.2. Efficiently build, and predict with, counterfactual feature matrices.3. Vectorize or parallelize for multiple KPIs and multiple treatments.4. Estimate the distribution of the treatment effect efficiently.Below we provide an overview of strategies that can address these themes; when combined these strategies yieldsignificant performance improvements to causal inference software. Greater details on each strategy can be found inour related CompCI papers.
No matter the form of the causal model, a large volume of observations is a significant contributor to computationalcomplexity. For a certain class of models, it is possible to reduce the volume of data while still preserving estimators foraverage effects, conditional average effects, and time dynamic effects. Since these effects are averages of counterfactualpredictions, it is possible to aggregate data at the beginning of the modeling process and still return estimates of thetreatment effects. 7n the simplest example, the two-sample t-test can be recast to the simple least squares model y = α + Aβ + ε .This models y ’s conditional means per unique condition: for the treatment group, A = 1, and the control group, A = 0. The β coefficient represents the average difference between the groups. The model can be estimated usingsize n arrays, or its data can be aggregated to just the mean and variances of the treatment and control group.This simple example leads to an elegant generalization for data compression using conditional sufficient statistics.A larger linear model that conditions on more features, X , can operate on matrices with n rows, or its data canbe aggregated to just the mean and variance per unique condition. A modification to the standard OLS algorithmallows linear models to be estimated on aggregated data without loss. Strategies for compressing for average effects,conditional average effects, and time dynamic effects are discussed in a related CompCI paper (Wong, Forsell, andR. Lewis 2020). There are three opportunities for sparse data optimization in the modeling stack. First, the creation of a featurematrix using data can be optimized for sparse features. Second, model training can be optimized for sparse algebra.Third, estimating the treatment effects can also be optimized for sparse algebra.Data is frequently retrieved from a data warehouse in the form of a dataframe, which is then encoded as afeature matrix. A significant optimization is to convert data from warehouses directly into feature matrices, withoutconstructing dataframes. For example, parquet files are serialized with the arrow format (Arrow 2020) for columnarstorage, and are aligned with the common columnar storage formats for matrices. Software that constructs a featurematrix from a parquet file, eliminating overhead from dataframes, would have great impact.Feature matrices frequently contain many zeroes, especially when A or X contain categorical variables that areone-hot encoded. The software library creating the feature matrix, M , must be optimized so that the feature matrixis sparse, which will decrease storage costs and will improve subsequent modeling steps. This is done in SparkML(Spark 2019), and R’s Matrix library (Bates and Maechler 2017). After constructing M , estimating the modelwith M should optimize any linear algebra operations as sparse linear algebra. For example, high performancecomputing for linear models can be achieved using sparse feature matrices and sparse cholesky decompositions inEigen (Guennebaud, Jacob, et al. 2010).Finally, the estimation of treatment effects through the difference in conditional expectations should also optimizefor the creation of sparse feature matrices and sparse linear algebra. Because the difference in conditional expectationsholds all X variables constant, and only vary the A variable, the conditional difference can be represented as operationson a sparse matrix. Estimating conditional average treatment effects requires constructing counterfactual feature matrices to simulatethe difference between treatment and control experiences. Even with sparse data optimizations this can be a largecomputational task. In a related CompCI paper, Wong, R. Lewis, and Wardrop 2019, shows it is possible to leveragestructure in linear models to estimate conditional average effects across multiple treatments, without allocatingmemory for large counterfactual matrices, by reusing the model matrix used for training.
In both applications for an experimentation platform and an algorithmic policy making engine, there are multipletreatments to consider. The experimenter wants to analyze the effect of each treatment and identify the best one.There may also be multiple KPIs that are used to determine the best treatment. The evaluation of many differenttreatment effects for many outcomes can usually be done in a vectorized way, where computational overhead isminimized and iteration over KPIs and causal effects have minimal incremental computation. For example, OLSestimates a set of parameters that analyze multiple treatments simultaneously by one-hot encoding the treatmentvariable and using heteroskedasticity-consistent covariances (Eicker 1967; Huber 1967; White et al. 1980). The normalequations for OLS can be extended to analyze multiple KPIs simultaneously with minimal incremental computationby computing ˆ β = ( M T M ) − M T Y , where Y is a matrix of KPIs that share a common ( M T M ) − . We can estimate the sampling distribution of treatment effects generically using the bootstrap. To do this at scale,we may implement the bag of little bootstraps (Kleiner et al. 2014), an efficient way to compute the bootstrap by8ividing the data into multiple small partitions, then bootstrapping within each partition. This method can be run inparallel and is scalable. Furthermore, it is generic and can be applied to models without knowing specific propertiesapriori.By integrating into a general framework for measuring treatment effects, the bag of little bootstraps becomesan engineering abstraction that allows developers to focus on causal identification and parameter estimation, with-out having to write specialized functions for estimating the distribution of treatment effects. It is a fundamentalcomponent to create a simple and unified framework.
In addition to the above strategies, CompCI should leverage conventional wisdom from high performance numericalcomputing.
Memory allocations and deallocations can consume a significant amount of time in numerical computing. For example,one software implementation for computing the variance on a vector can use the sum of the vector, and the sum of itssquared elements. Allocating a vector to represent its squared elements would be inefficient because the vector willbe reduced through the sum function. Numerical algorithms should design for the end target in mind, and minimizememory allocations along the way.Conditional average treatment effects can be thought of as the average treatment effect among a subset of users.This can be computed by taking the subset of the feature matrix, computing counterfactual predictions, then takingthe difference. To minimize memory allocations, the subset of the feature matrix should not create a deep copy ofthe data, it should be a view of the original data. Among linear models, another implementation is to never subsetthe feature matrix directly, instead multiply it by a vector of ones and zeros to select the observations that belongto a particular subset.
Computations can be optimized by making use of cache memory. One way to do this is to load data so that it isaccessed sequentially, improving spatial locality. For example, when computing the treatment effect for a specificsubpopulation where X = x ∗ , spatial locality can be improved by loading data that is sorted apriori so that userswith X = x ∗ are in consecutive memory blocks. Netflix’s experimentation platform has been investing heavily in CompCI software. Estimating treatment effects onlarge data with their software is approximately 30 times more performant than off-the-shelf libraries. Estimating 1000conditional average effects with n = 5 · and p ≈
200 returns in 10 seconds on a single machine. In an extremelylarge problem with n = 5 · and p ≈ As an emerging field, computational causal inference has a plethora of challenges, including generalizability ofcausal inference, numerical computation, statistics and probability theory, software design, and applications of causalinference. We believe experts from different domains are needed to solve remaining problems in CompCI. Below isa list of significant challenges in this new field that we wish to open to the community.1. Structuring software to enable detection of causal identification can create safe, scalable, and programmaticaccess to causal effects. It is a challenge because the framework must know some properties of the modelsimplemented. Previous literature exists for identification in causal graphs where the data generating processcan be represented as a directed and acyclic graph (DAG) (Pearl 1995 and Tian and Pearl 2002). In sucha graph, data are represented as nodes, and causal effects are represented as edges. The backdoor criterion9llows an experimenter to query a DAG and know if the causal effect on y due to X is identified. This canshift the responsibility of causal identification from the model to the data, making it easy for a frameworkfor causal effects to adopt. However, it is unclear how identification in a DAG extends to causal graphs whenrelationships can be cyclic, which is a common phenomenon in machine learned systems that have immediatefeedback loops.2. One of the ways machine learning has been democratized is the fact that most cost functions for modelsare differentiable, and can be optimized through stochastic gradient descent (SGD) (Bottou 2010). Thisleads to minimal friction for developers that want to create, or iterate on, a model: given only a functionalform and a cost function, a model can be estimated and predictions can be generated. Similarly, we couldstructure CompCI software around the class of models with causal identification that are also differentiable,then train them generically using SGD. However, SGD requires hyperparameter tuning, and it is unclear howthe risk of poor convergence affect bias in the causal effects estimators. In particular, we do not have a robustunderstanding of the convergence specifically on the causal parameters of interest.3. Tracing causal effects for n users over T time periods requires a dataset to grow to n · T observations. Whencomputing the distribution of the treatment effect, a model must also acknowledge the autocorrelation inthe data. This becomes both statistically and computationally challenging. A solution to this will preventoverconfidence in the estimates, and is also relevant for offline policy evaluation, since most methods make asimplifying assumption that the data is not autocorrelated.4. In algorithmic policy making, policy makers may hypothesize that additional treatments have compoundingor diminishing returns. While the first treatment can be randomized, experiencing a second treatment is notnecessarily random; for example, there can be a selection bias in who returns for a second round of treatment.Developing software that can determine identification for marginal treatment effects is hard, but will create safersoftware, and can be used in common situations where a user returns for multiple treatments. Instrumentalvariable methods can sometimes be used to estimate marginal effects when the first round of treatment israndomized but subsequent rounds are not, however these methods are difficult to scale. A solution for this isa significant contribution to attribution analysis.5. The set of treatments that a user can experience can change over time and with context. When learningfrom historic data, we must distinguish between a user not experiencing treatment A , from treatment A notbeing available at that time. If we have the ability to randomize the data, we must also understand how theprobability of experiencing treatment A varied with context, availability, and time. Software that understandsconditional randomization, and that the conditions change in time, are extremely important for policy enginesthat evolve with a business over time.
10 Conclusion
Computational causal inference, CompCI, is an interdisciplinary field across causal inference, algorithms design, andnumerical computing. CompCI aims to unite the community of engineers, scientists, experimenters, statisticiansand many others to develop mature software for causal inference. The field addresses engineering needs and humanneeds for scalability, and directly benefits the deepening relationship between experimentation and personalizationin products and algorithms. For example, experimentation platforms and policy engines both use causal inferenceto drive innovation, automation, and personalization in a company.We proposed the design for CompCI software follow design principles from machine learning and the potentialoutcomes framework, which would separate the framing and identification of causal effects from its estimationstrategy. By doing so we can develop a general and public framework for a wide variety of causal effects, thenoptimize its internal numerical engine privately. This software design would allow users to estimate causal effectsby simply specifying the form of a model, and assumptions of the data that can be used to determine if effects areidentified. Furthermore, we shared details of numerical computing that are crucial for improving the performance ofcausal inference algorithms. Mature CompCI software will minimize the amount of friction in researching, developingand applying causal effects to large datasets.We have started the discussion and made a large investment in increasing the performance of causal inferencealgorithms. Other fields in physical sciences, social sciences, experimentation, statistics and engineering have localexpertise that can be contributed to this interdisciplinary field. As an emerging field, we left several open challengesas a call for others to contribute and help develop this community.10 eferences [A+19] Susan Athey, Julie Tibshirani, Stefan Wager, et al. “Generalized random forests”. In:
The Annals ofStatistics { USENIX } Sym-posium on Operating Systems Design and Implementation ( { OSDI } . 2016, pp. 265–283.[AI95] Joshua D Angrist and Guido W Imbens. “Two-stage least squares estimation of average causal effectsin models with variable treatment intensity”. In: Journal of the American statistical Association
Apache Arrow . https://arrow.apache.org/. 2020.[BM17] Douglas Bates and Martin Maechler.
Matrix: Sparse and Dense Matrix Classes and Methods . R packageversion 1.2-12. 2017. url : https://CRAN.R-project.org/package=Matrix .[Bot10] L´eon Bottou. “Large-scale machine learning with stochastic gradient descent”. In: Proceedings of COMP-STAT’2010 . Springer, 2010, pp. 177–186.[CG16] Tianqi Chen and Carlos Guestrin. “Xgboost: A scalable tree boosting system”. In:
Proceedings of the22nd acm sigkdd international conference on knowledge discovery and data mining . ACM. 2016, pp. 785–794.[Che+14] Sharan Chetlur et al. “cudnn: Efficient primitives for deep learning”. In: arXiv preprint arXiv:1410.0759 (2014).[Dia+19] Nikos Diamantopoulos et al. “Engineering for a Science-Centric Experimentation Platform”. In: arXivpreprint arXiv:1910.03878 (2019).[Dim+17] Maria Dimakopoulou et al. “Estimation considerations in contextual bandits”. In: arXiv preprint arXiv:1711.07077 (2017).[DLL11] Miroslav Dudik, John Langford, and Lihong Li. “Doubly robust policy evaluation and learning”. In: arXiv preprint arXiv:1103.4601 (2011).[DLL17] Alex Deng, Jiannan Lu, and Jonthan Litz. “Trustworthy analysis of online A/B tests: Pitfalls, challengesand solutions”. In:
Proceedings of the Tenth ACM International Conference on Web Search and DataMining . 2017, pp. 641–649.[Eic67] Friedhelm Eicker. “Limit theorems for regressions with unequal and dependent errors”. In:
Proceedings ofthe Fifth Berkeley Symposium on Mathematical Statistics and Probability . University of California Press,1967, pp. 59–82.[ET94] Bradley Efron and Robert J Tibshirani.
An introduction to the bootstrap . CRC press, 1994.[Fab+17] Aleksander Fabijan et al. “The benefits of controlled experimentation at scale”. In: . IEEE. 2017, pp. 18–26.[G+10] Ga¨el Guennebaud, Benoˆıt Jacob, et al.
Eigen v3 . http://eigen.tuxfamily.org. 2010.[GR17] Jeff Gray and Bernhard Rumpe.
The importance of flow in software development . 2017.[Hub67] Peter J. Huber. “The Behavior of Maximum Likelihood Estimates Under Nonstandard Conditions”. In:
Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability . University ofCalifornia Press, 1967, pp. 221–234.[Jia+18] Chengfan Jia et al. “Improving the performance of distributed tensorflow with RDMA”. In:
InternationalJournal of Parallel Programming
Journal of the Royal Statistical Society:Series B (Statistical Methodology)
Proceedings of the 19th ACMSIGKDD international conference on Knowledge discovery and data mining . 2013, pp. 1168–1176.[Koo49] Tjalling C Koopmans. “Identification problems in economic model construction”. In:
Econometrica, Jour-nal of the Econometric Society (1949), pp. 125–144.[Kri16] Gopal Krishnan.
Selecting the best artwork for videos through A/B testing . 2016. url : https://medium.com / netflix - techblog / selecting - the - best - artwork - for - videos - through - a - b - testing -f6155c4595f6 (visited on 08/05/2019). 11KZ08] Christian Kleiber and Achim Zeileis. Applied Econometrics with R . ISBN 978-0-387-77316-2. New York:Springer-Verlag, 2008. url : https://CRAN.R-project.org/package=AER .[Li+10] Lihong Li, Wei Chu, John Langford, and Robert E Schapire. “A contextual-bandit approach to person-alized news article recommendation”. In: Proceedings of the 19th international conference on World wideweb . 2010, pp. 661–670.[Li+11] Lihong Li, Wei Chu, John Langford, and Xuanhui Wang. “Unbiased offline evaluation of contextual-bandit-based news article recommendation algorithms”. In:
Proceedings of the fourth ACM internationalconference on Web search and data mining . ACM. 2011, pp. 297–306.[LW18] Randall A Lewis and Jeffrey Wong. “Incrementality bidding & attribution”. In:
Available at SSRN3129350 (2018).[MF17] Rory Mitchell and Eibe Frank. “Accelerating the XGBoost algorithm using GPU computing”. In:
PeerJComputer Science . IEEE. 2017, pp. 240–242.[Nel16] Nick Nelson.
The Power Of A Picture . 2016. url : https://media.netflix.com/en/company-blog/the-power-of-a-picture (visited on 08/05/2019).[NW86] Whitney K Newey and Kenneth D West. A simple, positive semi-definite, heteroskedasticity and auto-correlation consistent covariance matrix . 1986.[Pea12] Judea Pearl. “The do-calculus revisited”. In: arXiv preprint arXiv:1210.4852 (2012).[Pea95] Judea Pearl. “Causal diagrams for empirical research”. In:
Biometrika
Source code for lm in R . 2019. url : https://github.com/wch/r-source/blob/master/src/library/stats/R/lm.R .[SD18] Alexander Sergeev and Mike Del Balso. “Horovod: fast and easy distributed deep learning in TensorFlow”.In: arXiv preprint arXiv:1802.05799 (2018).[Sid19] Faisal Siddiqi. Infra for Contextual Bandits and Reinforcement Learning . 2019. url : https://netflixtechblog.com / ml - platform - meetup - infra - for - contextual - bandits - and - reinforcement - learning -4a90305948ef (visited on 06/06/2020).[SP19a] Skipper Seabold and Josef Perktold. Source code for statsmodels.regression.linear model.OLS . 2019. url : https://github.com/statsmodels/statsmodels/blob/master/statsmodels/regression/linear_model.py .[SP19b] Skipper Seabold and Josef Perktold. Source code for statsmodels.sandbox.regression.gmm . 2019. url : .[Spa19] Apache Spark. ML Features . 2019. url : https://spark.apache.org/docs/latest/ml-features.html .[Ten20] Tensorflow. Basic classification: Classify images of clothing . Accessed: 2020-06-20. 2020. url : (visited on 06/20/2020).[TP02] Jin Tian and Judea Pearl. “A general identification condition for causal effects”. In: Aaai/iaai . 2002,pp. 567–573.[WA18] Stefan Wager and Susan Athey. “Estimation and inference of heterogeneous treatment effects usingrandom forests”. In:
Journal of the American Statistical Association econometrica arXiv preprint arXiv:1910.01305 (2019).[Zho+20] Zhengyuan Zhou et al.