Heterogeneous Treatment and Spillover Effects under Clustered Network Interference
HHeterogeneous Treatment and Spillover Effectsunder Clustered Network Interference
Falco J. Bargagli-Stoffi ∗ , Costanza Tort´u* , and Laura Forastiere † IMT School for Advanced Studies Lucca Katholieke Universiteit Leuven Yale University
Abstract
The bulk of causal inference studies rules out the presence of interference between units.However, in many real-world scenarios units are interconnected by social, physical or virtualties and the effect of a treatment can spill from one unit to other connected individuals inthe network. In these settings, interference should be taken into account to avoid biasedestimates of the treatment effect, but it can also be leveraged to save resources and providethe intervention to a lower percentage of the population where the treatment is more effectiveand where the effect can spill over to other susceptible individuals. In fact, different peoplemight respond differently not only to the treatment received but also to the treatment receivedby their network contacts. Understanding the heterogeneity of treatment and spillover effectscan help policy-makers in the scale-up phase of the intervention, it can guide the design oftargeting strategies with the ultimate goal of making the interventions more cost-effective, andit might even allow generalizing the level of treatment spillover effects in other populations. Inthis paper, we develop a machine learning method that makes use of tree-based algorithms andan Horvitz-Thompson estimator to assess the heterogeneity of treatment and spillover effectswith respect to individual, neighborhood and network characteristics in the context of clusterednetwork interference. We illustrate how the proposed binary tree methodology performs ina Monte Carlo simulation study. Additionally, we provide an application on a randomizedexperiment aimed at assessing the heterogeneous effects of information sessions on the uptakeof a new weather insurance policy in rural China.
Keywords: causal inference; potential outcomes; interference; social networks; machinelearning; heterogeneous effects
We are grateful for conversations with Andrea Canidio, Kenan Huremovi´c, Julie Josse, Fabrizia Mealli, KarimLounici, Michael Lechner, Hyun Min Kang, Stefan Wager, and for comments to participants at the “2020 Causalinference Opening Workshop (SAMSI Institute), at the “European Cooperation for Statistics and Network Data Sci-ence” (COSTNET) workshop, at the “Causal Machine Learning Workshop” organised by the University of St.Gallen,at the reading groups on “Causal Inference and Machine Learning” at Harvard University and on “Machine Learningand Networks in Economics” at IMT School for Advanced Studies Lucca, and to members of the
Missing Values andCausality Research Group at the ´Ecole Polytechnique. ∗ Equally contributing co-first authors. † Corresponding author: [email protected] a r X i v : . [ s t a t . M E ] S e p Introduction
According to Cox (1958), there is interference between different units if the outcome of one unitis affected by the treatment assignment of other units. In the case of policy interventions or socio-economic programs, interference may arise due to social, physical or virtual interactions. For in-stance, in the case of infectious diseases, unprotected individuals can still benefit from public healthmeasures undertaken in the rest of the population, such as vaccinations or preventive behaviors,because these interventions reduce the reservoir of infection (Nichol et al.; 1995; Bridges et al.;2000), the vector of transmission (Binka et al.; 1998; Howard et al.; 2000) and the number of sus-ceptible individuals (Broderick et al.; 2008; Kelso et al.; 2009; Kissler et al.; 2020). In labor market,job placement assistance can affect job seekers using this service, but it can also have an effect onother job seekers who are competing on the same labor market (McKenzie and Puerto; 2015). Ineducation, learning programs may spill over to untreated peers through knowledge transmissionpaths (Forastiere et al.; 2019; Schiltz et al.; 2019). In marketing, the exposure to an advertisementmight directly affect the consuming behavior of the exposed individuals, and indirectly affect otherindividuals that are influenced by the consuming choices of people in their social network (Par-shakov et al.; 2020). If the exposure to the advertisement takes place in social media, the spillovereffect might be also explained by the propagation of the advertisement to non-exposed users thatare virtually connected to the targeted ones (Chae et al.; 2017).In the presence of interference, the effect of the treatment status of other units on one’s outcomeis usually referred to as spillover effect . In economics and social sciences there has been increasinginterest in estimating spillover effects on networks in many different contexts: Duflo and Saez (2003)study the role of information and social interactions in retirement plan decisions in the academiccommunity; Cai et al. (2015) assess the spillover effect of training sessions on uptake of weatherinsurance in rural areas of China; Muralidharan and Sundararaman (2015) study the aggregateeffects of school choices while Imai et al. (2020) evaluate the effects of the Indian National healthinsurance.Spillovers are a crucial component in understanding the full impact of an intervention at thepopulation-level. In fact, the scale-up phase of a policy requires knowledge about the mechanism1f spillover, and how this would take place in the population where the intervention will be rolledout. Information about spillovers of public policies can also support decisions about how best todeliver interventions, and could be used to guide public funds allocation. Indeed, the presenceof beneficial spillover effects allows treating a lower percentage of the population, because theuntreated individuals would still benefit from the treatment provided to the targeted sample. Theuse of spillover effects to save resources could be further improved if we were able to target specificindividuals who would increase the overall impact of the intervention.A targeting (or seeding) strategy aims at delivering the intervention to certain individualssuch that the impact on the overall population is maximized (e.g., Valente; 2012; Kim et al.;2015; VanderWeele and Christakis; 2019; Montes et al.; 2020). Typically, seeding strategies aredesigned in settings where either an element of the intervention (e.g., information, flyers or couponsprovided during the intervention) or the outcome (e.g., the adoption of a behavior or a product)diffuse through the network. In these settings, the goal is the identification of the set of nodes inthe network that, if targeted, would maximize contagion or diffusion cascades. To do so, seedingstrategies are designed using information on the network structure and the dynamics of contagion ordiffusion. This ‘influence maximization’ problem is NP-hard and computer scientist have developedapproximate algorithms that usually rely on simplified contagion processes (Kempe et al.; 2003).Indeed, it is typically assumed that susceptibility to direct treatment and to others, as well as theinfluential power, are homogeneous across individuals.Here we take a different perspective. First, we investigate spillover effects of a unit’s treatmenton other units’ outcome without specifying the mechanism through which this might take place.Second, we focus on the assessment of the heterogeneity of susceptibility to direct and indirecttreatment. Understanding these heterogeneities can guide the design of targeting strategies aimedat making the interventions more cost-effective. When spillovers are not present, this can be achievedby targeting those with the highest treatment effect. In fact, in the field of personalized medicine it iswell known that individuals with different characteristics might respond differently to the treatment(e.g., Murphy; 2003; Chakraborty and Murphy; 2014; Kosorok and Laber; 2019). In the presence ofinterference, we also have that different people might be more or less susceptible to the treatmentreceived by other units. This means that not only the treatment effect but also spillover effects areheterogeneous. An assessment of the heterogeneity of treatment and spillover effect is crucial not2nly for designing a cost-effective roll-out of the intervention within the targeted population, butit can also allow a generalization of its impact to other populations.
In the past decade causal inference literature has experienced a growing interest in the mechanismof interference, leading to (i) the assessment of bias for causal effects estimated under the no-interference assumption (Sobel; 2006; Forastiere et al.; 2020), (ii) the design of experiments to eitheravoid or assess interference (Angelucci and Di Maro; 2015; Baird et al.; 2018; Kang and Imbens;2016), and (iii) the estimation of casual effects under interference. Estimators for treatment andspillover effects have first been developed under the assumption of partial interference, allowinginterference within groups but not across different groups (Hudgens and Halloran; 2008; Tchetgenand VanderWeele; 2012; Liu and Hudgens; 2014; Liu et al.; 2016; Forastiere et al.; 2016; Basse andFeller; 2018; Forastiere et al.; 2019). However, the assumption of group-level interference is ofteninvalid, too broad or not applicable. Hence, several works focus on the estimation of causal effectsin the context of units interconnected on networks, both in randomized experiments (Aronow andSamii; 2017; Athey et al.; 2018; Leung; 2020) and in observational studies (Ogburn et al.; 2017;Sofrygin and van der Laan; 2017; Forastiere et al.; 2018, 2020). In the context of social networks,even in randomized experiments where the treatment is randomized at the unit-level, exposureto other units’ treatment is not. Therefore the propensity of being exposed to different levels oftreatments among network contacts will depend on the network structure. Aronow and Samii (2017)developed a Horvitz-Thompson estimator to adjust for the imbalance in this propensity across unitsunder different individual and contacts’ treatment status.In parallel to this field of research on interference, in recent year, thanks to the availability ofincreased computing power and large data sets, researchers have started to think about advanceddata-driven methods to assess the heterogeneity of treatment effects with respect to large numbersof features. In this regard, there has been a large number of contributions in causal inferenceon subgroup analysis to investigate heterogeneous effect (e.g., Assmann et al.; 2000; Cook et al.;2004; Rothwell; 2005; Su et al.; 2009; Varadhan and Seeger; 2013; Ratkovic and Tingley; 2017).However, the standard methods for subgroup analysis have several drawbacks. In particular, (1)they strongly rely on the subjective decisions on the specific variables defining the heterogeneous3ub-populations; and (2) they fail to discover heterogeneities other than the ones that are a priori defined by the researchers. In addition, a data-driven method avoids potential problems relatedto cherry-picking the subgroups with extremely high/low treatment effects (Assmann et al.; 2000;Cook et al.; 2004). Hence, many data-driven algorithms for the estimation of heterogeneous causaleffects have been proposed in recent years (Hill; 2011; Foster et al.; 2011; Su et al.; 2012; Athey andImbens; 2016; Wager and Athey; 2018; Athey et al.; 2019; Lechner; 2018; Hahn et al.; 2020). Theunderlying aim of these methods is to detect ‘causal’ rules defining subsets of the study populationwhere the treatment effect for that subgroup deviates from the average treatment effect. Thisis done by selecting the most important features and their values that define a partition of thecovariate space (the tree) where the treatment effect is ‘significantly’ heterogeneous. Among thesealgorithms, many rely on extensions of the standard Classification and Regression Trees (CART)algorithm (Friedman et al.; 1984) and Random Forest (RF) algorithm (Breiman; 2001), and areadapted to different settings (Athey and Imbens; 2016; Zhang et al.; 2017; Lee et al.; 2018; Bargagli-Stoffi and Gnecco; 2018; Guber; 2018; Johnson et al.; 2019; Bargagli-Stoffi and Gnecco; 2020). Inparticular, in their seminal contribution Athey and Imbens (2016) develop the honest causal tree(HCT) methodology. HCT is a causal decision tree algorithm that is aimed at discovering theheterogeneity in causal effects through a single binary tree. HCT is constructed using a criterionfunction aimed at maximising the heterogeneity in causal effects at each split, while penalizingsplits with higher variance in the estimated conditional effects. In addition to a modification of thecriterion function, Athey and Imbens (2016) introduce the concept of honest splitting . The authorspropose to split the overall learning sample into a discovery (or training) set and an estimation set. The former is used to discover the heterogeneity in treatment effects, while the latter is usedto estimate these effects in the discovered sub-populations. The different role of the two sets avoidspotential spurious heterogeneity discovery due to overfitting of the learning sample. Building onthe HCT, Wager and Athey (2018) and Athey et al. (2019) introduce the causal forest (CF) andgeneralized random forest (GRF). Both CF and GRF are ensemble methods where multiple treesare built and combined to improve inference on the heterogeneous causal effects.HCT and similar tree-based methodologies have already been applied to various scenarios for For a recent review the reader can refer to Athey and Imbens (2019). The criterion function is the main difference between HCT and CART. Indeed, CART’s criterion function isaimed at minimizing the empirical predictive error at each split.
In this paper, we bridge the gap between the aforementioned two bodies of causal inference literatureby proposing a new algorithm for the discovery and estimation of heterogeneous treatment andspillover effects with respect to a large number of characteristics, including individual and networkfeatures. Our method is designed for randomized experiments affected by the presence of clustered5etwork interference, that is, units are organized in a clustered structure, with no interactionsbetween clusters and a network of connections within clusters (e.g., friendship networks withinschools). In addition, randomization is assumed at the individual-level, resulting in treated anduntreated units in the same cluster. Under this setting, spillover effects are confined within clustersand are assumed to take place on network interactions.Our proposed method, network causal tree (NCT), builds upon the causal trees proposed byAthey and Imbens (2016), by modifying the splitting criterion to target treatment and spillovereffects under interference. Splits are made so as to maximize the heterogeneity of the targetedcausal effect(s), treatment and/or spillover effects, across the population. This criterion relies onthe unbiasedness of the estimator of the effect(s) within each subset of the population. We firstcontribute to the existing literature on interference by developing an unbiased estimator for con-ditional treatment and spillover effects. We extend the Horvitz-Thompson estimator in Aronowand Samii (2017) to conditional causal effects under clustered network interference and we proveits consistency under the clustered network setting. This estimator is then used in our networkcausal trees to choose the binary splits that maximize the heterogeneity and to finally estimate theheterogeneous causal effects within the selected sub-populations.In order to use the selected partition of the covariate space and the estimated treatment andspillover effects to guide policies, the heterogeneous sub-populations should be identified basedon the causal effects that will be part of the decision rule. For instance, a policy that assignsthe treatment to those who benefit the most from it requires the discovery of the subsets of thepopulations with heterogeneous treatment effect. Alternatively, a policy that is designed to targetthose who will respond to both their own treatment and the neighbors’ treatment must identifysub-populations with high treatment and spillover effects. Hence, the proposed NCT methodologyis optimized to detect the heterogeneity in treatment and spillover effects (i) either simultaneously ,or (ii) separately . By reworking the criterion function of the seminal causal trees algorithm to ac-count for interference, we allow the algorithm to detect heterogeneity in treatment and/or spillovereffects. The discovery of the causal rules (the variables and their values defining heterogeneous sub-populations) representing the heterogeneity of one causal effect (treatment or spillover) is achievedby using a splitting criterion that maximizes the heterogeneity of that specific causal effect acrossthe partition. On the contrary, if our goal is to identify a partition of the covariate space that can6xplain the heterogeneity of multiple causal effects, we propose the use of a composite splitting func-tion that is designed to simultaneously maximize the heterogeneity in all the effects. This flexibilityallows scholars and policy-makers to customize their investigations depending on their targetinggoals. For instance, if a policy-makers wants to target individuals with the highest treatment effectand the lowest spillover effect (with the motivation that those with higher spillover effects canbenefit from the treatment received by others), the NCT algorithm should be implemented witha composite splitting function aimed at detecting subsets of the population where both treatmentand spillover effects are heterogeneous and the decision criterion can be applied. Conversely, if atargeting strategy is designed to target just individuals who would benefit the most from receivingthe treatment, regardless of other people’s assignment, a tree would be built using a single splittingcriterion targeted to maximize the treatment effect heterogeneity. Similarly, a single criterion tar-geted to a spillover effect would be used in the case of targeting rules only involving that spillovereffect.It is important to note that the use of our algorithm to design implementation strategies ispossible thanks to its high level of interpretability. Our network causal trees provide interpretableinference on heterogeneous treatment and spillover effects by discovering a set of causal rules thatcan be represented through a binary tree. As argued by Lee et al. (2018) and Lee et al. (2020) itis important to provide interpretable information on simple causal rules that can be targeted toimprove policy effectiveness and to ensure that stakeholders and policy-makers understand (and, inturn, trust) the functionality of these models. Valdes et al. (2016) claim that a learning algorithmis interpretable if one can explain its classification by a conjunction of conditional statements (i.e.,if-then rules). In this regard, tree-based algorithms based on if-then rules, such as the proposedNCT, are optimal for interpretability.To assess the performance of the proposed NCT algorithm, we run a series of Monte Carlosimulations. In particular, we investigate the performance of the proposed algorithm with respect totwo dimensions: its ability (i) to correctly identify the actual heterogeneous sub-populations and, (ii)to precisely estimate the conditional treatment and spillover effects. While the latter performanceassessment is quite standard in the literature, the former is critical for interpretable algorithmsfor heterogeneous causal effects (Bargagli-Stoffi et al.; 2019). Finally, we apply the proposed NCTalgorithm to a randomized experiment conducted in China to assess the impact of information7essions on the purchase of a new weather insurance policy (Cai et al.; 2015). Besides estimatingthe population average treatment and spillover effects (as already investigated in Cai et al. (2015)),our aim is to detect the strata of the population where one or both effects are heterogeneous andestimate these effects within these strata.The remainder of the paper is organized as follows. In Section 2 we introduce the notation, set-ting, and assumptions that we employ throughout the paper. In Section 3 we define the conditionalcausal effects in a general partition of the covariate space and develop a Horvitz-Thompson estima-tor. Section 4 presents the proposed network causal tree algorithm, which is based on effect-specifor composite splitting functions for causal effects under interference. We then conduct a simulationstudy to assess the performance of the algorithm and estimator under different scenarios in Section5 and we illustrate the application of the network causal tree on a randomized experiment in Sec-tion 6. Section 7 concludes the paper with a discussion of the proposed algorithm and directionsfor further research.
Let us consider a sample V of N units organized in K separate clusters. Let k ∈ K = [1 , . . . , K ] bethe cluster indicator and let i = 1 , . . . , n k be the unit indicator in each cluster k . Let now consider aconnection structure such that units belonging to the same cluster might share a link whereas unitsbelonging to different clusters are not connected. This network structure is represented by the graph G = ( V, E ), where V defines the set of nodes and E defined the set of edges, that is, the collectionof links between each connected pair of nodes. A clustered network G is in turn an ensemble of K disjoint sub-graphs: G k = ( V k , E k ) , k = 1 , . . . , K . The adjacency matrix A corresponding to thegraph G , is a block-diagonal matrix with K blocks, A k , k : 1 , . . . , K , where each element a ij,k isequal to 1 if there is a link between unit i and unit j in cluster k , that is, if the edge (cid:15) ij,k ∈ E k .Elements in A off the K blocks are equal to zero, indicating no links between units belonging todifferent clusters.Let now W ik ∈ { , } be a binary variable representing the treatment assigned to unit i incluster k and let Y ik be the observed outcome. We denote by W k and Y k the treatment and8utcome vectors in each cluster k . Similarly, W and Y denote the treatment and outcome vectorsin the whole sample. It is worth noting that the treatment status is not the same for all unitsbelonging to the same cluster. This treatment allocation corresponds to a unit-level randomizationwhere treatment is assigned to each unit of a cluster independently (as in a Bernoulli trial) or withsome level of dependency (as in a completely randomized trial) based on a probability distribution P ( W k ) (see Section 2.3 for further details). Moreover, for each unit ik we observe a vector X ik of P covariates (or pre-treatment variables) that are not influenced by the treatment assignment. Thevector of covariates might include individual characteristics (e.g., age, sex, socio-economic status,...), cluster-level characteristics (e.g., cluster size, location, ...), as well as network characteristicsrepresenting aggregated individual characteristics (e.g., average age or proportion of males andfemales, ...) or the network topology (e.g., degree, centrality, transitivity, ...).Figure 1 provides a graphical intuition on the clustered network structure and treatment assign-ments at the unit-level. Edges indicate links between units, within each cluster. Colors refer to theindividual treatment assignment: grey colored nodes represent treated units, while white coloredvertices indicate units assigned to the control group.Figure 1: Clustered Network Structure Following the potential outcome framework (Rubin; 1974; Holland; 1986), we denote by Y ik ( w )the potential outcome that unit i in cluster k would experience if the treatment vector W in the9hole sample were w , with w ∈ { , } N . Under the assumption of no-interference, the potentialoutcome could be indexed only by the individual treatment assignment W ik , that is, Y ik ( W ik = w ).In combination with the assumption of consistency , this assumption is known as Stable UnitsTreatment Value Assumption (SUTVA)(Rubin; 1986). The no-interference assumption is clearlyviolated in many real-world scenarios. For instance, the evaluation of the effect of vaccines shouldtake into account the fact that unprotected individuals would also benefit from high vaccinationcoverage around them.Here, we focus on a particular type of interference: clustered network interference . As we willshow later, focusing on this type of interference is critical to ensure asymptotic properties of theestimator for conditional causal effects as well as to allow the network causal tree to divide thesample into a training (or discovery) set and an estimation set (and a testing set, if applicable). The assumption of clustered network interference implies that: i) interference is restricted to nodesof the same cluster and interference between clusters is ruled out, that is, one’s outcome is onlyaffected by the treatment received by units belonging to the same cluster; ii) one’s outcome isaffected by a weighted function of the treatment status of potentially all units in their own cluster,with weights depending on the presence and possibly the value of network links.Let W k/i be the vector collecting the treatment status of all units in cluster k except unit i .Let g ( · ) : { , } n k − −→ ∆ ik be a function that maps a cluster assignment vector W k/i to anexposure value. Without loss of generality, we define it as a function of the dot product betweenthe cluster assignment vector and a vector of weights δ i ( A k , X k ), which in turn depends on theadjacency matrix A k and the covariate matrix X k , i.e., g ( W k/i , δ i ( A k , X k )) = f ( W k/i · δ i ( A k , X k )).For instance, the function g ( · ) could result in the number or proportion of treated units in a cluster.In this case the weight vector would be equal to δ i ( A k , X k ) = n k − or δ i ( A k , X k ) = ( n k − ) n k − ,respectively. Alternatively we could use the adjacency matrix to compute the geodesic distance d ( i, j ) between each pair of nodes in cluster k and let g ( W k/i , δ i ( A k , X k ) = (cid:80) n k − j =1 W jk d ( i,j ) . Thefunction g ( · ) is similar to the ‘effective treatments’ function in Manski (2013) and the ‘exposuremapping’ function in Aronow and Samii (2017), although it applies to the cluster treatment vectoronly. To ease notation, throughout we will omit the weight vector δ i ( A k , X k ) in the function g ( · ). Alternatively, network causal trees could also be extended to the case of one single network as long as theamount of dependency is limited to ensure the consistency of the estimator. Furthermore, to divide the sample intothe different sets needed for the causal tree algorithm, a community detection algorithm could be used to identifyseparate and densely connected communities
10e can now formalize the clustered network interference assumption as follows.
Assumption 1 (Clustered Network Interference) . Given a function g ( · ) : { , } n k − −→ ∆ ik , ∀ k ∈ K , ∀ i ∈ V k , and ∀ W , W (cid:48) ∈ { , } N such that W ik = W (cid:48) ik , g ( W k/i ) = g ( W (cid:48) k/i ) , the followingequality holds: Y ik ( W ) = Y ik ( W (cid:48) ) . Assumption 1 states that the outcome of a unit i in cluster k depends on the individual treat-ment W ik and a function of the treatment status of the other members of cluster k , i.e., g ( W k/i ),regardless of the specific treatment status of each member. This assumption can be viewed as an in-termediate assumption between (i) assuming no interference and (ii) making no assumptions aboutthe nature of interference. In a way, it is similar to the partial interference or the stratified interfer-ence in Hudgens and Halloran (2008), which are special cases of the clustered network interferenceassumption, with g ( W k/i ) = W k/i and g ( W k/i ) = (cid:80) n k j =1 W ij .Let G ik = g ( W k/i ), referred to throughout as network exposure . Under Assumption 1, eachunit has | ∆ ik | × Y ik ( w, g ), representing the potential outcome of unit ik under W ik = w and G ik = g ( W k/i ) = g .We also assume the following consistency assumption: Assumption 2 (Consistency) . Y ik = Y ik ( W ik , G ik ) . This assumption rules out different versions of the treatment and different ways in which a value ofthe network exposure can affect the outcome of a particular unit. Under a ‘finite sample perspective’,we assume the potential outcomes of each unit to be fixed but unknown, except for the observed Y ik ( W ik , G ik ). Therefore, the only source of randomness in the potential outcomes is given by therandom assignment to the treatment and the random network exposure induced by the randomcluster assignment.Assumptions 1 and 2 together are alternative to SUTVA when interference is present and islimited to within clusters. When the weight function δ i ( A k , X k ) is such that elements δ ij ( A k , X k ) =0 if j ∈ V k : a ij,k = 0, that is, units that are not directly connected to unit i receive a weight equal tozero, then interference is limited to the neighborhood N ik of each unit, with N ik = { j ∈ V k : a ij,k =1 } . In this case, Assumptions 1 and 2 correspond to the SUTNVA Assumption in Forastiere et al.112020). We denote by N gik the set of units defining the network exposure, that is, N gik = { j ∈ V k :if W (cid:48) jk = W jk then g ( W (cid:48) k/i ) = g ( W k/i ) , ∀ W (cid:48) k/i (cid:54) = W k/i } = { j ∈ V k : δ ij ( A k , X k ) (cid:54) = 0 } . In most ofthe literature on spillover effects this set is either the cluster k (Hudgens and Halloran; 2008) orthe neighborhood of unit i (Forastiere et al.; 2020). Alternative specifications are also possible andmight involve higher-order neighbors.For the purpose of assessing effect heterogeneity using tree-based methods, we will further makethe following assumption: Assumption 3 (Discrete Network Exposure) . We consider a discrete exposure mapping function g ( · ) : { , } n k − −→ ∆ ik ⊂ Z . Z is the set of integers. This assumption implies that the network exposure G ik is a discrete variableFor instance, we can define a binary network exposure based on a threshold function applied to thenumber of treated neighbors: G ik = (cid:18)(cid:0) (cid:88) j ∈N ik W jk (cid:1) ≥ q (cid:19) , (1)where q is a threshold. Hence, the g ( · ) exposure mapping function behaves like a threshold functionwhich sums the elements of its argument (i.e the treatment assignment vector that characterizes theneighborhood of each unit) taking the value of 1 if the resulting value exceeds a certain threshold(e.g., at least one treated neighbor is treated, the majority of the neighbors are treated, ...). Inour simulation study as well as in the application we have chosen the following definition: G ik = ( (cid:0)(cid:80) j ∈N ik W jk ) ≥ (cid:1) , that is, the network exposure is 1 if at least one network neighbor is treated.As a consequence, both the individual treatment and the network exposure are defined as binaryvariables, W ik ∈ { , } and G ik ∈ { , } . It follows that the support of the joint treatment variable( W ik , G ik ) is finite and comprises four possible realizations, given by the combination of the twomarginal domains. Hence, ( W ik , G ik ) ∈ { ( w, g ) = (0 , , (1 , , (0 , , (1 , } .A discrete network exposure is crucial for our causal tree algorithm, at least in the versionproposed in this paper. Indeed, the algorithm relies on the presence of enough observations foreach treatment and exposure value to allow the estimation of the causal effects. Depending onthe stopping rule which might rely on the accuracy of the estimation of conditional effects or onthe number of observations (see Section 4), if the sample size is not large enough with respect tothe number of categories of the network exposure and/or its distribution is non-uniform and highly12kewed, the network causal tree algorithm might result in a tree with low depth and low granularity,that is, with highly heterogeneous causal effects even within the terminal leaves. Therefore, themaximum number of categories for the network exposure depend on the sample size, the numberof covariates and their nature, as well as on the extent of the heterogeneity in the causal effects. In this work, we consider an experimental design with a unit-level randomization of the treatment,which is independent between clusters but might be dependent within them. Therefore, the treat-ment vector W is a random vector with probability distribution P ( W = w ) and the followingassumption holds. Assumption 4 (Independent treatment allocation between clusters) . P ( W = w ) = K (cid:89) k =1 P ( W k = w k ) where W k is the treatment vector in each cluster k . We denote by π Wik the unit-level probability that W ik is equal to 1, under the experimentaldesign in place. In a randomized experiment π Wik is known. In the case of a Bernoulli trial, whereeach unit is independently assigned to the individual treatment, π Wik is constant and equal to α . An example of a design with randomization independent between clusters but dependent withinclusters is that of a completely randomized experiment taking place in each cluster. In this case, π Wik would be equal to m k /n k , where m k is the fixed number of treated units, and the treatmentassignment for each unit does depend on the treatment status of other units.Since the network exposure is a deterministic function g ( · ) of the cluster assignment vector W k/i , then the randomization distribution P ( W = w ) induces, together with the definition ofthe function g ( · ), a probability distribution of the vector of network exposures G in the wholesample. Hence, the probability for a unit of being exposed to a specif value of the network exposure G ik = g given the individual treatment w , denoted by π G | Wik ( g | w ), is known and can in principle becomputed from the probability distribution P ( W ). Note that, we can drop the dependency fromthe individual treatment and write π Gik ( g ) when the randomization is independent between units. The unit-level assignment probability could also vary across clusters as in a two-stage randomization (Hudgensand Halloran; 2008). { , } × (cid:83) ik ∈ V ∆ ik be the domain of the joint individual and network treatment status,that is, ( w, g ) ∈ ∆. Let π ik ( w, g ) denote the marginal probability for unit ik of being assigned toindividual treatment w and being exposed to the network status g . This is equal to the expectedproportion of assignment vectors w inducing an individual treatment w and a network exposure g : π ik ( w, g ) = (cid:88) w ∈{ , } N ( W ik = w, G ik = g ) P ( W = w ) = ( π Wik ) w (1 − π Wik ) − w × π G | Wik ( g | w ) . (2)This marginal probability is a crucial component of the Horvitz-Thompson estimator for causaleffects under network interference. For instance, if the experimental design is a Bernoulli trialwith unit-level probability α and the network exposure is defined by a threshold function on theneighborhood as in Equation 1, then the joint probability could be computed as follows: π ik ( w, g ) = α w (1 − α ) − w × (cid:20) − h − (cid:88) l =0 (cid:18) N ik l (cid:19) p l (1 − p ) N ik − l (cid:21) g (cid:20) q − (cid:88) l =0 (cid:18) N ik l (cid:19) p l (1 − p ) N ik − l (cid:21) − g (3)where N ik is the number of neighbors (‘degree’) of unit ik .To deal with well-defined potential outcomes, we must assume that each unit has a non-zeroprobability of being exposed to each ( w, g ): Assumption 5 (Positivity) . π ik ( w, g ) > ∀ i ∈ N , k ∈ K and ∀ ( w, g ) ∈ ∆ . When π ik ( w, g ) = 0 for some units, then the average potential outcomes and causal effects involvingthese values z and g must be restricted to the subset of units for whom π ik ( w, g ) >
0. For instance,if the network exposure is defined as in Equation 1, then the positivity assumption is violated forunits who cannot be exposed to a value g , that is, those with a degree N ik lower than the threshold q . Consequently, the analysis must be restricted only to the subset of the population satisfying thepositivity criterion.The estimator that we propose below also requires the so-called pairwise exposure probabilities ,which describes the joint probability for pairs of units of being exposed to a given individualtreatment and network status. Hence, given specific exposure conditions ( w, g ) and ( w (cid:48) , g (cid:48) ) , apairwise exposure probability, denote by π ikjh ( w, g ; w (cid:48) , g (cid:48) ), quantifies the probability that the twoevents ( W ik = w, G ik = g ) and ( W jh = w (cid:48) , G jh = g (cid:48) ) occur, i.e., π ikjh ( w, g ; w (cid:48) , g (cid:48) ) = P ( W ik =14 , G ik = g, W jh = w (cid:48) , G jh = g (cid:48) ). In general, this can be written as: π ikjk (cid:48) ( w, g ; w (cid:48) , g (cid:48) ) = (cid:88) w ∈{ , } N ( W ik = w, G ik = g, W jk (cid:48) = w (cid:48) , G jk (cid:48) = g (cid:48) ) P ( W = w ) . (4)Under the event of both units being exposed to the same condition ( w, g ) we denote the pairwiseexposure probability by π ikjh ( w, g ).In the case of an experimental design assigning treatment independently between clusters, underthe clustered network interference the two events ( W ik = w, G ik = g ) and ( W jh = w (cid:48) , G jh = g (cid:48) ),with k (cid:54) = h , are independent and the pairwise exposure probability equals the product of the twojoint probabilities: π ikjh ( w, g ; w (cid:48) , g (cid:48) ) = π ik ( w, g ) × π jh ( w (cid:48) , g (cid:48) ). In the case of a Bernoulli trial andthe network exposure defined on the neighborhood only, this is also true for units i and j belongingto the same cluster, i.e., k = h , but are not connected and do not share any neighbors. In general,let N wgik = N gik ∪ ik . Even under independent treatment assignment, if N wgik ∩ N wgjk (cid:48) (cid:54) = 0 the jointtreatment of the units ik and jk (cid:48) will be dependent and π ikjk (cid:48) ( w, g ; w (cid:48) , g (cid:48) ) (cid:54) = π ik ( w, g ) π jk (cid:48) ( w (cid:48) , g (cid:48) ) (cid:54) =0. Note that if π ik ( w, g ) or π jh ( w (cid:48) , g (cid:48) ) = 0 ⇒ π ikjh ( w, g ; w (cid:48) , g (cid:48) ) = 0, but not the reverse. Indeed,the joint probability of the two events ( W ik = w, G ik = g ) and ( W jh = w (cid:48) , G jh = g (cid:48) ) might bezero if the network exposures G ik and G jh are defined on two subsets of units that coincide orinclude jh and ik , respectively. For example, if the network exposure is defined as in Equation 1with threshold equal to 1 (i.e., having at least one treated neighbor) then, if unit ik is treated, with W ik and belongs to the neighborhood N jh of unit jh , the network exposure G jh cannot be 0. Our ultimate goal is to detect the regions of the covariate space exhibiting a high level of het-erogeneity in the causal effects and estimate the causal effects of interest in these heterogeneousregions. In this section, we will focus on the definition and estimation of conditional treatmentand spillover effects and we will assume that the heterogeneous regions that we want to investigatehave already been identified, either a priori according to subject-matter knowledge or thanks to15ata-driven methods.Let us denote with Π a partition of the covariate space X into M non-overlapping regions:Π = { (cid:96) , . . . , (cid:96) M } , where (cid:83) Mm =1 (cid:96) m = X , and with (cid:96) ( x , Π) :
X →
Π a function that maps each vector x of the covariate space into a region. Let N ( (cid:96) m ) be the size of each region (cid:96) m , with m = 1 , . . . , M ,and let V k ( (cid:96) m ) be the subset of units belonging to region (cid:96) m in cluster k , with k = 1 , . . . , K . Inthe machine learning literature on CART, these non-overlapping regions are referred to as leaves .For consistency, throughout we will use this terminology, regardless of whether the partition Π hasbeen a priori defined or is the result of a tree-based algorithm. In addition, to ease notation, wewill drop the reference to the partition Π from the mapping function (cid:96) ( · ).When units are organized in a network, it is worth noting that a partition Π of the covariatespace divides the sample units into sub-populations according to similarities in their characteristics,regardless of their network distance. Hence, two connected units might belong to different regionsof the partition. However, in an homophilous network, where the probability of forming a linkdepends on the similarity in certain features and, hence, connected units are likely to share similarcharacteristics, a partition of the the covariate space is also likely to cluster connected units together.Figure 2: Partition of the covariate space with connected units.16iven a partition Π, we now define conditional average potential outcomes under each individualtreatment and network exposure condition ( w, g ) ∈ ∆. For the subset of units S m with covariatevectors x ∈ X that are mapped to the same region by the function (cid:96) ( x ), i.e., S m = { ik ∈ V : (cid:96) ( X ik ) = (cid:96) m } , we define the leaf-specific average potential outcome under treatment and exposurecondition ( w, g ) ∈ ∆ as follows: µ ( w,g ) ( (cid:96) ( x )) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 Y ik ( w, g ) ( X ik ∈ (cid:96) ( x )) . (5)Note that µ ( w,g ) ( (cid:96) ( x )) is a sample average, that is, it is the average potential outcome for all unitsin the sample V with a covariate vector mapped to the same region (cid:96) ( x ).Leaf-specific conditional average causal effect (CACE) can be defined by comparing averagepotential outcomes under two different conditions: τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 Y ik ( w, g ) ( X ik ∈ (cid:96) ( x )) − N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 Y ik ( w (cid:48) , g (cid:48) ) ( X ik ∈ (cid:96) ( x ))= µ ( w,g ) ( (cid:96) ( x )) − µ ( w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) . (6) We denote by T the set of possible contrast we are interested in. For instanceif both the individual treatment and the network exposure are binary, then T ⊆{ (1 ,
0; 0 , , (1 , , , , (0 ,
1; 0 , , (1 , , , , (1 , , , } . We define as leaf-specific treatment effects causal contrasts τ ( w,g ; w (cid:48) ,g ) ( (cid:96) ( x )) that keep the network exposure fixed at a level g while changingthe individual treatment from w (cid:48) to w , that is, when g = g (cid:48) . These represent causal effects of re-ceiving the treatment while the treatment status of all other units is kept fixed or is mapped to thesame network exposure g . On the contrary, we define as leaf-specific spillover effects causal contrasts τ ( w,g ; w,g (cid:48) ) ( (cid:96) ( x )) that keep the individual treatment fixed at a level w while changing the networkexposure from g (cid:48) to g , that is, when w = w (cid:48) . These spillover effects can be seen as causal effectsof a change in the treatment status of other units such that the network exposure also changes,while the individual treatment status is kept fixed. It should be emphasized that µ ( w,g ) ( (cid:96) ( x )) cor-responds to a unit-level intervention setting the treatment and network exposure of each unit tospecific values. The focus on these type of average potential outcomes, as opposed to the ones basedon population-level hypothetical interventions as in Hudgens and Halloran (2008), is due to ourpurpose of investigating heterogeneous responses to the individual treatment and network status17cross units with different characteristics. If one is interested in assessing the heterogeneity of theaverage response to the network exposure resulting from a hypothetical treatment allocation, ourapproach could be extended to marginalized causal effects as the ones in Forastiere et al. (2020). Here we develop an Horvitz-Thompson estimator for leaf-specific conditional average causal effects.The derivation of the proposed estimator builds upon the estimator for average causal effects undernetwork interference proposed by Aronow and Samii (2017).Following Horvitz and Thompson (1952) and Aronow and Samii (2017), a design-based esti-mator for the leaf-specific average potential outcome under individual treatment w and networkexposure g , µ ( w,g ) ( (cid:96) ( x )), can be expressed as: (cid:98) µ ( w,g ) ( (cid:96) ( x )) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 Y ik π ik ( w, g ) ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) (7)where π ik ( w, g ) denotes the probability of a given unit ik , that belongs to the leaf (cid:96) ( x ) (in thepartition Π), to be exposed to the treatment condition ( w, g ).The variance estimator of (cid:98) µ ( w,g ) ( (cid:96) ( x ) can be expressed as: (cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x ))[1 − π ik ( w, g )] (cid:20) Y ik π ik ( w, g ) (cid:21) + 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) ( W jk = w, G jk = g, X jk ∈ (cid:96) ( x )) × π ikjk ( w, g ) − π ik ( w, g ) π jk ( w, g ) π ikjk ( w, g ) Y ik π ik ( w, g ) Y jk π jk ( w, g ) . (8) This expression extends the variance estimator derived in Aronow and Samii (2017) (Equation 7)to the case of conditional average potential outcomes and clustered network interference. In fact, thesecond term in (8) includes the covariance between the individual treatment and network exposureof two units belonging to the same leaf (cid:96) ( x ). Under an experimental design with independenttreatment allocation between clusters and under clustered interference such covariance betweentwo units belonging to different clusters is zero and the second term should be restricted to units j in the same cluster as i . In addition, the covariance between the joint treatment of two units is18on-zero if the set of units defining the network exposure – e.g., the whole cluster or the unit’sneighborhood – is shared between them or includes them. Formally, if N wgik ∩ N wgjk (cid:48) (cid:54) = 0 the jointtreatment of the units ik and jk (cid:48) will be dependent, that is, π ikjk (cid:48) ( w, g ) − π ik ( w, g ) π jk (cid:48) ( w, g ) (cid:54) = 0 .Hence, two units belonging to the same leaf are more likely to have intersecting sets N wgik and N wgjk (cid:48) (e.g., shared neighbors) if the sets are homogeneous, that is, units belonging to these sets sharesimilar characteristics. Settings with homophilous networks are investigated in the appendix.An estimator for the leaf-specific conditional average causal effect of the exposure condition( w, g ) compared to the configuration ( w (cid:48) , g (cid:48) ) can be written as: (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) = (cid:98) µ ( w,g ) ( (cid:96) ( x )) − (cid:98) µ ( w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) . (9)The estimated variance of the estimator (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) can be decomposed as follows: (cid:98) V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17) = (cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) + (cid:98) V (cid:16)(cid:98) µ ( w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17) − (cid:104)(cid:98) C (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) , (cid:98) µ ( w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17)(cid:105) . (10)with the covariance estimator taking the following expression for the case when π ikjk ( w, g ; w (cid:48) , g (cid:48) ) > ∀ i, j, k : (cid:98) C (cid:16)(cid:98) µ w,g ( (cid:96) ( x )) , (cid:98) µ w (cid:48) ,g (cid:48) ( (cid:96) ( x )) (cid:17) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) ( W jk = w (cid:48) , G jk = g (cid:48) , X jk ∈ (cid:96) ( x )) π ikjk ( w, g ; w (cid:48) , g (cid:48) ) × [ π ikjk ( w, g ; w (cid:48) , g (cid:48) ) − π ik ( w, g ) π jk ( w (cid:48) , g (cid:48) )] Y ik π ik ( w, g ) Y jk π jk ( w (cid:48) , g (cid:48) ) − N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i (cid:104) ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) Y ik π ik ( w, g )+ ( W ik = w (cid:48) , G ik = g (cid:48) , X ik ∈ (cid:96) ( x )) Y ik π ik ( w (cid:48) , g (cid:48) ) (cid:105) . (11) Further details about the variance estimator of leaf-specific CACE can be found in Appendix B.
Here we will describe the properties of the Horvitz-Thompson estimator of leaf-specific causaleffects. Asymptotic results will rely on a growth process that is commonly assumed with clusterdata. In particular, we consider a sequence of nested samples V of size N, where V consists of Kseparate clusters V k of size n k , k = 1 , . . . , K . We let the sample size N −→ ∞ by letting the number19f clusters go to infinity, i.e., K −→ ∞ , while the cluster size n k , k = 1 , . . . , K remains fixed. Proposition 1 (Unbiasedness) . E (cid:104)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:105) = µ w,g ( (cid:96) ( x )) and E (cid:104)(cid:98) τ ( w,g,w (cid:48) g (cid:48) ) ( (cid:96) ( x )) (cid:105) = τ w,g ; w (cid:48) ,g (cid:48) ( (cid:96) ( x )) . Proof.
Proof in Appendix A.The unbiasedness of the estimator of leaf-specific CACE is conditional on the partition Π and thefunction (cid:96) ( · ). When building causal trees to assess the heterogeneity of causal effects, we will relyon this property to derive the splitting criterion and to estimate leaf-specific causal effects. Proposition 2 (The variance estimator of (cid:98) µ ( w,g ) is unbiased) . If π ikjk ( w, g ) > ∀ i, j, k then E (cid:104)(cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17)(cid:105) = V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x ) (cid:17) . The proof follows directly from the unbiasedness of the Horvitz-Thompson estimator. A conservativeestimator for the case when π ikjk ( w, g ) = 0 for some units can be found in Appendix B. Proposition 3 (The variance estimator of (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) is conservative) . E (cid:104)(cid:98) V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17)(cid:105) ≥ V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17) . A proof follows from Aronow and Samii (2017). The restriction on the covariances does not changethe proof.
Proposition 4 (Consistency) . Consider the asymptotic regime where the number of clusters K goto infinity, i.e., K −→ ∞ , while the cluster size remains bounded, i.e., n k ≤ B ( (cid:96) ( x )) ≤ B for someconstant B. In addition, assume that | Y ik ( w, g ) | /π ik ( w, g ) ≤ C < , ∀ i, k, w, g . Then as K −→ ∞ (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) p −→ τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) . The unbiasedness of the estimator (cid:98) τ w,g,w (cid:48) ,g (cid:48) ( (cid:96) ( x ) does not ensure the identification of subsets with the highestheterogeneity. The performance of the causal tree in identifying heterogeneous regions depends on the splittingcriterion, the algorithm and the sample. roof. See Appendix A.Note that cluster network interference and independent treatment allocation between clusters en-sure that the amount of dependence across units is limited. This limited independence is the con-dition required to ensure consistency (Aronow and Samii; 2017). Proposition 5 (Asymptotic Normality) . Given an cluster independent design and the clusterednetwork interference assumption, then: (cid:112) N ( (cid:96) ( x )) (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) − τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17) d −→ N (cid:18) , V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17)(cid:19) . An independent treatment allocation between clusters and the clustered network interferenceensure the limited dependence condition required in Aronow and Samii (2017). This conditionallows us to rely on a central limit theorem derived via Stein’s method (Chen and Shao; 2004) toachieve the asymptotic normality of the estimator. The variance estimators will depend on the sizeof the sample belonging to leaf (cid:96) ( x ) in each cluster, i.e., n k ( (cid:96) ( x )) ≤ B ( (cid:96) ( x )) ≤ B, k = 1 = . . . , K ,and the maximum conditional degree: D ( (cid:96) ) = max ik ∈ V : X ik ∈ (cid:96) ( x ) N gik ( (cid:96) ( x )) where N gik ( (cid:96) ) = N gik ∩ V k ( (cid:96) ( x )) . Given that these quantities are bounded, we can show that V (cid:16) V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17)(cid:17) = O (1 /N ( (cid:96) ( x ))), that is, the rate of convergence will be 1 / (cid:112) N ( (cid:96) ( x )), with N ( (cid:96) ( x )) ≤ KB ( (cid:96) ( x ))(the proof follows the one in Aronow and Samii (2017)).It is worth noting that, thanks to the asymptotic growth assumed here, Proposition 5 would stillhold if the covariates’ vector X would include network covariates defined both at the cluster-level orat the individual-level. This result would allow us to investigate the heterogeneity of causal effectswith respect to network characteristics, including variables defining a cluster network structure orthe structure of the network neighborhood around a node. Note that for the variance of the estimator to go to zero as N −→ ∞ one must have that: (cid:88) ik (cid:88) jk (cid:48) ( X ik ∈ (cid:96) ( x ) , X jk ∈ (cid:96) ( x )) (cid:0) π ikjk (cid:48) ( w, g ) − π ik ( w, g ) π jk (cid:48) ( w, g ) (cid:54) = 0 (cid:1) = o ( N ( (cid:96) ( x )) ) . This is guaranteed given that the joint treatment is independent between units in different clusters, that is, π ikjk (cid:48) ( w, g ) = π ik ( w, g ) π jk (cid:48) ( w, g ) , ∀ k (cid:54) = k (cid:48) , and given that the cluster size is bounded. Network Causal Trees for Heterogeneous Causal Effects underClustered Network Interference
In the previous section we have introduced and developed an estimator for causal effects conditionalon sub-populations of units defined by a partition Π of the covariate space X . Here we developa data-driven machine learning algorithm to identify the partition Π aimed at investigating theheterogeneity in the effects of interest.Our proposed algorithm, named Network Causal Tree (NCT), builds upon the Causal Tree(CT) algorithm introduced by Athey and Imbens (2016), which in turn finds its roots in theClassification and Regression Tree (CART) algorithm (Friedman et al.; 1984). CART is a widelyused nonparametric method to partition the feature space. It relies on a tree-based algorithmwhich recursively splits the sample. In particular, trees are constructed by recursively partitioningthe observations from the root (that contains all the observations in the learning samples) intotwo child nodes . This procedure is repeated until the tree reaches the final nodes which are called leaves . Because each node is always split into two sub-nodes, these trees are called binary trees .Binary trees are called regression trees when the outcome is a continuous variable, while they arecalled classification trees when the outcome is either a discrete or a binary variable. The aim of thetree construction is to identify heterogeneities in the relationship between the observed outcomeand the features to best predict the outcome variable. Therefore, splits are made with the aim ofminimizing the prediction error. With this aim different splitting criteria could be specified. Foradditional details on CART, we refer to the seminal paper by Friedman et al. (1984). Figure 3illustrates an example of binary partitioning in a simple case with just two predictors x ∈ [0 , x ∈ [0 , < . x > . l l No Yes l No YesFigure 3: (Left) An example of a binary tree. The internal nodes are labelled by their splitting rulesand the terminal nodes are labelled with the corresponding parameters l i .(Right) The corresponding partition of the sample space.clustered network interference, and (ii) it possibly models heterogeneity with respect to more thanone effect at the same time through a composite splitting criterion. In this Section, we describe andmotivate the splitting criteria for our NCT algorithm (Subsection 4.1) and its detailed structure(Subsection 4.2). The NCT algorithm is built to detect and estimate heterogeneous treatment and spillover effects, inthe presence of clustered network interference. Moreover, NCT is able to discover the heterogeneitywith respect to more than one estimand. Here, we present the three criteria that rule the splittingprocedure of NCT, targeted to single effects or multiple effects.
Let P be the space of partitions. Given a causal effect τ ( w,g ; w (cid:48) ,g (cid:48) ) , we can use recursive splitting tolook for the best partition Π ∈ P with respect to a splitting criterion Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π). Formally:Π = argmax Π ∈ P Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) . (12)23iven our goal of describing the relationship between the causal effect and the covariate space anddetecting subsets that exhibit a high level of heterogeneity, we can define a splitting criterion thatmaximizes accuracy in the prediction of conditional effects τ ( w,g ; w (cid:48) ,g (cid:48) ) ( X ik ) in the whole sample V.This translates into the minimization of expected value of the mean square error (MSE): Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) = − EM SE (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x ) , Π) (cid:17) = − E (cid:104)(cid:16) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( x ) − (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x , Π) (cid:17) (cid:105) (13)where the expected value is taken over the sampling distribution. When this splitting criterion isused to select the partition Π, we maximize the function in (13) evaluated in the sample used to buildthe tree, i.e., the training set . In this case, in the machine learning literature the objective functionis referred to as the in-sample splitting function and we denote this by Q in ( w,g ; w (cid:48) ,g (cid:48) ) (Π). As opposedto the EMSE of the observed outcome prediction, the true causal effect τ w,g ; w (cid:48) ,g (cid:48) ( x ) is unknown.However, we can use the training data to estimate the EMSE for the in-sample spitting rule. Thanksto the unbiasedness of the estimator (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x ) , Π) with respect to the population causal effect E [ τ ( w,g ; w (cid:48) ,g (cid:48) ) ( X ik ) | X ik ∈ (cid:96) ( x )] (Proposition 6 in Appendix A), following Athey and Imbens (2016)we can estimate the EMSE as follows: Q in ( w,g ; w (cid:48) ,g (cid:48) ) (Π) = − (cid:92) EM SE (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x ) , Π) (cid:17) = 1 N tr (cid:88) k ∈K tr n k (cid:88) i =1 (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( X ik , Π)) (cid:1) (14)where K tr is the subset of clusters belonging to the training set and N tr is the sample size. There-fore, the maximization of this splitting function results in the maximization of the heterogeneityacross leaves. In fact, if two sub-populations (cid:96) and (cid:96) have a different causal effect τ ( w,g ; w (cid:48) ,g (cid:48) ) , i.e., τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:54) = τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ), a partition Π that splits them would yield a higher Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π)than the partition Π c that combines the two sub-populations into one leaf (cid:96) (a simple proof canbe found in Appendix A).To avoid using the same information for selecting the partition and for the estimation, Atheyand Imbens (2016) propose to estimate the effects in a separate sample than the one used to buildthe tree. They call this an ‘honest’ causal tree. We denote by V est the estimation set and by V tr thetraining set (or discovery set) that is used to build the tree (by evaluating the splitting function). In standard CART the training set is a subset of the whole sample together with the testing set, which is usedto evaluate the objective function in order to choose the best partition selected in the training set that maximizesout-of-sample prediction accuracy. K tr and K est of the clusters in K . Hence, V est = (cid:83) k ∈K est V k and V tr = (cid:83) k ∈K tr V k = V /V est . This randomsplit of the sample avoids any dependencies between training and estimation sub-samples. In this‘honest’ version, the splitting function can be estimated as follows: Q in,H ( w,g ; w (cid:48) ,g (cid:48) ) (Π) = 1 N tr (cid:88) k ∈K tr n k (cid:88) i =1 (cid:0) ˆ τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( X ik , Π)) (cid:1) − (cid:16) N tr + 1 N est (cid:17) (cid:88) (cid:96) ∈ Π (cid:98) V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ; Π) (cid:17) (15)where N est = | V est | . The proof follows from Athey and Imbens (2016). In (15) we can see that thesplitting function is such that splits will be chosen so to maximize the heterogeneity across leavesas well as to minimize the average variance in the estimated effects. The idea is to identify themost heterogeneous partitions while introducing a penalization term which corrects the objectivefunction to minimize the leaf-specific variation in the estimated effect. This penalization term hasalso the effect of reducing the depth of the tree, because leaves with a small number of observations N ( (cid:96) ) will exhibit a higher variance. Hence, the final depth of the tree will depend on the samplesize as well as on the randomization design and the exposure mapping function. In addition tothis penalization, we will also add a stopping rule based on a minimum number of observations foreach condition ( w, g ) and ( w (cid:48) , g (cid:48) ) that we are comparing. This stopping rule is required to avoidhaving leaves where the effect τ ( w,g ; w (cid:48) ,g (cid:48) ) cannot be estimated because there are no observationswith observed treatment ( W ik , G ik ) equal to the values ( w, g ) or ( w (cid:48) , g (cid:48) ). We now introduce a composite splitting rule targeted to multiple causal estimands. When interfer-ence is in place targeting strategies might involve both treatment and spillover effects. For example,in settings with limited resources the treatment should be provided to those who would benefit fromit, i.e., with a non-zero treatment effect, whereas we could save resources by not giving the treat-ment to those who would benefit from other people being treated, namely units with high spillovereffects. For instance, this is the case in marketing interventions where we can provide advertise-ments only to those who would be affected and who are less likely to get the information fromsomeone else. Another interesting example can be found in the potential challenges of the COVID-19 vaccine distribution. Those at high risk of getting infected, even if those in close contact were25mmune, should be targeted. These would be individuals who are often in crowded spaces, either inthe workplace or in a social setting, and would likely get infected in these environments. Therefore,these individuals are characterized by a high treatment effect even with all close contacts beingtreated. On the contrary, those who are in contact with a low number of people and can greatlygain from having one of these contacts vaccinated could be left without the vaccine, at least in theearly stages of the distribution.For these kind of targeting strategies involving more than one causal effect we must partitionthe population into sub-groups that show high level of heterogeneity in all estimands of interest.Building a separate tree for each causal effect would provide us with different partitions that cannotbe used for the design of multi-effect strategies. Therefore, we propose a composite splitting functionthat would result in a tree that maximises heterogeneity in all the causal estimands of interest.This composite objective function is a weighted average of the effect-specific splitting functions: Q T (Π) = (cid:88) ( w,g ; w (cid:48) ,g (cid:48) ) ∈T γ ( w,g ; w (cid:48) ,g (cid:48) ) Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) with γ ( w,g ; w (cid:48) ,g (cid:48) ) = ω ( w,g ; w (cid:48) ,g (cid:48) ) (cid:0) ˆ τ ( w,g ; w (cid:48) ,g (cid:48) ) (cid:1) (16)where ω ( w,g ; w (cid:48) ,g (cid:48) ) ∈ [0 ,
1] is a customized weight for each estimand and ˆ τ ( w,g ; w (cid:48) ,g (cid:48) ) is the estimatedeffect in the whole sample. Each effect τ ( w,g ; w (cid:48) ,g (cid:48) ) , where ( w, g ; w (cid:48) , g (cid:48) ) ∈ T , contributes to theglobal objective function according to a specific weight γ ( w,g ; w (cid:48) ,g (cid:48) ) . γ ( w,g ; w (cid:48) ,g (cid:48) ) is proportional to acustomized weight ω ( w,g ; w (cid:48) ,g (cid:48) ) , which is set by the researcher according to the extent to which theestimand τ ( w,g ; w (cid:48) ,g (cid:48) ) is of interest, and is normalized by the the estimated effect in the whole sampleto rule out any dependence on the magnitude of the effect. The composite criterion requires that atleast two of the four weights are strictly greater than zero. A similar composite objective functioncan be derived from the splitting functions for the ‘honest’ causal trees: Q H T (Π) = (cid:88) ( w,g ; w (cid:48) ,g (cid:48) ) ∈T γ ( w,g ; w (cid:48) ,g (cid:48) ) Q H ( w,g ; w (cid:48) ,g (cid:48) ) (Π) with γ ( w,g ; w (cid:48) ,g (cid:48) ) = ω ( w,g ; w (cid:48) ,g (cid:48) ) (cid:0) ˆ τ ( w,g ; w (cid:48) ,g (cid:48) ) (cid:1) . (17) Compared with the standard HCT algorithm, the main novelties of NCT are the introduction ofinterference and the possibility of including more than one effect. Specifically, the extent to whicheach effect τ ( w,g ; w (cid:48) ,g (cid:48) ) , with ( w, g ; w (cid:48) , g (cid:48) ) ∈ T , contributes to the determination of the tree is specified26y the weight w ( w, g ; w (cid:48) , g (cid:48) ). Here we describe the key steps of the NCT algorithm, including therecursive partitioning based on the splitting functions and the stopping rules. The proposed algorithm takes mainly six elements as inputs:1. the sample V , which collects for each unit ik the individual treatment assignment status W ik ,the observed outcome Y ik and a vector of characteristics X ik ;2. the network information, which is fully described by the global adjacency matrix A , includingthe cluster-specific blocks A k ;3. the specification of the exposure mapping function g ( · ) which, together with the adjacencymatrix and possibly covariate matrix, will be translated in the computation of the observednetwork exposure G ik for each unit;4. the experimental design which will determine the computation of the probabilities π ik ( w, g )and π ik ( w, g, w (cid:48) , g (cid:48) );5. the weight ω w,g,w (cid:48) g (cid:48) for each causal effect;6. the specification of the two parameters: maximum depth , that is the maximum depth of thetree, and the minimum size , that is the minimum number of units falling in each exposurecondition ( w, g ) in each leaf.After some preliminary steps, the algorithm consists of two main steps. The first step is focused onthe selection of the partition, i.e. the tree, while the second step is concerned with the estimationof causal effects and returns point estimates and standard errors of the conditional average causaleffects, for all the comparisons of interest and within each leaf of the detected partition. We reportbelow the key steps of the NCT algorithm:0. Step 0 (Preliminaries): In a preliminary stage the algorithm computes the quantities andtools that will be used in the subsequent steps. In particular:(a) Given the adjacency blocks A k and potentially the covariate matrix X , for each unit the network exposure variable G ik is computed according to the rule expressed in Assumption3. 27b) The joint exposure probabilities π ik ( w, g ) and π ikjk (cid:48) ( w, g ; w (cid:48) , g (cid:48) ) are computed (as inSubsection 2.3) or estimated (as in Aronow and Samii (2017)).(c) Finally, the algorithm randomly splits the clusters between the training set V est and theestimation set V est . Step 1 (Tree Discovery): the first step of the algorithm sprouts the tree structure of the NCT,that is, it detects the relevant heterogeneous partitions. Note that this step is performed overthe discovery set only. In particular, the NCT algorithm works with the clusters belonging tothe set K tr and builds the tree using a binary recursive partitioning.(a) Recursive Partitioning. the algorithm grows a tree by maximizing the in-sample splittingcriterion at each binary split. At iteration r − r − = { x ∈ X : r − (cid:92) m =1 x m ∈ A h m m } h ∈{ L,R } r − where x m ∈ { x p } p =1 ,...,P is the feature that was split at iteration m and A Lm = { x m ≤ c m } , A Lm = { x m c m } for some cutoff point c m . The variable x m split at iteration m together with the cutoff point c m compose a node of the tree. At iteration r , the partitionwill be complemented with a split of a variable x r ∈ { x p } p =1 ,...,P / { x m } m =1 ...,r − at somecutoff point c r : Π r = { x ∈ X : r − (cid:92) m =1 x m ∈ A h m m (cid:92) x r ∈ A h r r } h ∈{ , } r . Among all the candidate splits x r and c r , the algorithm will choose the one that maxi-mizes the in-sample splitting function in (14) or (15).(b) Stopping Rule. The recursive partitioning stops when at least one stopping condition is met: (i) the NCT has reached the specified maximum depth ; (ii) the current split r generate at least one leaf (cid:96) where the set of units N ( (cid:96) ) ( w,g ) = { ik ∈ V tr : X ik ∈ (cid:96), W ik = w, G ik = g } with a number of observations | N ( (cid:96) ) ( w,g ) | lower than the specified minimumsize , for at least one exposure condition ( w, g ). Following Athey and Imbens (2016) we suggest to assign half of the clusters to the discovery sample and anotherhalf to the estimation sample. network causal tree which corresponds to a partition Π of the featurespace X into M leaves: Π = { (cid:96) , . . . , (cid:96) M } , with (cid:83) Mm :1 (cid:96) m = X and l ( x , Π) :
X →
Π.2.
Step 2 (Estimation): the second step of the algorithm takes as input the Network CausalTree Π built in Step 1 and computes all the point estimates, the standard errors and theconfidence intervals of the leaf-specific causal effects of interest in all its nodes (cid:96) ( x , Π). Thisis done using the Horvitz-Thompson estimator in Section 3. In the ‘honest’ version, at thisstage the NCT algorithm works with the clusters belonging to the set K est . Algorithm 1
Overview of the NCT algorithm • Inputs : i) observed data { W ik , Y ik , X ik } ik ∈ V ; ii) global adjacency matrix A , which comprisesthe cluster-specific blocks A k ; iii) experimental Design; iv) vector of weights ω ( w, g ; w (cid:48) , g (cid:48) ),where ( w, g ; w (cid:48) , g (cid:48) ) ∈ T ; v) tree parameters: maximum depth and minimum size . • Outputs : (1) a partition Π if the covariate space, and (2) point estimates, standard errorsand confidence intervals of the conditional average causal effects:1. Step 0 (Preliminaries): compute G ik and both the marginal and joint exposure proba-bilities π ik ( w, g ) and π ikjk (cid:48) ( w, g ; w (cid:48) , g (cid:48) ). Then, randomly assign clusters to discovery andestimation samples.2. Step 1 (Tree Discovery): using the discovery sample, build a tree according to the in-sample splitting criterion and stop when either the tree has reached its maximum depthor any additional split would generate leaves, which are not sufficiently representative ofthe four exposure conditions.3. Step 2 (Estimation): use the Horvitz-Thompson estimator on the estimation sample toestimate the leaf-specific CACE and their standard errors in each leaf. Our algorithm provides an interpretable method to detect and estimate heterogeneous effects in thepresence of clustered network interference. In this section we evaluate, through a set of simulations,the performance of the proposed algorithm with respect to both discovery and estimation. Inparticular, we investigate its ability to correctly identify the actual heterogeneous sub-populations,comparing the use of single or composite splitting functions, and we assess the performance ofthe Horvitz-Thompson estimator for leaf-specific treatment and spillover effects. While the latter29erformance assessment is quite standard in the literature, the former is critical for the developmentof interpretable algorithms for heterogeneous causal effects (Bargagli-Stoffi and Gnecco; 2020; Leeet al.; 2020). We evaluate the performance of the algorithm and the estimator in settings that differwith respect to three main factors: (1) the structure of the heterogeneity, (2) the extent of theeffect heterogeneity, and (3) the number of clusters. In Appendix C, we also consider two additionalfactors: (4) the correlation structure in the covariate matrix and (5) the presence of homophily inthe network structure. Regarding the structure of the heterogeneity, we are particularly interestedin settings where the structure of the causal tree representing heterogeneity is different for eachcausal effect. In particular, causal trees differ if they have different nodes corresponding to the splitof a feature, that is, if covariates driving the heterogeneity are different, or if they have differentterminal leaves where the causal effect is heterogeneous, i.e., non-zero. We call causal rules theseheterogeneous terminal leaves.For each simulation scenario we simulated M = 500 samples and applied our NCT algorithmto detect sub-populations with heterogeneous causal effects (or causal rules) and to estimate ourcausal effects of interest. To evaluate the performance of our composite splitting function underdifferent settings, splits rely on either effect-specific splitting criteria or on the composite function.All simulations are performed under Bernoulli trials, that is treatment is randomly assignedindependently to each unit with a fixed probability π Wik = α . In our simulation study, we alsoassume that interference only takes place at the neighborhood level and we choose the followingdefinition of the network exposure: G ik = (cid:18)(cid:0) (cid:88) j ∈N ik W jk (cid:1) ≥ (cid:19) , (18)that is, the network exposure of unit ik is 1 if at least one neighbor is treated. A binary networkexposure together with a binary individual treatment results in a joint treatment with four cate-gories – i.e., ( W ik , G ik ) ∈ { ( w, g ) = (0 , , (1 , , (0 , , (1 , } . The binary definition of the networkexposure is chosen to allow the growth of deeper trees. Given that the minimum size requirementstops the algorithm when the number of units in a child leaf is not enough to estimate a conditionalcausal effect, a joint treatment with four categories ensures that this stopping condition is unlikelyto be met during the first few splits.In addition, the assumption of neighborhood interference allows the computation of the marginal30nd joint probabilities without the need for intensive estimation procedures. In fact, the approximatealgorithm for estimating the marginal and joint probabilities proposed by Aronow and Samii (2017)is computationally demanding and could not be incorporated in our simulation study. However, inthe case of Bernoulli trial and network exposure defined as in (18), the probability π ik ( w, g ) canbe computed using the formula in (3) while the joint probability π ikjk (cid:48) ( w, g, w (cid:48) , g (cid:48) ) is simply theproduct of π ik ( w, g ) and π jk (cid:48) ( w (cid:48) , g (cid:48) ) if the two units ik and jk (cid:48) are independent. On the contrary,if N wgik and N wgjk (cid:48) (cid:54) = 0 overlap the joint treatment of the units ik and jk (cid:48) will be dependent, thatis, π ikjk (cid:48) ( w, g ; w (cid:48) , g (cid:48) ) (cid:54) = π ik ( w, g ) π jk (cid:48) ( w (cid:48) , g (cid:48) ). In this case, the joint probability can still be readilycomputed using combinatorics formulas on two overlapping sets. For each simulation m = 1 . . . , M we generated K clusters and simulated, within each cluster,Erd˝os-R´enyi random graphs (Erd˝os and R´enyi; 1959) with n k = 100 nodes and a fixed probability(0.01) to observe a link. Given the definition of the network exposure we removed isolated nodesfrom the analysis to make the Assumption 5 hold. W ik and any of the 10 covariates X ip weresampled from independent Bernoulli distributions with probability 0.5: W i ∼ Ber (0 .
5) and X ip ∼ Ber (0 . . In the simulation study we focus on two main effects: (i) the pure treatment effect τ (1 , , , and (ii)the pure spillover effect τ (0 , , . To ease notation, we denote by τ the treatment effect and by δ thespillover effect. After setting the value of these two conditional effects for each unit with covariates X ik (depending on the simulation scenarios), we generated the four different potential outcomes asfollows: Y ik (0 , ∼ N (0 , Y ik (1 , ∼ N (0 , Y ik (1 ,
0) = Y ik (0 ,
0) + τ ( X ik ); Y ik (0 ,
1) = Y ik (0 ,
0) + δ ( X ik ) . Y ik = (cid:88) w =0 1 (cid:88) g =0 ( W ik = w, G ik = g ) Y ik ( w, g ) . We now detail how we varied the three factors (1), (2), and (3). We simulated two differentscenarios with respect to the heterogeneity structure (1). In the first scenario we have: τ ( X ik ) = h if X ik ∈ (cid:96) = { X ik = 0 , X ik = 0 } ; − h if X ik ∈ (cid:96) = { X ik = 1 , X ik = 1 } ;0 otherwise; δ ( X ik ) = h if X ik ∈ (cid:96) = { X ik = 0 , X ik = 0 } ; − h if X ik ∈ (cid:96) = { X ik = 1 , X ik = 1 } ;0 otherwise . Hence, in this scenario the heterogeneity driving variables (HDV), i.e., X i and X i , are the samefor both the treatment effect τ and the spillover effect δ and the two causal rules overlap. In thesecond scenario, we introduce a change in the drivers of the heterogeneity in the following way: τ ( X ik ) = h if X ik ∈ (cid:96) τ = { X ik = 0 , X ik = 0 } ;3 h if X ik ∈ (cid:96) τ = { X ik = 0 , X ik = 1 } ;0 otherwise; δ ( X ik ) = h if X ik ∈ (cid:96) δ = { X ik = 1 , X ik = 0 } ;3 h if X ik ∈ (cid:96) δ = { X ik = 1 , X ik = 1 } ;0 otherwise . Hence, in the second scenario the heterogeneity drivers are different for the two causal effects.Specifically, we have: X i and X i for the treatment effects, and X i and X i for the spillovereffects. In addition, we have two causal rules for the treatment effect, namely { X ik = 0 , X ik = 0 } and { X ik = 0 , X ik = 1 } and two different causal rules for the spillover effect, namely { X ik =1 , X ik = 0 } and { X ik = 1 , X ik = 1 } . For each structural scenario we varied the effect size: h = ( r + 0 . r =1 . Figure 4 graphically represents the two simulations’ scenarios. Moreover, wechanged the number of clusters keeping their size fixed to K = (10 , , For each scenariowe built three NCTs: one tree implementing the composite splitting rule for the treatment andspillover effects as in (17) (with w (1 , , = w (0 , , = 0 . Q in,H (1 , , (Π) as in (15), and a third tree using the singular splittingrule for the spillover effect Q in,H (0 , , (Π) as in (15). Note that K/2 clusters will be assigned to the discovery sample and the remaining clusters will be in theestimation set as in Athey and Imbens (2016). = 0 x = 0 τ = − hδ = − h τ = 0 δ = 0Yes No x = 0 τ = 0 δ = 0 τ = hδ = h Yes NoYes No (a) First scenario. x = 0 x = 0 τ = 0 δ = 3 hτ = 0 δ = h Yes No x = 0 τ = 3 hδ = 0 τ = hδ = 0Yes NoYes No (b) Second scenario. Figure 4: Simulations’ scenarios.
In all the scenarios the performance of the NCT algorithm is evaluated using the following measures,averaged over the M generated data sets: • Average number of correctly discovered heterogeneous causal rules corresponding to the leavesof the generated NCTs (reported with respect to the effect sizes in Figures 5 and 6) in thediscovery sample; • Average conditional treatment and spillover effects estimated for each correctly detectedheterogeneous terminal leaf (reported as ˆ τ and ˆ δ in Tables 1, 2, 3 and 4); • Average standard errors estimated for each correctly detected heterogeneous terminal leaf(reported as ˆ se (ˆ τ ) and ˆ se (ˆ δ ) in Tables 1, 2, 3 and 4); • Monte Carlo bias in the estimation sample:Bias m ( V est ) = 1 N est (cid:88) k ∈K est n k (cid:88) i =1 (cid:16) τ ( w,g,w (cid:48) g (cid:48) ) ( X ik ) − (cid:98) τ ( w,g,w (cid:48) g (cid:48) ) ( (cid:96) ( X ik , , Π m ) , V est ) (cid:17) , Bias( V est ) = 1 M M (cid:88) m =1 Bias m ( V est )33here Π m is the partition selected in simulation m ; • Monte Carlo MSE in the estimation sample:MSE m ( V est ) = 1 N est (cid:88) k ∈K est n k (cid:88) i =1 (cid:16) τ ( w,g,w (cid:48) g (cid:48) ) ( X ik ) − (cid:98) τ ( w,g,w (cid:48) g (cid:48) ) ( (cid:96) ( X ik ) , Π m , V est ) (cid:17) , MSE( V est ) = 1 M M (cid:88) m =1 MSE m ( V est ); • Coverage, computed as the average proportion of units for whom where the estimated 95%confidence interval of the causal effect in the assigned leaf includes the true value:C m ( V est ) = 1 N est (cid:88) k ∈K est n k (cid:88) i =1 (cid:16) τ ( w,g,w (cid:48) g (cid:48) ) ( X ik ) ∈ (cid:99) CI (cid:16)(cid:98) τ ( w,g,w (cid:48) g (cid:48) ) ( (cid:96) ( X ik , Π m ) , V est ) (cid:17)(cid:17) , C( V est ) = 1 M M (cid:88) m =1 C m ( V est ) . We will first analyze the ability of the algorithm to correctly detect the heterogeneous subgroups inthe first simulation scenario, that is, when the heterogeneity is the same for the two causal effectsof interest. Figure 5 reports the average number of correctly discovered heterogeneous causal ruleswith composite splitting rule or effect-specific splitting rules targeted to the treatment effect or thespillover effect, in the case of 10, 20 and 30 clusters. Given that in this scenario the causal rulesdefining the heterogeneity is the same for both effects, all the splitting rules, targeted to a singleeffect or to both effects, have a similar performance and are able to detect the two heterogeneousleaves with a success rate that gets higher as the effect size increases. In addition, as the num-ber of cluster grows the minimum effect size allowing the algorithm to optimally discover all theheterogeneous sub-populations gets lower.Tables 1, 2, 3 report the results for the first scenario 4 for the performance of the estimatorin the correctly detected leaves. We only report the results of the estimation procedure on thetree built with the composite splitting rule, as the spitting rule would only affect the identificationof the heterogeneous sub-populations but not the estimation of the causal effects once these sub-34opulations are correctly detected. The estimator is able to estimate the heterogeneous treatmentand spillover effects without bias and its precision grows as the number of clusters increases. In-terestingly, NCT provides more accurate estimates of the heterogeneous spillover effects than thetreatment effects. This is due to the larger number of units with ( W ik = 0 , G ik = 1) than those with( W ik = 1 , G ik = 0), by definition of the network exposure (by design, units with ( W ik = 0 , G ik = 1)are on average 50% more than units with ( W ik = 1 , G ik = 0)). The higher precision in the esti-mation of the spillover effects is also reflected in the identification of the heterogeneous subgroups,which is more accurate when splits are targeted to the minimization of the MSE of the spillovereffect (Figure 5). . . . . .
10 Clusters
Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s . . . . .
20 Clusters
Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s . . . . .
30 Clusters
Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s NCT (composite)NCT (treatment)NCT (spillover)
Figure 5: Simulations’ results for correctly discovered leaves in the first scenario with 2 correctleaves and 10, 20 and 30 clusters, respectively. 35able 1: Simulations’ results for the first scenario (10 clusters)
Treatment EffectsEffect Size ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) MSE Bias Coverage0.1 0.353 0.416 -0.139 0.350 0.106 0.107 1.0001.1 1.108 0.514 -1.132 0.515 0.261 -0.012 0.9362.1 2.124 0.740 -2.156 0.741 0.570 -0.016 0.9413.1 3.004 1.003 -3.050 0.997 0.934 -0.023 0.9454.1 4.061 1.293 -4.179 1.287 1.843 -0.059 0.9285.1 5.201 1.565 -5.197 1.585 2.297 0.002 0.9466.1 6.179 1.889 -6.086 1.824 3.282 0.046 0.9317.1 7.125 2.140 -7.110 2.128 4.276 0.007 0.9218.1 8.011 2.378 -8.149 2.425 5.087 -0.069 0.9319.1 9.153 2.729 -9.128 2.713 6.803 0.012 0.92710.1 9.991 2.947 -9.916 2.974 8.007 0.038 0.934 Spillover Effects ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) MSE Bias Coverage0.1 0.225 0.380 -0.249 0.322 0.082 -0.012 0.9581.1 1.114 0.415 -1.103 0.402 0.155 0.005 0.9602.1 2.085 0.542 -2.138 0.545 0.230 -0.026 0.9673.1 3.047 0.716 -3.094 0.716 0.408 -0.023 0.9724.1 4.098 0.892 -4.079 0.882 0.646 0.009 0.9665.1 5.119 1.063 -5.054 1.057 0.862 0.032 0.9586.1 6.090 1.250 -6.045 1.242 1.132 0.022 0.9587.1 7.086 1.437 -7.099 1.442 1.542 -0.006 0.9738.1 8.075 1.618 -8.019 1.618 1.986 0.028 0.9619.1 8.936 1.811 -9.197 1.828 2.586 -0.130 0.95710.1 10.078 2.023 -10.055 2.026 2.730 0.011 0.971 Table 2: Simulations’ results for the first scenario (20 clusters)
Treatment EffectsEffect Size ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) MSE Bias Coverage0.1 0.053 0.286 -0.072 0.284 0.061 -0.010 0.9291.1 1.120 0.382 -1.102 0.377 0.145 0.009 0.9542.1 2.102 0.538 -2.114 0.538 0.270 -0.006 0.9493.1 3.105 0.710 -3.140 0.721 0.468 -0.018 0.9464.1 4.055 0.905 -4.198 0.947 0.848 -0.072 0.9485.1 5.058 1.100 -5.034 1.108 1.132 0.012 0.9406.1 6.155 1.332 -6.034 1.306 1.609 0.061 0.9337.1 7.058 1.519 -7.036 1.521 2.000 0.011 0.9548.1 8.141 1.737 -8.118 1.755 2.919 0.011 0.9619.1 9.130 1.952 -9.160 1.949 3.313 -0.015 0.95910.1 10.088 2.150 -10.071 2.142 4.021 0.008 0.947 Spillover Effects ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) MSE Bias Coverage0.1 0.085 0.234 -0.139 0.255 0.064 -0.027 0.9291.1 1.092 0.300 -1.106 0.297 0.075 -0.007 0.9672.1 2.072 0.393 -2.083 0.396 0.132 -0.006 0.9653.1 3.067 0.509 -3.084 0.511 0.208 -0.009 0.9734.1 4.124 0.641 -4.088 0.639 0.296 0.018 0.9715.1 5.118 0.772 -5.052 0.772 0.436 0.033 0.9746.1 6.082 0.900 -6.154 0.915 0.611 -0.036 0.9727.1 7.114 1.042 -7.101 1.040 0.802 0.007 0.9688.1 8.023 1.183 -8.156 1.185 1.027 -0.067 0.9759.1 9.106 1.312 -9.139 1.323 1.262 -0.017 0.97910.1 10.210 1.465 -10.151 1.460 1.472 0.029 0.976 Treatment EffectsEffect Size ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) MSE Bias Coverage0.1 0.092 0.238 -0.068 0.230 0.016 0.012 1.0001.1 1.083 0.311 -1.115 0.306 0.091 -0.016 0.9502.1 2.101 0.435 -2.107 0.436 0.170 -0.003 0.9493.1 3.104 0.584 -3.113 0.588 0.305 -0.005 0.9534.1 4.086 0.754 -4.114 0.746 0.546 -0.014 0.9465.1 5.168 0.921 -5.170 0.931 0.794 -0.001 0.9566.1 6.110 1.091 -6.059 1.074 1.041 0.025 0.9567.1 7.133 1.259 -7.135 1.251 1.422 -0.001 0.9608.1 8.078 1.420 -7.952 1.409 1.729 0.063 0.9469.1 9.199 1.618 -9.047 1.580 2.264 0.076 0.96110.1 10.148 1.763 -10.171 1.765 2.862 -0.012 0.958 Spillover Effects ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) MSE Bias Coverage0.1 0.022 0.208 -0.086 0.213 0.019 -0.032 0.9671.1 1.082 0.249 -1.108 0.244 0.056 -0.013 0.9642.1 2.080 0.322 -2.115 0.326 0.083 -0.017 0.9743.1 3.094 0.423 -3.065 0.418 0.132 0.015 0.9694.1 4.119 0.528 -4.083 0.525 0.209 0.018 0.9735.1 5.092 0.634 -5.110 0.637 0.263 -0.009 0.9836.1 6.085 0.745 -6.126 0.746 0.384 -0.020 0.9757.1 7.061 0.857 -7.160 0.860 0.517 -0.050 0.9778.1 8.101 0.974 -8.051 0.973 0.695 0.025 0.9799.1 9.111 1.088 -9.103 1.091 0.827 0.004 0.98410.1 10.184 1.215 -10.123 1.201 0.993 0.030 0.981 For the second scenario with different causal rules for each causal effect, we only report theresults for simulations with 30 clusters. Figure 6 depicts the average number of correctly discoveredheterogeneous causal rules with composite splitting rule or effect-specific splitting rules targeted tothe treatment effect or the spillover effect.When we are interested in building a tree that can represent the heterogeneity of all causaleffects simultaneously (left panel), the composite splitting rule NCT is able to correctly identify allthe heterogeneous causal rules (four in this example), while the other two NCT, targeted to eitherthe treatment effect or the spillover effect, only detect the two leaves where the correspondingcausal effect is heterogeneous. This is even more clear when looking at the other two plots, wherewe depict the ability to detect just the treatment effect rules (central panel) and the spillover effectrules (right panel). Indeed, when we are interested in subgroups that are heterogeneous with respectto only one causal effect, both the effect-specif spitting rule targeted to that effect or the compositespitting function can be used and perform similarly, while the use of the effect-specif spitting ruletargeted to the other effect results in a poor detection of the correct causal rules. The results fromthis second scenario show the clear added value of the composite splitting rule. Indeed, when theHDV are different for treatment and spillover effects implementing this splitting rule enables the37esearcher to correctly spot all the true causal rules simultaneously . Finally, Table 4 shows how theHorvitz-Thompson estimator performs well in estimating the conditional treatment and spillovereffects in the two corresponding heterogeneous leaves.
Composite Tree
Four LeavesEffect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s . . . . . Treatment Effects Tree
Two LeavesEffect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s . . . . . Spillover Effects Tree
Two LeavesEffect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s NCT (composite)NCT (treatment)NCT (spillover)
Figure 6: Simulations’ results for correctly discovered leaves in the second scenario with 30 clusters.Table 4: Simulations’ results for second scenario (30 clusters)
Treatment EffectsEffect Size ˆ τ (cid:96) τ ˆ se (ˆ τ (cid:96) τ ) ˆ τ (cid:96) τ ˆ se (ˆ τ (cid:96) τ ) MSE Bias Coverage0.1 0.194 0.254 0.351 0.241 0.059 0.072 0.9641.1 1.103 0.310 3.239 0.607 0.223 -0.029 0.9422.1 2.088 0.438 6.379 1.121 0.650 0.034 0.9523.1 3.100 0.589 9.317 1.634 1.226 0.008 0.9704.1 4.113 0.756 12.285 2.137 2.353 -0.001 0.9455.1 5.120 0.917 15.418 2.708 4.106 0.069 0.9446.1 6.106 1.074 18.263 3.159 5.469 -0.015 0.9477.1 7.086 1.241 21.312 3.712 6.246 -0.001 0.9558.1 8.167 1.433 24.257 4.209 8.790 0.012 0.9499.1 9.028 1.569 27.099 4.671 9.975 -0.136 0.95410.1 10.024 1.733 30.617 5.288 14.454 0.120 0.951 Spillover Effects ˆ δ (cid:96) δ ˆ se (ˆ δ (cid:96) δ ) ˆ δ (cid:96) δ ˆ se (ˆ δ (cid:96) δ ) MSE Bias Coverage0.1 0.086 0.187 0.281 0.209 0.041 -0.017 1.0001.1 1.089 0.248 3.321 0.444 0.105 0.005 0.9682.1 2.116 0.326 6.325 0.773 0.237 0.021 0.9773.1 3.120 0.421 9.313 1.114 0.491 0.016 0.9804.1 4.109 0.527 12.277 1.455 0.832 -0.007 0.9725.1 5.123 0.637 15.407 1.816 1.441 0.065 0.9766.1 6.119 0.746 18.341 2.159 1.990 0.030 0.9727.1 7.074 0.858 21.393 2.516 2.177 0.033 0.9798.1 8.095 0.976 24.312 2.854 3.168 0.003 0.9749.1 9.048 1.090 27.264 3.210 3.953 -0.044 0.97510.1 10.053 1.201 30.416 3.565 4.610 0.034 0.980 Empirical Application
In this section we provide an empirical application of the proposed methodology. We use data froma randomized experiment designed to assess the effectiveness of intensive information sessions topromote the uptake of a new weather insurance policy among farmers in rural China. The promotedpolicy is addressed to rice farmers and it is aimed at protecting their products from adverse weathershocks (Cai et al.; 2015). The sample consists of 4,569 households belonging to 47 different villages.Households were randomly assigned to the the intensive training session. Here, the individualtreatment variable W ik ∈ { , } for each household i living in village k represents the assignmentto the intensive ( W ik = 1 ) or simple ( W ik = 0) information session.Households are linked according to an observed village-specific friendship network, representedin Figure 7. Relationships between households belonging to different villages are negligible and wereremoved.Figure 7: Friendship network between households living in rural villages in China. Colors refer todifferent villages.Friendship relationships between households in a given rural village k are fully described by theadjacency matrix A k , where the generic element A k ( i, j ) equals 1 if the household i in village k has The original experiment also includes a village-level randomization on price variation and a second round ofsessions Cai et al. (2015). Here we only consider the household-level randomization to the first round of sessions. j in the same village k as friend. Therefore, the adjacency matrix is notsymmetric. We denote with N ik the set of nominated friends of unit ik , and with N ik the out-degree.Figure 8a shows the overall degree distribution, while Figure 8b represents the distribution of thenumber of treated neighbors. In this setting interference could take place because the informationreceived in the intensive information session could be transferred to network contacts. We assumehere an exposure to the network treatments only at the neighborhood level and, in particular, wedefine the network exposure variable as in (18). Hence, G ik is equal to 1 if a unit has at least onetreated friend, and 0 otherwise. This definition of the network exposure is justifiable by the factthat the decision of a farmer to buy the weather insurance might depend only on the informationeither received directly in the intensive session or indirectly by one of his friends. This is also whathas been found in Cai et al. (2015) where the only significant spillover effect is from first neighbors. (a) Number of neighbors (b) Number of treated neighbors Figure 8: Degree and treated neighbors distributionWe dropped from the analysis the 57 households without any friends. The total number ofremaining households is 4 , ,
535 belong to the control group and, thus, they haveundergone a less intensive session. 2 ,
075 households do not have any treated friends, while 2 ,
437 ofthem undertake friendship relationships with at least one treated household. The joint distributionof the individual and neighborhood treatments is summarized in the following Table 5.40able 5: Distribution of the joint treatment G i = 0 G i = 1 W i = 0 1,621 1,914 W i = 1 454 523Figures 9a and 9b represent the treatment distribution in four villages. In Figure 9a, nodes arecolored according to their individual treatment assignment, while in Figure 9b node colors refer tothe joint treatment status. (a) Individual treatment in four villages (b) Joint treatment in four villages: Figure 9: Zoom on four villagesThe outcome variable Y ik ∈ { , } represents the insurance uptake and equals 1 if the household i residing in village k purchased the insurance after the information session. At the end of theexperiment, 2 ,
495 families chose not to accept the proposed insurance policy, while 2 ,
017 werepositively persuaded by the session and accepted the weather insurance.To evaluate the heterogeneity of treatment and spillover effects, we included in the analysisall the observed characteristics that could plausibly moderate the effect of the intervention: threedummies representing the household’s production area ( reg1 , reg2 and reg3 , respectively); a binaryvariable which equals 1 if the head of the household is at least 50 years old ( age ); a binary variablebeing 1 if the household’s head has successfully completed high school ( educ ); a binary variabledistinguishing families who are strongly worried about weather phenomena ( prob dis , which isequal to 1 if perceived probability of a weather disaster happening in the coming year exceeds41 . averse ). In this section, we present the most relevant empirical findings. The following figures report thetrees built targeting single effects (Figure 10) and multiple effects (Figure 11). In each node, wealso report the estimates of the main treatment and spillover effects and the size of the selectedsub-populations. 20 villages were randomly assigned to the training set, while the remaining villageshave been allocated to the estimation set. Furthermore, we have set the maximum depth at 3 tomaintain a high level of interpretability, while we have set at 20 the minimum number of units thatmust be present in the child leaves for each of the four exposure conditions.We can see that, in the whole population, the treatment has a positive effect on insurance takeup, while the spillover effect is negligible. The most relevant heterogeneity drivers result to be theperceived probability of disaster and the production area (specifically, living in the third area).However, the estimated tree slightly changes under different specifications of the spitting rule. Fig-ures 10a and 10b separately represent the trees targeted to τ (1 ,
0; 0 ,
0) and τ (0 ,
1; 0 , τ (1 ,
0; 0 , τ (0 ,
1; 0 , T . In this application, the composite tree which considers both τ (1 ,
0; 0 , The random assignment of the villages to the training and estimation samples has been kept fixed in all thetrees. τ (0 ,
1; 0 ,
0) coincides with the tree based on τ (1 ,
0; 0 ,
0) only. Therefore, the latter leads thecomposite variability across partitions. Finally, the network causal tree that incorporates all thefour effects shows slightly different results: here the sub-population where the treatment resultsto be more effective is the one including older households who are currently settled in the secondproduction area.We can conclude that intensive training sessions encouraged Chinese rural households to takeup the the insurance policy. The main determinants of the heterogeneity in the treatment effect arethe production area and age. Indirect effects do not have a significant impact in this study, and,thus, composite criteria are mostly ruled by treatment effects. (a) NCT targeted to τ (1 ,
0; 0 ,
0) (b) NCT targeted to τ (0 ,
1; 0 , Figure 10: Network causal trees targeted to single effects43 a) NCT targeted to τ (1 ,
0; 0 ,
0) and τ (0 ,
1; 0 ,
0) (b) NCT targeted to all effects
Figure 11: Network causal trees targeted to multiple effects
Depending on our characteristics we might respond differently to a treatment or an intervention.Similarly, we might be more or less susceptible to the influence of other people who have received thetreatment. Understanding the heterogeneity of the effect of a treatment with the aim of targetingpeople who would benefit from it has been the focus of a recent field of research, especially appliedto medicine. Investigating how different people respond differently to the treatment received byothers can be crucial, particularly in settings with limited resources where spillover effects could beleveraged. In this paper, we have introduced a new algorithm to estimate heterogeneous causal treat-ment and spillover effects in the presence of clustered network interference. The proposed networkcausal tree model bridges the gap between two streams of causal inference literature: estimatorsfor causal effects under interference and tree-based methods for the discovery of heterogeneoussub-populations. We build upon the seminal algorithm proposed by Athey and Imbens (2016) toaccount for clustered network interference through a rework of the criterion function. Leaf-specificcausal effects are then estimated using the Horvitz-Thompson estimator proposed by Aronow andSamii (2017). 44he proposed NCT algorithm has enhanced interpretability and shows an excellent performancein a set of Monte Carlo simulations. In particular, the algorithm is able both to spot the relevantsources of heterogeneity in the data and to consistently estimate the conditional treatment andspillover effects. Moreover, we have introduced a composite splitting function that allows the re-searchers to simultaneously detect the sub-populations where both treatment and spillover effectsare heterogeneous. The identification of these multi-effect heterogeneous subgroups is crucial for thedesign of targeting strategies that involve multiple effects. For instance, in marketing campaignsa person might not be affected by an advertisement received directly by a company but mightbe susceptible to the advertisement received by her friends. In this case, resources could be savedby promoting the product among people who could be directly susceptible and letting those whocould be more influenced by their friends receive the advertisement indirectly. Our simulation studyshows that the use of such composite splitting rule is able to correctly detect all the heterogeneoussub-populations defined by both treatment and spillover effects, as well the ones defined by oneeffect only. Therefore, the selected partition of the population can be used to design strategieswhose objective function incorporates multiple effects, but can also be used a posteriori to targetsubgroups maximizing either a treatment or a spillover effect.When applied to real-world data, the NCT algorithm provided useful insights on the effectivenessof intensive training sessions among Chinese rural households on the uptake of a weather insurancepolicy. We found that the main characteristics responsible for the heterogeneity of the effects areperception of a possible disaster, the production area, and the age of the farmers. However, theseheterogeneity drivers play a different role with respect to the two main treatment and spillovereffects and when the tree is built using composite splitting function.Nevertheless, the proposed algorithm may suffer from the limitations common to tree-basedmethods: instability to the random allocation of units in the training sample and the potentialimpact of outlier observations in the node-specific estimations. Here, we propose an algorithm thatselects a single tree because of its high interpretability that plays a fundamental role in policydesign. Indeed, the single tree algorithms are suitable for the discovery of heterogeneous sub-populations, which can give useful insights into the main variables that drive the heterogeneityand can be used to design targeting strategies. Moreover, we did not find instability in neither thedetection nor the estimation of heterogeneous effects in the Monte Carlo simulations. Nonetheless,45he proposed algorithm could be extended to tree-based ensemble methods, following Wager andAthey (2018) and Athey et al. (2019). By averaging the estimates from many single trees, thisextension could enhance estimation precision at the cost of reduced interpretability (see Lee et al.(2020) for a discussion on the trade-off between accuracy and interpretability). In addition, ourapproach might be rearranged to deal with settings where the network cannot be partitioned intowell-defined and pre-specified clusters. However, in some network structures, clusters could bedetected by implementing a network-based community detection algorithm (Fortunato; 2010) andthe NCT algorithm could be applied on the detected communities, while the estimator should takeinto account the uncertainty in group membership .Here we assume that the network exposure variable is discrete, with the performance of thealgorithm being affected by the number of categories resulting from the exposure mapping function.In our simulation study and application we have used a binary neighborhood exposure, whichallowed us to grow deeper trees, reduce the number of possible causal effects, and have enoughobservations for each exposure condition to maintain the variance in a reasonable range. However,alternative specifications could be used. For instance, the network exposure could be defined as theproportion of treated neighbors, perhaps categorized into few bins. A more complex definition of thenetwork exposure, possibly resulting in a continuous variable, would require some methodologicaladjustments in the estimation strategy, but it would allow to model a wider ensemble of real-worldinterference mechanisms. We leave this extension to future work.Finally, further research is needed to use the selected partition to actually design targetingstrategies involving both treatment and spillover effects. Furthermore, these strategies should alsorely on the average susceptibility of network contacts as well as on heterogeneous influential power.To this end, new causal estimands could be defined and their heterogeneity should be investigatedand incorporated in the design of complex targeting rules, aimed at maximizing the effect of theintervention while saving resources. 46 eferences
Angelucci, M. and Di Maro, V. (2015).
Program evaluation and spillover effects , The World Bank.Arduini, T., Patacchini, E. and Rainone, E. (2019). Treatment effects with heterogeneous exter-nalities,
Journal of Business & Economic Statistics (doi: 10.1080/07350015.2019.1592755): 1–13.Arduini, T., Patacchini, E. and Rainone, E. (2020). Identification and estimation of network modelswith heterogeneous interactions,
Forthcoming at Advances in Econometrics .Aronow, P. M. and Samii, C. (2017). Estimating average causal effects under general interference,with application to a social network experiment, The Annals of Applied Statistics (4): 1912–1947.Assmann, S. F., Pocock, S. J., Enos, L. E. and Kasten, L. E. (2000). Subgroup analysis and other(mis) uses of baseline data in clinical trials, The Lancet (9209): 1064–1069.Athey, S., Eckles, D. and Imbens, G. W. (2018). Exact p-values for network interference,
Journalof the American Statistical Association (521): 230–240.Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal effects,
Proceedingsof the National Academy of Sciences (27): 7353–7360.Athey, S. and Imbens, G. W. (2019). Machine learning methods that economists should knowabout,
Annual Review of Economics : 685–725.Athey, S., Tibshirani, J., Wager, S. et al. (2019). Generalized random forests, The Annals ofStatistics (2): 1148–1178.Baird, S., Bohren, J. A., McIntosh, C. and ¨Ozler, B. (2018). Optimal design of experiments in thepresence of interference, Review of Economics and Statistics (5): 844–860.Bargagli-Stoffi, F. J., De Witte, K. and Gnecco, G. (2019). Heterogeneous causal effects with imper-fect compliance: a novel Bayesian machine learning approach, arXiv preprint arXiv:1905.12707 . 47argagli-Stoffi, F. J. and Gnecco, G. (2018). Estimating heterogeneous causal effects in the presenceof irregular assignment mechanisms,
In Proceedings of the 5th IEEE Conference in Data Scienceand Advanced Analytics (DSAA) pp. 1–10.Bargagli-Stoffi, F. J. and Gnecco, G. (2020). Causal tree with instrumental variable: an extensionof the causal tree framework to irregular assignment mechanisms,
International Journal of DataScience and Analytics (3): 315–337.Basse, G. and Feller, A. (2018). Analyzing two-stage experiments in the presence of interference, Journal of the American Statistical Association (521): 41–55.Binka, F. N., Indome, F. and Smith, T. (1998). Impact of spatial distribution of permethrin-impregnated bed nets on child mortality in rural northern ghana.,
The American Journal ofTropical Medicine and Hygiene (1): 80–85.Breiman, L. (2001). Random forests, Machine Learning (1): 5–32.Bridges, C. B., Thompson, W. W., Meltzer, M. I., Reeve, G. R., Talamonti, W. J., Cox, N. J., Lilac,H. A., Hall, H., Klimov, A. and Fukuda, K. (2000). Effectiveness and cost-benefit of influenzavaccination of healthy working adults: a randomized controlled trial, Journal of the AmericanMedical Association (13): 1655–1663.Broderick, M. P., Hansen, C. J. and Russell, K. L. (2008). Exploration of the effectiveness ofsocial distancing on respiratory pathogen transmission implicates environmental contributions,
The Journal of Infectious Diseases (10): 1420–1426.Cai, J., De Janvry, A. and Sadoulet, E. (2015). Social networks and the decision to insure,
AmericanEconomic Journal: Applied Economics (2): 81–108.Chae, I., Stephen, A. T., Bart, Y. and Yao, D. (2017). Spillover effects in seeded word-of-mouthmarketing campaigns, Marketing Science (1): 89–104.Chakraborty, B. and Murphy, S. (2014). Dynamic treatment regimes, Annual Review of Statisticsand Its Application : 447–464.Chen, L. H. Y. and Shao, Q.-M. (2004). Normal approximation under local dependence, The Annalsof Probability (3): 1985–2028. 48ockx, B., Lechner, M. and Bollens, J. (2019). Priority to unemployed immigrants? A causalmachine learning evaluation of training in Belgium, arXiv preprint arXiv:1912.12864 .Cook, D. I., Gebski, V. J. and Keech, A. C. (2004). Subgroup analysis in clinical trials, MedicalJournal of Australia (6): 289–291.Cox, D. R. (1958).
Planning of experiments. , Wiley: New York.Duflo, E. and Saez, E. (2003). The role of information and social interactions in retirement plan deci-sions: Evidence from a randomized experiment,
The Quarterly Journal of Economics (3): 815–842.Erd˝os, P. and R´enyi, A. (1959). On random graphs i,
Publicationes Mathematicae (290-297): 18.Forastiere, L., Airoldi, E. M. and Mealli, F. (2020). Identification and estimation of treatmentand interference effects in observational studies on networks, Forthcoming in the Journal of theAmerican Statistical Association. Preprint at arXiv:1609.06245 .Forastiere, L., Lattarulo, P., Mariani, M., Mealli, F. and Razzolini, L. (2019). Exploring encour-agement, treatment, and spillover effects using principal stratification, with application to afield experiment on teens museum attendance,
Journal of Business & Economic Statistics (doi:10.1080/07350015.2019.1647843): 1–15.Forastiere, L., Mealli, F. and VanderWeele, T. J. (2016). Identification and estimation of causalmechanisms in clustered encouragement designs: Disentangling bed nets using bayesian principalstratification,
Journal of the American Statistical Association (514): 510–525.Forastiere, L., Mealli, F., Wu, A. and Airoldi, E. (2018). Estimating causal effects under interferenceusing bayesian generalized propensity scores, arXiv preprint arXiv:1807.11038 .Fortunato, S. (2010). Community detection in graphs,
Physics reports (3-5): 75–174.Foster, J. C., Taylor, J. M. and Ruberg, S. J. (2011). Subgroup identification from randomizedclinical trial data,
Statistics in Medicine (24): 2867–2880.Friedman, J. H., Olshen, R. A., Stone, C. J. et al. (1984). Classification and Regression Trees, Belmont, CA: Wadsworth & Brooks . 49uber, R. (2018). Instrument Validity Tests with Causal Trees: With an Application to the Same-sex Instrument,
MEA Discussion Paper Series 201805 , Munich Center for the Economics ofAging (MEA) at the Max Planck Institute for Social Law and Social Policy.Hahn, P. R., Murray, J. S., Carvalho, C. M. et al. (2020). Bayesian regression tree models forcausal inference: regularization, confounding, and heterogeneous effects,
Forthcoming in BayesianAnalysis .Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference,
Journal of Computationaland Graphical Statistics (1): 217–240.Holland, P. W. (1986). Statistics and causal inference, Journal of the American Statistical Associ-ation (396): 945–960.Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacementfrom a finite universe, Journal of the American Statistical Association (260): 663–685.Howard, S., Omumbo, J., Nevill, C., Some, E., Donnelly, C. and Snow, R. (2000). Evidence for amass community effect of insecticide-treated bednets on the incidence of malaria on the kenyancoast, Transactions of the Royal Society of Tropical Medicine and Hygiene (4): 357–360.Hudgens, M. G. and Halloran, M. E. (2008). Toward causal inference with interference, Journal ofthe American Statistical Association (482): 832–842.Imai, K., Jiang, Z. and Malani, A. (2020). Causal inference with interference and noncompli-ance in two-stage randomized experiments,
Forthcoming in Journal of the American StatisticalAssociation .Johnson, M., Cao, J. and Kang, H. (2019). Detecting heterogeneous treatment effect with instru-mental variables, arXiv preprint arXiv:1908.03652 .Kang, H. and Imbens, G. (2016). Peer encouragement designs in causal inference with partialinterference and identification of local average network effects, arXiv preprint arXiv:1609.04464 . 50elso, J. K., Milne, G. J. and Kelly, H. (2009). Simulation suggests that rapid activation of socialdistancing can arrest epidemic development due to a novel strain of influenza,
BMC public health (1): 117.Kempe, D., Kleinberg, J. and Tardos, . (2003). Maximizing the spread of influence through asocial network, Proceedings of the Ninth ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining p. 137146.Kim, D., Hwong, A., Stafford, D., Hughes, D., O’Malley, J., Fowler, J. and Christakis, N. (2015).Social network targeting to maximise population behaviour change: A cluster randomised con-trolled trial,
Lancet : 145–153.Kissler, S. M., Tedijanto, C., Goldstein, E., Grad, Y. H. and Lipsitch, M. (2020). Projecting thetransmission dynamics of sars-cov-2 through the postpandemic period,
Science (6493): 860–868.Kosorok, M. R. and Laber, E. B. (2019). Precision medicine,
Annual Review of Statistics and ItsApplication (1): 263–286.Lechner, M. (2018). Modified causal forests for estimating heterogeneous causal effects, arXivpreprint arXiv:1812.09487 .Lee, K., Bargagli-Stoffi, F. J. and Dominici, F. (2020). Causal rule ensemble: Interpretable inferenceof heterogeneous treatment effects, Working paper .Lee, K., Small, D. S. and Dominici, F. (2018). Discovering effect modification and randomizationinference in air pollution studies, arXiv preprint arXiv:1802.06710 .Leung, M. P. (2020). Treatment and spillover effects under network interference,
The Review ofEconomics and Statistics (2): 368–380.Liu, L., Hudgens, M. and Becker-Dreps, S. (2016). On inverse probability-weighted estimators inthe presence of interference,
Biometrika (4): 829–842.Liu, L. and Hudgens, M. G. (2014). Large sample randomization inference of causal effects in thepresence of interference,
Journal of the American Statistical Association (505): 288–301.51anski, C. (2013). Identification of treatment response with social interactions,
EconometricsJournal (1): 1–23.McKenzie, D. and Puerto, S. (2015). The direct and spillover impacts of a business training programfor female entrepreneurs in Kenya, In Proceedings of the 2015 International Labor OrganizationConference pp. 1–37.Montes, F., Jaramillo, A. M., Meisel, J. D., Diaz-Guilera, A., Valdivia, J. A., Sarmiento, O. L. andZarama, R. (2020). Benchmarking seeding strategies for spreading processes in social networks:an interplay between influencers, topologies and sizes,
Scientific Reports (1): 3666.Muralidharan, K. and Sundararaman, V. (2015). The aggregate effect of school choice: Evidencefrom a two-stage experiment in india, The Quarterly Journal of Economics (3): 1011–1066.Murphy, S. A. (2003). Optimal dynamic treatment regimes,
Journal of the Royal Statistical Society:Series B (Statistical Methodology) (2): 331–355.Nichol, K. L., Lind, A., Margolis, K. L., Murdoch, M., McFadden, R., Hauge, M., Magnan, S. andDrake, M. (1995). The effectiveness of vaccination against influenza in healthy, working adults, New England Journal of Medicine (14): 889–893.Ogburn, E., Sofrygin, O., Diaz, I. and van der Laan, M. (2017). Causal inference for social networkdata, arXiv preprint arXiv:1705.08527 .Parshakov, P., Naidenova, I. and Barajas, A. (2020). Spillover effect in promotion: Evidence fromvideo game publishers and esports tournaments,
Journal of Business Research : 262–270.Ratkovic, M. and Tingley, D. (2017). Sparse estimation and uncertainty with application to sub-group analysis,
Political Analysis (1): 1–40.Rothwell, P. M. (2005). Subgroup analysis in randomised controlled trials: importance, indications,and interpretation, The Lancet (9454): 176–186.Rubin, D. B. (1974). Estimating causal effects of treatments in randomized and nonrandomizedstudies.,
Journal of Educational Psychology (5): 688.52ubin, D. B. (1986). Comment: Which ifs have causal answers, Journal of the American StatisticalAssociation (396): 961–962.Schiltz, F., Mazrekaj, D., Horn, D. and De Witte, K. (2019). Does it matter when your smartestpeers leave your class? Evidence from Hungary, Labour Economics : 79–91.Sobel, M. E. (2006). What do randomized studies of housing mobility demonstrate?, Journal ofthe American Statistical Association (476): 1398–1407.Sofrygin, O. and van der Laan, M. (2017). Semi-parametric estimation and inference for the meanoutcome of the single time-point intervention in a causally connected population,
Journal ofCausal Inference (1): 20160003.Su, X., Kang, J., Fan, J., Levine, R. A. and Yan, X. (2012). Facilitating score and causal inferencetrees for large observational studies, Journal of Machine Learning Research (Oct): 2955–2994.Su, X., Tsai, C.-L., Wang, H., Nickerson, D. M. and Li, B. (2009). Subgroup analysis via recursivepartitioning., Journal of Machine Learning Research (2): 141–158.Tchetgen, E. J. T. and VanderWeele, T. J. (2012). On causal inference in the presence of interfer-ence, Statistical Methods in Medical Research (1): 55–75.Valdes, G., Luna, J. M., Eaton, E., Simone II, C. B., Ungar, L. H. and Solberg, T. D. (2016).Mediboost: a patient stratification tool for interpretable decision making in the era of precisionmedicine, Scientific Reports : 37854.Valente, T. W. (2012). Network interventions, Science (6090): 49–53.VanderWeele, T. J. and Christakis, N. A. (2019). Network multipliers and public health,
Interna-tional Journal of Epidemiology (4): 1032–1037.Varadhan, R. and Seeger, J. D. (2013). Estimation and reporting of heterogeneity of treatmenteffects, Developing a Protocol for Observational Comparative Effectiveness Research: A User’sGuide , Agency for Healthcare Research and Quality (US).Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment effects usingrandom forests,
Journal of the American Statistical Association (523): 1228–1242.53ang, G., Li, J. and Hopp, W. J. (2017). Personalized health care outcome analysis of cardiovascularsurgical procedures,
Ross School of Business Paper (1336).Zhang, W., Le, T. D., Liu, L., Zhou, Z.-H. and Li, J. (2017). Mining heterogeneous causal effectsfor personalized cancer treatment,
Bioinformatics (15): 2372–2378.Zhao, J., Runfola, D. M. and Kemper, P. (2017). Quantifying heterogeneous causal treatment effectsin world bank development finance projects, Joint European Conference on Machine Learningand Knowledge Discovery in Databases , Springer, pp. 204–215.54 nline Appendix
A Proofs
Proposition 1 (Unbiasedness) . E [ (cid:98) µ ( w,g ) ( (cid:96) ( x ))] = µ ( w,g ) ( (cid:96) ( x )) . Proof. E [ (cid:98) µ ( w,g ) ] = E (cid:20) N K (cid:88) k =1 n k (cid:88) i =1 ( W ik = w, G ik = g, ) Y ik π ik (( w, g )) (cid:21) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 E (cid:2) ( W ik = w, G ik = g ) (cid:3) Y ik ( w, g ) π ik ( w, g )= µ ( w,g ) (19)where the expectation is over the randomization distribution of W ik and the induced distributionon G ik and the first equality holds by consistency. Proposition 6 (Population Unbiasedness) . The estimator is unbiased with respect to the populationmean of the potential outcomes: E [ (cid:98) µ ( w,g ) ( (cid:96) ( x ))] = E [ Y ik ( w, g ) | X ik ∈ (cid:96) ( x ))] = µ P ( w,g ) ( (cid:96) ( x )) and E [ (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x ))] = E [ Y ik ( w, g ) − Y ik ( w (cid:48) , g (cid:48) ) | X ik ∈ (cid:96) ( x ))]= E [ τ ( w,g ; w (cid:48) ,g (cid:48) ) ( X ik ) | X ik ∈ (cid:96) ( x ))] = µ P ( w,g ) ( (cid:96) ( x )) where the expected value is taken over the sampling distribution. roof. E [ (cid:98) µ ( w,g ) ] = E (cid:20) N K (cid:88) k =1 n k (cid:88) i =1 ( W ik = w, G ik = g, ) Y ik π ik (( w, g )) (cid:21) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 E (cid:2) ( W ik = w, G ik = g ) Y ik (cid:3) π ik ( w, g )= 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 E (cid:2) Y ik | W ik = w, G ik = g (cid:3) π ik ( w, g ) π ik ( w, g )= 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 E (cid:2) Y ik ( w, g ) (cid:3) = E (cid:2) Y ik ( w, g ) (cid:3) = µ P ( w,g ) . (20)The proof of unbiasedness of (cid:98) τ ( w,g,w (cid:48) ,g (cid:48) ) follows directly. Proposition 4 (Consistency) . Consider the asymptotic regime where the number of clusters K goto infinity, i.e., K −→ ∞ , while the cluster size remains bounded, i.e., n k ≤ B for some constantB. In addition, assume that | Y ik ( w, g ) | /π ik ( w, g ) ≤ C < , ∀ i, k, w, g . Then as K −→ ∞ (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) p −→ τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) . Proof.
As in Proposition 1 (cid:98) µ ( w,g )) ( (cid:96) ( x )) is unbiased. Hence, for consistency to hold we need to provethat the variance goes to 0 as N goes to infinity. Following Aronow and Samii (2017), it is easy toshow that the variance of (cid:98) µ ( w,g ) ( (cid:96) ( x )) is given by: V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 ( X ik ∈ (cid:96) ( x )) π ik ( w, g )[1 − π ik ( w, g )] (cid:20) Y ik ( w, g ) π ik ( w, g ) (cid:21) + 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i ( X ik ∈ (cid:96) ( x ) , X jk ∈ (cid:96) ( x )) × [ π ikjk ( w, g ) − π ik ( w, g ) π jk ( w, g ]) Y ik ( w, g ) π ik ( w, g ) Y jk ( w, g ) π jk ( w, g ) (21)Since | Y ik ( w, g ) | /π ik ( w, g ) ≤ C < (cid:96) ( x ) is bounded, i.e., n k ( (cid:96) ( x )) ≤ B ( (cid:96) ( x )) ≤ B , we have:( K × B ) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) ≤ C × K × B + C × K × B (cid:98) µ ( w,g ) ( (cid:96) ( x )) is therefore ensured since V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) −→ K −→ ∞ . Consis-tency of (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) follows by Slutsky’s Theorem. Proposition 7.
The partition Π (cid:63) such that Π (cid:63) = argmax Π ∈ P Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) = 1 N K (cid:88) k =1 n k (cid:88) i =1 (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( X ik , Π)) (cid:1) maximizes the heterogeneity across leaves.Proof. Let (cid:96) and (cid:96) be two sub-populations with a different causal effect τ ( w,g ; w (cid:48) ,g (cid:48) ) , i.e., τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:54) = τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ). Let Π be a partition that splits (cid:96) and (cid:96) into two leaves and letΠ c be the partition that combines the two sub-populations into one node (cid:96) . Then we have that Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) > Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π c ). The proofs follows from Jensen’s inequality. In fact, for partitionΠ the splitting function can be written as follows: Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) = 1 | (cid:96) | + | (cid:96) | (cid:88) ik ∈ (cid:96) ∪ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( X ik , Π)) (cid:1) = 1 | (cid:96) | + | (cid:96) | (cid:16) (cid:88) ik ∈ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:1) + (cid:88) ik ∈ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:1) (cid:17) For partition Π c we have: Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π c ) = 1 | (cid:96) | + | (cid:96) | (cid:88) ik ∈ (cid:96) ∪ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( X ik , Π c )) (cid:1) = 1 | (cid:96) | + | (cid:96) | (cid:88) ik ∈ (cid:96) ∪ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:1) = 1 | (cid:96) | + | (cid:96) | (cid:88) ik ∈ (cid:96) ∪ (cid:96) (cid:104) | (cid:96) | + | (cid:96) | (cid:16) (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) + (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:17)(cid:105) = (cid:104) | (cid:96) | + | (cid:96) | (cid:16) (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) + (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:17)(cid:105) where the second-last equality holds because of the properties of the Horvitz-Thompson estimator.Thanks to Jensen’s inequality1 | (cid:96) | + | (cid:96) | (cid:16) (cid:88) ik ∈ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:1) + (cid:88) ik ∈ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:1) (cid:17) ≥ (cid:104) | (cid:96) | + | (cid:96) | (cid:16) (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) + (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:17)(cid:105) Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) ≥ Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π c ). B Further Details of the Variance Estimator of Leaf-SpecificCACE
If in a generic leaf (cid:96) ( x ) there are some pairs of units ( i , j ) whose joint probability of the exposure con-dition ( w, g ) is zero (namely, π ikjk ( w, g ) = 0), the variance of (cid:92) µ ( w,g ) ( (cid:96) ( x )) can be estimated followinga result from Aronow and Samii (2017). Such estimator, denoted by (cid:98) V c (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) , is the sumof two components: (i) the estimated variance of leaf-specific potential outcomes, (cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x ) (cid:17) )in (8) for the case when π ikjk ( w, g ) > ∀ i, j, k , and (ii) a correction term (cid:98) A ( w,g ) ( (cid:96) ( x )): (cid:98) V c (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) = (cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) + (cid:98) A ( w,g ) ( (cid:96) ( x )) (22)where (cid:98) A ( w,g ) ( (cid:96) ( x )) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i : π ikjk ( w,g )=0 (cid:34) ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) Y ik π ik ( w, g ) ++ ( W jk = w, G jk = g, X ik ∈ (cid:96) ( x )) Y jk π jk ( w, g ) (cid:35) . (23)Note that the correction term (cid:98) A ( w,g ) ( (cid:96) ( x )) is zero if the leaf does not have pairs of units suchthat π ikjk ( w, g, w, g ) = 0. Furthermore, as in Aronow and Samii (2017), (cid:98) V c (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) is aconservative estimator of the leaf-specific variance, as the following holds: E (cid:34)(cid:98) V c (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17)(cid:35) = E (cid:34)(cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) + (cid:98) A ( w,g ) ( (cid:96) ( x )) (cid:35) ≥ V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) . (24)We now explicit the covariance (cid:98) C c (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) , (cid:98) µ ( w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17) in the case we have pairs of units( i, j ), whose joint probability of experiencing the conditions ( w, g ) and ( w (cid:48) , g (cid:48) ), respectively, is zero,that is, π ikjk ( w, g ; w (cid:48) , g (cid:48) ) = 0: 4 C (cid:16)(cid:98) µ w,g ( (cid:96) ( x )) , (cid:98) µ w (cid:48) ,g (cid:48) ( (cid:96) ( x )) (cid:17) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i : π ikjk ( w,g ; w (cid:48) ,g (cid:48) ) > ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) ( W jk = w (cid:48) , G jk = g (cid:48) , X jk ∈ (cid:96) ( x )) π ikjk ( w, g ; w (cid:48) , g (cid:48) ) × [ π ikjk ( w, g ; w (cid:48) , g (cid:48) ) − π ik ( w, g ) π jk ( w (cid:48) , g (cid:48) )] Y ik π ik ( w, g ) Y jk π jk ( w (cid:48) , g (cid:48) ) − N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i : π ikjk ( w,g ; w (cid:48) ,g (cid:48) )=0 (cid:104) ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) Y ik π ik ( w, g )+ ( W ik = w (cid:48) , G ik = g (cid:48) , X ik ∈ (cid:96) ( x )) Y ik π ik ( w (cid:48) , g (cid:48) ) (cid:105) . (25) C Additional Monte Carlo Simulations
We included in the simulation study two additional sets of simulations where (4) we introduce corre-lation between the covariates, and (5) we replace the Erd˝os-R´enyi model for network formation withan exponential random graph (ERGM) model introducing homophily within the clusters. These twoadditional simulations are conducted with 30 clusters and under the first scenario introduced inSection 5, where the heterogeneity is the same for the two causal effects of interest.
C.1 Correlated covariates
In Figure 12 we report the number of correctly detected leaves under low and high correlation(0.25 and 0.5), while in Tables 6 and 7 we report the estimated treatment and spillover effects withtheir standard error in the two heterogeneous leaves, together with the MSE, bias and coverageof the average treatment and spillover effects in the sample. From Figure 12 one can see that thecorrelation between covariates compromises the ability of the algorithm to correctly identify theheterogeneous subgroups. This is due to the fact that, as the covariates become more similar to eachother, it becomes harder for the algorithm to detect the true HDV. Such a problem is common toall tree-based algorithms. Hence, we argue that one should carefully check the correlation patternsbetween the variables to get a sense of the reliability of the discovered subgroups.5 . . . . . Correlation 0.25
Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s . . . . . Correlation 0.50
Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s NCT (composite)NCT (treatment)NCT (spillover)
Figure 12: Simulations’ results for correctly discovered leaves in the first scenario with correlatedcovariatesTable 6: Simulations’ results for the first scenario with correlated covariates (0.25)
Treatment EffectsEffect Size ˆ τ l ˆ se (ˆ τ l ) ˆ τ l ˆ se (ˆ τ l ) MSE Bias Coverage0.1 0.116 0.288 -0.230 0.225 0.071 -0.057 1.0001.1 1.093 0.280 -1.078 0.288 0.080 0.008 0.9462.1 2.084 0.403 -2.067 0.401 0.147 0.009 0.9393.1 3.112 0.553 -3.134 0.555 0.273 -0.011 0.9584.1 4.082 0.692 -4.140 0.699 0.400 -0.029 0.9615.1 5.126 0.843 -5.162 0.852 0.673 -0.018 0.9506.1 6.121 1.007 -6.157 1.008 1.000 -0.018 0.9487.1 7.019 1.161 -7.067 1.165 1.214 -0.024 0.9588.1 8.018 1.309 -8.085 1.309 1.557 -0.033 0.9599.1 9.221 1.509 -9.088 1.491 2.014 0.066 0.95310.1 10.214 1.655 -10.195 1.648 2.441 0.009 0.957 Spillover Effects ˆ δ l ˆ se (ˆ δ l ) ˆ δ l ˆ se (ˆ δ l ) MSE Bias Coverage0.1 0.069 0.185 -0.200 0.194 0.025 -0.065 0.9441.1 1.102 0.227 -1.090 0.227 0.046 0.006 0.9702.1 2.111 0.302 -2.117 0.302 0.078 -0.003 0.9723.1 3.093 0.392 -3.116 0.393 0.112 -0.011 0.9834.1 4.086 0.488 -4.093 0.490 0.169 -0.004 0.9795.1 5.131 0.589 -5.078 0.587 0.246 0.026 0.9716.1 6.071 0.692 -6.051 0.693 0.331 0.010 0.9797.1 7.109 0.801 -7.087 0.802 0.458 0.011 0.9808.1 8.092 0.906 -8.063 0.902 0.577 0.014 0.9769.1 9.110 1.016 -9.060 1.010 0.659 0.025 0.99010.1 10.122 1.123 -10.101 1.124 0.861 0.010 0.980 Treatment EffectsEffect Size ˆ τ l ˆ se (ˆ τ l ) ˆ τ l ˆ se (ˆ τ l ) MSE Bias Coverage0.1 0.129 0.223 -0.101 0.217 0.050 0.014 0.9091.1 1.093 0.268 -1.120 0.272 0.066 -0.014 0.9722.1 2.140 0.392 -2.114 0.378 0.158 0.013 0.9493.1 3.104 0.503 -3.065 0.505 0.241 0.019 0.9464.1 4.123 0.652 -4.188 0.664 0.399 -0.033 0.9605.1 5.081 0.795 -5.108 0.801 0.618 -0.013 0.9536.1 6.138 0.960 -6.146 0.943 0.825 -0.004 0.9597.1 7.154 1.110 -7.063 1.096 1.088 0.045 0.9568.1 8.093 1.227 -8.174 1.257 1.388 -0.041 0.9499.1 9.080 1.380 -9.116 1.389 1.637 -0.018 0.95910.1 10.086 1.529 -10.122 1.520 1.952 -0.018 0.952 Spillover Effects ˆ δ l ˆ se (ˆ δ l ) ˆ δ l ˆ se (ˆ δ l ) MSE Bias Coverage0.1 0.144 0.190 -0.202 0.175 0.030 -0.029 0.8641.1 1.091 0.214 -1.087 0.212 0.047 0.002 0.9522.1 2.103 0.282 -2.137 0.285 0.064 -0.017 0.9753.1 3.119 0.364 -3.072 0.365 0.097 0.024 0.9774.1 4.098 0.458 -4.090 0.458 0.156 0.004 0.9625.1 5.056 0.551 -5.101 0.553 0.218 -0.022 0.9766.1 6.127 0.654 -6.093 0.649 0.302 0.017 0.9807.1 7.093 0.750 -7.130 0.752 0.370 -0.018 0.9888.1 8.162 0.852 -8.038 0.844 0.476 0.062 0.9849.1 9.122 0.948 -9.026 0.943 0.571 0.048 0.98210.1 10.043 1.046 -10.177 1.052 0.698 -0.067 0.982 Nevertheless, for both correlation levels (0.25 and 0.50) the estimator seems to perform wellwithin correctly detected leaves (see Tables 6 and 7).
C.2 Network homophily within the clusters
Table 8 shows the results in the case of network homophily within the clusters. In this case, we findlarger standard errors than the original scenario reported in 3 without homophily. As a consequence,the Monte Carlo MSE is also slightly larger.Moreover, as we can see from Figure 13 there is a decrease in the ability of the algorithm todiscover the true leaves. Indeed, if one compares this Figure with the right panel of Figure 5, onecan see how the correct discovery of the true leaves is slower in the case with homophily network.This is due to the fact that the standard errors are larger than in the original first scenario.7 . . . . . Homophily
Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s NCT (composite)NCT (treatment)NCT (spillover)
Figure 13: Simulations’ results for correctly discovered leaves in the first scenario with homophilynetworkTable 8: Simulations’ results for the first scenario with network homophily within the clusters (30clusters)
Treatment EffectsEffect Size ˆ τ l ˆ se (ˆ τ l ) ˆ τ l ˆ se (ˆ τ l ) MSE Bias Coverage0.1 0.152 0.241 -0.108 0.259 0.045 0.022 1.0001.1 1.069 0.312 -1.094 0.315 0.100 -0.013 0.9532.1 2.130 0.454 -2.119 0.448 0.188 0.006 0.9593.1 3.149 0.608 -3.083 0.606 0.335 0.033 0.9554.1 4.099 0.775 -4.084 0.763 0.477 0.007 0.9595.1 5.115 0.946 -5.116 0.943 0.736 0.000 0.9616.1 6.162 1.127 -6.150 1.127 1.043 0.006 0.9707.1 7.076 1.285 -7.175 1.289 1.377 -0.049 0.9558.1 8.129 1.465 -8.000 1.446 1.752 0.064 0.9579.1 9.029 1.621 -9.136 1.629 2.249 -0.053 0.95310.1 10.163 1.814 -10.133 1.815 2.977 0.015 0.957 Spillover Effects ˆ δ l ˆ se (ˆ δ l ) ˆ δ l ˆ se (ˆ δ l ) MSE Bias Coverage0.1 0.084 0.221 -0.142 0.222 0.036 -0.029 0.9711.1 1.082 0.268 -1.100 0.274 0.069 -0.009 0.9572.1 2.102 0.370 -2.084 0.366 0.120 0.009 0.9643.1 3.116 0.483 -3.112 0.486 0.172 0.002 0.9744.1 4.065 0.603 -4.114 0.610 0.260 -0.025 0.9765.1 5.101 0.740 -5.077 0.737 0.389 0.012 0.9806.1 6.076 0.871 -6.045 0.869 0.538 0.015 0.9797.1 7.088 1.004 -7.054 0.997 0.738 0.017 0.9728.1 8.041 1.132 -8.150 1.142 0.936 -0.055 0.9719.1 9.157 1.275 -9.030 1.263 1.070 0.064 0.97610.1 10.066 1.408 -10.093 1.407 1.391 -0.014 0.982) MSE Bias Coverage0.1 0.084 0.221 -0.142 0.222 0.036 -0.029 0.9711.1 1.082 0.268 -1.100 0.274 0.069 -0.009 0.9572.1 2.102 0.370 -2.084 0.366 0.120 0.009 0.9643.1 3.116 0.483 -3.112 0.486 0.172 0.002 0.9744.1 4.065 0.603 -4.114 0.610 0.260 -0.025 0.9765.1 5.101 0.740 -5.077 0.737 0.389 0.012 0.9806.1 6.076 0.871 -6.045 0.869 0.538 0.015 0.9797.1 7.088 1.004 -7.054 0.997 0.738 0.017 0.9728.1 8.041 1.132 -8.150 1.142 0.936 -0.055 0.9719.1 9.157 1.275 -9.030 1.263 1.070 0.064 0.97610.1 10.066 1.408 -10.093 1.407 1.391 -0.014 0.982