[PDF] Heterogeneous Treatment and Spillover Effects under Clustered Network Interference

Abstract

The bulk of causal inference studies rules out the presence of interference between units. However, in many real-world settings units are interconnected by social, physical or virtual ties and the effect of a treatment can spill from one unit to other connected individuals in the network. In these settings, interference should be taken into account to avoid biased estimates of the treatment effect, but it can also be leveraged to save resources and provide the intervention to a lower percentage of the population where the treatment is more effective and where the effect can spill over to other susceptible individuals. In fact, different people might respond differently not only to the treatment received but also to the treatment received by their network contacts. Understanding the heterogeneity of treatment and spillover effects can help policy-makers in the scale-up phase of the intervention, it can guide the design of targeting strategies with the ultimate goal of making the interventions more cost-effective, and it might even allow generalizing the level of treatment spillover effects in other populations. In this paper, we develop a machine learning method that makes use of tree-based algorithms and an Horvitz-Thompson estimator to assess the heterogeneity of treatment and spillover effects with respect to individual, neighborhood and network characteristics in the context of clustered network interference. We illustrate how the proposed binary tree methodology performs in a Monte Carlo simulation study. Additionally, we provide an application on a randomized experiment aimed at assessing the heterogeneous effects of information sessions on the uptake of a new weather insurance policy in rural China.

Full PDF

HHeterogeneous Treatment and Spillover Eﬀectsunder Clustered Network Interference

Falco J. Bargagli-Stoﬃ ∗ , Costanza Tort´u* , and Laura Forastiere † IMT School for Advanced Studies Lucca Katholieke Universiteit Leuven Yale University

Abstract

The bulk of causal inference studies rules out the presence of interference between units.However, in many real-world scenarios units are interconnected by social, physical or virtualties and the eﬀect of a treatment can spill from one unit to other connected individuals inthe network. In these settings, interference should be taken into account to avoid biasedestimates of the treatment eﬀect, but it can also be leveraged to save resources and providethe intervention to a lower percentage of the population where the treatment is more eﬀectiveand where the eﬀect can spill over to other susceptible individuals. In fact, diﬀerent peoplemight respond diﬀerently not only to the treatment received but also to the treatment receivedby their network contacts. Understanding the heterogeneity of treatment and spillover eﬀectscan help policy-makers in the scale-up phase of the intervention, it can guide the design oftargeting strategies with the ultimate goal of making the interventions more cost-eﬀective, andit might even allow generalizing the level of treatment spillover eﬀects in other populations. Inthis paper, we develop a machine learning method that makes use of tree-based algorithms andan Horvitz-Thompson estimator to assess the heterogeneity of treatment and spillover eﬀectswith respect to individual, neighborhood and network characteristics in the context of clusterednetwork interference. We illustrate how the proposed binary tree methodology performs ina Monte Carlo simulation study. Additionally, we provide an application on a randomizedexperiment aimed at assessing the heterogeneous eﬀects of information sessions on the uptakeof a new weather insurance policy in rural China.

Keywords: causal inference; potential outcomes; interference; social networks; machinelearning; heterogeneous eﬀects

We are grateful for conversations with Andrea Canidio, Kenan Huremovi´c, Julie Josse, Fabrizia Mealli, KarimLounici, Michael Lechner, Hyun Min Kang, Stefan Wager, and for comments to participants at the “2020 Causalinference Opening Workshop (SAMSI Institute), at the “European Cooperation for Statistics and Network Data Sci-ence” (COSTNET) workshop, at the “Causal Machine Learning Workshop” organised by the University of St.Gallen,at the reading groups on “Causal Inference and Machine Learning” at Harvard University and on “Machine Learningand Networks in Economics” at IMT School for Advanced Studies Lucca, and to members of the

Missing Values andCausality Research Group at the ´Ecole Polytechnique. ∗ Equally contributing co-ﬁrst authors. † Corresponding author: [email protected] a r X i v : . [ s t a t . M E ] S e p Introduction

According to Cox (1958), there is interference between diﬀerent units if the outcome of one unitis aﬀected by the treatment assignment of other units. In the case of policy interventions or socio-economic programs, interference may arise due to social, physical or virtual interactions. For in-stance, in the case of infectious diseases, unprotected individuals can still beneﬁt from public healthmeasures undertaken in the rest of the population, such as vaccinations or preventive behaviors,because these interventions reduce the reservoir of infection (Nichol et al.; 1995; Bridges et al.;2000), the vector of transmission (Binka et al.; 1998; Howard et al.; 2000) and the number of sus-ceptible individuals (Broderick et al.; 2008; Kelso et al.; 2009; Kissler et al.; 2020). In labor market,job placement assistance can aﬀect job seekers using this service, but it can also have an eﬀect onother job seekers who are competing on the same labor market (McKenzie and Puerto; 2015). Ineducation, learning programs may spill over to untreated peers through knowledge transmissionpaths (Forastiere et al.; 2019; Schiltz et al.; 2019). In marketing, the exposure to an advertisementmight directly aﬀect the consuming behavior of the exposed individuals, and indirectly aﬀect otherindividuals that are inﬂuenced by the consuming choices of people in their social network (Par-shakov et al.; 2020). If the exposure to the advertisement takes place in social media, the spillovereﬀect might be also explained by the propagation of the advertisement to non-exposed users thatare virtually connected to the targeted ones (Chae et al.; 2017).In the presence of interference, the eﬀect of the treatment status of other units on one’s outcomeis usually referred to as spillover eﬀect . In economics and social sciences there has been increasinginterest in estimating spillover eﬀects on networks in many diﬀerent contexts: Duﬂo and Saez (2003)study the role of information and social interactions in retirement plan decisions in the academiccommunity; Cai et al. (2015) assess the spillover eﬀect of training sessions on uptake of weatherinsurance in rural areas of China; Muralidharan and Sundararaman (2015) study the aggregateeﬀects of school choices while Imai et al. (2020) evaluate the eﬀects of the Indian National healthinsurance.Spillovers are a crucial component in understanding the full impact of an intervention at thepopulation-level. In fact, the scale-up phase of a policy requires knowledge about the mechanism1f spillover, and how this would take place in the population where the intervention will be rolledout. Information about spillovers of public policies can also support decisions about how best todeliver interventions, and could be used to guide public funds allocation. Indeed, the presenceof beneﬁcial spillover eﬀects allows treating a lower percentage of the population, because theuntreated individuals would still beneﬁt from the treatment provided to the targeted sample. Theuse of spillover eﬀects to save resources could be further improved if we were able to target speciﬁcindividuals who would increase the overall impact of the intervention.A targeting (or seeding) strategy aims at delivering the intervention to certain individualssuch that the impact on the overall population is maximized (e.g., Valente; 2012; Kim et al.;2015; VanderWeele and Christakis; 2019; Montes et al.; 2020). Typically, seeding strategies aredesigned in settings where either an element of the intervention (e.g., information, ﬂyers or couponsprovided during the intervention) or the outcome (e.g., the adoption of a behavior or a product)diﬀuse through the network. In these settings, the goal is the identiﬁcation of the set of nodes inthe network that, if targeted, would maximize contagion or diﬀusion cascades. To do so, seedingstrategies are designed using information on the network structure and the dynamics of contagion ordiﬀusion. This ‘inﬂuence maximization’ problem is NP-hard and computer scientist have developedapproximate algorithms that usually rely on simpliﬁed contagion processes (Kempe et al.; 2003).Indeed, it is typically assumed that susceptibility to direct treatment and to others, as well as theinﬂuential power, are homogeneous across individuals.Here we take a diﬀerent perspective. First, we investigate spillover eﬀects of a unit’s treatmenton other units’ outcome without specifying the mechanism through which this might take place.Second, we focus on the assessment of the heterogeneity of susceptibility to direct and indirecttreatment. Understanding these heterogeneities can guide the design of targeting strategies aimedat making the interventions more cost-eﬀective. When spillovers are not present, this can be achievedby targeting those with the highest treatment eﬀect. In fact, in the ﬁeld of personalized medicine it iswell known that individuals with diﬀerent characteristics might respond diﬀerently to the treatment(e.g., Murphy; 2003; Chakraborty and Murphy; 2014; Kosorok and Laber; 2019). In the presence ofinterference, we also have that diﬀerent people might be more or less susceptible to the treatmentreceived by other units. This means that not only the treatment eﬀect but also spillover eﬀects areheterogeneous. An assessment of the heterogeneity of treatment and spillover eﬀect is crucial not2nly for designing a cost-eﬀective roll-out of the intervention within the targeted population, butit can also allow a generalization of its impact to other populations.

In the past decade causal inference literature has experienced a growing interest in the mechanismof interference, leading to (i) the assessment of bias for causal eﬀects estimated under the no-interference assumption (Sobel; 2006; Forastiere et al.; 2020), (ii) the design of experiments to eitheravoid or assess interference (Angelucci and Di Maro; 2015; Baird et al.; 2018; Kang and Imbens;2016), and (iii) the estimation of casual eﬀects under interference. Estimators for treatment andspillover eﬀects have ﬁrst been developed under the assumption of partial interference, allowinginterference within groups but not across diﬀerent groups (Hudgens and Halloran; 2008; Tchetgenand VanderWeele; 2012; Liu and Hudgens; 2014; Liu et al.; 2016; Forastiere et al.; 2016; Basse andFeller; 2018; Forastiere et al.; 2019). However, the assumption of group-level interference is ofteninvalid, too broad or not applicable. Hence, several works focus on the estimation of causal eﬀectsin the context of units interconnected on networks, both in randomized experiments (Aronow andSamii; 2017; Athey et al.; 2018; Leung; 2020) and in observational studies (Ogburn et al.; 2017;Sofrygin and van der Laan; 2017; Forastiere et al.; 2018, 2020). In the context of social networks,even in randomized experiments where the treatment is randomized at the unit-level, exposureto other units’ treatment is not. Therefore the propensity of being exposed to diﬀerent levels oftreatments among network contacts will depend on the network structure. Aronow and Samii (2017)developed a Horvitz-Thompson estimator to adjust for the imbalance in this propensity across unitsunder diﬀerent individual and contacts’ treatment status.In parallel to this ﬁeld of research on interference, in recent year, thanks to the availability ofincreased computing power and large data sets, researchers have started to think about advanceddata-driven methods to assess the heterogeneity of treatment eﬀects with respect to large numbersof features. In this regard, there has been a large number of contributions in causal inferenceon subgroup analysis to investigate heterogeneous eﬀect (e.g., Assmann et al.; 2000; Cook et al.;2004; Rothwell; 2005; Su et al.; 2009; Varadhan and Seeger; 2013; Ratkovic and Tingley; 2017).However, the standard methods for subgroup analysis have several drawbacks. In particular, (1)they strongly rely on the subjective decisions on the speciﬁc variables deﬁning the heterogeneous3ub-populations; and (2) they fail to discover heterogeneities other than the ones that are a priori deﬁned by the researchers. In addition, a data-driven method avoids potential problems relatedto cherry-picking the subgroups with extremely high/low treatment eﬀects (Assmann et al.; 2000;Cook et al.; 2004). Hence, many data-driven algorithms for the estimation of heterogeneous causaleﬀects have been proposed in recent years (Hill; 2011; Foster et al.; 2011; Su et al.; 2012; Athey andImbens; 2016; Wager and Athey; 2018; Athey et al.; 2019; Lechner; 2018; Hahn et al.; 2020). Theunderlying aim of these methods is to detect ‘causal’ rules deﬁning subsets of the study populationwhere the treatment eﬀect for that subgroup deviates from the average treatment eﬀect. Thisis done by selecting the most important features and their values that deﬁne a partition of thecovariate space (the tree) where the treatment eﬀect is ‘signiﬁcantly’ heterogeneous. Among thesealgorithms, many rely on extensions of the standard Classiﬁcation and Regression Trees (CART)algorithm (Friedman et al.; 1984) and Random Forest (RF) algorithm (Breiman; 2001), and areadapted to diﬀerent settings (Athey and Imbens; 2016; Zhang et al.; 2017; Lee et al.; 2018; Bargagli-Stoﬃ and Gnecco; 2018; Guber; 2018; Johnson et al.; 2019; Bargagli-Stoﬃ and Gnecco; 2020). Inparticular, in their seminal contribution Athey and Imbens (2016) develop the honest causal tree(HCT) methodology. HCT is a causal decision tree algorithm that is aimed at discovering theheterogeneity in causal eﬀects through a single binary tree. HCT is constructed using a criterionfunction aimed at maximising the heterogeneity in causal eﬀects at each split, while penalizingsplits with higher variance in the estimated conditional eﬀects. In addition to a modiﬁcation of thecriterion function, Athey and Imbens (2016) introduce the concept of honest splitting . The authorspropose to split the overall learning sample into a discovery (or training) set and an estimation set. The former is used to discover the heterogeneity in treatment eﬀects, while the latter is usedto estimate these eﬀects in the discovered sub-populations. The diﬀerent role of the two sets avoidspotential spurious heterogeneity discovery due to overﬁtting of the learning sample. Building onthe HCT, Wager and Athey (2018) and Athey et al. (2019) introduce the causal forest (CF) andgeneralized random forest (GRF). Both CF and GRF are ensemble methods where multiple treesare built and combined to improve inference on the heterogeneous causal eﬀects.HCT and similar tree-based methodologies have already been applied to various scenarios for For a recent review the reader can refer to Athey and Imbens (2019). The criterion function is the main diﬀerence between HCT and CART. Indeed, CART’s criterion function isaimed at minimizing the empirical predictive error at each split.

In this paper, we bridge the gap between the aforementioned two bodies of causal inference literatureby proposing a new algorithm for the discovery and estimation of heterogeneous treatment andspillover eﬀects with respect to a large number of characteristics, including individual and networkfeatures. Our method is designed for randomized experiments aﬀected by the presence of clustered5etwork interference, that is, units are organized in a clustered structure, with no interactionsbetween clusters and a network of connections within clusters (e.g., friendship networks withinschools). In addition, randomization is assumed at the individual-level, resulting in treated anduntreated units in the same cluster. Under this setting, spillover eﬀects are conﬁned within clustersand are assumed to take place on network interactions.Our proposed method, network causal tree (NCT), builds upon the causal trees proposed byAthey and Imbens (2016), by modifying the splitting criterion to target treatment and spillovereﬀects under interference. Splits are made so as to maximize the heterogeneity of the targetedcausal eﬀect(s), treatment and/or spillover eﬀects, across the population. This criterion relies onthe unbiasedness of the estimator of the eﬀect(s) within each subset of the population. We ﬁrstcontribute to the existing literature on interference by developing an unbiased estimator for con-ditional treatment and spillover eﬀects. We extend the Horvitz-Thompson estimator in Aronowand Samii (2017) to conditional causal eﬀects under clustered network interference and we proveits consistency under the clustered network setting. This estimator is then used in our networkcausal trees to choose the binary splits that maximize the heterogeneity and to ﬁnally estimate theheterogeneous causal eﬀects within the selected sub-populations.In order to use the selected partition of the covariate space and the estimated treatment andspillover eﬀects to guide policies, the heterogeneous sub-populations should be identiﬁed basedon the causal eﬀects that will be part of the decision rule. For instance, a policy that assignsthe treatment to those who beneﬁt the most from it requires the discovery of the subsets of thepopulations with heterogeneous treatment eﬀect. Alternatively, a policy that is designed to targetthose who will respond to both their own treatment and the neighbors’ treatment must identifysub-populations with high treatment and spillover eﬀects. Hence, the proposed NCT methodologyis optimized to detect the heterogeneity in treatment and spillover eﬀects (i) either simultaneously ,or (ii) separately . By reworking the criterion function of the seminal causal trees algorithm to ac-count for interference, we allow the algorithm to detect heterogeneity in treatment and/or spillovereﬀects. The discovery of the causal rules (the variables and their values deﬁning heterogeneous sub-populations) representing the heterogeneity of one causal eﬀect (treatment or spillover) is achievedby using a splitting criterion that maximizes the heterogeneity of that speciﬁc causal eﬀect acrossthe partition. On the contrary, if our goal is to identify a partition of the covariate space that can6xplain the heterogeneity of multiple causal eﬀects, we propose the use of a composite splitting func-tion that is designed to simultaneously maximize the heterogeneity in all the eﬀects. This ﬂexibilityallows scholars and policy-makers to customize their investigations depending on their targetinggoals. For instance, if a policy-makers wants to target individuals with the highest treatment eﬀectand the lowest spillover eﬀect (with the motivation that those with higher spillover eﬀects canbeneﬁt from the treatment received by others), the NCT algorithm should be implemented witha composite splitting function aimed at detecting subsets of the population where both treatmentand spillover eﬀects are heterogeneous and the decision criterion can be applied. Conversely, if atargeting strategy is designed to target just individuals who would beneﬁt the most from receivingthe treatment, regardless of other people’s assignment, a tree would be built using a single splittingcriterion targeted to maximize the treatment eﬀect heterogeneity. Similarly, a single criterion tar-geted to a spillover eﬀect would be used in the case of targeting rules only involving that spillovereﬀect.It is important to note that the use of our algorithm to design implementation strategies ispossible thanks to its high level of interpretability. Our network causal trees provide interpretableinference on heterogeneous treatment and spillover eﬀects by discovering a set of causal rules thatcan be represented through a binary tree. As argued by Lee et al. (2018) and Lee et al. (2020) itis important to provide interpretable information on simple causal rules that can be targeted toimprove policy eﬀectiveness and to ensure that stakeholders and policy-makers understand (and, inturn, trust) the functionality of these models. Valdes et al. (2016) claim that a learning algorithmis interpretable if one can explain its classiﬁcation by a conjunction of conditional statements (i.e.,if-then rules). In this regard, tree-based algorithms based on if-then rules, such as the proposedNCT, are optimal for interpretability.To assess the performance of the proposed NCT algorithm, we run a series of Monte Carlosimulations. In particular, we investigate the performance of the proposed algorithm with respect totwo dimensions: its ability (i) to correctly identify the actual heterogeneous sub-populations and, (ii)to precisely estimate the conditional treatment and spillover eﬀects. While the latter performanceassessment is quite standard in the literature, the former is critical for interpretable algorithmsfor heterogeneous causal eﬀects (Bargagli-Stoﬃ et al.; 2019). Finally, we apply the proposed NCTalgorithm to a randomized experiment conducted in China to assess the impact of information7essions on the purchase of a new weather insurance policy (Cai et al.; 2015). Besides estimatingthe population average treatment and spillover eﬀects (as already investigated in Cai et al. (2015)),our aim is to detect the strata of the population where one or both eﬀects are heterogeneous andestimate these eﬀects within these strata.The remainder of the paper is organized as follows. In Section 2 we introduce the notation, set-ting, and assumptions that we employ throughout the paper. In Section 3 we deﬁne the conditionalcausal eﬀects in a general partition of the covariate space and develop a Horvitz-Thompson estima-tor. Section 4 presents the proposed network causal tree algorithm, which is based on eﬀect-specifor composite splitting functions for causal eﬀects under interference. We then conduct a simulationstudy to assess the performance of the algorithm and estimator under diﬀerent scenarios in Section5 and we illustrate the application of the network causal tree on a randomized experiment in Sec-tion 6. Section 7 concludes the paper with a discussion of the proposed algorithm and directionsfor further research.

Let us consider a sample V of N units organized in K separate clusters. Let k ∈ K = [1 , . . . , K ] bethe cluster indicator and let i = 1 , . . . , n k be the unit indicator in each cluster k . Let now consider aconnection structure such that units belonging to the same cluster might share a link whereas unitsbelonging to diﬀerent clusters are not connected. This network structure is represented by the graph G = ( V, E ), where V deﬁnes the set of nodes and E deﬁned the set of edges, that is, the collectionof links between each connected pair of nodes. A clustered network G is in turn an ensemble of K disjoint sub-graphs: G k = ( V k , E k ) , k = 1 , . . . , K . The adjacency matrix A corresponding to thegraph G , is a block-diagonal matrix with K blocks, A k , k : 1 , . . . , K , where each element a ij,k isequal to 1 if there is a link between unit i and unit j in cluster k , that is, if the edge (cid:15) ij,k ∈ E k .Elements in A oﬀ the K blocks are equal to zero, indicating no links between units belonging todiﬀerent clusters.Let now W ik ∈ { , } be a binary variable representing the treatment assigned to unit i incluster k and let Y ik be the observed outcome. We denote by W k and Y k the treatment and8utcome vectors in each cluster k . Similarly, W and Y denote the treatment and outcome vectorsin the whole sample. It is worth noting that the treatment status is not the same for all unitsbelonging to the same cluster. This treatment allocation corresponds to a unit-level randomizationwhere treatment is assigned to each unit of a cluster independently (as in a Bernoulli trial) or withsome level of dependency (as in a completely randomized trial) based on a probability distribution P ( W k ) (see Section 2.3 for further details). Moreover, for each unit ik we observe a vector X ik of P covariates (or pre-treatment variables) that are not inﬂuenced by the treatment assignment. Thevector of covariates might include individual characteristics (e.g., age, sex, socio-economic status,...), cluster-level characteristics (e.g., cluster size, location, ...), as well as network characteristicsrepresenting aggregated individual characteristics (e.g., average age or proportion of males andfemales, ...) or the network topology (e.g., degree, centrality, transitivity, ...).Figure 1 provides a graphical intuition on the clustered network structure and treatment assign-ments at the unit-level. Edges indicate links between units, within each cluster. Colors refer to theindividual treatment assignment: grey colored nodes represent treated units, while white coloredvertices indicate units assigned to the control group.Figure 1: Clustered Network Structure Following the potential outcome framework (Rubin; 1974; Holland; 1986), we denote by Y ik ( w )the potential outcome that unit i in cluster k would experience if the treatment vector W in the9hole sample were w , with w ∈ { , } N . Under the assumption of no-interference, the potentialoutcome could be indexed only by the individual treatment assignment W ik , that is, Y ik ( W ik = w ).In combination with the assumption of consistency , this assumption is known as Stable UnitsTreatment Value Assumption (SUTVA)(Rubin; 1986). The no-interference assumption is clearlyviolated in many real-world scenarios. For instance, the evaluation of the eﬀect of vaccines shouldtake into account the fact that unprotected individuals would also beneﬁt from high vaccinationcoverage around them.Here, we focus on a particular type of interference: clustered network interference . As we willshow later, focusing on this type of interference is critical to ensure asymptotic properties of theestimator for conditional causal eﬀects as well as to allow the network causal tree to divide thesample into a training (or discovery) set and an estimation set (and a testing set, if applicable). The assumption of clustered network interference implies that: i) interference is restricted to nodesof the same cluster and interference between clusters is ruled out, that is, one’s outcome is onlyaﬀected by the treatment received by units belonging to the same cluster; ii) one’s outcome isaﬀected by a weighted function of the treatment status of potentially all units in their own cluster,with weights depending on the presence and possibly the value of network links.Let W k/i be the vector collecting the treatment status of all units in cluster k except unit i .Let g ( · ) : { , } n k − −→ ∆ ik be a function that maps a cluster assignment vector W k/i to anexposure value. Without loss of generality, we deﬁne it as a function of the dot product betweenthe cluster assignment vector and a vector of weights δ i ( A k , X k ), which in turn depends on theadjacency matrix A k and the covariate matrix X k , i.e., g ( W k/i , δ i ( A k , X k )) = f ( W k/i · δ i ( A k , X k )).For instance, the function g ( · ) could result in the number or proportion of treated units in a cluster.In this case the weight vector would be equal to δ i ( A k , X k ) = n k − or δ i ( A k , X k ) = ( n k − ) n k − ,respectively. Alternatively we could use the adjacency matrix to compute the geodesic distance d ( i, j ) between each pair of nodes in cluster k and let g ( W k/i , δ i ( A k , X k ) = (cid:80) n k − j =1 W jk d ( i,j ) . Thefunction g ( · ) is similar to the ‘eﬀective treatments’ function in Manski (2013) and the ‘exposuremapping’ function in Aronow and Samii (2017), although it applies to the cluster treatment vectoronly. To ease notation, throughout we will omit the weight vector δ i ( A k , X k ) in the function g ( · ). Alternatively, network causal trees could also be extended to the case of one single network as long as theamount of dependency is limited to ensure the consistency of the estimator. Furthermore, to divide the sample intothe diﬀerent sets needed for the causal tree algorithm, a community detection algorithm could be used to identifyseparate and densely connected communities

10e can now formalize the clustered network interference assumption as follows.

Assumption 1 (Clustered Network Interference) . Given a function g ( · ) : { , } n k − −→ ∆ ik , ∀ k ∈ K , ∀ i ∈ V k , and ∀ W , W (cid:48) ∈ { , } N such that W ik = W (cid:48) ik , g ( W k/i ) = g ( W (cid:48) k/i ) , the followingequality holds: Y ik ( W ) = Y ik ( W (cid:48) ) . Assumption 1 states that the outcome of a unit i in cluster k depends on the individual treat-ment W ik and a function of the treatment status of the other members of cluster k , i.e., g ( W k/i ),regardless of the speciﬁc treatment status of each member. This assumption can be viewed as an in-termediate assumption between (i) assuming no interference and (ii) making no assumptions aboutthe nature of interference. In a way, it is similar to the partial interference or the stratiﬁed interfer-ence in Hudgens and Halloran (2008), which are special cases of the clustered network interferenceassumption, with g ( W k/i ) = W k/i and g ( W k/i ) = (cid:80) n k j =1 W ij .Let G ik = g ( W k/i ), referred to throughout as network exposure . Under Assumption 1, eachunit has | ∆ ik | × Y ik ( w, g ), representing the potential outcome of unit ik under W ik = w and G ik = g ( W k/i ) = g .We also assume the following consistency assumption: Assumption 2 (Consistency) . Y ik = Y ik ( W ik , G ik ) . This assumption rules out diﬀerent versions of the treatment and diﬀerent ways in which a value ofthe network exposure can aﬀect the outcome of a particular unit. Under a ‘ﬁnite sample perspective’,we assume the potential outcomes of each unit to be ﬁxed but unknown, except for the observed Y ik ( W ik , G ik ). Therefore, the only source of randomness in the potential outcomes is given by therandom assignment to the treatment and the random network exposure induced by the randomcluster assignment.Assumptions 1 and 2 together are alternative to SUTVA when interference is present and islimited to within clusters. When the weight function δ i ( A k , X k ) is such that elements δ ij ( A k , X k ) =0 if j ∈ V k : a ij,k = 0, that is, units that are not directly connected to unit i receive a weight equal tozero, then interference is limited to the neighborhood N ik of each unit, with N ik = { j ∈ V k : a ij,k =1 } . In this case, Assumptions 1 and 2 correspond to the SUTNVA Assumption in Forastiere et al.112020). We denote by N gik the set of units deﬁning the network exposure, that is, N gik = { j ∈ V k :if W (cid:48) jk = W jk then g ( W (cid:48) k/i ) = g ( W k/i ) , ∀ W (cid:48) k/i (cid:54) = W k/i } = { j ∈ V k : δ ij ( A k , X k ) (cid:54) = 0 } . In most ofthe literature on spillover eﬀects this set is either the cluster k (Hudgens and Halloran; 2008) orthe neighborhood of unit i (Forastiere et al.; 2020). Alternative speciﬁcations are also possible andmight involve higher-order neighbors.For the purpose of assessing eﬀect heterogeneity using tree-based methods, we will further makethe following assumption: Assumption 3 (Discrete Network Exposure) . We consider a discrete exposure mapping function g ( · ) : { , } n k − −→ ∆ ik ⊂ Z . Z is the set of integers. This assumption implies that the network exposure G ik is a discrete variableFor instance, we can deﬁne a binary network exposure based on a threshold function applied to thenumber of treated neighbors: G ik = (cid:18)(cid:0) (cid:88) j ∈N ik W jk (cid:1) ≥ q (cid:19) , (1)where q is a threshold. Hence, the g ( · ) exposure mapping function behaves like a threshold functionwhich sums the elements of its argument (i.e the treatment assignment vector that characterizes theneighborhood of each unit) taking the value of 1 if the resulting value exceeds a certain threshold(e.g., at least one treated neighbor is treated, the majority of the neighbors are treated, ...). Inour simulation study as well as in the application we have chosen the following deﬁnition: G ik = ( (cid:0)(cid:80) j ∈N ik W jk ) ≥ (cid:1) , that is, the network exposure is 1 if at least one network neighbor is treated.As a consequence, both the individual treatment and the network exposure are deﬁned as binaryvariables, W ik ∈ { , } and G ik ∈ { , } . It follows that the support of the joint treatment variable( W ik , G ik ) is ﬁnite and comprises four possible realizations, given by the combination of the twomarginal domains. Hence, ( W ik , G ik ) ∈ { ( w, g ) = (0 , , (1 , , (0 , , (1 , } .A discrete network exposure is crucial for our causal tree algorithm, at least in the versionproposed in this paper. Indeed, the algorithm relies on the presence of enough observations foreach treatment and exposure value to allow the estimation of the causal eﬀects. Depending onthe stopping rule which might rely on the accuracy of the estimation of conditional eﬀects or onthe number of observations (see Section 4), if the sample size is not large enough with respect tothe number of categories of the network exposure and/or its distribution is non-uniform and highly12kewed, the network causal tree algorithm might result in a tree with low depth and low granularity,that is, with highly heterogeneous causal eﬀects even within the terminal leaves. Therefore, themaximum number of categories for the network exposure depend on the sample size, the numberof covariates and their nature, as well as on the extent of the heterogeneity in the causal eﬀects. In this work, we consider an experimental design with a unit-level randomization of the treatment,which is independent between clusters but might be dependent within them. Therefore, the treat-ment vector W is a random vector with probability distribution P ( W = w ) and the followingassumption holds. Assumption 4 (Independent treatment allocation between clusters) . P ( W = w ) = K (cid:89) k =1 P ( W k = w k ) where W k is the treatment vector in each cluster k . We denote by π Wik the unit-level probability that W ik is equal to 1, under the experimentaldesign in place. In a randomized experiment π Wik is known. In the case of a Bernoulli trial, whereeach unit is independently assigned to the individual treatment, π Wik is constant and equal to α . An example of a design with randomization independent between clusters but dependent withinclusters is that of a completely randomized experiment taking place in each cluster. In this case, π Wik would be equal to m k /n k , where m k is the ﬁxed number of treated units, and the treatmentassignment for each unit does depend on the treatment status of other units.Since the network exposure is a deterministic function g ( · ) of the cluster assignment vector W k/i , then the randomization distribution P ( W = w ) induces, together with the deﬁnition ofthe function g ( · ), a probability distribution of the vector of network exposures G in the wholesample. Hence, the probability for a unit of being exposed to a specif value of the network exposure G ik = g given the individual treatment w , denoted by π G | Wik ( g | w ), is known and can in principle becomputed from the probability distribution P ( W ). Note that, we can drop the dependency fromthe individual treatment and write π Gik ( g ) when the randomization is independent between units. The unit-level assignment probability could also vary across clusters as in a two-stage randomization (Hudgensand Halloran; 2008). { , } × (cid:83) ik ∈ V ∆ ik be the domain of the joint individual and network treatment status,that is, ( w, g ) ∈ ∆. Let π ik ( w, g ) denote the marginal probability for unit ik of being assigned toindividual treatment w and being exposed to the network status g . This is equal to the expectedproportion of assignment vectors w inducing an individual treatment w and a network exposure g : π ik ( w, g ) = (cid:88) w ∈{ , } N ( W ik = w, G ik = g ) P ( W = w ) = ( π Wik ) w (1 − π Wik ) − w × π G | Wik ( g | w ) . (2)This marginal probability is a crucial component of the Horvitz-Thompson estimator for causaleﬀects under network interference. For instance, if the experimental design is a Bernoulli trialwith unit-level probability α and the network exposure is deﬁned by a threshold function on theneighborhood as in Equation 1, then the joint probability could be computed as follows: π ik ( w, g ) = α w (1 − α ) − w × (cid:20) − h − (cid:88) l =0 (cid:18) N ik l (cid:19) p l (1 − p ) N ik − l (cid:21) g (cid:20) q − (cid:88) l =0 (cid:18) N ik l (cid:19) p l (1 − p ) N ik − l (cid:21) − g (3)where N ik is the number of neighbors (‘degree’) of unit ik .To deal with well-deﬁned potential outcomes, we must assume that each unit has a non-zeroprobability of being exposed to each ( w, g ): Assumption 5 (Positivity) . π ik ( w, g ) > ∀ i ∈ N , k ∈ K and ∀ ( w, g ) ∈ ∆ . When π ik ( w, g ) = 0 for some units, then the average potential outcomes and causal eﬀects involvingthese values z and g must be restricted to the subset of units for whom π ik ( w, g ) >

0. For instance,if the network exposure is deﬁned as in Equation 1, then the positivity assumption is violated forunits who cannot be exposed to a value g , that is, those with a degree N ik lower than the threshold q . Consequently, the analysis must be restricted only to the subset of the population satisfying thepositivity criterion.The estimator that we propose below also requires the so-called pairwise exposure probabilities ,which describes the joint probability for pairs of units of being exposed to a given individualtreatment and network status. Hence, given speciﬁc exposure conditions ( w, g ) and ( w (cid:48) , g (cid:48) ) , apairwise exposure probability, denote by π ikjh ( w, g ; w (cid:48) , g (cid:48) ), quantiﬁes the probability that the twoevents ( W ik = w, G ik = g ) and ( W jh = w (cid:48) , G jh = g (cid:48) ) occur, i.e., π ikjh ( w, g ; w (cid:48) , g (cid:48) ) = P ( W ik =14 , G ik = g, W jh = w (cid:48) , G jh = g (cid:48) ). In general, this can be written as: π ikjk (cid:48) ( w, g ; w (cid:48) , g (cid:48) ) = (cid:88) w ∈{ , } N ( W ik = w, G ik = g, W jk (cid:48) = w (cid:48) , G jk (cid:48) = g (cid:48) ) P ( W = w ) . (4)Under the event of both units being exposed to the same condition ( w, g ) we denote the pairwiseexposure probability by π ikjh ( w, g ).In the case of an experimental design assigning treatment independently between clusters, underthe clustered network interference the two events ( W ik = w, G ik = g ) and ( W jh = w (cid:48) , G jh = g (cid:48) ),with k (cid:54) = h , are independent and the pairwise exposure probability equals the product of the twojoint probabilities: π ikjh ( w, g ; w (cid:48) , g (cid:48) ) = π ik ( w, g ) × π jh ( w (cid:48) , g (cid:48) ). In the case of a Bernoulli trial andthe network exposure deﬁned on the neighborhood only, this is also true for units i and j belongingto the same cluster, i.e., k = h , but are not connected and do not share any neighbors. In general,let N wgik = N gik ∪ ik . Even under independent treatment assignment, if N wgik ∩ N wgjk (cid:48) (cid:54) = 0 the jointtreatment of the units ik and jk (cid:48) will be dependent and π ikjk (cid:48) ( w, g ; w (cid:48) , g (cid:48) ) (cid:54) = π ik ( w, g ) π jk (cid:48) ( w (cid:48) , g (cid:48) ) (cid:54) =0. Note that if π ik ( w, g ) or π jh ( w (cid:48) , g (cid:48) ) = 0 ⇒ π ikjh ( w, g ; w (cid:48) , g (cid:48) ) = 0, but not the reverse. Indeed,the joint probability of the two events ( W ik = w, G ik = g ) and ( W jh = w (cid:48) , G jh = g (cid:48) ) might bezero if the network exposures G ik and G jh are deﬁned on two subsets of units that coincide orinclude jh and ik , respectively. For example, if the network exposure is deﬁned as in Equation 1with threshold equal to 1 (i.e., having at least one treated neighbor) then, if unit ik is treated, with W ik and belongs to the neighborhood N jh of unit jh , the network exposure G jh cannot be 0. Our ultimate goal is to detect the regions of the covariate space exhibiting a high level of het-erogeneity in the causal eﬀects and estimate the causal eﬀects of interest in these heterogeneousregions. In this section, we will focus on the deﬁnition and estimation of conditional treatmentand spillover eﬀects and we will assume that the heterogeneous regions that we want to investigatehave already been identiﬁed, either a priori according to subject-matter knowledge or thanks to15ata-driven methods.Let us denote with Π a partition of the covariate space X into M non-overlapping regions:Π = { (cid:96) , . . . , (cid:96) M } , where (cid:83) Mm =1 (cid:96) m = X , and with (cid:96) ( x , Π) :

X →

Π a function that maps each vector x of the covariate space into a region. Let N ( (cid:96) m ) be the size of each region (cid:96) m , with m = 1 , . . . , M ,and let V k ( (cid:96) m ) be the subset of units belonging to region (cid:96) m in cluster k , with k = 1 , . . . , K . Inthe machine learning literature on CART, these non-overlapping regions are referred to as leaves .For consistency, throughout we will use this terminology, regardless of whether the partition Π hasbeen a priori deﬁned or is the result of a tree-based algorithm. In addition, to ease notation, wewill drop the reference to the partition Π from the mapping function (cid:96) ( · ).When units are organized in a network, it is worth noting that a partition Π of the covariatespace divides the sample units into sub-populations according to similarities in their characteristics,regardless of their network distance. Hence, two connected units might belong to diﬀerent regionsof the partition. However, in an homophilous network, where the probability of forming a linkdepends on the similarity in certain features and, hence, connected units are likely to share similarcharacteristics, a partition of the the covariate space is also likely to cluster connected units together.Figure 2: Partition of the covariate space with connected units.16iven a partition Π, we now deﬁne conditional average potential outcomes under each individualtreatment and network exposure condition ( w, g ) ∈ ∆. For the subset of units S m with covariatevectors x ∈ X that are mapped to the same region by the function (cid:96) ( x ), i.e., S m = { ik ∈ V : (cid:96) ( X ik ) = (cid:96) m } , we deﬁne the leaf-speciﬁc average potential outcome under treatment and exposurecondition ( w, g ) ∈ ∆ as follows: µ ( w,g ) ( (cid:96) ( x )) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 Y ik ( w, g ) ( X ik ∈ (cid:96) ( x )) . (5)Note that µ ( w,g ) ( (cid:96) ( x )) is a sample average, that is, it is the average potential outcome for all unitsin the sample V with a covariate vector mapped to the same region (cid:96) ( x ).Leaf-speciﬁc conditional average causal eﬀect (CACE) can be deﬁned by comparing averagepotential outcomes under two diﬀerent conditions: τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 Y ik ( w, g ) ( X ik ∈ (cid:96) ( x )) − N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 Y ik ( w (cid:48) , g (cid:48) ) ( X ik ∈ (cid:96) ( x ))= µ ( w,g ) ( (cid:96) ( x )) − µ ( w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) . (6) We denote by T the set of possible contrast we are interested in. For instanceif both the individual treatment and the network exposure are binary, then T ⊆{ (1 ,

0; 0 , , (1 , , , , (0 ,

1; 0 , , (1 , , , , (1 , , , } . We deﬁne as leaf-speciﬁc treatment eﬀects causal contrasts τ ( w,g ; w (cid:48) ,g ) ( (cid:96) ( x )) that keep the network exposure ﬁxed at a level g while changingthe individual treatment from w (cid:48) to w , that is, when g = g (cid:48) . These represent causal eﬀects of re-ceiving the treatment while the treatment status of all other units is kept ﬁxed or is mapped to thesame network exposure g . On the contrary, we deﬁne as leaf-speciﬁc spillover eﬀects causal contrasts τ ( w,g ; w,g (cid:48) ) ( (cid:96) ( x )) that keep the individual treatment ﬁxed at a level w while changing the networkexposure from g (cid:48) to g , that is, when w = w (cid:48) . These spillover eﬀects can be seen as causal eﬀectsof a change in the treatment status of other units such that the network exposure also changes,while the individual treatment status is kept ﬁxed. It should be emphasized that µ ( w,g ) ( (cid:96) ( x )) cor-responds to a unit-level intervention setting the treatment and network exposure of each unit tospeciﬁc values. The focus on these type of average potential outcomes, as opposed to the ones basedon population-level hypothetical interventions as in Hudgens and Halloran (2008), is due to ourpurpose of investigating heterogeneous responses to the individual treatment and network status17cross units with diﬀerent characteristics. If one is interested in assessing the heterogeneity of theaverage response to the network exposure resulting from a hypothetical treatment allocation, ourapproach could be extended to marginalized causal eﬀects as the ones in Forastiere et al. (2020). Here we develop an Horvitz-Thompson estimator for leaf-speciﬁc conditional average causal eﬀects.The derivation of the proposed estimator builds upon the estimator for average causal eﬀects undernetwork interference proposed by Aronow and Samii (2017).Following Horvitz and Thompson (1952) and Aronow and Samii (2017), a design-based esti-mator for the leaf-speciﬁc average potential outcome under individual treatment w and networkexposure g , µ ( w,g ) ( (cid:96) ( x )), can be expressed as: (cid:98) µ ( w,g ) ( (cid:96) ( x )) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 Y ik π ik ( w, g ) ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) (7)where π ik ( w, g ) denotes the probability of a given unit ik , that belongs to the leaf (cid:96) ( x ) (in thepartition Π), to be exposed to the treatment condition ( w, g ).The variance estimator of (cid:98) µ ( w,g ) ( (cid:96) ( x ) can be expressed as: (cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x ))[1 − π ik ( w, g )] (cid:20) Y ik π ik ( w, g ) (cid:21) + 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) ( W jk = w, G jk = g, X jk ∈ (cid:96) ( x )) × π ikjk ( w, g ) − π ik ( w, g ) π jk ( w, g ) π ikjk ( w, g ) Y ik π ik ( w, g ) Y jk π jk ( w, g ) . (8) This expression extends the variance estimator derived in Aronow and Samii (2017) (Equation 7)to the case of conditional average potential outcomes and clustered network interference. In fact, thesecond term in (8) includes the covariance between the individual treatment and network exposureof two units belonging to the same leaf (cid:96) ( x ). Under an experimental design with independenttreatment allocation between clusters and under clustered interference such covariance betweentwo units belonging to diﬀerent clusters is zero and the second term should be restricted to units j in the same cluster as i . In addition, the covariance between the joint treatment of two units is18on-zero if the set of units deﬁning the network exposure – e.g., the whole cluster or the unit’sneighborhood – is shared between them or includes them. Formally, if N wgik ∩ N wgjk (cid:48) (cid:54) = 0 the jointtreatment of the units ik and jk (cid:48) will be dependent, that is, π ikjk (cid:48) ( w, g ) − π ik ( w, g ) π jk (cid:48) ( w, g ) (cid:54) = 0 .Hence, two units belonging to the same leaf are more likely to have intersecting sets N wgik and N wgjk (cid:48) (e.g., shared neighbors) if the sets are homogeneous, that is, units belonging to these sets sharesimilar characteristics. Settings with homophilous networks are investigated in the appendix.An estimator for the leaf-speciﬁc conditional average causal eﬀect of the exposure condition( w, g ) compared to the conﬁguration ( w (cid:48) , g (cid:48) ) can be written as: (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) = (cid:98) µ ( w,g ) ( (cid:96) ( x )) − (cid:98) µ ( w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) . (9)The estimated variance of the estimator (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) can be decomposed as follows: (cid:98) V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17) = (cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) + (cid:98) V (cid:16)(cid:98) µ ( w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17) − (cid:104)(cid:98) C (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) , (cid:98) µ ( w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17)(cid:105) . (10)with the covariance estimator taking the following expression for the case when π ikjk ( w, g ; w (cid:48) , g (cid:48) ) > ∀ i, j, k : (cid:98) C (cid:16)(cid:98) µ w,g ( (cid:96) ( x )) , (cid:98) µ w (cid:48) ,g (cid:48) ( (cid:96) ( x )) (cid:17) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) ( W jk = w (cid:48) , G jk = g (cid:48) , X jk ∈ (cid:96) ( x )) π ikjk ( w, g ; w (cid:48) , g (cid:48) ) × [ π ikjk ( w, g ; w (cid:48) , g (cid:48) ) − π ik ( w, g ) π jk ( w (cid:48) , g (cid:48) )] Y ik π ik ( w, g ) Y jk π jk ( w (cid:48) , g (cid:48) ) − N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i (cid:104) ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) Y ik π ik ( w, g )+ ( W ik = w (cid:48) , G ik = g (cid:48) , X ik ∈ (cid:96) ( x )) Y ik π ik ( w (cid:48) , g (cid:48) ) (cid:105) . (11) Further details about the variance estimator of leaf-speciﬁc CACE can be found in Appendix B.

Here we will describe the properties of the Horvitz-Thompson estimator of leaf-speciﬁc causaleﬀects. Asymptotic results will rely on a growth process that is commonly assumed with clusterdata. In particular, we consider a sequence of nested samples V of size N, where V consists of Kseparate clusters V k of size n k , k = 1 , . . . , K . We let the sample size N −→ ∞ by letting the number19f clusters go to inﬁnity, i.e., K −→ ∞ , while the cluster size n k , k = 1 , . . . , K remains ﬁxed. Proposition 1 (Unbiasedness) . E (cid:104)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:105) = µ w,g ( (cid:96) ( x )) and E (cid:104)(cid:98) τ ( w,g,w (cid:48) g (cid:48) ) ( (cid:96) ( x )) (cid:105) = τ w,g ; w (cid:48) ,g (cid:48) ( (cid:96) ( x )) . Proof.

Proof in Appendix A.The unbiasedness of the estimator of leaf-speciﬁc CACE is conditional on the partition Π and thefunction (cid:96) ( · ). When building causal trees to assess the heterogeneity of causal eﬀects, we will relyon this property to derive the splitting criterion and to estimate leaf-speciﬁc causal eﬀects. Proposition 2 (The variance estimator of (cid:98) µ ( w,g ) is unbiased) . If π ikjk ( w, g ) > ∀ i, j, k then E (cid:104)(cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17)(cid:105) = V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x ) (cid:17) . The proof follows directly from the unbiasedness of the Horvitz-Thompson estimator. A conservativeestimator for the case when π ikjk ( w, g ) = 0 for some units can be found in Appendix B. Proposition 3 (The variance estimator of (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) is conservative) . E (cid:104)(cid:98) V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17)(cid:105) ≥ V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17) . A proof follows from Aronow and Samii (2017). The restriction on the covariances does not changethe proof.

Proposition 4 (Consistency) . Consider the asymptotic regime where the number of clusters K goto inﬁnity, i.e., K −→ ∞ , while the cluster size remains bounded, i.e., n k ≤ B ( (cid:96) ( x )) ≤ B for someconstant B. In addition, assume that | Y ik ( w, g ) | /π ik ( w, g ) ≤ C < , ∀ i, k, w, g . Then as K −→ ∞ (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) p −→ τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) . The unbiasedness of the estimator (cid:98) τ w,g,w (cid:48) ,g (cid:48) ( (cid:96) ( x ) does not ensure the identiﬁcation of subsets with the highestheterogeneity. The performance of the causal tree in identifying heterogeneous regions depends on the splittingcriterion, the algorithm and the sample. roof. See Appendix A.Note that cluster network interference and independent treatment allocation between clusters en-sure that the amount of dependence across units is limited. This limited independence is the con-dition required to ensure consistency (Aronow and Samii; 2017). Proposition 5 (Asymptotic Normality) . Given an cluster independent design and the clusterednetwork interference assumption, then: (cid:112) N ( (cid:96) ( x )) (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) − τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17) d −→ N (cid:18) , V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17)(cid:19) . An independent treatment allocation between clusters and the clustered network interferenceensure the limited dependence condition required in Aronow and Samii (2017). This conditionallows us to rely on a central limit theorem derived via Stein’s method (Chen and Shao; 2004) toachieve the asymptotic normality of the estimator. The variance estimators will depend on the sizeof the sample belonging to leaf (cid:96) ( x ) in each cluster, i.e., n k ( (cid:96) ( x )) ≤ B ( (cid:96) ( x )) ≤ B, k = 1 = . . . , K ,and the maximum conditional degree: D ( (cid:96) ) = max ik ∈ V : X ik ∈ (cid:96) ( x ) N gik ( (cid:96) ( x )) where N gik ( (cid:96) ) = N gik ∩ V k ( (cid:96) ( x )) . Given that these quantities are bounded, we can show that V (cid:16) V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17)(cid:17) = O (1 /N ( (cid:96) ( x ))), that is, the rate of convergence will be 1 / (cid:112) N ( (cid:96) ( x )), with N ( (cid:96) ( x )) ≤ KB ( (cid:96) ( x ))(the proof follows the one in Aronow and Samii (2017)).It is worth noting that, thanks to the asymptotic growth assumed here, Proposition 5 would stillhold if the covariates’ vector X would include network covariates deﬁned both at the cluster-level orat the individual-level. This result would allow us to investigate the heterogeneity of causal eﬀectswith respect to network characteristics, including variables deﬁning a cluster network structure orthe structure of the network neighborhood around a node. Note that for the variance of the estimator to go to zero as N −→ ∞ one must have that: (cid:88) ik (cid:88) jk (cid:48) ( X ik ∈ (cid:96) ( x ) , X jk ∈ (cid:96) ( x )) (cid:0) π ikjk (cid:48) ( w, g ) − π ik ( w, g ) π jk (cid:48) ( w, g ) (cid:54) = 0 (cid:1) = o ( N ( (cid:96) ( x )) ) . This is guaranteed given that the joint treatment is independent between units in diﬀerent clusters, that is, π ikjk (cid:48) ( w, g ) = π ik ( w, g ) π jk (cid:48) ( w, g ) , ∀ k (cid:54) = k (cid:48) , and given that the cluster size is bounded. Network Causal Trees for Heterogeneous Causal Eﬀects underClustered Network Interference

In the previous section we have introduced and developed an estimator for causal eﬀects conditionalon sub-populations of units deﬁned by a partition Π of the covariate space X . Here we developa data-driven machine learning algorithm to identify the partition Π aimed at investigating theheterogeneity in the eﬀects of interest.Our proposed algorithm, named Network Causal Tree (NCT), builds upon the Causal Tree(CT) algorithm introduced by Athey and Imbens (2016), which in turn ﬁnds its roots in theClassiﬁcation and Regression Tree (CART) algorithm (Friedman et al.; 1984). CART is a widelyused nonparametric method to partition the feature space. It relies on a tree-based algorithmwhich recursively splits the sample. In particular, trees are constructed by recursively partitioningthe observations from the root (that contains all the observations in the learning samples) intotwo child nodes . This procedure is repeated until the tree reaches the ﬁnal nodes which are called leaves . Because each node is always split into two sub-nodes, these trees are called binary trees .Binary trees are called regression trees when the outcome is a continuous variable, while they arecalled classiﬁcation trees when the outcome is either a discrete or a binary variable. The aim of thetree construction is to identify heterogeneities in the relationship between the observed outcomeand the features to best predict the outcome variable. Therefore, splits are made with the aim ofminimizing the prediction error. With this aim diﬀerent splitting criteria could be speciﬁed. Foradditional details on CART, we refer to the seminal paper by Friedman et al. (1984). Figure 3illustrates an example of binary partitioning in a simple case with just two predictors x ∈ [0 , x ∈ [0 , < . x > . l l No Yes l No YesFigure 3: (Left) An example of a binary tree. The internal nodes are labelled by their splitting rulesand the terminal nodes are labelled with the corresponding parameters l i .(Right) The corresponding partition of the sample space.clustered network interference, and (ii) it possibly models heterogeneity with respect to more thanone eﬀect at the same time through a composite splitting criterion. In this Section, we describe andmotivate the splitting criteria for our NCT algorithm (Subsection 4.1) and its detailed structure(Subsection 4.2). The NCT algorithm is built to detect and estimate heterogeneous treatment and spillover eﬀects, inthe presence of clustered network interference. Moreover, NCT is able to discover the heterogeneitywith respect to more than one estimand. Here, we present the three criteria that rule the splittingprocedure of NCT, targeted to single eﬀects or multiple eﬀects.

Let P be the space of partitions. Given a causal eﬀect τ ( w,g ; w (cid:48) ,g (cid:48) ) , we can use recursive splitting tolook for the best partition Π ∈ P with respect to a splitting criterion Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π). Formally:Π = argmax Π ∈ P Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) . (12)23iven our goal of describing the relationship between the causal eﬀect and the covariate space anddetecting subsets that exhibit a high level of heterogeneity, we can deﬁne a splitting criterion thatmaximizes accuracy in the prediction of conditional eﬀects τ ( w,g ; w (cid:48) ,g (cid:48) ) ( X ik ) in the whole sample V.This translates into the minimization of expected value of the mean square error (MSE): Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) = − EM SE (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x ) , Π) (cid:17) = − E (cid:104)(cid:16) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( x ) − (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x , Π) (cid:17) (cid:105) (13)where the expected value is taken over the sampling distribution. When this splitting criterion isused to select the partition Π, we maximize the function in (13) evaluated in the sample used to buildthe tree, i.e., the training set . In this case, in the machine learning literature the objective functionis referred to as the in-sample splitting function and we denote this by Q in ( w,g ; w (cid:48) ,g (cid:48) ) (Π). As opposedto the EMSE of the observed outcome prediction, the true causal eﬀect τ w,g ; w (cid:48) ,g (cid:48) ( x ) is unknown.However, we can use the training data to estimate the EMSE for the in-sample spitting rule. Thanksto the unbiasedness of the estimator (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x ) , Π) with respect to the population causal eﬀect E [ τ ( w,g ; w (cid:48) ,g (cid:48) ) ( X ik ) | X ik ∈ (cid:96) ( x )] (Proposition 6 in Appendix A), following Athey and Imbens (2016)we can estimate the EMSE as follows: Q in ( w,g ; w (cid:48) ,g (cid:48) ) (Π) = − (cid:92) EM SE (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x ) , Π) (cid:17) = 1 N tr (cid:88) k ∈K tr n k (cid:88) i =1 (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( X ik , Π)) (cid:1) (14)where K tr is the subset of clusters belonging to the training set and N tr is the sample size. There-fore, the maximization of this splitting function results in the maximization of the heterogeneityacross leaves. In fact, if two sub-populations (cid:96) and (cid:96) have a diﬀerent causal eﬀect τ ( w,g ; w (cid:48) ,g (cid:48) ) , i.e., τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:54) = τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ), a partition Π that splits them would yield a higher Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π)than the partition Π c that combines the two sub-populations into one leaf (cid:96) (a simple proof canbe found in Appendix A).To avoid using the same information for selecting the partition and for the estimation, Atheyand Imbens (2016) propose to estimate the eﬀects in a separate sample than the one used to buildthe tree. They call this an ‘honest’ causal tree. We denote by V est the estimation set and by V tr thetraining set (or discovery set) that is used to build the tree (by evaluating the splitting function). In standard CART the training set is a subset of the whole sample together with the testing set, which is usedto evaluate the objective function in order to choose the best partition selected in the training set that maximizesout-of-sample prediction accuracy. K tr and K est of the clusters in K . Hence, V est = (cid:83) k ∈K est V k and V tr = (cid:83) k ∈K tr V k = V /V est . This randomsplit of the sample avoids any dependencies between training and estimation sub-samples. In this‘honest’ version, the splitting function can be estimated as follows: Q in,H ( w,g ; w (cid:48) ,g (cid:48) ) (Π) = 1 N tr (cid:88) k ∈K tr n k (cid:88) i =1 (cid:0) ˆ τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( X ik , Π)) (cid:1) − (cid:16) N tr + 1 N est (cid:17) (cid:88) (cid:96) ∈ Π (cid:98) V (cid:16)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ; Π) (cid:17) (15)where N est = | V est | . The proof follows from Athey and Imbens (2016). In (15) we can see that thesplitting function is such that splits will be chosen so to maximize the heterogeneity across leavesas well as to minimize the average variance in the estimated eﬀects. The idea is to identify themost heterogeneous partitions while introducing a penalization term which corrects the objectivefunction to minimize the leaf-speciﬁc variation in the estimated eﬀect. This penalization term hasalso the eﬀect of reducing the depth of the tree, because leaves with a small number of observations N ( (cid:96) ) will exhibit a higher variance. Hence, the ﬁnal depth of the tree will depend on the samplesize as well as on the randomization design and the exposure mapping function. In addition tothis penalization, we will also add a stopping rule based on a minimum number of observations foreach condition ( w, g ) and ( w (cid:48) , g (cid:48) ) that we are comparing. This stopping rule is required to avoidhaving leaves where the eﬀect τ ( w,g ; w (cid:48) ,g (cid:48) ) cannot be estimated because there are no observationswith observed treatment ( W ik , G ik ) equal to the values ( w, g ) or ( w (cid:48) , g (cid:48) ). We now introduce a composite splitting rule targeted to multiple causal estimands. When interfer-ence is in place targeting strategies might involve both treatment and spillover eﬀects. For example,in settings with limited resources the treatment should be provided to those who would beneﬁt fromit, i.e., with a non-zero treatment eﬀect, whereas we could save resources by not giving the treat-ment to those who would beneﬁt from other people being treated, namely units with high spillovereﬀects. For instance, this is the case in marketing interventions where we can provide advertise-ments only to those who would be aﬀected and who are less likely to get the information fromsomeone else. Another interesting example can be found in the potential challenges of the COVID-19 vaccine distribution. Those at high risk of getting infected, even if those in close contact were25mmune, should be targeted. These would be individuals who are often in crowded spaces, either inthe workplace or in a social setting, and would likely get infected in these environments. Therefore,these individuals are characterized by a high treatment eﬀect even with all close contacts beingtreated. On the contrary, those who are in contact with a low number of people and can greatlygain from having one of these contacts vaccinated could be left without the vaccine, at least in theearly stages of the distribution.For these kind of targeting strategies involving more than one causal eﬀect we must partitionthe population into sub-groups that show high level of heterogeneity in all estimands of interest.Building a separate tree for each causal eﬀect would provide us with diﬀerent partitions that cannotbe used for the design of multi-eﬀect strategies. Therefore, we propose a composite splitting functionthat would result in a tree that maximises heterogeneity in all the causal estimands of interest.This composite objective function is a weighted average of the eﬀect-speciﬁc splitting functions: Q T (Π) = (cid:88) ( w,g ; w (cid:48) ,g (cid:48) ) ∈T γ ( w,g ; w (cid:48) ,g (cid:48) ) Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) with γ ( w,g ; w (cid:48) ,g (cid:48) ) = ω ( w,g ; w (cid:48) ,g (cid:48) ) (cid:0) ˆ τ ( w,g ; w (cid:48) ,g (cid:48) ) (cid:1) (16)where ω ( w,g ; w (cid:48) ,g (cid:48) ) ∈ [0 ,

1] is a customized weight for each estimand and ˆ τ ( w,g ; w (cid:48) ,g (cid:48) ) is the estimatedeﬀect in the whole sample. Each eﬀect τ ( w,g ; w (cid:48) ,g (cid:48) ) , where ( w, g ; w (cid:48) , g (cid:48) ) ∈ T , contributes to theglobal objective function according to a speciﬁc weight γ ( w,g ; w (cid:48) ,g (cid:48) ) . γ ( w,g ; w (cid:48) ,g (cid:48) ) is proportional to acustomized weight ω ( w,g ; w (cid:48) ,g (cid:48) ) , which is set by the researcher according to the extent to which theestimand τ ( w,g ; w (cid:48) ,g (cid:48) ) is of interest, and is normalized by the the estimated eﬀect in the whole sampleto rule out any dependence on the magnitude of the eﬀect. The composite criterion requires that atleast two of the four weights are strictly greater than zero. A similar composite objective functioncan be derived from the splitting functions for the ‘honest’ causal trees: Q H T (Π) = (cid:88) ( w,g ; w (cid:48) ,g (cid:48) ) ∈T γ ( w,g ; w (cid:48) ,g (cid:48) ) Q H ( w,g ; w (cid:48) ,g (cid:48) ) (Π) with γ ( w,g ; w (cid:48) ,g (cid:48) ) = ω ( w,g ; w (cid:48) ,g (cid:48) ) (cid:0) ˆ τ ( w,g ; w (cid:48) ,g (cid:48) ) (cid:1) . (17) Compared with the standard HCT algorithm, the main novelties of NCT are the introduction ofinterference and the possibility of including more than one eﬀect. Speciﬁcally, the extent to whicheach eﬀect τ ( w,g ; w (cid:48) ,g (cid:48) ) , with ( w, g ; w (cid:48) , g (cid:48) ) ∈ T , contributes to the determination of the tree is speciﬁed26y the weight w ( w, g ; w (cid:48) , g (cid:48) ). Here we describe the key steps of the NCT algorithm, including therecursive partitioning based on the splitting functions and the stopping rules. The proposed algorithm takes mainly six elements as inputs:1. the sample V , which collects for each unit ik the individual treatment assignment status W ik ,the observed outcome Y ik and a vector of characteristics X ik ;2. the network information, which is fully described by the global adjacency matrix A , includingthe cluster-speciﬁc blocks A k ;3. the speciﬁcation of the exposure mapping function g ( · ) which, together with the adjacencymatrix and possibly covariate matrix, will be translated in the computation of the observednetwork exposure G ik for each unit;4. the experimental design which will determine the computation of the probabilities π ik ( w, g )and π ik ( w, g, w (cid:48) , g (cid:48) );5. the weight ω w,g,w (cid:48) g (cid:48) for each causal eﬀect;6. the speciﬁcation of the two parameters: maximum depth , that is the maximum depth of thetree, and the minimum size , that is the minimum number of units falling in each exposurecondition ( w, g ) in each leaf.After some preliminary steps, the algorithm consists of two main steps. The ﬁrst step is focused onthe selection of the partition, i.e. the tree, while the second step is concerned with the estimationof causal eﬀects and returns point estimates and standard errors of the conditional average causaleﬀects, for all the comparisons of interest and within each leaf of the detected partition. We reportbelow the key steps of the NCT algorithm:0. Step 0 (Preliminaries): In a preliminary stage the algorithm computes the quantities andtools that will be used in the subsequent steps. In particular:(a) Given the adjacency blocks A k and potentially the covariate matrix X , for each unit the network exposure variable G ik is computed according to the rule expressed in Assumption3. 27b) The joint exposure probabilities π ik ( w, g ) and π ikjk (cid:48) ( w, g ; w (cid:48) , g (cid:48) ) are computed (as inSubsection 2.3) or estimated (as in Aronow and Samii (2017)).(c) Finally, the algorithm randomly splits the clusters between the training set V est and theestimation set V est . Step 1 (Tree Discovery): the ﬁrst step of the algorithm sprouts the tree structure of the NCT,that is, it detects the relevant heterogeneous partitions. Note that this step is performed overthe discovery set only. In particular, the NCT algorithm works with the clusters belonging tothe set K tr and builds the tree using a binary recursive partitioning.(a) Recursive Partitioning. the algorithm grows a tree by maximizing the in-sample splittingcriterion at each binary split. At iteration r − r − = { x ∈ X : r − (cid:92) m =1 x m ∈ A h m m } h ∈{ L,R } r − where x m ∈ { x p } p =1 ,...,P is the feature that was split at iteration m and A Lm = { x m ≤ c m } , A Lm = { x m c m } for some cutoﬀ point c m . The variable x m split at iteration m together with the cutoﬀ point c m compose a node of the tree. At iteration r , the partitionwill be complemented with a split of a variable x r ∈ { x p } p =1 ,...,P / { x m } m =1 ...,r − at somecutoﬀ point c r : Π r = { x ∈ X : r − (cid:92) m =1 x m ∈ A h m m (cid:92) x r ∈ A h r r } h ∈{ , } r . Among all the candidate splits x r and c r , the algorithm will choose the one that maxi-mizes the in-sample splitting function in (14) or (15).(b) Stopping Rule. The recursive partitioning stops when at least one stopping condition is met: (i) the NCT has reached the speciﬁed maximum depth ; (ii) the current split r generate at least one leaf (cid:96) where the set of units N ( (cid:96) ) ( w,g ) = { ik ∈ V tr : X ik ∈ (cid:96), W ik = w, G ik = g } with a number of observations | N ( (cid:96) ) ( w,g ) | lower than the speciﬁed minimumsize , for at least one exposure condition ( w, g ). Following Athey and Imbens (2016) we suggest to assign half of the clusters to the discovery sample and anotherhalf to the estimation sample. network causal tree which corresponds to a partition Π of the featurespace X into M leaves: Π = { (cid:96) , . . . , (cid:96) M } , with (cid:83) Mm :1 (cid:96) m = X and l ( x , Π) :

X →

Π.2.

Step 2 (Estimation): the second step of the algorithm takes as input the Network CausalTree Π built in Step 1 and computes all the point estimates, the standard errors and theconﬁdence intervals of the leaf-speciﬁc causal eﬀects of interest in all its nodes (cid:96) ( x , Π). Thisis done using the Horvitz-Thompson estimator in Section 3. In the ‘honest’ version, at thisstage the NCT algorithm works with the clusters belonging to the set K est . Algorithm 1

Overview of the NCT algorithm • Inputs : i) observed data { W ik , Y ik , X ik } ik ∈ V ; ii) global adjacency matrix A , which comprisesthe cluster-speciﬁc blocks A k ; iii) experimental Design; iv) vector of weights ω ( w, g ; w (cid:48) , g (cid:48) ),where ( w, g ; w (cid:48) , g (cid:48) ) ∈ T ; v) tree parameters: maximum depth and minimum size . • Outputs : (1) a partition Π if the covariate space, and (2) point estimates, standard errorsand conﬁdence intervals of the conditional average causal eﬀects:1. Step 0 (Preliminaries): compute G ik and both the marginal and joint exposure proba-bilities π ik ( w, g ) and π ikjk (cid:48) ( w, g ; w (cid:48) , g (cid:48) ). Then, randomly assign clusters to discovery andestimation samples.2. Step 1 (Tree Discovery): using the discovery sample, build a tree according to the in-sample splitting criterion and stop when either the tree has reached its maximum depthor any additional split would generate leaves, which are not suﬃciently representative ofthe four exposure conditions.3. Step 2 (Estimation): use the Horvitz-Thompson estimator on the estimation sample toestimate the leaf-speciﬁc CACE and their standard errors in each leaf. Our algorithm provides an interpretable method to detect and estimate heterogeneous eﬀects in thepresence of clustered network interference. In this section we evaluate, through a set of simulations,the performance of the proposed algorithm with respect to both discovery and estimation. Inparticular, we investigate its ability to correctly identify the actual heterogeneous sub-populations,comparing the use of single or composite splitting functions, and we assess the performance ofthe Horvitz-Thompson estimator for leaf-speciﬁc treatment and spillover eﬀects. While the latter29erformance assessment is quite standard in the literature, the former is critical for the developmentof interpretable algorithms for heterogeneous causal eﬀects (Bargagli-Stoﬃ and Gnecco; 2020; Leeet al.; 2020). We evaluate the performance of the algorithm and the estimator in settings that diﬀerwith respect to three main factors: (1) the structure of the heterogeneity, (2) the extent of theeﬀect heterogeneity, and (3) the number of clusters. In Appendix C, we also consider two additionalfactors: (4) the correlation structure in the covariate matrix and (5) the presence of homophily inthe network structure. Regarding the structure of the heterogeneity, we are particularly interestedin settings where the structure of the causal tree representing heterogeneity is diﬀerent for eachcausal eﬀect. In particular, causal trees diﬀer if they have diﬀerent nodes corresponding to the splitof a feature, that is, if covariates driving the heterogeneity are diﬀerent, or if they have diﬀerentterminal leaves where the causal eﬀect is heterogeneous, i.e., non-zero. We call causal rules theseheterogeneous terminal leaves.For each simulation scenario we simulated M = 500 samples and applied our NCT algorithmto detect sub-populations with heterogeneous causal eﬀects (or causal rules) and to estimate ourcausal eﬀects of interest. To evaluate the performance of our composite splitting function underdiﬀerent settings, splits rely on either eﬀect-speciﬁc splitting criteria or on the composite function.All simulations are performed under Bernoulli trials, that is treatment is randomly assignedindependently to each unit with a ﬁxed probability π Wik = α . In our simulation study, we alsoassume that interference only takes place at the neighborhood level and we choose the followingdeﬁnition of the network exposure: G ik = (cid:18)(cid:0) (cid:88) j ∈N ik W jk (cid:1) ≥ (cid:19) , (18)that is, the network exposure of unit ik is 1 if at least one neighbor is treated. A binary networkexposure together with a binary individual treatment results in a joint treatment with four cate-gories – i.e., ( W ik , G ik ) ∈ { ( w, g ) = (0 , , (1 , , (0 , , (1 , } . The binary deﬁnition of the networkexposure is chosen to allow the growth of deeper trees. Given that the minimum size requirementstops the algorithm when the number of units in a child leaf is not enough to estimate a conditionalcausal eﬀect, a joint treatment with four categories ensures that this stopping condition is unlikelyto be met during the ﬁrst few splits.In addition, the assumption of neighborhood interference allows the computation of the marginal30nd joint probabilities without the need for intensive estimation procedures. In fact, the approximatealgorithm for estimating the marginal and joint probabilities proposed by Aronow and Samii (2017)is computationally demanding and could not be incorporated in our simulation study. However, inthe case of Bernoulli trial and network exposure deﬁned as in (18), the probability π ik ( w, g ) canbe computed using the formula in (3) while the joint probability π ikjk (cid:48) ( w, g, w (cid:48) , g (cid:48) ) is simply theproduct of π ik ( w, g ) and π jk (cid:48) ( w (cid:48) , g (cid:48) ) if the two units ik and jk (cid:48) are independent. On the contrary,if N wgik and N wgjk (cid:48) (cid:54) = 0 overlap the joint treatment of the units ik and jk (cid:48) will be dependent, thatis, π ikjk (cid:48) ( w, g ; w (cid:48) , g (cid:48) ) (cid:54) = π ik ( w, g ) π jk (cid:48) ( w (cid:48) , g (cid:48) ). In this case, the joint probability can still be readilycomputed using combinatorics formulas on two overlapping sets. For each simulation m = 1 . . . , M we generated K clusters and simulated, within each cluster,Erd˝os-R´enyi random graphs (Erd˝os and R´enyi; 1959) with n k = 100 nodes and a ﬁxed probability(0.01) to observe a link. Given the deﬁnition of the network exposure we removed isolated nodesfrom the analysis to make the Assumption 5 hold. W ik and any of the 10 covariates X ip weresampled from independent Bernoulli distributions with probability 0.5: W i ∼ Ber (0 .

5) and X ip ∼ Ber (0 . . In the simulation study we focus on two main eﬀects: (i) the pure treatment eﬀect τ (1 , , , and (ii)the pure spillover eﬀect τ (0 , , . To ease notation, we denote by τ the treatment eﬀect and by δ thespillover eﬀect. After setting the value of these two conditional eﬀects for each unit with covariates X ik (depending on the simulation scenarios), we generated the four diﬀerent potential outcomes asfollows: Y ik (0 , ∼ N (0 , Y ik (1 , ∼ N (0 , Y ik (1 ,

0) = Y ik (0 ,

0) + τ ( X ik ); Y ik (0 ,

1) = Y ik (0 ,

0) + δ ( X ik ) . Y ik = (cid:88) w =0 1 (cid:88) g =0 ( W ik = w, G ik = g ) Y ik ( w, g ) . We now detail how we varied the three factors (1), (2), and (3). We simulated two diﬀerentscenarios with respect to the heterogeneity structure (1). In the ﬁrst scenario we have: τ ( X ik ) =  h if X ik ∈ (cid:96) = { X ik = 0 , X ik = 0 } ; − h if X ik ∈ (cid:96) = { X ik = 1 , X ik = 1 } ;0 otherwise; δ ( X ik ) =  h if X ik ∈ (cid:96) = { X ik = 0 , X ik = 0 } ; − h if X ik ∈ (cid:96) = { X ik = 1 , X ik = 1 } ;0 otherwise . Hence, in this scenario the heterogeneity driving variables (HDV), i.e., X i and X i , are the samefor both the treatment eﬀect τ and the spillover eﬀect δ and the two causal rules overlap. In thesecond scenario, we introduce a change in the drivers of the heterogeneity in the following way: τ ( X ik ) =  h if X ik ∈ (cid:96) τ = { X ik = 0 , X ik = 0 } ;3 h if X ik ∈ (cid:96) τ = { X ik = 0 , X ik = 1 } ;0 otherwise; δ ( X ik ) =  h if X ik ∈ (cid:96) δ = { X ik = 1 , X ik = 0 } ;3 h if X ik ∈ (cid:96) δ = { X ik = 1 , X ik = 1 } ;0 otherwise . Hence, in the second scenario the heterogeneity drivers are diﬀerent for the two causal eﬀects.Speciﬁcally, we have: X i and X i for the treatment eﬀects, and X i and X i for the spillovereﬀects. In addition, we have two causal rules for the treatment eﬀect, namely { X ik = 0 , X ik = 0 } and { X ik = 0 , X ik = 1 } and two diﬀerent causal rules for the spillover eﬀect, namely { X ik =1 , X ik = 0 } and { X ik = 1 , X ik = 1 } . For each structural scenario we varied the eﬀect size: h = ( r + 0 . r =1 . Figure 4 graphically represents the two simulations’ scenarios. Moreover, wechanged the number of clusters keeping their size ﬁxed to K = (10 , , For each scenariowe built three NCTs: one tree implementing the composite splitting rule for the treatment andspillover eﬀects as in (17) (with w (1 , , = w (0 , , = 0 . Q in,H (1 , , (Π) as in (15), and a third tree using the singular splittingrule for the spillover eﬀect Q in,H (0 , , (Π) as in (15). Note that K/2 clusters will be assigned to the discovery sample and the remaining clusters will be in theestimation set as in Athey and Imbens (2016). = 0 x = 0 τ = − hδ = − h τ = 0 δ = 0Yes No x = 0 τ = 0 δ = 0 τ = hδ = h Yes NoYes No (a) First scenario. x = 0 x = 0 τ = 0 δ = 3 hτ = 0 δ = h Yes No x = 0 τ = 3 hδ = 0 τ = hδ = 0Yes NoYes No (b) Second scenario. Figure 4: Simulations’ scenarios.

In all the scenarios the performance of the NCT algorithm is evaluated using the following measures,averaged over the M generated data sets: • Average number of correctly discovered heterogeneous causal rules corresponding to the leavesof the generated NCTs (reported with respect to the eﬀect sizes in Figures 5 and 6) in thediscovery sample; • Average conditional treatment and spillover eﬀects estimated for each correctly detectedheterogeneous terminal leaf (reported as ˆ τ and ˆ δ in Tables 1, 2, 3 and 4); • Average standard errors estimated for each correctly detected heterogeneous terminal leaf(reported as ˆ se (ˆ τ ) and ˆ se (ˆ δ ) in Tables 1, 2, 3 and 4); • Monte Carlo bias in the estimation sample:Bias m ( V est ) = 1 N est (cid:88) k ∈K est n k (cid:88) i =1 (cid:16) τ ( w,g,w (cid:48) g (cid:48) ) ( X ik ) − (cid:98) τ ( w,g,w (cid:48) g (cid:48) ) ( (cid:96) ( X ik , , Π m ) , V est ) (cid:17) , Bias( V est ) = 1 M M (cid:88) m =1 Bias m ( V est )33here Π m is the partition selected in simulation m ; • Monte Carlo MSE in the estimation sample:MSE m ( V est ) = 1 N est (cid:88) k ∈K est n k (cid:88) i =1 (cid:16) τ ( w,g,w (cid:48) g (cid:48) ) ( X ik ) − (cid:98) τ ( w,g,w (cid:48) g (cid:48) ) ( (cid:96) ( X ik ) , Π m , V est ) (cid:17) , MSE( V est ) = 1 M M (cid:88) m =1 MSE m ( V est ); • Coverage, computed as the average proportion of units for whom where the estimated 95%conﬁdence interval of the causal eﬀect in the assigned leaf includes the true value:C m ( V est ) = 1 N est (cid:88) k ∈K est n k (cid:88) i =1 (cid:16) τ ( w,g,w (cid:48) g (cid:48) ) ( X ik ) ∈ (cid:99) CI (cid:16)(cid:98) τ ( w,g,w (cid:48) g (cid:48) ) ( (cid:96) ( X ik , Π m ) , V est ) (cid:17)(cid:17) , C( V est ) = 1 M M (cid:88) m =1 C m ( V est ) . We will ﬁrst analyze the ability of the algorithm to correctly detect the heterogeneous subgroups inthe ﬁrst simulation scenario, that is, when the heterogeneity is the same for the two causal eﬀectsof interest. Figure 5 reports the average number of correctly discovered heterogeneous causal ruleswith composite splitting rule or eﬀect-speciﬁc splitting rules targeted to the treatment eﬀect or thespillover eﬀect, in the case of 10, 20 and 30 clusters. Given that in this scenario the causal rulesdeﬁning the heterogeneity is the same for both eﬀects, all the splitting rules, targeted to a singleeﬀect or to both eﬀects, have a similar performance and are able to detect the two heterogeneousleaves with a success rate that gets higher as the eﬀect size increases. In addition, as the num-ber of cluster grows the minimum eﬀect size allowing the algorithm to optimally discover all theheterogeneous sub-populations gets lower.Tables 1, 2, 3 report the results for the ﬁrst scenario 4 for the performance of the estimatorin the correctly detected leaves. We only report the results of the estimation procedure on thetree built with the composite splitting rule, as the spitting rule would only aﬀect the identiﬁcationof the heterogeneous sub-populations but not the estimation of the causal eﬀects once these sub-34opulations are correctly detected. The estimator is able to estimate the heterogeneous treatmentand spillover eﬀects without bias and its precision grows as the number of clusters increases. In-terestingly, NCT provides more accurate estimates of the heterogeneous spillover eﬀects than thetreatment eﬀects. This is due to the larger number of units with ( W ik = 0 , G ik = 1) than those with( W ik = 1 , G ik = 0), by deﬁnition of the network exposure (by design, units with ( W ik = 0 , G ik = 1)are on average 50% more than units with ( W ik = 1 , G ik = 0)). The higher precision in the esti-mation of the spillover eﬀects is also reﬂected in the identiﬁcation of the heterogeneous subgroups,which is more accurate when splits are targeted to the minimization of the MSE of the spillovereﬀect (Figure 5). . . . . .

10 Clusters

Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s . . . . .

20 Clusters

Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s . . . . .

30 Clusters

Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s NCT (composite)NCT (treatment)NCT (spillover)

Figure 5: Simulations’ results for correctly discovered leaves in the ﬁrst scenario with 2 correctleaves and 10, 20 and 30 clusters, respectively. 35able 1: Simulations’ results for the ﬁrst scenario (10 clusters)

Treatment EﬀectsEﬀect Size ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) MSE Bias Coverage0.1 0.353 0.416 -0.139 0.350 0.106 0.107 1.0001.1 1.108 0.514 -1.132 0.515 0.261 -0.012 0.9362.1 2.124 0.740 -2.156 0.741 0.570 -0.016 0.9413.1 3.004 1.003 -3.050 0.997 0.934 -0.023 0.9454.1 4.061 1.293 -4.179 1.287 1.843 -0.059 0.9285.1 5.201 1.565 -5.197 1.585 2.297 0.002 0.9466.1 6.179 1.889 -6.086 1.824 3.282 0.046 0.9317.1 7.125 2.140 -7.110 2.128 4.276 0.007 0.9218.1 8.011 2.378 -8.149 2.425 5.087 -0.069 0.9319.1 9.153 2.729 -9.128 2.713 6.803 0.012 0.92710.1 9.991 2.947 -9.916 2.974 8.007 0.038 0.934 Spillover Eﬀects ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) MSE Bias Coverage0.1 0.225 0.380 -0.249 0.322 0.082 -0.012 0.9581.1 1.114 0.415 -1.103 0.402 0.155 0.005 0.9602.1 2.085 0.542 -2.138 0.545 0.230 -0.026 0.9673.1 3.047 0.716 -3.094 0.716 0.408 -0.023 0.9724.1 4.098 0.892 -4.079 0.882 0.646 0.009 0.9665.1 5.119 1.063 -5.054 1.057 0.862 0.032 0.9586.1 6.090 1.250 -6.045 1.242 1.132 0.022 0.9587.1 7.086 1.437 -7.099 1.442 1.542 -0.006 0.9738.1 8.075 1.618 -8.019 1.618 1.986 0.028 0.9619.1 8.936 1.811 -9.197 1.828 2.586 -0.130 0.95710.1 10.078 2.023 -10.055 2.026 2.730 0.011 0.971 Table 2: Simulations’ results for the ﬁrst scenario (20 clusters)

Treatment EﬀectsEﬀect Size ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) MSE Bias Coverage0.1 0.053 0.286 -0.072 0.284 0.061 -0.010 0.9291.1 1.120 0.382 -1.102 0.377 0.145 0.009 0.9542.1 2.102 0.538 -2.114 0.538 0.270 -0.006 0.9493.1 3.105 0.710 -3.140 0.721 0.468 -0.018 0.9464.1 4.055 0.905 -4.198 0.947 0.848 -0.072 0.9485.1 5.058 1.100 -5.034 1.108 1.132 0.012 0.9406.1 6.155 1.332 -6.034 1.306 1.609 0.061 0.9337.1 7.058 1.519 -7.036 1.521 2.000 0.011 0.9548.1 8.141 1.737 -8.118 1.755 2.919 0.011 0.9619.1 9.130 1.952 -9.160 1.949 3.313 -0.015 0.95910.1 10.088 2.150 -10.071 2.142 4.021 0.008 0.947 Spillover Eﬀects ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) MSE Bias Coverage0.1 0.085 0.234 -0.139 0.255 0.064 -0.027 0.9291.1 1.092 0.300 -1.106 0.297 0.075 -0.007 0.9672.1 2.072 0.393 -2.083 0.396 0.132 -0.006 0.9653.1 3.067 0.509 -3.084 0.511 0.208 -0.009 0.9734.1 4.124 0.641 -4.088 0.639 0.296 0.018 0.9715.1 5.118 0.772 -5.052 0.772 0.436 0.033 0.9746.1 6.082 0.900 -6.154 0.915 0.611 -0.036 0.9727.1 7.114 1.042 -7.101 1.040 0.802 0.007 0.9688.1 8.023 1.183 -8.156 1.185 1.027 -0.067 0.9759.1 9.106 1.312 -9.139 1.323 1.262 -0.017 0.97910.1 10.210 1.465 -10.151 1.460 1.472 0.029 0.976 Treatment EﬀectsEﬀect Size ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) ˆ τ (cid:96) ˆ se (ˆ τ (cid:96) ) MSE Bias Coverage0.1 0.092 0.238 -0.068 0.230 0.016 0.012 1.0001.1 1.083 0.311 -1.115 0.306 0.091 -0.016 0.9502.1 2.101 0.435 -2.107 0.436 0.170 -0.003 0.9493.1 3.104 0.584 -3.113 0.588 0.305 -0.005 0.9534.1 4.086 0.754 -4.114 0.746 0.546 -0.014 0.9465.1 5.168 0.921 -5.170 0.931 0.794 -0.001 0.9566.1 6.110 1.091 -6.059 1.074 1.041 0.025 0.9567.1 7.133 1.259 -7.135 1.251 1.422 -0.001 0.9608.1 8.078 1.420 -7.952 1.409 1.729 0.063 0.9469.1 9.199 1.618 -9.047 1.580 2.264 0.076 0.96110.1 10.148 1.763 -10.171 1.765 2.862 -0.012 0.958 Spillover Eﬀects ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) ˆ δ (cid:96) ˆ se (ˆ δ (cid:96) ) MSE Bias Coverage0.1 0.022 0.208 -0.086 0.213 0.019 -0.032 0.9671.1 1.082 0.249 -1.108 0.244 0.056 -0.013 0.9642.1 2.080 0.322 -2.115 0.326 0.083 -0.017 0.9743.1 3.094 0.423 -3.065 0.418 0.132 0.015 0.9694.1 4.119 0.528 -4.083 0.525 0.209 0.018 0.9735.1 5.092 0.634 -5.110 0.637 0.263 -0.009 0.9836.1 6.085 0.745 -6.126 0.746 0.384 -0.020 0.9757.1 7.061 0.857 -7.160 0.860 0.517 -0.050 0.9778.1 8.101 0.974 -8.051 0.973 0.695 0.025 0.9799.1 9.111 1.088 -9.103 1.091 0.827 0.004 0.98410.1 10.184 1.215 -10.123 1.201 0.993 0.030 0.981 For the second scenario with diﬀerent causal rules for each causal eﬀect, we only report theresults for simulations with 30 clusters. Figure 6 depicts the average number of correctly discoveredheterogeneous causal rules with composite splitting rule or eﬀect-speciﬁc splitting rules targeted tothe treatment eﬀect or the spillover eﬀect.When we are interested in building a tree that can represent the heterogeneity of all causaleﬀects simultaneously (left panel), the composite splitting rule NCT is able to correctly identify allthe heterogeneous causal rules (four in this example), while the other two NCT, targeted to eitherthe treatment eﬀect or the spillover eﬀect, only detect the two leaves where the correspondingcausal eﬀect is heterogeneous. This is even more clear when looking at the other two plots, wherewe depict the ability to detect just the treatment eﬀect rules (central panel) and the spillover eﬀectrules (right panel). Indeed, when we are interested in subgroups that are heterogeneous with respectto only one causal eﬀect, both the eﬀect-specif spitting rule targeted to that eﬀect or the compositespitting function can be used and perform similarly, while the use of the eﬀect-specif spitting ruletargeted to the other eﬀect results in a poor detection of the correct causal rules. The results fromthis second scenario show the clear added value of the composite splitting rule. Indeed, when theHDV are diﬀerent for treatment and spillover eﬀects implementing this splitting rule enables the37esearcher to correctly spot all the true causal rules simultaneously . Finally, Table 4 shows how theHorvitz-Thompson estimator performs well in estimating the conditional treatment and spillovereﬀects in the two corresponding heterogeneous leaves.

Composite Tree

Four LeavesEffect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s . . . . . Treatment Effects Tree

Two LeavesEffect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s . . . . . Spillover Effects Tree

Two LeavesEffect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s NCT (composite)NCT (treatment)NCT (spillover)

Figure 6: Simulations’ results for correctly discovered leaves in the second scenario with 30 clusters.Table 4: Simulations’ results for second scenario (30 clusters)

Treatment EﬀectsEﬀect Size ˆ τ (cid:96) τ ˆ se (ˆ τ (cid:96) τ ) ˆ τ (cid:96) τ ˆ se (ˆ τ (cid:96) τ ) MSE Bias Coverage0.1 0.194 0.254 0.351 0.241 0.059 0.072 0.9641.1 1.103 0.310 3.239 0.607 0.223 -0.029 0.9422.1 2.088 0.438 6.379 1.121 0.650 0.034 0.9523.1 3.100 0.589 9.317 1.634 1.226 0.008 0.9704.1 4.113 0.756 12.285 2.137 2.353 -0.001 0.9455.1 5.120 0.917 15.418 2.708 4.106 0.069 0.9446.1 6.106 1.074 18.263 3.159 5.469 -0.015 0.9477.1 7.086 1.241 21.312 3.712 6.246 -0.001 0.9558.1 8.167 1.433 24.257 4.209 8.790 0.012 0.9499.1 9.028 1.569 27.099 4.671 9.975 -0.136 0.95410.1 10.024 1.733 30.617 5.288 14.454 0.120 0.951 Spillover Eﬀects ˆ δ (cid:96) δ ˆ se (ˆ δ (cid:96) δ ) ˆ δ (cid:96) δ ˆ se (ˆ δ (cid:96) δ ) MSE Bias Coverage0.1 0.086 0.187 0.281 0.209 0.041 -0.017 1.0001.1 1.089 0.248 3.321 0.444 0.105 0.005 0.9682.1 2.116 0.326 6.325 0.773 0.237 0.021 0.9773.1 3.120 0.421 9.313 1.114 0.491 0.016 0.9804.1 4.109 0.527 12.277 1.455 0.832 -0.007 0.9725.1 5.123 0.637 15.407 1.816 1.441 0.065 0.9766.1 6.119 0.746 18.341 2.159 1.990 0.030 0.9727.1 7.074 0.858 21.393 2.516 2.177 0.033 0.9798.1 8.095 0.976 24.312 2.854 3.168 0.003 0.9749.1 9.048 1.090 27.264 3.210 3.953 -0.044 0.97510.1 10.053 1.201 30.416 3.565 4.610 0.034 0.980 Empirical Application

In this section we provide an empirical application of the proposed methodology. We use data froma randomized experiment designed to assess the eﬀectiveness of intensive information sessions topromote the uptake of a new weather insurance policy among farmers in rural China. The promotedpolicy is addressed to rice farmers and it is aimed at protecting their products from adverse weathershocks (Cai et al.; 2015). The sample consists of 4,569 households belonging to 47 diﬀerent villages.Households were randomly assigned to the the intensive training session. Here, the individualtreatment variable W ik ∈ { , } for each household i living in village k represents the assignmentto the intensive ( W ik = 1 ) or simple ( W ik = 0) information session.Households are linked according to an observed village-speciﬁc friendship network, representedin Figure 7. Relationships between households belonging to diﬀerent villages are negligible and wereremoved.Figure 7: Friendship network between households living in rural villages in China. Colors refer todiﬀerent villages.Friendship relationships between households in a given rural village k are fully described by theadjacency matrix A k , where the generic element A k ( i, j ) equals 1 if the household i in village k has The original experiment also includes a village-level randomization on price variation and a second round ofsessions Cai et al. (2015). Here we only consider the household-level randomization to the ﬁrst round of sessions. j in the same village k as friend. Therefore, the adjacency matrix is notsymmetric. We denote with N ik the set of nominated friends of unit ik , and with N ik the out-degree.Figure 8a shows the overall degree distribution, while Figure 8b represents the distribution of thenumber of treated neighbors. In this setting interference could take place because the informationreceived in the intensive information session could be transferred to network contacts. We assumehere an exposure to the network treatments only at the neighborhood level and, in particular, wedeﬁne the network exposure variable as in (18). Hence, G ik is equal to 1 if a unit has at least onetreated friend, and 0 otherwise. This deﬁnition of the network exposure is justiﬁable by the factthat the decision of a farmer to buy the weather insurance might depend only on the informationeither received directly in the intensive session or indirectly by one of his friends. This is also whathas been found in Cai et al. (2015) where the only signiﬁcant spillover eﬀect is from ﬁrst neighbors. (a) Number of neighbors (b) Number of treated neighbors Figure 8: Degree and treated neighbors distributionWe dropped from the analysis the 57 households without any friends. The total number ofremaining households is 4 , ,

535 belong to the control group and, thus, they haveundergone a less intensive session. 2 ,

075 households do not have any treated friends, while 2 ,

437 ofthem undertake friendship relationships with at least one treated household. The joint distributionof the individual and neighborhood treatments is summarized in the following Table 5.40able 5: Distribution of the joint treatment G i = 0 G i = 1 W i = 0 1,621 1,914 W i = 1 454 523Figures 9a and 9b represent the treatment distribution in four villages. In Figure 9a, nodes arecolored according to their individual treatment assignment, while in Figure 9b node colors refer tothe joint treatment status. (a) Individual treatment in four villages (b) Joint treatment in four villages: Figure 9: Zoom on four villagesThe outcome variable Y ik ∈ { , } represents the insurance uptake and equals 1 if the household i residing in village k purchased the insurance after the information session. At the end of theexperiment, 2 ,

495 families chose not to accept the proposed insurance policy, while 2 ,

017 werepositively persuaded by the session and accepted the weather insurance.To evaluate the heterogeneity of treatment and spillover eﬀects, we included in the analysisall the observed characteristics that could plausibly moderate the eﬀect of the intervention: threedummies representing the household’s production area ( reg1 , reg2 and reg3 , respectively); a binaryvariable which equals 1 if the head of the household is at least 50 years old ( age ); a binary variablebeing 1 if the household’s head has successfully completed high school ( educ ); a binary variabledistinguishing families who are strongly worried about weather phenomena ( prob dis , which isequal to 1 if perceived probability of a weather disaster happening in the coming year exceeds41 . averse ). In this section, we present the most relevant empirical ﬁndings. The following ﬁgures report thetrees built targeting single eﬀects (Figure 10) and multiple eﬀects (Figure 11). In each node, wealso report the estimates of the main treatment and spillover eﬀects and the size of the selectedsub-populations. 20 villages were randomly assigned to the training set, while the remaining villageshave been allocated to the estimation set. Furthermore, we have set the maximum depth at 3 tomaintain a high level of interpretability, while we have set at 20 the minimum number of units thatmust be present in the child leaves for each of the four exposure conditions.We can see that, in the whole population, the treatment has a positive eﬀect on insurance takeup, while the spillover eﬀect is negligible. The most relevant heterogeneity drivers result to be theperceived probability of disaster and the production area (speciﬁcally, living in the third area).However, the estimated tree slightly changes under diﬀerent speciﬁcations of the spitting rule. Fig-ures 10a and 10b separately represent the trees targeted to τ (1 ,

0; 0 ,

0) and τ (0 ,

1; 0 , τ (1 ,

0; 0 , τ (0 ,

1; 0 , T . In this application, the composite tree which considers both τ (1 ,

0; 0 , The random assignment of the villages to the training and estimation samples has been kept ﬁxed in all thetrees. τ (0 ,

1; 0 ,

0) coincides with the tree based on τ (1 ,

0; 0 ,

0) only. Therefore, the latter leads thecomposite variability across partitions. Finally, the network causal tree that incorporates all thefour eﬀects shows slightly diﬀerent results: here the sub-population where the treatment resultsto be more eﬀective is the one including older households who are currently settled in the secondproduction area.We can conclude that intensive training sessions encouraged Chinese rural households to takeup the the insurance policy. The main determinants of the heterogeneity in the treatment eﬀect arethe production area and age. Indirect eﬀects do not have a signiﬁcant impact in this study, and,thus, composite criteria are mostly ruled by treatment eﬀects. (a) NCT targeted to τ (1 ,

0; 0 ,

0) (b) NCT targeted to τ (0 ,

1; 0 , Figure 10: Network causal trees targeted to single eﬀects43 a) NCT targeted to τ (1 ,

0; 0 ,

0) and τ (0 ,

1; 0 ,

0) (b) NCT targeted to all eﬀects

Figure 11: Network causal trees targeted to multiple eﬀects

Depending on our characteristics we might respond diﬀerently to a treatment or an intervention.Similarly, we might be more or less susceptible to the inﬂuence of other people who have received thetreatment. Understanding the heterogeneity of the eﬀect of a treatment with the aim of targetingpeople who would beneﬁt from it has been the focus of a recent ﬁeld of research, especially appliedto medicine. Investigating how diﬀerent people respond diﬀerently to the treatment received byothers can be crucial, particularly in settings with limited resources where spillover eﬀects could beleveraged. In this paper, we have introduced a new algorithm to estimate heterogeneous causal treat-ment and spillover eﬀects in the presence of clustered network interference. The proposed networkcausal tree model bridges the gap between two streams of causal inference literature: estimatorsfor causal eﬀects under interference and tree-based methods for the discovery of heterogeneoussub-populations. We build upon the seminal algorithm proposed by Athey and Imbens (2016) toaccount for clustered network interference through a rework of the criterion function. Leaf-speciﬁccausal eﬀects are then estimated using the Horvitz-Thompson estimator proposed by Aronow andSamii (2017). 44he proposed NCT algorithm has enhanced interpretability and shows an excellent performancein a set of Monte Carlo simulations. In particular, the algorithm is able both to spot the relevantsources of heterogeneity in the data and to consistently estimate the conditional treatment andspillover eﬀects. Moreover, we have introduced a composite splitting function that allows the re-searchers to simultaneously detect the sub-populations where both treatment and spillover eﬀectsare heterogeneous. The identiﬁcation of these multi-eﬀect heterogeneous subgroups is crucial for thedesign of targeting strategies that involve multiple eﬀects. For instance, in marketing campaignsa person might not be aﬀected by an advertisement received directly by a company but mightbe susceptible to the advertisement received by her friends. In this case, resources could be savedby promoting the product among people who could be directly susceptible and letting those whocould be more inﬂuenced by their friends receive the advertisement indirectly. Our simulation studyshows that the use of such composite splitting rule is able to correctly detect all the heterogeneoussub-populations deﬁned by both treatment and spillover eﬀects, as well the ones deﬁned by oneeﬀect only. Therefore, the selected partition of the population can be used to design strategieswhose objective function incorporates multiple eﬀects, but can also be used a posteriori to targetsubgroups maximizing either a treatment or a spillover eﬀect.When applied to real-world data, the NCT algorithm provided useful insights on the eﬀectivenessof intensive training sessions among Chinese rural households on the uptake of a weather insurancepolicy. We found that the main characteristics responsible for the heterogeneity of the eﬀects areperception of a possible disaster, the production area, and the age of the farmers. However, theseheterogeneity drivers play a diﬀerent role with respect to the two main treatment and spillovereﬀects and when the tree is built using composite splitting function.Nevertheless, the proposed algorithm may suﬀer from the limitations common to tree-basedmethods: instability to the random allocation of units in the training sample and the potentialimpact of outlier observations in the node-speciﬁc estimations. Here, we propose an algorithm thatselects a single tree because of its high interpretability that plays a fundamental role in policydesign. Indeed, the single tree algorithms are suitable for the discovery of heterogeneous sub-populations, which can give useful insights into the main variables that drive the heterogeneityand can be used to design targeting strategies. Moreover, we did not ﬁnd instability in neither thedetection nor the estimation of heterogeneous eﬀects in the Monte Carlo simulations. Nonetheless,45he proposed algorithm could be extended to tree-based ensemble methods, following Wager andAthey (2018) and Athey et al. (2019). By averaging the estimates from many single trees, thisextension could enhance estimation precision at the cost of reduced interpretability (see Lee et al.(2020) for a discussion on the trade-oﬀ between accuracy and interpretability). In addition, ourapproach might be rearranged to deal with settings where the network cannot be partitioned intowell-deﬁned and pre-speciﬁed clusters. However, in some network structures, clusters could bedetected by implementing a network-based community detection algorithm (Fortunato; 2010) andthe NCT algorithm could be applied on the detected communities, while the estimator should takeinto account the uncertainty in group membership .Here we assume that the network exposure variable is discrete, with the performance of thealgorithm being aﬀected by the number of categories resulting from the exposure mapping function.In our simulation study and application we have used a binary neighborhood exposure, whichallowed us to grow deeper trees, reduce the number of possible causal eﬀects, and have enoughobservations for each exposure condition to maintain the variance in a reasonable range. However,alternative speciﬁcations could be used. For instance, the network exposure could be deﬁned as theproportion of treated neighbors, perhaps categorized into few bins. A more complex deﬁnition of thenetwork exposure, possibly resulting in a continuous variable, would require some methodologicaladjustments in the estimation strategy, but it would allow to model a wider ensemble of real-worldinterference mechanisms. We leave this extension to future work.Finally, further research is needed to use the selected partition to actually design targetingstrategies involving both treatment and spillover eﬀects. Furthermore, these strategies should alsorely on the average susceptibility of network contacts as well as on heterogeneous inﬂuential power.To this end, new causal estimands could be deﬁned and their heterogeneity should be investigatedand incorporated in the design of complex targeting rules, aimed at maximizing the eﬀect of theintervention while saving resources. 46 eferences

Angelucci, M. and Di Maro, V. (2015).

Program evaluation and spillover eﬀects , The World Bank.Arduini, T., Patacchini, E. and Rainone, E. (2019). Treatment eﬀects with heterogeneous exter-nalities,

Journal of Business & Economic Statistics (doi: 10.1080/07350015.2019.1592755): 1–13.Arduini, T., Patacchini, E. and Rainone, E. (2020). Identiﬁcation and estimation of network modelswith heterogeneous interactions,

Forthcoming at Advances in Econometrics .Aronow, P. M. and Samii, C. (2017). Estimating average causal eﬀects under general interference,with application to a social network experiment, The Annals of Applied Statistics (4): 1912–1947.Assmann, S. F., Pocock, S. J., Enos, L. E. and Kasten, L. E. (2000). Subgroup analysis and other(mis) uses of baseline data in clinical trials, The Lancet (9209): 1064–1069.Athey, S., Eckles, D. and Imbens, G. W. (2018). Exact p-values for network interference,

Journalof the American Statistical Association (521): 230–240.Athey, S. and Imbens, G. (2016). Recursive partitioning for heterogeneous causal eﬀects,

Proceedingsof the National Academy of Sciences (27): 7353–7360.Athey, S. and Imbens, G. W. (2019). Machine learning methods that economists should knowabout,

Annual Review of Economics : 685–725.Athey, S., Tibshirani, J., Wager, S. et al. (2019). Generalized random forests, The Annals ofStatistics (2): 1148–1178.Baird, S., Bohren, J. A., McIntosh, C. and ¨Ozler, B. (2018). Optimal design of experiments in thepresence of interference, Review of Economics and Statistics (5): 844–860.Bargagli-Stoﬃ, F. J., De Witte, K. and Gnecco, G. (2019). Heterogeneous causal eﬀects with imper-fect compliance: a novel Bayesian machine learning approach, arXiv preprint arXiv:1905.12707 . 47argagli-Stoﬃ, F. J. and Gnecco, G. (2018). Estimating heterogeneous causal eﬀects in the presenceof irregular assignment mechanisms,

In Proceedings of the 5th IEEE Conference in Data Scienceand Advanced Analytics (DSAA) pp. 1–10.Bargagli-Stoﬃ, F. J. and Gnecco, G. (2020). Causal tree with instrumental variable: an extensionof the causal tree framework to irregular assignment mechanisms,

International Journal of DataScience and Analytics (3): 315–337.Basse, G. and Feller, A. (2018). Analyzing two-stage experiments in the presence of interference, Journal of the American Statistical Association (521): 41–55.Binka, F. N., Indome, F. and Smith, T. (1998). Impact of spatial distribution of permethrin-impregnated bed nets on child mortality in rural northern ghana.,

The American Journal ofTropical Medicine and Hygiene (1): 80–85.Breiman, L. (2001). Random forests, Machine Learning (1): 5–32.Bridges, C. B., Thompson, W. W., Meltzer, M. I., Reeve, G. R., Talamonti, W. J., Cox, N. J., Lilac,H. A., Hall, H., Klimov, A. and Fukuda, K. (2000). Eﬀectiveness and cost-beneﬁt of inﬂuenzavaccination of healthy working adults: a randomized controlled trial, Journal of the AmericanMedical Association (13): 1655–1663.Broderick, M. P., Hansen, C. J. and Russell, K. L. (2008). Exploration of the eﬀectiveness ofsocial distancing on respiratory pathogen transmission implicates environmental contributions,

The Journal of Infectious Diseases (10): 1420–1426.Cai, J., De Janvry, A. and Sadoulet, E. (2015). Social networks and the decision to insure,

AmericanEconomic Journal: Applied Economics (2): 81–108.Chae, I., Stephen, A. T., Bart, Y. and Yao, D. (2017). Spillover eﬀects in seeded word-of-mouthmarketing campaigns, Marketing Science (1): 89–104.Chakraborty, B. and Murphy, S. (2014). Dynamic treatment regimes, Annual Review of Statisticsand Its Application : 447–464.Chen, L. H. Y. and Shao, Q.-M. (2004). Normal approximation under local dependence, The Annalsof Probability (3): 1985–2028. 48ockx, B., Lechner, M. and Bollens, J. (2019). Priority to unemployed immigrants? A causalmachine learning evaluation of training in Belgium, arXiv preprint arXiv:1912.12864 .Cook, D. I., Gebski, V. J. and Keech, A. C. (2004). Subgroup analysis in clinical trials, MedicalJournal of Australia (6): 289–291.Cox, D. R. (1958).

Planning of experiments. , Wiley: New York.Duﬂo, E. and Saez, E. (2003). The role of information and social interactions in retirement plan deci-sions: Evidence from a randomized experiment,

The Quarterly Journal of Economics (3): 815–842.Erd˝os, P. and R´enyi, A. (1959). On random graphs i,

Publicationes Mathematicae (290-297): 18.Forastiere, L., Airoldi, E. M. and Mealli, F. (2020). Identiﬁcation and estimation of treatmentand interference eﬀects in observational studies on networks, Forthcoming in the Journal of theAmerican Statistical Association. Preprint at arXiv:1609.06245 .Forastiere, L., Lattarulo, P., Mariani, M., Mealli, F. and Razzolini, L. (2019). Exploring encour-agement, treatment, and spillover eﬀects using principal stratiﬁcation, with application to aﬁeld experiment on teens museum attendance,

Journal of Business & Economic Statistics (doi:10.1080/07350015.2019.1647843): 1–15.Forastiere, L., Mealli, F. and VanderWeele, T. J. (2016). Identiﬁcation and estimation of causalmechanisms in clustered encouragement designs: Disentangling bed nets using bayesian principalstratiﬁcation,

Journal of the American Statistical Association (514): 510–525.Forastiere, L., Mealli, F., Wu, A. and Airoldi, E. (2018). Estimating causal eﬀects under interferenceusing bayesian generalized propensity scores, arXiv preprint arXiv:1807.11038 .Fortunato, S. (2010). Community detection in graphs,

Physics reports (3-5): 75–174.Foster, J. C., Taylor, J. M. and Ruberg, S. J. (2011). Subgroup identiﬁcation from randomizedclinical trial data,

Statistics in Medicine (24): 2867–2880.Friedman, J. H., Olshen, R. A., Stone, C. J. et al. (1984). Classiﬁcation and Regression Trees, Belmont, CA: Wadsworth & Brooks . 49uber, R. (2018). Instrument Validity Tests with Causal Trees: With an Application to the Same-sex Instrument,

MEA Discussion Paper Series 201805 , Munich Center for the Economics ofAging (MEA) at the Max Planck Institute for Social Law and Social Policy.Hahn, P. R., Murray, J. S., Carvalho, C. M. et al. (2020). Bayesian regression tree models forcausal inference: regularization, confounding, and heterogeneous eﬀects,

Forthcoming in BayesianAnalysis .Hill, J. L. (2011). Bayesian nonparametric modeling for causal inference,

Journal of Computationaland Graphical Statistics (1): 217–240.Holland, P. W. (1986). Statistics and causal inference, Journal of the American Statistical Associ-ation (396): 945–960.Horvitz, D. G. and Thompson, D. J. (1952). A generalization of sampling without replacementfrom a ﬁnite universe, Journal of the American Statistical Association (260): 663–685.Howard, S., Omumbo, J., Nevill, C., Some, E., Donnelly, C. and Snow, R. (2000). Evidence for amass community eﬀect of insecticide-treated bednets on the incidence of malaria on the kenyancoast, Transactions of the Royal Society of Tropical Medicine and Hygiene (4): 357–360.Hudgens, M. G. and Halloran, M. E. (2008). Toward causal inference with interference, Journal ofthe American Statistical Association (482): 832–842.Imai, K., Jiang, Z. and Malani, A. (2020). Causal inference with interference and noncompli-ance in two-stage randomized experiments,

Forthcoming in Journal of the American StatisticalAssociation .Johnson, M., Cao, J. and Kang, H. (2019). Detecting heterogeneous treatment eﬀect with instru-mental variables, arXiv preprint arXiv:1908.03652 .Kang, H. and Imbens, G. (2016). Peer encouragement designs in causal inference with partialinterference and identiﬁcation of local average network eﬀects, arXiv preprint arXiv:1609.04464 . 50elso, J. K., Milne, G. J. and Kelly, H. (2009). Simulation suggests that rapid activation of socialdistancing can arrest epidemic development due to a novel strain of inﬂuenza,

BMC public health (1): 117.Kempe, D., Kleinberg, J. and Tardos, . (2003). Maximizing the spread of inﬂuence through asocial network, Proceedings of the Ninth ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining p. 137146.Kim, D., Hwong, A., Staﬀord, D., Hughes, D., O’Malley, J., Fowler, J. and Christakis, N. (2015).Social network targeting to maximise population behaviour change: A cluster randomised con-trolled trial,

Lancet : 145–153.Kissler, S. M., Tedijanto, C., Goldstein, E., Grad, Y. H. and Lipsitch, M. (2020). Projecting thetransmission dynamics of sars-cov-2 through the postpandemic period,

Science (6493): 860–868.Kosorok, M. R. and Laber, E. B. (2019). Precision medicine,

Annual Review of Statistics and ItsApplication (1): 263–286.Lechner, M. (2018). Modiﬁed causal forests for estimating heterogeneous causal eﬀects, arXivpreprint arXiv:1812.09487 .Lee, K., Bargagli-Stoﬃ, F. J. and Dominici, F. (2020). Causal rule ensemble: Interpretable inferenceof heterogeneous treatment eﬀects, Working paper .Lee, K., Small, D. S. and Dominici, F. (2018). Discovering eﬀect modiﬁcation and randomizationinference in air pollution studies, arXiv preprint arXiv:1802.06710 .Leung, M. P. (2020). Treatment and spillover eﬀects under network interference,

The Review ofEconomics and Statistics (2): 368–380.Liu, L., Hudgens, M. and Becker-Dreps, S. (2016). On inverse probability-weighted estimators inthe presence of interference,

Biometrika (4): 829–842.Liu, L. and Hudgens, M. G. (2014). Large sample randomization inference of causal eﬀects in thepresence of interference,

Journal of the American Statistical Association (505): 288–301.51anski, C. (2013). Identiﬁcation of treatment response with social interactions,

EconometricsJournal (1): 1–23.McKenzie, D. and Puerto, S. (2015). The direct and spillover impacts of a business training programfor female entrepreneurs in Kenya, In Proceedings of the 2015 International Labor OrganizationConference pp. 1–37.Montes, F., Jaramillo, A. M., Meisel, J. D., Diaz-Guilera, A., Valdivia, J. A., Sarmiento, O. L. andZarama, R. (2020). Benchmarking seeding strategies for spreading processes in social networks:an interplay between inﬂuencers, topologies and sizes,

Scientiﬁc Reports (1): 3666.Muralidharan, K. and Sundararaman, V. (2015). The aggregate eﬀect of school choice: Evidencefrom a two-stage experiment in india, The Quarterly Journal of Economics (3): 1011–1066.Murphy, S. A. (2003). Optimal dynamic treatment regimes,

Journal of the Royal Statistical Society:Series B (Statistical Methodology) (2): 331–355.Nichol, K. L., Lind, A., Margolis, K. L., Murdoch, M., McFadden, R., Hauge, M., Magnan, S. andDrake, M. (1995). The eﬀectiveness of vaccination against inﬂuenza in healthy, working adults, New England Journal of Medicine (14): 889–893.Ogburn, E., Sofrygin, O., Diaz, I. and van der Laan, M. (2017). Causal inference for social networkdata, arXiv preprint arXiv:1705.08527 .Parshakov, P., Naidenova, I. and Barajas, A. (2020). Spillover eﬀect in promotion: Evidence fromvideo game publishers and esports tournaments,

Journal of Business Research : 262–270.Ratkovic, M. and Tingley, D. (2017). Sparse estimation and uncertainty with application to sub-group analysis,

Political Analysis (1): 1–40.Rothwell, P. M. (2005). Subgroup analysis in randomised controlled trials: importance, indications,and interpretation, The Lancet (9454): 176–186.Rubin, D. B. (1974). Estimating causal eﬀects of treatments in randomized and nonrandomizedstudies.,

Journal of Educational Psychology (5): 688.52ubin, D. B. (1986). Comment: Which ifs have causal answers, Journal of the American StatisticalAssociation (396): 961–962.Schiltz, F., Mazrekaj, D., Horn, D. and De Witte, K. (2019). Does it matter when your smartestpeers leave your class? Evidence from Hungary, Labour Economics : 79–91.Sobel, M. E. (2006). What do randomized studies of housing mobility demonstrate?, Journal ofthe American Statistical Association (476): 1398–1407.Sofrygin, O. and van der Laan, M. (2017). Semi-parametric estimation and inference for the meanoutcome of the single time-point intervention in a causally connected population,

Journal ofCausal Inference (1): 20160003.Su, X., Kang, J., Fan, J., Levine, R. A. and Yan, X. (2012). Facilitating score and causal inferencetrees for large observational studies, Journal of Machine Learning Research (Oct): 2955–2994.Su, X., Tsai, C.-L., Wang, H., Nickerson, D. M. and Li, B. (2009). Subgroup analysis via recursivepartitioning., Journal of Machine Learning Research (2): 141–158.Tchetgen, E. J. T. and VanderWeele, T. J. (2012). On causal inference in the presence of interfer-ence, Statistical Methods in Medical Research (1): 55–75.Valdes, G., Luna, J. M., Eaton, E., Simone II, C. B., Ungar, L. H. and Solberg, T. D. (2016).Mediboost: a patient stratiﬁcation tool for interpretable decision making in the era of precisionmedicine, Scientiﬁc Reports : 37854.Valente, T. W. (2012). Network interventions, Science (6090): 49–53.VanderWeele, T. J. and Christakis, N. A. (2019). Network multipliers and public health,

Interna-tional Journal of Epidemiology (4): 1032–1037.Varadhan, R. and Seeger, J. D. (2013). Estimation and reporting of heterogeneity of treatmenteﬀects, Developing a Protocol for Observational Comparative Eﬀectiveness Research: A User’sGuide , Agency for Healthcare Research and Quality (US).Wager, S. and Athey, S. (2018). Estimation and inference of heterogeneous treatment eﬀects usingrandom forests,

Journal of the American Statistical Association (523): 1228–1242.53ang, G., Li, J. and Hopp, W. J. (2017). Personalized health care outcome analysis of cardiovascularsurgical procedures,

Ross School of Business Paper (1336).Zhang, W., Le, T. D., Liu, L., Zhou, Z.-H. and Li, J. (2017). Mining heterogeneous causal eﬀectsfor personalized cancer treatment,

Bioinformatics (15): 2372–2378.Zhao, J., Runfola, D. M. and Kemper, P. (2017). Quantifying heterogeneous causal treatment eﬀectsin world bank development ﬁnance projects, Joint European Conference on Machine Learningand Knowledge Discovery in Databases , Springer, pp. 204–215.54 nline Appendix

A Proofs

Proposition 1 (Unbiasedness) . E [ (cid:98) µ ( w,g ) ( (cid:96) ( x ))] = µ ( w,g ) ( (cid:96) ( x )) . Proof. E [ (cid:98) µ ( w,g ) ] = E (cid:20) N K (cid:88) k =1 n k (cid:88) i =1 ( W ik = w, G ik = g, ) Y ik π ik (( w, g )) (cid:21) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 E (cid:2) ( W ik = w, G ik = g ) (cid:3) Y ik ( w, g ) π ik ( w, g )= µ ( w,g ) (19)where the expectation is over the randomization distribution of W ik and the induced distributionon G ik and the ﬁrst equality holds by consistency. Proposition 6 (Population Unbiasedness) . The estimator is unbiased with respect to the populationmean of the potential outcomes: E [ (cid:98) µ ( w,g ) ( (cid:96) ( x ))] = E [ Y ik ( w, g ) | X ik ∈ (cid:96) ( x ))] = µ P ( w,g ) ( (cid:96) ( x )) and E [ (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x ))] = E [ Y ik ( w, g ) − Y ik ( w (cid:48) , g (cid:48) ) | X ik ∈ (cid:96) ( x ))]= E [ τ ( w,g ; w (cid:48) ,g (cid:48) ) ( X ik ) | X ik ∈ (cid:96) ( x ))] = µ P ( w,g ) ( (cid:96) ( x )) where the expected value is taken over the sampling distribution. roof. E [ (cid:98) µ ( w,g ) ] = E (cid:20) N K (cid:88) k =1 n k (cid:88) i =1 ( W ik = w, G ik = g, ) Y ik π ik (( w, g )) (cid:21) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 E (cid:2) ( W ik = w, G ik = g ) Y ik (cid:3) π ik ( w, g )= 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 E (cid:2) Y ik | W ik = w, G ik = g (cid:3) π ik ( w, g ) π ik ( w, g )= 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 E (cid:2) Y ik ( w, g ) (cid:3) = E (cid:2) Y ik ( w, g ) (cid:3) = µ P ( w,g ) . (20)The proof of unbiasedness of (cid:98) τ ( w,g,w (cid:48) ,g (cid:48) ) follows directly. Proposition 4 (Consistency) . Consider the asymptotic regime where the number of clusters K goto inﬁnity, i.e., K −→ ∞ , while the cluster size remains bounded, i.e., n k ≤ B for some constantB. In addition, assume that | Y ik ( w, g ) | /π ik ( w, g ) ≤ C < , ∀ i, k, w, g . Then as K −→ ∞ (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) p −→ τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) . Proof.

As in Proposition 1 (cid:98) µ ( w,g )) ( (cid:96) ( x )) is unbiased. Hence, for consistency to hold we need to provethat the variance goes to 0 as N goes to inﬁnity. Following Aronow and Samii (2017), it is easy toshow that the variance of (cid:98) µ ( w,g ) ( (cid:96) ( x )) is given by: V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 ( X ik ∈ (cid:96) ( x )) π ik ( w, g )[1 − π ik ( w, g )] (cid:20) Y ik ( w, g ) π ik ( w, g ) (cid:21) + 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i ( X ik ∈ (cid:96) ( x ) , X jk ∈ (cid:96) ( x )) × [ π ikjk ( w, g ) − π ik ( w, g ) π jk ( w, g ]) Y ik ( w, g ) π ik ( w, g ) Y jk ( w, g ) π jk ( w, g ) (21)Since | Y ik ( w, g ) | /π ik ( w, g ) ≤ C < (cid:96) ( x ) is bounded, i.e., n k ( (cid:96) ( x )) ≤ B ( (cid:96) ( x )) ≤ B , we have:( K × B ) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) ≤ C × K × B + C × K × B (cid:98) µ ( w,g ) ( (cid:96) ( x )) is therefore ensured since V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) −→ K −→ ∞ . Consis-tency of (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) follows by Slutsky’s Theorem. Proposition 7.

The partition Π (cid:63) such that Π (cid:63) = argmax Π ∈ P Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) = 1 N K (cid:88) k =1 n k (cid:88) i =1 (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( X ik , Π)) (cid:1) maximizes the heterogeneity across leaves.Proof. Let (cid:96) and (cid:96) be two sub-populations with a diﬀerent causal eﬀect τ ( w,g ; w (cid:48) ,g (cid:48) ) , i.e., τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:54) = τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ). Let Π be a partition that splits (cid:96) and (cid:96) into two leaves and letΠ c be the partition that combines the two sub-populations into one node (cid:96) . Then we have that Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) > Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π c ). The proofs follows from Jensen’s inequality. In fact, for partitionΠ the splitting function can be written as follows: Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) = 1 | (cid:96) | + | (cid:96) | (cid:88) ik ∈ (cid:96) ∪ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( X ik , Π)) (cid:1) = 1 | (cid:96) | + | (cid:96) | (cid:16) (cid:88) ik ∈ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:1) + (cid:88) ik ∈ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:1) (cid:17) For partition Π c we have: Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π c ) = 1 | (cid:96) | + | (cid:96) | (cid:88) ik ∈ (cid:96) ∪ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ( X ik , Π c )) (cid:1) = 1 | (cid:96) | + | (cid:96) | (cid:88) ik ∈ (cid:96) ∪ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:1) = 1 | (cid:96) | + | (cid:96) | (cid:88) ik ∈ (cid:96) ∪ (cid:96) (cid:104) | (cid:96) | + | (cid:96) | (cid:16) (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) + (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:17)(cid:105) = (cid:104) | (cid:96) | + | (cid:96) | (cid:16) (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) + (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:17)(cid:105) where the second-last equality holds because of the properties of the Horvitz-Thompson estimator.Thanks to Jensen’s inequality1 | (cid:96) | + | (cid:96) | (cid:16) (cid:88) ik ∈ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:1) + (cid:88) ik ∈ (cid:96) (cid:0)(cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:1) (cid:17) ≥ (cid:104) | (cid:96) | + | (cid:96) | (cid:16) (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) + (cid:88) ik ∈ (cid:96) (cid:98) τ ( w,g ; w (cid:48) ,g (cid:48) ) ( (cid:96) ) (cid:17)(cid:105) Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π) ≥ Q ( w,g ; w (cid:48) ,g (cid:48) ) (Π c ). B Further Details of the Variance Estimator of Leaf-SpeciﬁcCACE

If in a generic leaf (cid:96) ( x ) there are some pairs of units ( i , j ) whose joint probability of the exposure con-dition ( w, g ) is zero (namely, π ikjk ( w, g ) = 0), the variance of (cid:92) µ ( w,g ) ( (cid:96) ( x )) can be estimated followinga result from Aronow and Samii (2017). Such estimator, denoted by (cid:98) V c (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) , is the sumof two components: (i) the estimated variance of leaf-speciﬁc potential outcomes, (cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x ) (cid:17) )in (8) for the case when π ikjk ( w, g ) > ∀ i, j, k , and (ii) a correction term (cid:98) A ( w,g ) ( (cid:96) ( x )): (cid:98) V c (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) = (cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) + (cid:98) A ( w,g ) ( (cid:96) ( x )) (22)where (cid:98) A ( w,g ) ( (cid:96) ( x )) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i : π ikjk ( w,g )=0 (cid:34) ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) Y ik π ik ( w, g ) ++ ( W jk = w, G jk = g, X ik ∈ (cid:96) ( x )) Y jk π jk ( w, g ) (cid:35) . (23)Note that the correction term (cid:98) A ( w,g ) ( (cid:96) ( x )) is zero if the leaf does not have pairs of units suchthat π ikjk ( w, g, w, g ) = 0. Furthermore, as in Aronow and Samii (2017), (cid:98) V c (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) is aconservative estimator of the leaf-speciﬁc variance, as the following holds: E (cid:34)(cid:98) V c (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17)(cid:35) = E (cid:34)(cid:98) V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) + (cid:98) A ( w,g ) ( (cid:96) ( x )) (cid:35) ≥ V (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) (cid:17) . (24)We now explicit the covariance (cid:98) C c (cid:16)(cid:98) µ ( w,g ) ( (cid:96) ( x )) , (cid:98) µ ( w (cid:48) ,g (cid:48) ) ( (cid:96) ( x )) (cid:17) in the case we have pairs of units( i, j ), whose joint probability of experiencing the conditions ( w, g ) and ( w (cid:48) , g (cid:48) ), respectively, is zero,that is, π ikjk ( w, g ; w (cid:48) , g (cid:48) ) = 0: 4 C (cid:16)(cid:98) µ w,g ( (cid:96) ( x )) , (cid:98) µ w (cid:48) ,g (cid:48) ( (cid:96) ( x )) (cid:17) = 1 N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i : π ikjk ( w,g ; w (cid:48) ,g (cid:48) ) > ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) ( W jk = w (cid:48) , G jk = g (cid:48) , X jk ∈ (cid:96) ( x )) π ikjk ( w, g ; w (cid:48) , g (cid:48) ) × [ π ikjk ( w, g ; w (cid:48) , g (cid:48) ) − π ik ( w, g ) π jk ( w (cid:48) , g (cid:48) )] Y ik π ik ( w, g ) Y jk π jk ( w (cid:48) , g (cid:48) ) − N ( (cid:96) ( x )) K (cid:88) k =1 n k (cid:88) i =1 (cid:88) j (cid:54) = i : π ikjk ( w,g ; w (cid:48) ,g (cid:48) )=0 (cid:104) ( W ik = w, G ik = g, X ik ∈ (cid:96) ( x )) Y ik π ik ( w, g )+ ( W ik = w (cid:48) , G ik = g (cid:48) , X ik ∈ (cid:96) ( x )) Y ik π ik ( w (cid:48) , g (cid:48) ) (cid:105) . (25) C Additional Monte Carlo Simulations

We included in the simulation study two additional sets of simulations where (4) we introduce corre-lation between the covariates, and (5) we replace the Erd˝os-R´enyi model for network formation withan exponential random graph (ERGM) model introducing homophily within the clusters. These twoadditional simulations are conducted with 30 clusters and under the ﬁrst scenario introduced inSection 5, where the heterogeneity is the same for the two causal eﬀects of interest.

C.1 Correlated covariates

In Figure 12 we report the number of correctly detected leaves under low and high correlation(0.25 and 0.5), while in Tables 6 and 7 we report the estimated treatment and spillover eﬀects withtheir standard error in the two heterogeneous leaves, together with the MSE, bias and coverageof the average treatment and spillover eﬀects in the sample. From Figure 12 one can see that thecorrelation between covariates compromises the ability of the algorithm to correctly identify theheterogeneous subgroups. This is due to the fact that, as the covariates become more similar to eachother, it becomes harder for the algorithm to detect the true HDV. Such a problem is common toall tree-based algorithms. Hence, we argue that one should carefully check the correlation patternsbetween the variables to get a sense of the reliability of the discovered subgroups.5 . . . . . Correlation 0.25

Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s . . . . . Correlation 0.50

Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s NCT (composite)NCT (treatment)NCT (spillover)

Figure 12: Simulations’ results for correctly discovered leaves in the ﬁrst scenario with correlatedcovariatesTable 6: Simulations’ results for the ﬁrst scenario with correlated covariates (0.25)

Treatment EﬀectsEﬀect Size ˆ τ l ˆ se (ˆ τ l ) ˆ τ l ˆ se (ˆ τ l ) MSE Bias Coverage0.1 0.116 0.288 -0.230 0.225 0.071 -0.057 1.0001.1 1.093 0.280 -1.078 0.288 0.080 0.008 0.9462.1 2.084 0.403 -2.067 0.401 0.147 0.009 0.9393.1 3.112 0.553 -3.134 0.555 0.273 -0.011 0.9584.1 4.082 0.692 -4.140 0.699 0.400 -0.029 0.9615.1 5.126 0.843 -5.162 0.852 0.673 -0.018 0.9506.1 6.121 1.007 -6.157 1.008 1.000 -0.018 0.9487.1 7.019 1.161 -7.067 1.165 1.214 -0.024 0.9588.1 8.018 1.309 -8.085 1.309 1.557 -0.033 0.9599.1 9.221 1.509 -9.088 1.491 2.014 0.066 0.95310.1 10.214 1.655 -10.195 1.648 2.441 0.009 0.957 Spillover Eﬀects ˆ δ l ˆ se (ˆ δ l ) ˆ δ l ˆ se (ˆ δ l ) MSE Bias Coverage0.1 0.069 0.185 -0.200 0.194 0.025 -0.065 0.9441.1 1.102 0.227 -1.090 0.227 0.046 0.006 0.9702.1 2.111 0.302 -2.117 0.302 0.078 -0.003 0.9723.1 3.093 0.392 -3.116 0.393 0.112 -0.011 0.9834.1 4.086 0.488 -4.093 0.490 0.169 -0.004 0.9795.1 5.131 0.589 -5.078 0.587 0.246 0.026 0.9716.1 6.071 0.692 -6.051 0.693 0.331 0.010 0.9797.1 7.109 0.801 -7.087 0.802 0.458 0.011 0.9808.1 8.092 0.906 -8.063 0.902 0.577 0.014 0.9769.1 9.110 1.016 -9.060 1.010 0.659 0.025 0.99010.1 10.122 1.123 -10.101 1.124 0.861 0.010 0.980 Treatment EﬀectsEﬀect Size ˆ τ l ˆ se (ˆ τ l ) ˆ τ l ˆ se (ˆ τ l ) MSE Bias Coverage0.1 0.129 0.223 -0.101 0.217 0.050 0.014 0.9091.1 1.093 0.268 -1.120 0.272 0.066 -0.014 0.9722.1 2.140 0.392 -2.114 0.378 0.158 0.013 0.9493.1 3.104 0.503 -3.065 0.505 0.241 0.019 0.9464.1 4.123 0.652 -4.188 0.664 0.399 -0.033 0.9605.1 5.081 0.795 -5.108 0.801 0.618 -0.013 0.9536.1 6.138 0.960 -6.146 0.943 0.825 -0.004 0.9597.1 7.154 1.110 -7.063 1.096 1.088 0.045 0.9568.1 8.093 1.227 -8.174 1.257 1.388 -0.041 0.9499.1 9.080 1.380 -9.116 1.389 1.637 -0.018 0.95910.1 10.086 1.529 -10.122 1.520 1.952 -0.018 0.952 Spillover Eﬀects ˆ δ l ˆ se (ˆ δ l ) ˆ δ l ˆ se (ˆ δ l ) MSE Bias Coverage0.1 0.144 0.190 -0.202 0.175 0.030 -0.029 0.8641.1 1.091 0.214 -1.087 0.212 0.047 0.002 0.9522.1 2.103 0.282 -2.137 0.285 0.064 -0.017 0.9753.1 3.119 0.364 -3.072 0.365 0.097 0.024 0.9774.1 4.098 0.458 -4.090 0.458 0.156 0.004 0.9625.1 5.056 0.551 -5.101 0.553 0.218 -0.022 0.9766.1 6.127 0.654 -6.093 0.649 0.302 0.017 0.9807.1 7.093 0.750 -7.130 0.752 0.370 -0.018 0.9888.1 8.162 0.852 -8.038 0.844 0.476 0.062 0.9849.1 9.122 0.948 -9.026 0.943 0.571 0.048 0.98210.1 10.043 1.046 -10.177 1.052 0.698 -0.067 0.982 Nevertheless, for both correlation levels (0.25 and 0.50) the estimator seems to perform wellwithin correctly detected leaves (see Tables 6 and 7).

C.2 Network homophily within the clusters

Table 8 shows the results in the case of network homophily within the clusters. In this case, we ﬁndlarger standard errors than the original scenario reported in 3 without homophily. As a consequence,the Monte Carlo MSE is also slightly larger.Moreover, as we can see from Figure 13 there is a decrease in the ability of the algorithm todiscover the true leaves. Indeed, if one compares this Figure with the right panel of Figure 5, onecan see how the correct discovery of the true leaves is slower in the case with homophily network.This is due to the fact that the standard errors are larger than in the original ﬁrst scenario.7 . . . . . Homophily

Effect Size N u m be r o f C o rr e c t l y I den t i f i ed Lea v e s NCT (composite)NCT (treatment)NCT (spillover)

Figure 13: Simulations’ results for correctly discovered leaves in the ﬁrst scenario with homophilynetworkTable 8: Simulations’ results for the ﬁrst scenario with network homophily within the clusters (30clusters)

Treatment EﬀectsEﬀect Size ˆ τ l ˆ se (ˆ τ l ) ˆ τ l ˆ se (ˆ τ l ) MSE Bias Coverage0.1 0.152 0.241 -0.108 0.259 0.045 0.022 1.0001.1 1.069 0.312 -1.094 0.315 0.100 -0.013 0.9532.1 2.130 0.454 -2.119 0.448 0.188 0.006 0.9593.1 3.149 0.608 -3.083 0.606 0.335 0.033 0.9554.1 4.099 0.775 -4.084 0.763 0.477 0.007 0.9595.1 5.115 0.946 -5.116 0.943 0.736 0.000 0.9616.1 6.162 1.127 -6.150 1.127 1.043 0.006 0.9707.1 7.076 1.285 -7.175 1.289 1.377 -0.049 0.9558.1 8.129 1.465 -8.000 1.446 1.752 0.064 0.9579.1 9.029 1.621 -9.136 1.629 2.249 -0.053 0.95310.1 10.163 1.814 -10.133 1.815 2.977 0.015 0.957 Spillover Eﬀects ˆ δ l ˆ se (ˆ δ l ) ˆ δ l ˆ se (ˆ δ l ) MSE Bias Coverage0.1 0.084 0.221 -0.142 0.222 0.036 -0.029 0.9711.1 1.082 0.268 -1.100 0.274 0.069 -0.009 0.9572.1 2.102 0.370 -2.084 0.366 0.120 0.009 0.9643.1 3.116 0.483 -3.112 0.486 0.172 0.002 0.9744.1 4.065 0.603 -4.114 0.610 0.260 -0.025 0.9765.1 5.101 0.740 -5.077 0.737 0.389 0.012 0.9806.1 6.076 0.871 -6.045 0.869 0.538 0.015 0.9797.1 7.088 1.004 -7.054 0.997 0.738 0.017 0.9728.1 8.041 1.132 -8.150 1.142 0.936 -0.055 0.9719.1 9.157 1.275 -9.030 1.263 1.070 0.064 0.97610.1 10.066 1.408 -10.093 1.407 1.391 -0.014 0.982) MSE Bias Coverage0.1 0.084 0.221 -0.142 0.222 0.036 -0.029 0.9711.1 1.082 0.268 -1.100 0.274 0.069 -0.009 0.9572.1 2.102 0.370 -2.084 0.366 0.120 0.009 0.9643.1 3.116 0.483 -3.112 0.486 0.172 0.002 0.9744.1 4.065 0.603 -4.114 0.610 0.260 -0.025 0.9765.1 5.101 0.740 -5.077 0.737 0.389 0.012 0.9806.1 6.076 0.871 -6.045 0.869 0.538 0.015 0.9797.1 7.088 1.004 -7.054 0.997 0.738 0.017 0.9728.1 8.041 1.132 -8.150 1.142 0.936 -0.055 0.9719.1 9.157 1.275 -9.030 1.263 1.070 0.064 0.97610.1 10.066 1.408 -10.093 1.407 1.391 -0.014 0.982