[PDF] G-Formula for Observational Studies with Partial Interference, with Application to Bed Net Use on Malaria

Abstract

Assessing population-level effects of vaccines and other infectious disease prevention measures is important to the field of public health. In infectious disease studies, one person's treatment may affect another individual's outcome, i.e., there may be interference between units. For example, use of bed nets to prevent malaria by one individual may have an indirect or spillover effect to other individuals living in close proximity. In some settings, individuals may form groups or clusters where interference only occurs within groups, i.e., there is partial interference. Inverse probability weighted estimators have previously been developed for observational studies with partial interference. Unfortunately, these estimators are not well suited for studies with large clusters. Therefore, in this paper, the parametric g-formula is extended to allow for partial interference. G-formula estimators are proposed of overall effects, spillover effects when treated, and spillover effects when untreated. The proposed estimators can accommodate large clusters and do not suffer from the g-null paradox that may occur in the absence of interference. The large sample properties of the proposed estimators are derived, and simulation studies are presented demonstrating the finite-sample performance of the proposed estimators. The Demographic and Health Survey from the Democratic Republic of the Congo is then analyzed using the proposed g-formula estimators to assess the overall and spillover effects of bed net use on malaria.

Full PDF

GG-Formula for Observational Studies with Partial Interference, withApplication to Bed Net Use on Malaria

Kayla W. Kilpatrick ∗ and Michael G. Hudgens Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, NorthCarolina, U.S.A. ∗ [email protected] ‘ Abstract

Assessing population-level eﬀects of vaccines and other infectious disease prevention measures is impor-tant to the ﬁeld of public health. In infectious disease studies, one person’s treatment may aﬀect anotherindividual’s outcome, i.e., there may be interference between units. For example, use of bed nets to preventmalaria by one individual may have an indirect or spillover eﬀect to other individuals living in close proxim-ity. In some settings, individuals may form groups or clusters where interference only occurs within groups,i.e., there is partial interference. Inverse probability weighted estimators have previously been developedfor observational studies with partial interference. Unfortunately, these estimators are not well suited forstudies with large clusters. Therefore, in this paper, the parametric g-formula is extended to allow for partialinterference. G-formula estimators are proposed of overall eﬀects, spillover eﬀects when treated, and spillovereﬀects when untreated. The proposed estimators can accommodate large clusters and do not suﬀer from theg-null paradox that may occur in the absence of interference. The large sample properties of the proposedestimators are derived, and simulation studies are presented demonstrating the ﬁnite-sample performance ofthe proposed estimators. The Demographic and Health Survey from the Democratic Republic of the Congois then analyzed using the proposed g-formula estimators to assess the overall and spillover eﬀects of bednet use on malaria.

Keywords— causal inference; g-formula; herd immunity; observational studies; spillover

In settings where individuals interact or are connected, one individual’s treatment status may aﬀect another individual’soutcome, i.e., interference may be present between individuals [Cox, 1958]. Interference is common in infectious diseaseresearch. For instance, if one individual wears a mask, this could aﬀect whether another individual develops COVID-19(coronavirus disease 2019). In some settings, it may be reasonable to assume that individuals within a cluster (or group) a r X i v : . [ s t a t . M E ] F e b ay interfere with one another, but not with individuals in other clusters, i.e., there is “partial interference” [Sobel,2006]. Clusters might entail households, villages, schools, or other hierarchical structures. For instance, when assessingthe eﬀect of an intervention or exposure in students, it may be reasonable to assume no interference between studentsin diﬀerent schools. Under this partial interference setting, several methods have been proposed for drawing inferenceabout causal estimands of treatment eﬀects; e.g., see Tchetgen Tchetgen and VanderWeele [2012], Papadogeorgou et al.[2019], Barkley et al. [2020].In the presence of interference, it is of interest to assess the eﬀect of policies which alter the distribution of treatmentin the population. For instance, in the Democratic Republic of the Congo, public health oﬃcials and policy makers maybe interested in estimates of malaria risk for diﬀerent levels of bed net usage in the population. In observational studieswhere partial interference is present, it may be unlikely that treatment selection among individuals in the same clusteris independent. For example, in household studies of vaccine eﬀects, we might expect vaccine uptake to be positivelycorrelated between individuals in the same household. Therefore, estimands that will be most relevant to policy makersneed to account for possible within-cluster treatment selection dependence. Papadogeorgou et al. [2019] and Barkleyet al. [2020] recently proposed such estimands and developed corresponding inferential methods using inverse probabilityweighted (IPW) estimators. These IPW estimators entail inverse weighting by an estimated group propensity score.Unfortunately, this approach is not well suited for large groups, because in practice the estimated group propensity scoreis often near zero when there are a large number of individuals in a group [Saul and Hudgens, 2017, Chakladar et al., 2019,Liu et al., 2019]. In the absence of interference, a commonly used alternative to the IPW estimator is the parametricg-formula, which entails combining outcome regression and standardization [Robins, 1986, Hern´an and Robins, 2006].This paper proposes an extension of the parametric g-formula for observational studies where partial interference maybe present which is better suited for large clusters compared to IPW.The proposed methods were motivated by the 2013-14 Democratic Republic of the Congo (DRC) Demographicand Health Survey (DHS), a nationally representative survey to gather information about fertility, maternal and childhealth, sexually transmitted infections, mosquito net (hereafter “bed net”) usage, malaria, and other health information[MPSMRM, MSP, and ICF International, 2014]. In the analysis presented below, population level eﬀects of bed net useon malaria are assessed using data from the DRC DHS. Figure 1 displays province-level bed net use and the proportionof children who did not use bed nets with malaria. The DHS data were collected at the household level. For the analysishere, a single linkage agglomerative cluster method was used to group individuals into clusters based on their householdglobal positioning system (GPS) coordinates, resulting in a total of 395 clusters with at least one child and measuredspatial information and other covariates. After performing this clustering algorithm, covariates and bed net use data areavailable for approximately 87,500 individuals. Malaria outcome data is available for about 7,500 children between 6 to59 months (for brevity, henceforth referred to as ”children”). Among the clusters with at least one child who did notuse a bed net, the prevalence of malaria in children who did not use bed nets is inversely associated with the proportionof bed net usage in the cluster (Spearman correlation r s = − . , p = 0 . opulation-level eﬀects of bed use on malaria when varying the proportion of children who use bed nets. Figure 1: Malaria bed net study in the Democratic Republic of the Congo. Left map: province-level bed netusage. Right map: prevalence of malaria in children who do not use bed nets.

The outline of the remainder of this paper is as follows. Section 2 presents the proposed extension of the g-formulato allow for partial interference. Section 3 presents the simulation results evaluating the performance of the proposedmethods in ﬁnite samples. In Section 4, the proposed estimators are employed to assess the eﬀect of bed net use onmalaria using data from the DRC DHS. Section 5 concludes with a discussion.

Suppose data is observed on m clusters of individuals, and let N i denote the number of individuals in cluster i . Supposesome individuals within each cluster may receive treatment (e.g., bed net) and denote the vector of binary treatmentindicators in cluster i as A i = ( A i , A i , . . . , A iN i ) with A ij representing the treatment indicator for individual j . Let S i = ( (cid:80) N i j =1 A ij ) /N i denote the proportion of treated individuals in cluster i . Let Y i represent the outcome at the clusterlevel. In general, Y i may be deﬁned diﬀerently depending on the outcome of interest. For example, in the analysis ofthe DRC data, Y i may be deﬁned as the proportion of children in a cluster with malaria. Let L i represent a vectorof cluster-level baseline covariates, including N i . Let O i = { L i , S i , Y i } be the observed random variables for cluster i ,and assume O , . . . , O m are independent and identically distributed. For notational simplicity, the subscript i is omittedwhen not needed.Assume partial interference, i.e., there is no interference between clusters, but there may be interference betweenindividuals within the same cluster. For example, in the DRC analysis, one individual’s bed net usage may aﬀect whetheror not another individual in the same cluster gets malaria. Let A ( N i ) denote the set of all vectors of length N i withbinary entries such that a = ( a i , a i , . . . , a iN i ) ∈ A ( N i ) is a vector of possible treatment statuses for a cluster of size N i . For cluster i , let Y a i represent the potential outcome if, possibly counter to fact, the cluster had been exposed to a ∈ A ( N i ), such that Y a i = Y i when A i = a . n addition to partial interference, we also assume the cluster level potential outcomes depend only on the proportionof individuals treated, but not which particular individuals receive treatment. That is, Y a i = Y a (cid:48) i for any two vectors a , a (cid:48) ∈ A ( N i ) such that (cid:80) N i j =1 a ij = (cid:80) N i j =1 a (cid:48) ij ; this type of assumption is sometimes referred to as “stratiﬁed interference”[Hudgens and Halloran, 2008]. For example, in the DRC analysis, we will assume that the prevalence of malaria in acluster only depends on the proportion of bed net users, not which speciﬁc individuals use bed nets. For cluster i , let Y si denote the potential outcome for any a such that ( (cid:80) N i j =1 a ij ) /N i = s . Assume exchangeability conditional on L at thecluster level, i.e., Y s ⊥ S | L .Population-level eﬀects of interventions such as bed nets can be deﬁned by diﬀerences in expected outcomes when thedistribution of treatment is altered. For example, in the absence of interference, the eﬀect of treatment is often deﬁned bythe diﬀerence in expected outcomes when all individuals receive treatment versus when no individuals receive treatment.Here we consider stochastic policies where individuals receive treatment with some probability between 0 and 1. Deﬁnepolicy α to be the setting where the expected proportion of individuals in a cluster who receive treatment is α , i.e., E α ( S ) = α , where in general the subscript α denotes the counterfactual scenario in which the policy α is implemented.For example, the DRC analysis below considers policies where diﬀerent proportions of individuals use bed nets.The expected outcome in a group of individuals under policy α can be expressed as: µ α = E α ( Y ) = (cid:90) l (cid:88) s ∈S E α ( Y | S = s, L = l ) P α ( S = s | L = l ) dF α L ( l ) (1)= (cid:90) l (cid:88) s ∈S E α ( Y s | S = s, L = l ) P α ( S = s | L = l ) dF α L ( l )where S = { , /n, /n, ..., } and F α L denotes the marginal distribution of baseline covariates under policy α . The ﬁrstline of (1) follows from the law of total expectation and the second line from causal consistency [Cole and Frangakis,2009]. Eﬀects of interest can be deﬁned by contrasts in µ α for two policies α and α (cid:48) , e.g., δ ( α, α (cid:48) ) = µ α − µ α (cid:48) . (2)Here, eﬀects are deﬁned as a diﬀerence in average potential outcomes, but ratios or other contrasts could be used instead.A primary contrast of interest in the DRC analysis is the diﬀerence in the proportion of children infected with malariaunder policies α versus α (cid:48) .In the DRC analysis, we will consider three diﬀerent eﬀects of bed nets: the overall eﬀect, the spillover eﬀect whentreated, and the spillover eﬀect when untreated. All three eﬀects have the form (2) but diﬀer in how Y i is deﬁned.The overall eﬀect compares the average outcome among all individuals in a cluster under policies α versus α (cid:48) . As it islikely that populations of interest will include a mixture of individuals who would and who would not choose to receivetreatment, the overall eﬀect may be valuable for public health oﬃcials and policy makers in assessing the overall impactof increasing treatment coverage among a population. For inference about the overall eﬀect, Y i is a summary measure ofoutcomes in all individuals in cluster i . For the malaria data analysis, Y i is deﬁned to be the proportion of all childrenin a cluster with malaria. wo diﬀerent spillover eﬀects are also considered. The spillover eﬀect when untreated contrasts average outcomeswhen an individual is untreated under policy α versus policy α (cid:48) . For this eﬀect, Y i may be deﬁned by some summarymeasure of outcomes in untreated individuals. In the DRC analysis of the spillover eﬀect in the untreated, Y i will bedeﬁned as the proportion of children who do not use bed nets with malaria. If there are no untreated individuals in thecluster, we adopt the convention Y i = 0. Similarly, the spillover eﬀect when treated contrasts average outcomes when anindividual is treated under policy α versus policy α (cid:48) . For the spillover eﬀect when treated in the DRC analysis, Y i willbe the proportion of children who use bed nets with malaria, with Y i = 0 in clusters with no treated individuals. Additional assumptions are made to draw inference about the estimands described above. Assume F L = F αL , i.e., thedistribution of the covariates is the same under the factual and counterfactual policies. Let π s = g − ( ρ + ρ L ), where g is some monotone, user-speciﬁed link function such as logit or probit, and assume P ( S = s | L ) = P ( S = s | L ; ρ ) = (cid:32) NNs (cid:33) π Nss (1 − π s ) N − Ns . (3)where ρ = ( ρ , ρ ). Likewise, under policy α , let π sα = g − ( γ α + γ α L ) and assume P α ( S = s | L ) = P α ( S = s | L ; γ ) = (cid:32) NNs (cid:33) π Nssα (1 − π sα ) N − Ns . (4)where γ = ( γ α , γ α ). The parameters ρ in (3) are identiﬁable from the observable data, whereas the counterfactualparameters γ in (4) are not identiﬁable without additional assumptions. As in Barkley et al. [2020], assume ρ = γ α ; thisassumption implies rank preservation between clusters in treatment propensity. In other words, if treatment adoption ismore likely in cluster i than cluster j , then under counterfactual policy α , treatment adoption will also be more likely incluster i than cluster j . It follows that π sα = g − ( γ α + ρ L ) and γ α is the solution to (cid:90) l E α ( S | L = l ; γ α , ρ ) dF L − α = 0 (5)where E α ( S | L = l ; γ α , ρ ) = π sα . Finally, let π y = g − ( β + β L + β S ) and assume E ( Y | S = s, L = l ) = E ( Y | S = s, L = l ; β ) = π y (6)where β = ( β , β , β ). For simplicity, an interaction between S and L is omitted from the model of E ( Y | S = s, L = l )but could be included. Assume that the mean of Y given S, L is the same under the factual scenario and counterfactualscenario α , i.e., E ( Y | S = s, L = l ) = E α ( Y | S = s, L = l ). .3 Inference Estimators for µ α can be constructed as follows. First estimate the parameters ρ = ( ρ , ρ ) of model (3) and β =( β , β , β ) of model (6) via maximum likelihood; denote these estimators by ˆ ρ = (ˆ ρ , ˆ ρ ) and ˆ β = ( ˆ β , ˆ β , ˆ β ). Next,for a given policy α , let ˆ γ α denote the estimator of γ α obtained by ﬁnding the solution to (5) with F L replaced byits empirical distribution, i.e., m − (cid:80) mi =1 ˆ E α ( S | L i ; γ α , ˆ ρ ) − α = 0 where ˆ E α ( S | L i ; γ α , ˆ ρ ) = g − ( γ α + ˆ ρ L i ). Letˆ P α ( S = s | L ) denote (4) evaluated using (ˆ γ α , ˆ ρ ), and let ˆ E ( Y | S = s, L = l ) denote (6) evaluated using ˆ β . Then theg-formula estimator of µ α is ˆ µ α = (cid:90) l (cid:88) s ∈S ˆ E ( Y | S = s, L = l ) ˆ P α ( S = s | L = l ) d ˆ F L ( l )where ˆ F L denotes the empirical distribution function of L , and the estimator for the eﬀects of interest is ˆ δ ( α, α (cid:48) ) =ˆ µ α − ˆ µ α (cid:48) . The estimators ˆ ρ, ˆ β , ˆ µ α , ˆ µ α (cid:48) , and ˆ δ ( α, α (cid:48) ) are solutions to unbiased estimating equations (see Appendix).Therefore, it follows from standard large-sample estimating equation theory that the estimators are consistent andasymptotically Normal [Stefanski and Boos, 2002]. The empirical sandwich estimators, which are consistent estimatorsof the asymptotic variances, can be used to construct Wald conﬁdence intervals (CIs). For the DRC malaria example, the methods described above may be applied directly if children are considered thepopulation of interest and we ignore data collected from adults. Such an approach makes inference about counterfactualscenarios regarding the distribution of bed net usage in children and is agnostic to bed net use by others in the clusters.However, the DRC DHS includes bed net data for all individuals, which can be utilized to estimate the eﬀects of bednet usage by all individuals on the risk of malaria in children. To do so, the approach above can simply be modiﬁed bychanging the deﬁnition of S to be the proportion of all individuals in the cluster, not just children, who use bed nets.Alternatively, one may choose to model separately the proportion of children using bed nets (say S ) and the proportionof other individuals in the cluster using bed nets (say S ). In particular, the population mean estimand µ α may beexpressed (cid:90) l (cid:88) s ∈S (cid:88) s ∈S E ( Y | S = s , S = s , L = l ) P α ( S = s | L = l , S = s ) P α ( S = s | L = l ) dF L ( l )where policy α is deﬁned such that individuals in strata 1 and 2 are treated with the same probability: E α ( S ) = E α ( S ) = E α ( S ) = α . Inference proceeds analogous to Sections 2.2–2.3, but with separate parametric models for S given L , S and for S given L ; such an approach is taken in the DRC bed net analysis in Section 4. In the absence of interference, the parametric g-formula may give rise to the so-called g-null paradox. That is, certainparametric models are guaranteed to be misspeciﬁed under the null hypothesis of no treatment eﬀect. As a result, thenull hypothesis of no treatment eﬀect will be incorrectly rejected with high probability when the sample size is large Robins, 1986, Robins and Wasserman, 1997, Taubman et al., 2009].For the setting considered in this paper, the null hypothesis is that the proportion treated S has no eﬀect on theoutcome Y , or that µ α = µ (cid:48) α for any two policies α, α (cid:48) . If S has no eﬀect on Y , then β = 0 and E ( Y | S = s, L ) = E ( Y | L ).Recall E ( Y | S = s, L ) = E α ( Y | S = s, L ). Therefore (1) reduces to µ α = (cid:90) l E ( Y | L = l ) (cid:88) s ∈S P α ( S = s | L = l ) d α F L ( l ) = (cid:90) l E ( Y | L = l ) dF L ( l ) (7)where the second equality follows because (cid:80) s ∈S P α ( S = s | L = l ) = 1. The right-hand side of (7) does not depend on α ,so the g-null paradox does not occur here. Simulation studies were conducted to evaluate the ﬁnite sample properties of the proposed g-formula estimator. Threeseparate simulations studies were conducted for the three target estimands: overall eﬀect, spillover eﬀect when treated,and spillover eﬀect when not treated. For the overall eﬀect simulation study, 1000 data sets each with m = 125 clusterswere stochastically generated as follows:(i) The number of individuals per cluster N i was simulated such that P ( N i = 8) = 0 . , P ( N i = 16) = 0 . , and P ( N i =20) = 0 . L i and L i were generated, where L i was Normal with mean 40 and standarddeviation 10, and L i was such that P ( L i = 0) = 5 / , P ( L i = 1) = 3 / , P ( L i = 2) = 4 / , P ( L i = 3) =5 / , P ( L i = 4) = 1 / N i and π si = expit( ρ + ρ L i + ρ L i ) where ρ = (logit(0 . , − . , − . S i , was then calculated by dividing the number of treated individuals by N i .(iv) For each cluster, the outcome Y i was set equal to X i /N i where X i was Binomial with parameters N i and π yi =expit( β + β L i + β S i + β L i ) where β = (logit(0 . , − . , − . , − . Y given S and L , and of S given L were ﬁt by maximum likelihood. The asymptoticvariance of the estimators was estimated using the empirical sandwich variance estimator, and Wald 95% CIs werecalculated with these variance estimates.The true values of estimands for policies α ∈ { . , . , . } were calculated analytically for the data generating processdescribed above. In particular, the true values of γ α are the solutions to (5) where π sα = expit( γ α + ρ L + ρ L ). Thecounterfactual probabilities P α ( S = s | L ) for s ∈ S can then be computed via (4) based on the true values of γ α , ρ , ρ .Similarly, E ( Y | S = s, L ) for s ∈ S may be evaluated using (6) and the true value of β . Finally, the true values of µ α canbe found using (1).Results for the overall eﬀect simulation study are given in the top third of Table 1. The average bias of the proposedg-formula estimators was negligible, and the CIs contained the true parameter values for approximately 95% of the imulated datasets. The average of the estimated sandwich standard errors was approximately equal to the empiricalstandard errors, with standard error ratios of approximately 1.The simulation study described above was repeated for the spillover eﬀect when treated, with the following modiﬁ-cation. In step (iv), the cluster outcome Y i was set equal to X i / ( N i S i ) where X i was Binomial with parameters N i S i and π yi . If there were no treated individuals in a cluster, then Y i was set to 0. Results for the g-formula estimator of thespillover eﬀect when treated are presented in the middle part of Table 1. Results are similar to the overall eﬀect, exceptthe standard error for the g-formula estimator of the spillover eﬀect when treated is larger because fewer individualscontribute to the outcome.Finally, a third simulation study was conducted for the spillover eﬀect when untreated. The simulation steps abovewere repeated, but with step (iv) modiﬁed such that the cluster outcome Y i was set equal to X i / { N i (1 − S i ) } where X i was Binomial with parameters N i (1 − S i ) and π yi , with Y i set to 0 if S i = 1. Results are given in the bottom section ofTable 1. Table 1: Summary of simulation study results as described in Section 3. Truth: true value of the estimandtargeted by the estimator. Bias: average bias of the g-formula estimates over 1000 datasets. Cov%: empiricalcoverage of Wald 95% CIs. ASE: average of estimated sandwich standard errors. ESE: empirical standard error.SER: ASE/ESE. Estimator Truth Bias Cov% ASE ESE SERAll Individualsˆ µ α =0 . µ α =0 . µ α =0 . δ ( α = 0 . , α (cid:48) = 0 .

4) -0.038 -0.001 94% 0.0172 0.0180 0.95ˆ δ ( α = 0 . , α (cid:48) = 0 .

5) -0.019 -0.000 94% 0.0084 0.0089 0.95ˆ δ ( α = 0 . , α (cid:48) = 0 .

4) -0.019 -0.000 94% 0.0087 0.0091 0.96When Treatedˆ µ α =0 . µ α =0 . µ α =0 . δ ( α = 0 . , α (cid:48) = 0 .

4) -0.038 0.002 93% 0.0255 0.0267 0.96ˆ δ ( α = 0 . , α (cid:48) = 0 .

5) -0.019 0.001 93% 0.0126 0.0132 0.96ˆ δ ( α = 0 . , α (cid:48) = 0 .

4) -0.019 0.001 93% 0.0129 0.0135 0.96When Untreatedˆ µ α =0 . µ α =0 . µ α =0 . δ ( α = 0 . , α (cid:48) = 0 .

4) -0.038 0.001 94% 0.0248 0.0259 0.96ˆ δ ( α = 0 . , α (cid:48) = 0 .

5) -0.019 0.000 94% 0.0122 0.0127 0.96ˆ δ ( α = 0 . , α (cid:48) = 0 .

4) -0.019 0.000 94% 0.0126 0.0131 0.968

Analysis of Bed Net Use on Malaria in the Democratic Republicof the Congo

The methods described above were applied to the DRC DHS survey to draw inference about the eﬀects of bed netson malaria in children when varying the proportion of children in this age range who use bed nets. As mentioned inSection 1, a single linkage agglomerative hierarchical cluster method [Everitt et al., 2011] was used to group householdsof individuals into clusters. The maximum distance between any two households in the same cluster was constrained tonot exceed 10 kilometers. This distance was selected based on the maximum ﬂight distance of an

Anopheles mosquito[Janko et al., 2018]. The GPS coordinates used in the clustering algorithm were randomly displaced from the actuallocation to prevent participant identiﬁcation. Rural clusters were displaced up to 5 kilometers, while urban clusters weredisplaced up to 2 kilometers [MPSMRM, MSP, and ICF International, 2014]. Using this clustering algorithm, there were395 clusters with at least one child that were not missing spatial information and other covariates. Figure 2 displays thenumber of children per cluster, as well as the proportion of these children who used bed nets; on average, 55% of childrenutilized bed nets. N u m be r o f C h il d r en pe r C l u s t e r One cluster with 401 individuals P r opo r t i on o f C h il d r en U s i ng B ed N e t s pe r C l u s t e r . . . . . . Figure 2: Malaria bed net study in the Democratic Republic of the Congo. Left panel: number of children witha measured malaria outcome per cluster. Right panel: proportion of children who used bed nets per cluster.

Because malaria was measured only in children, Y , S , and N for each cluster were deﬁned based only on childrenwith a measured outcome. Exchangeability was assumed conditional on the cluster-level proportion of women, as wellas cluster-level averages of building materials (described below), urbanicity, altitude, age, temperature in the month ofthe survey, total precipitation in a 10 kilometer radius the month before the survey, and proportion of agricultural landcover within a 10 kilometer radius in 2013. The building material variable was deﬁned similar to Levitz et al. [2018]where roof and wall materials were summed for each individual within a cluster. Natural materials were worth 0 points,rudimentary materials 1 point, and ﬁnished materials 2 points. Hence, for each individual, the building material variablewas an integer between 0 and 4. The link g = logit was used for ﬁtting both the treatment and outcome models.Figure 3 displays g-formula estimates of the population mean estimands over a range of policies α ∈ [0 . , .

9] in allindividuals, when treated, and when untreated. The left panel of Figure 3 shows that the overall risk of malaria decreases s α increases, which is not surprising since bed nets are known to protect against malaria and bed net usage increaseswith α . The middle panel of Figure 3 demonstrates that the risk of malaria when treated also decreases as α increases,suggesting the presence of interference. In other words, treated individuals appear to beneﬁt from others in their clusteralso using bed nets. On the other hand, there appears to be little or no spillover eﬀect when untreated (right panelFigure 3).Estimates of the overall eﬀects, spillover eﬀects when treated, and spillover eﬀects when untreated for diﬀerent policies α compared to the current factual policy α (cid:48) = 0 .

55 are displayed in Figure 4. These estimates approximate the expectedchange in the number of cases of malaria due to increasing or decreasing bed net use relative to current utilization. Forexample, ˆ δ ( α = 0 . , α (cid:48) = 0 .

55) = − .

056 (95% CI − . , − . δ ( α = 0 . , α (cid:48) = 0 .

55) = − .

077 (95% CI − . , − . α = 0 . α (cid:48) = 0 .

55 is − .

011 (95% CI − . , . Everyone Treated Untreated0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.90.20.30.40.5 a E s t i m a t ed P opu l a t i on M ean s Figure 3: Estimates of the population mean estimands from the malaria bed net study. The proportion oftreated children is denoted by policy α . The shaded regions indicate 95% conﬁdence intervals. For sake of comparison, the Barkley et al. [2020] IPW estimator was also applied to the DRC DHS data to estimatethe bed net eﬀects. However, the mixed eﬀects model used to estimate the group propensity scores did not converge,hence it was not possible to compute the IPW estimates. Given that the DRC data includes several large clusters, it isnot surprising issues were encountered when attempting to compute the IPW estimator. A possible workaround would beto exclude the large clusters [Chakladar et al., 2019], but this would ineﬃciently discard data and limit generalizabilityof the results.The results above are based on clustering of households such that the maximum distance between any two householdsin the same cluster was 10 km. Sensitivity analyses were performed where clusters were instead deﬁned based on maximumdistances of 5 km and 2.5 km. There were 415 clusters in the 5 km analysis and 449 clusters in the 2.5 km analysis thatwere not missing spatial information and had at least one child. Population mean estimates were very similar between verall Effect Spillover Effect When Treated Spillover Effect When Untreated a vs a ' = . a E ff e c t E s t i m a t e s Figure 4: Estimated eﬀects from the malaria bed net study. The proportion of treated children is denoted bypolicy α . Eﬀects contrast α with α (cid:48) = 0 .

55, the current factual policy. The shaded regions indicate point-wise95% conﬁdence intervals. the 2.5 km, 5 km and 10 km analyses; see Figure 5.

Everyone Treated Untreated0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.90.150.200.250.300.350.40 a E s t i m a t ed P opu l a t i on M ean s Figure 5: Estimates of the population mean estimands from the malaria bed net study. The proportion oftreated children is denoted by policy α . Solid black lines represent 10 km, solid gray lines represent 5 km, anddashed lines represent 2.5 km clusters. To investigate the eﬀect of changing the proportion of the entire population who use bed nets, the 10 kilometer clusterswere also analyzed using the methods from Section 2.4. The estimated population means for the general population policycompared to the children-only policy are shown in Figure 6. Changes in the general population policy are associated withgreater changes in the mean outcome in all individuals and when treated compared to the children-only policy. However,the largest diﬀerence in estimated population means between the general population policy and the children-only policyis only 0.05. For the spillover eﬀect when untreated, the estimates are approximately the same for both the children-onlyand general population policies. veryone Treated Untreated0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.9 0.1 0.3 0.5 0.7 0.90.20.30.4 a E s t i m a t ed P opu l a t i on M ean s Figure 6: Estimates of the population mean estimands from the malaria bed net study for the children-onlypolicy (solid lines) and general population policy (dashed lines).

In the presence of partial interference, the proposed g-formula estimator is an alternative to existing IPW estimators,such as those proposed in Tchetgen Tchetgen and VanderWeele [2012]. The g-formula estimator can accommodate largeclusters, unlike IPW estimators [Chakladar et al., 2019, Liu et al., 2019], and does not suﬀer from the g-null paradoxthat may occur in the absence of interference. Like the IPW estimators of Papadogeorgou et al. [2019] and Barkley et al.[2020], the proposed methods target counterfactual estimands which allow for within cluster dependence of treatmentselection and thus may be more relevant to policy makers. Consistency of the proposed g-formula estimator requiresthat the parametric models be correctly speciﬁed; future research could explore relaxing these parametric assumptions,perhaps by using semiparametric or nonparametric models. While motivated by infectious disease prevention studies,the g-formula methods developed in this paper are applicable in other settings where partial inteference may be present.

Appendix

The g-formula estimators in Section 2.3 can be shown to be consistent and asymptotically Normal using standard large-sample estimating equation theory. Let θ = ( ρ, γ α , γ α (cid:48) , β, µ α , µ α (cid:48) , δ ( α, α (cid:48) )). Estimating functions for ˆ ρ and ˆ β are givenby score equations corresponding to the binomial models P ( S = s | L ; ρ ) and P ( Y = y | S = s, L ; β ). Denote these scoreequations by ψ ρ ( O ; θ ) and ψ β ( O ; θ ). For policy α , let ψ γ α ( O ; θ ) = E α ( S | L = l ; γ α , ρ ) − α where E α ( S | L = l ; γ α , ρ ) =expit( γ α + ρ L ), and let ψ µ α ( O ; θ ) = (cid:88) s ∈S E ( Y | S = s, L ; β ) P α ( S = s | L ; γ α , ρ ) − µ α . Deﬁne ψ δ ( α,α (cid:48) ) ( O ; θ ) = ψ µ α ( O ; θ ) − ψ µ α (cid:48) ( O ; θ ), and let ψ θ = ( ψ ρ , ψ γ α , ψ γ α (cid:48) , ψ β , ψ µ α , ψ µ α (cid:48) , ψ δ ( α,α (cid:48) ) ) (cid:62) . Then theestimator ˆ θ = (ˆ ρ, ˆ γ α , ˆ γ α (cid:48) , ˆ β, ˆ µ α , ˆ µ α (cid:48) , ˆ δ ( α, α (cid:48) )) is the solution to the vector estimating equation (cid:80) mi =1 ψ θ ( O ; θ ) = .It is straightforward to show these estimating equations are unbiased. Because ψ ρ ( O ; θ ) and ψ β ( O ; θ ) are scoreequations, (cid:82) ψ ρ ( O ; θ ) dF O ( O ) = 0 and (cid:82) ψ β ( O ; θ ) dF O ( O ) = 0 where F O ( O ) denotes the distribution of the observed ariables O . For policy α , γ α is the solution to (5), implying E { ψ γ α ( O ; θ ) } = 0. Next note E { ψ µ α ( O ; θ ) } = E { (cid:88) s ∈S E α ( Y | S = s, L ) P α ( S = s | L ) } − µ α = E { (cid:88) s ∈S E α ( Y s | S = s, L ) P α ( S = s | L ) } − µ α = E { (cid:88) s ∈S E α ( Y s | L ) P α ( S = s | L ) } − µ α = (cid:90) l (cid:88) s ∈S E α ( Y s | L = l ) P α ( S = s | L = l ) dF L ( l ) − µ α = 0where the ﬁrst equality holds assuming the Y | S, L and S | L models are correctly speciﬁed and that E α ( Y | S = s, L = l ) = E ( Y | S = s, L = l ), the second equality by causal consistency, the third equaltiy from conditional exchangeability, andthe last equality from the deﬁnition of µ α .From standard large-sample estimating equation theory, it follows that under suitable regularity conditions, ˆ θ → p θ and √ m (ˆ θ − θ ) → d N (0 , Σ) where Σ = U − W ( U −(cid:62) ) for U = E {− ˙ ψ θ ( O ; θ ) } , where − ˙ ψ θ ( O ; θ ) = ∂ψ θ ( O ; θ ) /∂θ (cid:62) , and W = E { ψ θ ( O ; θ ) ⊗ } [Stefanski and Boos, 2002]. The asymptotic variance Σ can be consistently estimated by the empiricalsandwich variance estimator (cid:98) Σ = (cid:98) U − (cid:99) W ( (cid:98) U −(cid:62) ) where (cid:98) U = m − (cid:80) mi =1 − ˙ ψ θ ( O i ; ˆ θ ) and (cid:99) W = m − (cid:80) mi =1 ψ θ ( O i ; ˆ θ ) ⊗ . Acknowledgments

The authors thank Shaina Alexandria, Bryan Blette, M. Elizabeth Halloran, Sam Rosin, Bonnie Shook-Sa, and JaﬀerZaidi for providing comments on the manuscript. The authors also thank Mark Janko for providing temperature,precipitation, and agricultural density data. This work was partially supported by NIH grants R01 AI085073 and T32ES007018.

Supplementary Material

Code and Data Availability:

References

B.G. Barkley, M.G. Hudgens, J.D. Clemens, M Ali, and M.E. Emch. Causal inference from observational studies withclustered interference, with application to a cholera vaccine study.

Annals of Applied Statistics , 14(3):1432–1448, 2020.S Chakladar, M.G. Hudgens, M.E. Halloran, J.D. Clemens, M Ali, and M.E. Emch. Inverse probability weighted stimators of vaccine eﬀects accommodating partial interference and censoring. arXiv preprint arXiv:1910.03536 ,2019.SR Cole and CE Frangakis. The consistency statement in causal inference: a deﬁnition or an assumption? Epidemiology ,20(1):3–5, 2009.D.R. Cox.

Planning of Experiments . New York: Wiley, 1958.B.S. Everitt, S. Landau, M. Leese, and D. Stahl.

Cluster Analysis . John Wiley, 5 edition, 2011.M.A. Hern´an and J.M. Robins. Estimating causal eﬀects from epidemiological data.

Journal of Epidemiology & Com-munity Health , 60(7):578–586, 2006.M.G. Hudgens and M.E. Halloran. Toward causal inference with interference.

Journal of the American StatisticalAssociation , 103(482):832–842, 2008. doi: 10.1198/016214508000000292.M.M. Janko, S.R. Irish, B.J. Reich, M Peterson, S.M. Doctor, M.K. Mwandagalirwa, J.L. Likwela, A.K. Tshefu, S.R.Meshnick, and M.E. Emch. The links between agriculture,

Anopheles mosquitoes, and malaria risk in children youngerthan 5 years in the Democratic Republic of the Congo: a population-based, cross-sectional, spatial study.

The LancetPlanetary Health , 2(2):e74–e82, 2018.L Levitz, M Janko, K Mwandagalirwa, K.L. Thwai, J.L. Likwela, A.K. Tshefu, M Emch, and S.R. Meshnick. Eﬀect ofindividual and community-level bed net usage on malaria prevalence among under-ﬁves in the Democratic Republicof Congo.

Malaria Journal , 17(1):39, 2018.L Liu, M.G. Hudgens, B Saul, J.D. Clemens, M Ali, and M.E. Emch. Doubly robust estimation in observational studieswith partial interference.

Stat , 8(1):e214, 2019.

R´epublique D´emocratique du Congo Enquˆete D´emographique et de Sant´e (EDS-RDC) 2013-2014 [Dataset]. CDPR61SD,CDGE61FL.

Minist´ere du Plan et Suivi de la Mise en oeuvre de la R´evolution de la Modernit´e (MPSMRM), Minist´erede la Sant´e Publique (MSP), and ICF International, Rockville, Maryland, USA, 2014. Rockville, Maryland, USA:MPSMRM, MSP and ICF International [Producers], ICF [Distributer].G Papadogeorgou, F Mealli, and C.M. Zigler. Causal inference with interfering units for cluster and population leveltreatment allocation programs.

Biometrics , 75(3):778–787, 2019.J Robins. A new approach to causal inference in mortality studies with sustained exposure period — application tocontrol of the healthy worker survivor eﬀect.

Mathematical Modelling , 7:1393–1512, 1986. ISSN 0270-0255. doi:10.1016/0270-0255(86)90088-6.J.M. Robins and L Wasserman. Estimation of eﬀects of sequential treatments by reparameterizing directed acyclic graphs.

Proceedings of the Thirteenth Conference on Uncertainty in Artiﬁcial Intelligence , 1997. .C. Saul and M.G. Hudgens. A recipe for inferference: Start with causal inference. add interference. mix well with r. Journal of Statistical Software , 82(2), 2017.M.E. Sobel. What do randomized studies of housing mobility demonstrate? Causal inference in the face of interference.

Journal of the American Statistical Association , 101(476):1398–1407, 2006. doi: 10.1198/016214506000000636.L.A. Stefanski and D.D. Boos. The calculus of M-estimation.

The American Statistician , 56(1):29–38, 2002.S.L. Taubman, J.M. Robins, M.A. Mittleman, and M.A. Hern´an. Intervening on risk factors for coronary heart disease:an application of the parametric g-formula.

International Journal of Epidemiology , 38(6):1599–1611, 2009.E.J. Tchetgen Tchetgen and T.J. VanderWeele. On causal inference in the presence of interference.

Statistical Methodsin Medical Research , 21(1):55–75, 2012. doi: 10.1177/0962280210386779., 21(1):55–75, 2012. doi: 10.1177/0962280210386779.