[PDF] Challenges to estimating contagion effects from observational data

Abstract

A growing body of literature attempts to learn about contagion using observational (i.e. non-experimental) data collected from a single social network. While the conclusions of these studies may be correct, the methods rely on assumptions that are likely--and sometimes guaranteed to be--false, and therefore the evidence for the conclusions is often weaker than it seems. Developing methods that do not need to rely on implausible assumptions is an incredibly challenging and important open problem in statistics. Appropriate methods don't (yet!) exist, so researchers hoping to learn about contagion from observational social network data are sometimes faced with a dilemma: they can abandon their research program, or they can use inappropriate methods. This chapter will focus on the challenges and the open problems and will not weigh in on that dilemma, except to mention here that the most responsible way to use any statistical method, especially when it is well-known that the assumptions on which it rests do not hold, is with a healthy dose of skepticism, with honest acknowledgment and deep understanding of the limitations, and with copious caveats about how to interpret the results.

Full PDF

CChallenges to estimating contagion effects fromobservational data

Elizabeth L. Ogburn

Suppose that students attending the residential Faber College are measured andweighed at the start and close of each school year, and a complete social networkcensus is taken, cataloguing all social ties among members of the student body. Inaddition, researchers have access to basic demographic covariates measured on eachstudent. Researchers are interested in testing whether there is a contagion effect forbody mass index (BMI): if one individual–the ego–gains (or looses) weight, doesthat make his or her social contacts–the alters–more likely to do the same? They arealso interested in estimating the contagion effect if one exists: if an ego gains (orlooses) weight, what is the expected increase (or decrease) in the alters’ body massindices?There are many different procedures one could use to test for or estimate a con-tagion effect, using different models, different assumptions, different sets of covari-ates, different ways of calculating intervals or uncertainty, and the list goes on. Inorder for a procedure to be useful, it has to satisfy two requirements. First, it hasto isolate the causal effect of the ego’s change in BMI on the alters’ changes inBMI from potential other sources of similarity between the ego’s and the alters’outcomes. This has to do with confounding, which is the subject of Section 4.The second requirement for a useful analysis is that it must be generalizable topopulations beyond the precise student body used in the analysis. We would like tobe able to extrapolate what we learn about contagion from the Faber student bodyto contagion of BMI in similar college populations across different colleges or evenacross different years at Faber College. Assume that the student body we observe

Elizabeth L. OgburnJohns Hopkins Bloomberg School of Public Health, 615 N. Wolfe St. Baltimore MD, e-mail: [email protected];supportfromONRgrantN000141512343 a r X i v : . [ s t a t . A P ] J un Elizabeth L. Ogburn at Faber College is representative of these other student populations, that is, thatthe true underlying contagion effect for the observed sample of Faber students isthe same as the true underlying contagion effect in the other college populations towhich we want to extrapolate. Then one way to determine whether we are warrantedin extrapolating from Faber students to the other similar groups of students is tocalculate a conﬁdence interval for the true contagion effect, based on a model ofasymptotic growth of the sample. For example, if the sample is large enough that acentral limit theorem approximately holds for the contagion effect estimate, then aGaussian conﬁdence interval around the sample mean is approximately valid. Underthe assumption of the same true underlying contagion effect, our conﬁdence that thisinterval covers the true contagion effect for Faber College students is the same as ourconﬁdence that it covers the true contagion effect for students at a different collegeor in a different year. As in many settings for statistical inference, asymptotics areappropriate not because we care about an inﬁnite population but because they shedlight on ﬁnite samples. This requires valid statistical inference, which is the subjectof Section 5.

Questions about the inﬂuence one subject has on the outcome of another subject areinherently questions about causal effects: contagion is a causal effect on an ego’soutcome at time t of his alter’s outcome at time s for some s < t . Causal effectsare deﬁned in terms of potential or counterfactual outcomes (see e.g. Hern´an, 2004;Rubin, 2005). In general, a unit-level potential outcome, Y i ( z ) , is deﬁned as the out-come that we would have observed for subject i if we could have intervened toset that subject’s treatment or exposure Z i to value z . A contagion effect of inter-est for dyadic data might be a contrast of counterfactuals of the form Y tego ( y t − alter ) ,for example E (cid:2) Y tego ( y ) − Y tego ( y − ) (cid:3) would be the expected difference in the ego’scounterfactual outcome at time t had the alter’s outcome at time t − y compared to y −

1. In data comprised of independent dyads this contagion effectis well-deﬁned, but social networks represent a paradigmatic opportunity for inter-ference , whereby one subject’s exposure may affect not only his own outcome butalso the outcomes of his social contacts and possibly other subjects. Under inter-ference, the traditional unit-level potential outcomes are not well-deﬁned. Instead, Y i ( z ) is the outcome that we would have observed if we could have set the vectorof exposures for the entire population, Z , to z = ( z , ..., z n ) where for each i , z i isin the support of Z . The causal inference literature distinguishes between interfer-ence, which is present when one subject’s treatment or exposure may affect others’outcomes, and contagion, which is present when one subject’s outcome may inﬂu-ence or transmit to other subjects (e.g. Ogburn and VanderWeele, 2014a), but in factthey are usually intertwined. Consider three Faber students: Alex, Andy, and Ari,all friends with each other. Alex’s outcome at time t depends on both Andy’s andAri’s outcomes at time t −

1, Andy’s outcome at time t depends on Alex’s and Ari’s hallenges to estimating contagion effects from observational data 3 at time t −

1, and Ari’s outcome at time t depends on Alex’s and Andy’s at time t −

1. This results in a situation that is hardly distinguishable from the hallmarks ofinterference: Y tAlex ( y t − Andy , y t − Ari ) , Y tAndy ( y t − Alex , y t − Ari ) , and Y tAri ( y t − Alex , y t − Andy ) are potentialoutcomes that depend on multiple “treatments” and those treatments are overlappingacross subjects. Furthermore, just as in settings with interference, a counterfactualoutcome for node i that omits some of the treatments to which node i is exposed (i.e.the outcomes at time t − i ’s alters) is not well-deﬁned. This has beenoverlooked in most of the literature on contagion in observational social networkdata, which generally focuses on alter-ego pairs, thereby inherently considering ill-deﬁned counterfactuals like Y tAlex ( y t − Andy ) .This points to an under-appreciated challenge for the study of contagion in asocial network: simply deﬁning the causal effect of interest. If researchers samplenon-overlapping alter-ego dyads from the network then Y tego ( y t − alter ) may be well-deﬁned, but if they wish to use all of the available data, comprised of overlappingdyads, causal effects must be deﬁned in terms of all of the alters for a particular ego.In the latter case, we could deﬁne a contagion effect that compares the mean coun-terfactual outcome for an ego had the mean outcome among the alters been set toone value as opposed to a different value. For simplicity, in the remaining sectionswe will talk about alter-ego pairs rather than clusters of an ego with all of its alters.This is in keeping with the existing applied literature, but it is important to notethat close attention should be paid in future work to the deﬁnition of causal conta-gion effects for non-dyadic data. Numerous papers and researchers have addressedthe deﬁnition of counterfactuals and causal effects in settings with interference (e.g.Aronow and Samii, 2012; Halloran and Struchiner, 1995; Halloran and Hudgens,2011; Hong and Raudenbush, 2006; Hudgens and Halloran, 2008; Ogburn and Van-derWeele, 2014a; Rosenbaum, 2007; Rubin, 1990; Sobel, 2006; Tchetgen Tchetgenand VanderWeele, 2012); similar attention should be paid to contagion effects. Confounding, is, loosely, the presence of a non-causal association that may be mis-interpreted as a causal effect of one variable on another. Most commonly, confound-ing is due to the presence of a confounder that has a causal effect on both the hy-pothesized cause and the hypothesized effect. Such a confounder generates an asso-ciation between the hypothesized cause and effect which, without careful analysis,could be taken as evidence of a causal effect. There are two types of confoundingthat are nearly ubiquitous and especially intransigent in the context of contagioneffects in social networks: homophily is the tendency of people who are similar tobegin with to share network ties, and environmental confounding is the tendencyof people who share network ties to also share environmental exposures that couldjointly affect their outcomes. We elucidate these two types of confounding below.

Elizabeth L. Ogburn

Consider the Faber College student body. Suppose that two students, Pat and Lee,meet in September and bond over the fact that they both used to be competitiverunners but recently developed injuries that prevent them from running and fromparticipating in other active hobbies they used to enjoy. Soon Pat and Lee are closefriends. Over the course of a few months, the sedentary lifestyle catches up withPat, who gains a considerable amount of weight. It takes longer for Lee, but bythe close of the school year Lee has also gained a lot of weight. If you did nothave access to the back story and only observed that Pat gained weight and thenPat’s close friend Lee did too, this looks like potential evidence of a causal effect ofPat’s change in BMI on Lee’s change in BMI. In fact, this is a case of homophily:unobserved covariates related to the propensity to gain weight (in this case, recentinjury) caused Pat and Lee to become friends and also caused them to both undergochanges in BMI.Some carefully considered studies attempt to control for all sources of homophily(see Shalizi and Thomas, 2011 for details and references), but this is generally notpossible unless researchers have a high degree of control over data collection andcan collect extremely rich (and therefore expensive!) data on the covariates that af-fect ties. Any traits that are related to the formation, duration, or strength of ties andto the outcome of interest must be measured. For some outcomes, such as infectiousdiseases, it may be possible to enumerate and observe all such traits, but for otheroutcomes, such as BMI, endless permutations of the Pat-and-Lee story are possible(e.g. friendship based on shared body norms, shared love of sugary snacks, sharedappreciation for a particular celebrity whose BMI changes could affect both Pat andLee’s, etc.), making it nearly impossible to control for all potentially confoundingtraits. In addition to the challenge of enumerating the potentially confounding traits,there are huge costs to collecting such rich data, and available social network dataare highly unlikely to include adequate covariates.For these reasons, researchers have developed clever tricks to try to control forhomophily using only data the network and the outcome of interest. One such trickis to include both the alter and the ego’s outcomes ate time t − t on the alter’s outcome at time t −

1. Theargument used to justify this method is that any traits related to tie formation and tothe outcome are fully captured by the similarity in the alter and ego’s outcomes attime t −

2; any association between the alter’s outcome at time t − t after controlling for this baseline similarity must be due to contagion. Butthe story of Pat and Lee demonstrates one ﬂaw in this argument: baseline traits canaffect outcome trajectories over time and so conditioning on the outcome at a singletime point does not render all future outcome measures independent of the baselinecovariates. Another ﬂaw in the argument is that homophily operates not only throughthe propensity to form ties, but also through the propensity to maintain ties andthrough the strength of the ties; neither strength nor duration can be captured bypast outcomes (Noel and Nyhan, 2011). Furthermore, Shalizi and Thomas (2011)demonstrated that, even if a baseline trait only affects friendship formation (not hallenges to estimating contagion effects from observational data 5 strength or duration), merely conditioning on the presence of a tie, which is inherentin all analyses focused on alter-ego pairs, creates a spurious association between thealter’s outcome at time t − t . This is because thepresence of a tie is a collider : a common effect of two variables, conditioning onwhich creates a spurious association between the two causes. (For an accessiblereview of colliders see Elwert and Winship, 2014.)Another clever trick is to compare the strength of the association between an al-ter’s and an ego’s outcomes across different types of ties: undirected, or mutual; di-rected, with the ego naming the alter as a friend but not vice versa; and directed, withthe alter naming the ego as a friend but not vice versa. Suppose Pat claims Lee as afriend but Lee does not claim Pat as a friend. Any similarity in baseline traits that Patand Lee share is a symmetric relationship, the argument goes, and therefore if theregression of Pat’s BMI at time t on Lee’s BMI at time t − t on Pat’s BMI at time t −

1, thisis evidence of contagion. Unfortunately, this argument is also ﬂawed (Lyons, 2011;Shalizi and Thomas, 2011). This is because, somewhat counterintuitively, similarityin baseline traits does not have to be symmetric. Suppose Pat claims Lee as a friendbecause Lee is the only person Pat knows who is going through a painful separationwith running and other active hobbies, while Lee participates in a support group forrecently injured former runners and considers only one participant, Lou, who hasthe exact same injury and prognosis, as a friend. By construction, even though Leeis the node with the most baseline similarity to Pat from among all of Pat’s potentialfriends, the reverse is not true: Lou, not Pat, is the node with the most similarity toLee from among all of Lee’s potential friends. Therefore, if Lou’s outcome at time t − t − Let’s turn to a different pair of Faber students, Cam and Sam, who both decided tomove off campus to a neighborhood across town from the college. Over the courseof the school year, both the grocery store and the gym in their neighborhood closeddown and were replaced with fast food restaurants. Cam immediately starts takingevery meal at the fast food joint and gains weight fairly quickly, while Sam holds outfor several months, taking the bus to a distant grocery store, but when time winterweather and ﬁnal exams pile on Sam, too, falls prey to the fast food marketing. Bythe end of the year both students have gained weight. This is confounding due to

Elizabeth L. Ogburn shared environment, another source of confounding that plagues attempts to learnabout contagion from observational data. People who share network ties tend to livenear each other, work together, pay attention to the same information, or work in thesame industry, all of which can generate confounding due to shared environment(which need not be restricted to physical environment). Note that confounding dueto shared environment is present whether Cam and Sam are friends because theylive in the same neighborhood or they moved to the same neighborhood becausethey were friends. The distinction between homophily and shared environment isnot always clearcut; if Cam and Sam became friends because they lived in the sameneighborhood that would simultaneously be an example of homophily and of sharedenvironment. The same strategies described above for dealing with homophily havebeen used in an attempt to control for confounding due to shared environment, butsimilar reasoning controverts their effectiveness.Cohen-Cole and Fletcher (2008) proposed controlling for confounding by sharedenvironment by including ﬁxed effects for “community” in regressions of an ego’soutcome at time t on an alter’s outcome at time t −

1. If all such confounding occursdue to clearly delineated and known communities, like well-deﬁned neighborhoodsin the example above, this is potentially a good solution, though in many cases theoperative communities, or their membership, will likely be unknown.

Suppose confounding is not an issue, because researchers at Faber were well-fundedand prescient enough to collect data on every possible confounder of the conta-gion effect, and further suppose that the researchers have a model–maybe a regres-sion, maybe a propensity-score based method (Aral et al, 2009), maybe some othermodel–that they believe gives an estimate of the causal contagion effect. We nowturn to the question of how to perform valid statistical inference using a model ﬁtto data from a social network. The issue of valid statistical inference is entirely sep-arate from the issue of confounding or even contagion; it applies whether we wantto estimate a simple mean or a complicated causal effect. The key points made inthis section apply to anything that we want to estimate using social network data.Most estimators of causal effects, including The coefﬁcient on the alter’s outcomeat time t − t , are closely related tosample means (to be technical, they are M-estimators), so all of the points madebelow apply.Going back to Faber College, administrators are now interested in the simplerproblem of estimating the mean BMI for the student body at the end of the schoolyear. There are n students, or nodes in the social network comprised of students,and each one furnishes an observed BMI measurement Y i . Our goal is to performvalid (frequentist) statistical inference about the true mean µ of Y using a samplemean ¯ Y = n ∑ ni = Y i of dependent observations Y = ( Y , ..., Y n ) , where the depen-dence among observations is determined or informed by network structure. But for hallenges to estimating contagion effects from observational data 7 the dependence, this is a familiar problem. In general, when we want to use a samplemean to perform inference about a true mean, we take the sample mean as our pointestimate, calculate a standard error for the sample mean, and tack on a conﬁdenceinterval based on that standard error. The unique challenge for the social networksetting is the effect of dependence on the standard error. To keep things as simpleas possible, let’s assume that Y i , ..., Y n are identically, though not independently, dis-tributed, so the mean of Y i is µ and the variance of Y i is σ , which we assume isﬁnite, for all i . (In fact, it is easier to deal with observations that are not identicallydistributed than it is to deal with observations that are dependent, so relaxing thisassumption is not too difﬁcult.)Recall that the standard error of ¯ Y is the square-root of its variance, where Var ( ¯ Y ) = n Var (cid:32) n ∑ i = Y i (cid:33) = n (cid:40) n ∑ i = σ + ∑ i (cid:54) = j cov ( Y i , Y j ) (cid:41) = σ n + n ∑ i (cid:54) = j cov ( Y i , Y j ) . When Y i , ..., Y n are independent, the covariance term cov ( Y i , Y j ) is equal to 0 forall i (cid:54) = j pairs, so the variance of ¯ Y is σ n , which should be familiar from any in-troductory statistics or data analysis class. But when Y i , ..., Y n are dependent , inparticular when they are positively correlated (which is the type of dependencethat we would expect to see in just about every social network setting), the vari-ance of ¯ Y is bigger than σ n because it includes the term n ∑ i (cid:54) = j cov ( Y i , Y j ) . Deﬁne b n = n ∑ i (cid:54) = j cov ( Y i , Y j ) . Then var ( ¯ Y ) = σ n / (cid:16) + bn σ (cid:17) and we can see that the factor by which the variance of ¯ Y is bigger than what itwould be if Y i , ..., Y n were independent is (cid:16) + b n σ (cid:17) . We call n / (cid:16) + b n σ (cid:17) the effec-tive sample size of our sample of n dependent observations Y , ..., Y n . The effectivesample size n / (cid:16) + b n σ (cid:17) is smaller than the true sample size n ; heuristically thisis because each observation Y i contains some new information about the target ofinference µ and some information that is rendered redundant by dependence. Underindependence each observation furnishes 1 “bit” of information about µ , whereasunder dependence each observation furnishes only 1 / (cid:16) + b n σ (cid:17) bit of informationabout µ .In order to explain the impact of this dependence on statistical inference, weﬁrst review the standard inferential procedure for independent data. When Y i , ..., Y n are independent, a typical procedure would be to calculate an approximate 95% Elizabeth L. Ogburn conﬁdence interval for µ as ¯ Y ± . × ˆ σ √ n , where ˆ σ is the square root of an estimateof the variance of Y . The factor 1.96 is the 97.5th quantile of the standard Normaldistribution; t-distribution quantiles could be used instead to account for the fact that σ is estimated rather than known. This procedure relies on several preliminaries: (1)¯ Y is unbiased for µ , (2) ¯ Y is approximately Normally distributed, and (3) ˆ σ √ n is agood estimate of the variance of ¯ Y . These preliminaries hold, at least approximately,in most settings with independent data and moderate to large n . Dependence doesn’taffect (1), but it does affect (2) and (3).When Y i , ..., Y n are independent, the Central Limit Theorem (CLT) tells us that √ n ( ¯ Y − µ ) converges in distribution to a Normal distribution as n → ∞ . The factor √ n is called the rate of convergence and it is needed to make sure that the varianceof √ n ( ¯ Y − µ ) is not 0, in which case √ n ( ¯ Y − µ ) would converge to a constantrather than a distribution, and is not inﬁnite, in which case √ n ( ¯ Y − µ ) would notconverge at all. The variance of ¯ Y (equivalently, the variance of ¯ Y − µ ) is σ / n , sothe variance of √ n ( ¯ Y − µ ) is n × (cid:0) σ / n (cid:1) = σ , which is a positive, ﬁnite constant.When Y i , ..., Y n are dependent, the rate of convergence may be different (slower)than √ n . (In fact, if the dependence is strong and widespread enough, the CLT maynot hold at all; determining what types of social network dependence are consistentwith the CLT is an important area for future study.) This is because the rate ofconvergence is determined by the effective sample size instead of by n : the varianceof ¯ Y is σ / (cid:110) n / (cid:16) + b n σ (cid:17)(cid:111) , so (as long as a CLT holds), (cid:114) n / (cid:16) + b n σ (cid:17) ( ¯ Y − µ ) willconverge to a Normal distribution as n → ∞ and the rate of convergence is given by (cid:114) n / (cid:16) + b n σ (cid:17) rather than √ n . Sometimes, in particular when b n is ﬁxed as n → ∞ ,this distinction will be meaningless. But sometimes, when b n grows with n , it isa meaningfully slower rate of convergence. (Note that b n / n must converge to 0 as n → ∞ in order for a CLT to hold, so b n must grow slower than n .) This mattersbecause it informs when the approximate Normality of the CLT kicks in, i.e. at whatsample size it is safe to assume that ¯ Y is approximately Normally distributed. Manydifferent rules of thumb exist for determining when approximate Normality holds;one popular rule of thumb is that n =

30 sufﬁces. With dependent data, this numberis larger, and sometimes considerably so. The effective sample size, rather than n ,should be used to assess whether the sample size is large enough to approximate thedistribution of ¯ Y with a Normal distribution. When researchers ignore dependenceand rely on the Normal approximation in samples that have large enough n butnot large enough effective sample size, there is no reason to think that their 95%conﬁdence intervals will have good coverage properties.Ignoring dependence is most dangerous when estimating the standard error of¯ Y . Any estimate of var ( ¯ Y ) that is based only on the marginal variances σ of Y i and ignore the covariances cov ( Y i , Y j ) will underestimate the standard error of ¯ Y ,often severely. Inference that is based on an underestimated standard error is an-ticonservative : conﬁdence intervals are narrower than they should be and p-valuesare lower than they should be, leading researchers to draw conclusions that are notin fact substantiated by the data. Even if each observation is dependent only on a hallenges to estimating contagion effects from observational data 9 ﬁxed and ﬁnite number of other observations, so that dependence is asymptoticallynegligible and does not affect the rate of convergence of the CLT, in ﬁnite samplesignoring the covariance terms in var ( ¯ Y ) could still have substantial implications oninference. This is particularly a problem because no good solutions exist. Statis-ticians are good at dealing with dependence that arises due to space or time, oreven other more complicated processes that can be expressed using Euclidean ge-ometry. But dependence that is informed by a network is very different from thesewell-understood types of dependence, and, unfortunately, statisticians are only justbeginning to develop methods for taking it into account. Most published researchabout social contagion uses regression models or generalized estimating equations(GEEs) to estimate contagion effects; though some of these models account for thedependence due to observing the same nodes over multiple time points, none ofthem account for dependence among nodes. In the literature on spatial and temporal dependence, dependence is often implicitlyassumed to be the result of latent traits that are more similar for observations that areclose in Euclidean distance than for distant observations. This type of dependenceis likely to be present in many network contexts as well. In networks, edges presentopportunities to transmit traits or information, and contagion or inﬂuence is an im-portant additional source of dependence that depends on the underlying networkstructure.Latent trait dependence will be present in data sampled from a network wheneverobservations from nodes that are close to one another are more likely to share un-measured traits than are observations from distant nodes. Homophily is a paradig-matic example of latent trait dependence. If the outcome under study in a socialnetwork has a genetic component, then we would expect latent variable dependencedue the fact that family members, who share latent genetic traits, are more likelyto be close in social distance than people who are unrelated. If the outcome wereaffected by geography or physical environment, latent variable dependence couldarise because people who live close to one another are more likely to be friends thanthose who are geographically distant. Of course, whether these traits are latent orobserved they can create dependence, but if they are observed then conditioning onthem renders observations independent, so only when they are latent do they resultin dependence that requires new tools for statistical inference. Just like in the spa-tial dependence context, there is often little reason to think that we could identify,let alone measure, all of these sources of dependence. The notions of latent sourcesof homophily or latent correlates of shared environment are familiar from the dis-cussion of confounding, above, but there is an important distinction to be madebetween latent sources of confounding and latent sources of dependence: in order tobe a source of unmeasured confounding, a latent trait must affect both the exposure(e.g. the alter’s outcome at time t −

1) and the outcome (ego’s outcome at time t ) of interest. In order to be a source of dependence, a latent trait must affect two ormore outcomes of interest. Latent trait dependence is the most general form of de-pendence, in that it provides no structure that can be harnessed to propel inference.In order to make any progress towards valid inference in the presence of latent traitdependence, some structure must be assumed, namely that the range of inﬂuence ofthe latent traits is primarily local in the network and that any long-range effects arenegligible.Contagion or inﬂuence arises when the outcome under study is transmitted fromnode to node along edges in the network. The diagram in Figure 1 depicts contagionin a network with three nodes in which node 2 is connected to nodes 1 and 3 but thereis no edge between 1 and 3. Y ti represents the outcome for node i at time t , and theunit of time is small enough that at most one transmission event can occur betweenconsecutive time points. Dependence due to contagion has known, though possiblyunobserved, structures that can sometimes be harnessed to facilitate inference; wetouch on this brieﬂy in Section 6. Crucially, whenever contagion is present so isdependence, and therefore statistical analysis must take dependence into account inorder to result in valid inference. Fig. 1

Dependence by contagion

Researchers have known for decades that learning about contagion from observa-tional data is fraught with difﬁculty, perhaps most famously expressed by Manski(1993). Recent years have seen incremental methodological progress, but huge hur-dles remain. Most of the constructive ideas in Shalizi and Thomas (2011) involvebounding contagion effects rather than attempting to point identify them; lookingfor bounds rather than point estimates is a general approach that could prove fruit-ful in the future. Indeed, Ver Steeg and Galstyan (2010) built upon the ideas in hallenges to estimating contagion effects from observational data 11

Shalizi and Thomas (2011) and were able to derive bounds on the association dueto homophily on traits that do not change over time (“static homophily”). Anothergeneral approach is to make use of sensitivity analyses whenever an estimation pro-cedure relies on assumptions that may not be realistic (e.g. VanderWeele, 2011).Some of the problems discussed above have solutions in some settings; below wediscuss solutions that exploit features of speciﬁc settings rather than providing gen-eral approaches to the problem of estimating contagion effects. (Some of the mate-rial below was ﬁrst published in Ogburn and Volfovsky, 2016.)

If it is possible to randomize some members of a social network to receive an inter-vention, and if it is known that an alter’s receiving an intervention can only affectthe ego’s outcome through contagion (as opposed to directly; see Ogburn and Van-derWeele 2014a for discussion), then problems of confounding and dependence canbe entirely obviated.

Randomization-based inference , pioneered by Fisher (Fisher,1922) and applied to network-like settings by Rosenbaum (2007) and Bowers et al(2013), is founded on the very intuitive notion that, under the null hypothesis of noeffect of treatment on any subject (sometimes called the sharp null hypothesis to dis-tinguish it from other null hypotheses that may be of interest), the treated and controlgroups are random samples from the same underlying distribution. Randomization-based inference treats outcomes as ﬁxed and treatment assignments as random vari-ables: quantities that depend on the vector of treatment assignments are the onlyrandom variables in this paradigm. Therefore, dependence among outcomes is anon-issue. Typically this type of inference is reserved for hypothesis testing, thoughresearchers have extended it to estimation. We leave the details, including severalsubtleties and challenges that are speciﬁc to the social network context, to a laterchapter (see also Ogburn and Volfovsky, 2016 for a review).Randomizing the formation of network ties themselves obviates confounding dueto the effects of homophily on tie formation. A number of studies have taken ad-vantage of naturally occurring randomizations of this kind, such as the assignmentof students to dorm rooms (Sacerdote, 2000) or of children to classrooms (Kang,2007). However, this does not sufﬁce to control for the effects of homophily on tiestrength or duration, or to control for confounding due to shared environment.

If researchers are willing to commit to certain types of parametric models, it may bepossible isolate contagion from confounding (Snijders et al, 2007). It is a relianceon strong parametric models, for example, that underpins mathematical modeling or agent based modeling approaches to contagion (Burk et al, 2007; Snijders et al,2010; Railsback and Grimm, 2011).This might seem benign–after all, most statistical analyses rely on parametricmodels of one kind or another–but there is a fundamental difference between, forexample, using a linear regression when the true underlying relationships is notlinear, and relying on parametric models to identify a causal effect that is otherwisehopelessly confounded. In the ﬁrst case, a misspeciﬁed model may bias the estimatewe are interested in, often in ways that are well-understood, and often in proportionto the ﬁt of the model to the data (i.e. the worse the misspeciﬁcation, the greater thebias). In the latter case, at least in the absence of a model-speciﬁc proof otherwise,any hint of misspeciﬁcation undermines the causal interpretation we would like to beable to justify and what looks like evidence of a causal effect could just be evidenceof confounding. George Box’s oft-cited aphorism, “all models are wrong but someare useful,” justiﬁes the use of misspeciﬁed parametric models in many settings, butwhen the parametric form of the model is the only bulwark against confounding, themodel must (in the absence of a proof to the contrary) in fact be correct in order tobe useful.

O’Malley et al (2014) proposed an instrumental variable (IV) solution to the prob-lem of disentangling contagion from homophily. An instrument is a random vari-able, V , that affects exposure but has no effect on the outcome conditional on expo-sure. When the exposure - outcome relation suffers from unmeasured confoundingbut an instrument can be found that is not confounded with the outcome, IV meth-ods can be used to recover valid estimates of the causal effect of the exposure on theoutcome. In this case there is unmeasured confounding of the relation between analter’s outcome at time t − t whenever there is ho-mophily on unmeasured traits. Angrist and Pischke (2008), Greenland (2000), andPearl (2000) provide accessible reviews of IV methods.O’Malley et al (2014) propose using a gene that is known to be associated withthe outcome of interest as an instrument. In their paper they focus on perhaps themost highly publicized claim of peer effects, namely that there are signiﬁcant peereffects of body mass index (BMI) and obesity (Christakis and Fowler, 2007). If thereis a gene that affects BMI but that does not affect other homophilous traits, then thatgene is a valid instrument for the effect of an alter’s BMI on his ego’s BMI. The geneaffects the ego’s BMI only through the alter’s manifest BMI (and it is independentof the ego’s BMI conditional on the alter’s BMI), and there is unlikely to be anyconfounding, measured or unmeasured, of the relation between an alter’s gene andthe ego’s BMI.There are two important challenges to this approach. First, the power to detectpeer effects is dependent in part upon the strength of the instrument - exposurerelation which, for genetic instruments, is often weak. Indeed, O’Malley et al (2014) hallenges to estimating contagion effects from observational data 13 reported low power for their data analyses. Second, in order to assess contagion atmore than a single time point (i.e. the average effect of the alter’s outcomes on theego’s outcomes up to that time point), multiple instruments are required. O’Malleyet al (2014) suggest using a single gene interacted with age to capture time-varyinggene expression, but this could further attenuate the instrument - exposure relationand this method is not valid unless the effect of the gene on the outcome really doesvary with time; if the gene-by-age interactions are highly collinear then they willfail to act as differentiated instruments for different time points. When multiple independent networks are observed, the problems of confoundingdue to shared environment and of dependence may be considerably easier to dealwith. A large literature on interference in causal inference is dedicated to inferencein the setting where independent groups of individuals interact and affect one an-other within, but not between, groups; this is analogous to multiple independentsocial networks (see, e.g., Sobel, 2006; Hong and Raudenbush, 2006; Hudgensand Halloran, 2008; Tchetgen Tchetgen and VanderWeele, 2012; Liu and Hudgens,2014). If environmental factors can shared within but not across networks, it maybe possible to control for confounding by shared environment via a ﬁxed effect foreach network, as in Cohen-Cole and Fletcher (2008).

If researchers have reason to believe that there is no unmeasured homophily or fea-tures of shared environments that contribute to confounding or to dependence, i.e. ifcontagion is the only mechanism giving rise to either dependence or to associationsamong the outcomes of interest, then there are a few recent methodological advancesthat can be used to estimate contagion effects (van der Laan, 2012; Ogburn and Van-derWeele, 2014b; Ogburn et al, 2017). Dependence due to contagion has known,though possibly unobserved, structures that can sometimes be harnessed to facil-itate inference. Time and distance act as information barriers for dependence dueto contagion, giving rise to many conditional independencies that can sometimesbe used to make network dependence tractable. Two examples of the many condi-tional independencies that hold in Figure (1) are (cid:2) Y t ⊥ Y t | Y t − , Y t − , Y t − , Y t − (cid:3) and (cid:2) Y t − ⊥ Y t | Y t − (cid:3) . The ﬁrst conditional independence statement illustrates theprinciple that outcomes measured at a particular time point are mutually indepen-dent conditional on all past outcomes. The second conditional independence state-ment illustrates the fact that outcomes sampled from two nonadjacent nodes areindependent if the amount of time that passed between the two measurements wasnot sufﬁciently long for information to travel along the shortest path from one node to the other, conditional any information that could have simultaneously inﬂuencedthe sampled nodes (in this case Y t − ). Observing outcomes in a network on a ﬁneenough time scale to observe all transmissions requires a richness of data that willnot usually be available, and if the network under a contagious process is observedat a single time point, dependence due to contagion is indistinguishable from latentvariable dependence and the structure is lost. Acknowledgements

This work was funded by the Ofﬁce of Naval Research grant N00014-15-1-2343.

References

Ali MM, Dwyer DS (2009) Estimating peer effects in adolescent smoking behavior:A longitudinal analysis. Journal of Adolescent Health 45(4):402–408Angrist JD, Pischke JS (2008) Mostly harmless econometrics: An empiricist’s com-panion. Princeton university pressAral S, Muchnik L, Sundararajan A (2009) Distinguishing inﬂuence-based conta-gion from homophily-driven diffusion in dynamic networks. Proceedings of theNational Academy of Sciences 106(51):21,544–21,549Aronow PM, Samii C (2012) Estimating average causal effects under general inter-ference. Tech. rep.Besag J (1974) On spatial-temporal models and markov ﬁelds. In: Transactions ofthe Seventh Prague Conference on Information Theory, Statistical Decision Func-tions, and Random Processes, Springer, pp 47–55Bowers J, Fredrickson MM, Panagopoulos C (2013) Reasoning about interferencebetween units: A general framework. Political Analysis 21(1):97–124Burk WJ, Steglich CE, Snijders TA (2007) Beyond dyadic interdependence: Actor-oriented models for co-evolving social networks and individual behaviors. Inter-national journal of behavioral development 31(4):397–404Cacioppo JT, Fowler JH, Christakis NA (2009) Alone in the crowd: the structure andspread of loneliness in a large social network. Journal of personality and socialpsychology 97(6):977Christakis N, Fowler J (2007) The spread of obesity in a large social network over32 years. New England journal of medicine 357(4):370–379Christakis N, Fowler J (2008) The collective dynamics of smoking in a large socialnetwork. New England journal of medicine 358(21):2249–2258Christakis N, Fowler J (2010) Social network sensors for early detection of conta-gious outbreaks. PloS one 5(9):e12,948Cohen-Cole E, Fletcher JM (2008) Is obesity contagious? social networks vs.environmental factors in the obesity epidemic. Journal of Health Economics27(5):1382–1387Elwert F, Winship C (2014) Endogenous selection bias: The problem of condition-ing on a collider variable. Annual Review of Sociology 40:31–53 hallenges to estimating contagion effects from observational data 15

Fisher RA (1922) On the mathematical foundations of theoretical statistics. Philo-sophical Transactions of the Royal Society of London Series A, Containing Pa-pers of a Mathematical or Physical Character 222:309–368Goetzke F (2008) Network effects in public transit use: evidence from a spatiallyautoregressive mode choice model for new york. Urban Studies 45(2):407–417Greenland S (2000) An introduction to instrumental variables for epidemiologists.International journal of epidemiology 29(4):722–729Halloran M, Hudgens M (2011) Causal inference for vaccine effects on infectious-ness. The University of North Carolina at Chapel Hill Department of BiostatisticsTechnical Report Series p 20Halloran ME, Struchiner CJ (1995) Causal inference in infectious diseases. Epi-demiology 6(2):142–151Hern´an MA (2004) A deﬁnition of causal effect for epidemiological research. Jour-nal of Epidemiology and Community Health 58(4):265–271Hong G, Raudenbush S (2006) Evaluating kindergarten retention policy. Journal ofthe American Statistical Association 101(475):901–910Hudgens M, Halloran M (2008) Toward causal inference with interference. Journalof the American Statistical Association 103(482):832–842Kang C (2007) Classroom peer effects and academic achievement: Quasi-randomization evidence from south korea. Journal of Urban Economics61(3):458–495van der Laan MJ (2012) Causal inference for networks. UC Berkeley Division ofBiostatistics Working Paper Series Working Paper 300Lauritzen SL, Richardson TS (2002) Chain graph models and their causal interpreta-tions. Journal of the Royal Statistical Society: Series B (Statistical Methodology)64(3):321–348Lazer D, Rubineau B, Chetkovich C, Katz N, Neblo M (2010) The coevolution ofnetworks and political attitudes. Political Communication 27(3):248–274Lee LF (2004) Asymptotic distributions of quasi-maximum likelihood estimatorsfor spatial autoregressive models. Econometrica 72(6):1899–1925Lin X (2005) Peer effects and student academic achievement: an application ofspatial autoregressive model with group unobservables. Unpublished manuscript,Ohio State UniversityLiu L, Hudgens MG (2014) Large sample randomization inference of causal ef-fects in the presence of interference. Journal of the american statistical association109(505):288–301Lyons R (2011) The spread of evidence-poor medicine via ﬂawed social-networkanalysis. Statistics, Politics, and Policy 2(1)Manski CF (1993) Identiﬁcation of endogenous social effects: The reﬂection prob-lem. The review of economic studies 60(3):531–542Noel H, Nyhan B (2011) The unfriending problem: The consequences of homophilyin friendship retention for causal estimates of social inﬂuence. Social Networks33(3):211–218Ogburn EL, VanderWeele TJ (2014a) Causal diagrams for interference. StatisticalScience

Ogburn EL, VanderWeele TJ (2014b) Vaccines, contagion, and social networks.arXiv preprint arXiv:14031241Ogburn EL, Volfovsky A (2016) Networks. In: P B, P D, M K, van der Laan MJ(eds) Handbook of Big Data, Chapman & Hall/CRCOgburn EL, O S, van der Laan MJ, I D (2017) Causal inference for social networkdata with contagion. Tech. rep., Johns Hopkins UniversityO’Malley AJ, Elwert F, Rosenquist JN, Zaslavsky AM, Christakis NA (2014) Esti-mating peer effects in longitudinal dyadic data using instrumental variables. Bio-metricsO’Malley JA, Marsden PV (2008) The analysis of social networks. Health servicesand outcomes research methodology 8(4):222–269Pearl J (2000) Causality: models, reasoning and inference. Cambridge Univ PressRailsback SF, Grimm V (2011) Agent-based and individual-based modeling: a prac-tical introduction. Princeton university pressRosenbaum P (2007) Interference between units in randomized experiments. Jour-nal of the American Statistical Association 102(477):191–200Rosenquist JN, Murabito J, Fowler JH, Christakis NA (2010) The spread of alco-hol consumption behavior in a large social network. Annals of Internal Medicine152(7):426–433Rubin D (1990) On the application of probability theory to agricultural experiments.essay on principles. section 9. comment: Neyman (1923) and causal inference inexperiments and observational studies. Statistical Science 5(4):472–480Rubin DB (2005) Causal inference using potential outcomes: Design, modeling,decisions. Journal of the American Statistical Association 100(469):322–331Sacerdote B (2000) Peer effects with random assignment: Results for dartmouthroommates. Tech. rep., National Bureau of Economic ResearchShalizi CR (2012) Comment on ”why and when ’ﬂawed’ social network analysesstill yield valid tests of no contagion”. Statistics, Politics, and Policy 3(1)Shalizi CR, Thomas AC (2011) Homophily and contagion are generically con-founded in observational social network studies. Sociological Methods & Re-search 40(2):211–239Snijders T, Steglich C, Schweinberger M (2007) Modeling the coevolution of net-works and behavior. naSnijders TA, Van de Bunt GG, Steglich CE (2010) Introduction to stochastic actor-based models for network dynamics. Social networks 32(1):44–60Sobel M (2006) What do randomized studies of housing mobility demonstrate?Journal of the American Statistical Association 101(476):1398–1407Tchetgen Tchetgen EJ, VanderWeele T (2012) On causal inference in the presenceof interference. Statistical Methods in Medical Research 21(1):55–75Thomas A (2013) The social contagion hypothesis: comment on ‘social contagiontheory: examining dynamic social networks and human behavior’. Statistical inMedicine 32(4):581–590VanderWeele TJ (2011) Sensitivity analysis for contagion effects in social networks.Sociological Methods & Research 40(2):240–255 hallenges to estimating contagion effects from observational data 17hallenges to estimating contagion effects from observational data 17